Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

25
Hadoop w poszukiwaniu złotego młotka Kamil Chmielewski Jacek Juraszek

description

Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka." Prezencja z j.Piknik 30.08.12.

Transcript of Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Page 1: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Hadoopw poszukiwaniu złotego młotka

Kamil Chmielewski Jacek Juraszek

Page 2: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Źródło: IDC's Digital Universe Study, sponsored by EMC, June 2011

Page 3: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

• Facebook – 30 PB (2011)• 2000 serwerów• 22 400 rdzeni• 64 TB RAM

• Yahoo – 14 PB (2010)• 4000 serwerów

• Ebay – 5,3 PB• 532 serwery• 4256 rdzeni

• Google – 24 PB ???

Page 4: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Wzrost mocy obliczeniowej

Źródło: The Free Lunch Is Over, Herb Sutter

Page 5: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."
Page 6: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Architektura HDFS

Page 7: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

HDFS File System Shell

• hadoop fs -cat file:///file3 /user/hadoop/file4

• hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2

• hadoop fs -du /user/hadoop/dir1

• hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

• hadoop fs -ls /user/hadoop/file1

• hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir

• hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2

• hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile

• hadoop fs -rm hdfs://nn.example.com/file

• hadoop fs -tail pathname

Page 8: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Rozproszony klient?

Page 9: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

NameNode HA

Page 10: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Zrównoleglenie – MapReduce

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)

function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum)

http://en.wikipedia.org/wiki/MapReduce

Page 11: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

MapReduce – Hadoop JAVA

http://wiki.apache.org/hadoop/WordCount

63 linie !!!

Page 12: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

MapReduce – Apache PIG

http://en.wikipedia.org/wiki/Pig_(programming_tool)

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag-- datatype, then flatten the bag to get one word on each rowwords = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spacesfiltered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each wordword_groups = GROUP filtered_words BY word; -- count the entries in each groupword_count = FOREACH word_groups

GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by countordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

7 linii dobrze,63 źle

Page 13: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Przykład z życia wziętypublic static class MetricsMapper extends TableMapper<Text, IntWritable> { private final static Logger log = LoggerFactory.getLogger(MetricsMapper.class);

protected void map(ImmutableBytesWritable key, Result value, Mapper<ImmutableBytesWritable, Result, Text, IntWritable>.Context context) throws IOException, InterruptedException { String query = Bytes.toString(value.getValue(RawDataFamily.CF_B, RawDataFamily.QUERY.getColumn())); Map<String, String> infoTags = getValuesFromQuery(query, KEYS); for (String key : KEYS) {

long eventTime = toLong(value.getValue(EvalDataFamily.CF_B, EvalDataFamily.CREATE_TIME.getColumn()));

long eventTruncatedToDay = timestampToDay(eventTime);

String tagValue = resolveTagValue(key, value, infoTags); int visitCount = toInt(value.getValue(EvalDataFamily.CF_B, EvalDataFamily.VISIT_COUNT.getColumn())); context.write(new Text(eventTruncatedToDay + KEY_DELIMITER + infoKey(key) + KEY_DELIMITER + tagValue), new IntWritable( visitCount)); } }; private String resolveTagValue(String attr, Result result, Map<String, String> allTags) { String tagValue = allTags.get(attr); tagValue = StringUtils.isBlank(tagValue) ? UNDEFINED : tagValue; if (SOURCE.equals(attr)) { if (!UNDEFINED.equals(tagValue)) { return tagValue; } String direct = Bytes.toString(result.getValue(RawDataFamily.CF_B, RawDataFamily.DIRECT.getColumn())); if (StringUtils.isNotBlank(direct)) { return retrieveOrigin(direct); } return DIRECT; } else if (MEDIUM.equals(attr)) { String source = resolveTagValue(SOURCE, result, allTags); return source + VALUE_DELIMITER + tagValue; } return tagValue; }

private String retrieveHost(String url) { if (StringUtils.isNotBlank(url)) { try { return (new URL(url)).getHost().replaceFirst("www.", ""); } catch (MalformedURLException e) { log.warn("Malformed URL '{}'. Could not retrieve host value.", url); } } return null; }

DEFINE extractor pl.allegro.cm.pig.udf.specific.Extractor();DEFINE isNotBlank pl.allegro.cm.pig.udf.IsNotBlank();DEFINE concat pl.allegro.cm.pig.udf.Concat();

in = LOAD 'events.$account' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('r:userId e:processId e:createTime r:query r:direct e:newprocess', '-caster HBaseBinaryConverter') AS (userId:chararray, processId:chararray, createTime:chararray, query:chararray, direct:chararray, newprocess:chararray);

rows = FILTER in BY (userId IS NOT NULL) AND (processId IS NOT NULL) AND (createTime IS NOT NULL);rows = FOREACH rows GENERATE SUBSTRING(createTime,0,10) AS createTime, userId, processId, query, direct, newprocess;rows = FILTER rows BY '$lower' <= createTime AND '$upper' >= createTime;

processs = GROUP rows BY (userId,processId);processs = FOREACH processs GENERATE concat(group.$0,'|',group.$1) AS countId, COUNT($1) AS count;

firstEvFromEachprocess = FILTER rows BY (newprocess IS NOT NULL);firstEvFromEachprocess = FOREACH firstEvFromEachprocess GENERATE createTime AS ct, concat(userId,'|',processId) AS campId, extractor(query,direct) AS params;

joinedData = JOIN firstEvFromEachprocess BY procId, processs BY countId;unpackParams = FOREACH joinedData GENERATE ct AS t, FLATTEN(params), count AS c, (count==1 ? 1 : 0) AS b;dataForWrite = GROUP unpackParams BY (t,$1,$2);dataForWrite = FOREACH dataForWrite GENERATE group.t, group.$1, group.$2, SUM(unpackParams.b),SUM(unpackParams.c), COUNT(unpackParams);

STORE dataForWrite INTO 'metrics' USING org.apache.pig.piggybank.storage.DBStorage('$driver','$url','$usr','$pass','INSERT INTO metrics (account, date, key,value, cripled, events, processs) VALUES ("$account", ?, ?, ?, ?, ?, ?) ON DUPLICATE KEY UPDATE cripled=VALUES(cripled), events=VALUES(events), processs=VALUES(processs)');

private Map<String, String> getValuesFromQueryString(String query, Set<String> keys) { String[] keyVal = split(query, '&'); Map<String, String> result = new HashMap<String, String>(); for (String s : keyVal) { String[] kv = split(s, '='); if (keys.contains(kv[0]) && kv.length > 1) { result.put(kv[0], kv[1]); } } return result; } private String key(String key) { if (SOURCE.equals(key)) { return SOURCE; } else if (MEDIUM.equals(key)) { return MEDIUM; } return key; } } public static class MetricsReducer extends TableReducer<Text, IntWritable, Writable> { protected void reduce(Text key, Iterable<IntWritable> visitCounts, Reducer<Text, IntWritable, Writable, Writable>.Context context) throws IOException, InterruptedException { long visits = 0; long pv = 0; long bounces = 0; for (IntWritable vc : visitCounts) { visits++; pv += vc.get(); bounces += vc.get() == 1 ? 1 : 0; } context.write( null, new Put(Bytes.toBytes(key.toString())) .add(Constants.CF_B, Constants.VISITS.getColumn(), toBytes(visits)) .add(Constants.CF_B, Constants.PV.getColumn(), toBytes(pv)) .add(Constants.CF_B, Constants.BOUNCES.getColumn(), toBytes(bounces))); }; }

A to jest PIG…

Page 14: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Hadoop + MongoDB

MongoDB

online data

HADOOP

archive data

Flushed dataMR

Bach proc. result

Page 15: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Filesystem = HDFS ?

Page 16: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

HBase

key timestamp cf dane cf adres

80071223097 t3 miasto=Warszawa

80071223097 t2 miasto=Gdańsk

80071223097 t1 imie=Jan

86121267222 t2 ulica=Długa

86121267222 t1 imie=Maria miasto=Poznań

Page 17: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

HTable table = new HTable("osoby");Put event = new Put(Bytes.toBytes("80071223097") .add(Bytes.toBytes("dane"), Bytes.toBytes("imie"), Bytes.toBytes("Jan")) .add(Bytes.toBytes("adres"), Bytes.toBytes("miasto"), Bytes.toBytes("Warszawa")) ;table.put(event);

// https://github.com/nearinfinity/hbase-dslHTable table = new HTable("osoby");hBase.save(table).row("80071223097").

family("dane").col("imie", "Jan").family("adres").col("miasto", "Warszawa");

# http://happybase.readthedocs.org/table = connection.table('osoby')table.put('80071223097’,

{'dane:imie': 'Jan', 'adres:miasto': 'Warszawa'})

Page 18: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

# Count rows in a tabledef _count_internal(interval = 1000, caching_rows = 10) # We can safely set scanner cachingwith the first key only filter scan = org.apache.hadoop.hbase.client.Scan.new scan.cache_blocks = false scan.caching = caching_rows scan.setFilter(org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter.new)

# Run the scanner scanner = @table.getScanner(scan) count = 0 iter = scanner.iterator

# Iterate results while iter.hasNext row = iter.next count += 1 next unless (block_given? && count % interval == 0) # Allow command modules to visualize counting process yield(count, String.from_java_bytes(row.getRow)) end

# Return the counter return countend

HBase Shell

Page 19: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Koszmar pakietowy

org.apache.hadoop.mapred

Wszystko_mający Status: legacy Chain_mr Operacja JOIN na MR

org.apache.hadoop.mapreduce

Przyjazne API Klasy bazowe Konteksty Wsparcie dla CLI i CoC

Smaczki z Maven Repo:

Przepakietowana GUAVAZależności do commons-loggingDystrybucje tylko w 3rd party repoHBASE z zależnościami do: jetty i servlet-api

Page 20: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Bałagan z wersjami

Page 21: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Przykładowa architektura systemu

Bazy danych nadal nadają sens aplikacji

MR = Batch

Page 22: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Hadoop + SOLR = SOLR Cloud

Page 23: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Nie każdy problem jest dość duży…

FACEBOOK CLUSTER2k maszyn12 TB per maszyna30 PB całkowitej pojemności1200 maszyn x 8 core800 maszyn X 16 core

Page 24: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Zastosowania• Indeksowanie dokumentów

• Analiza wykorzystania serwisów internetowych

• Logi serwerów, firewalli

• Repozytoria obrazów, filmów

• Metryki parametrów systemów

• Systemy rekomendacji

Page 25: Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

More info …

http://hortonworks.com/blog/

http://www.cloudera.com/blog/

http://hadoopblog.blogspot.com/

http://www.larsgeorge.com/

http://natishalom.typepad.com/nati_shaloms_blog/

http://developer.yahoo.com/blogs/ydn/categories/hadoop/

http://bradhedlund.com/topics/big-data/