Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

59
Hadoop: challenge accepted! Arkadiusz Osiński [email protected] Robert Mroczkowski [email protected]

description

Nowadays we are producing a huge volume of information, but unfortunately at most only 12% of it is analyzed. That is why we should dive into our data lake and pull out the Holy Grail - the knowledge. But BigData means big problem. So, challenge accepted! The perfect solution for achieving this goal is Hadoop. It is a 'data operating system', which allows us to process large volumes of any data in a distributed way. Together, we will take a phenomenal journey around Hadoop world. First stop: operations basics. Second stop: short tour around Hadoop ecosystem. At the end of our travel, we will walk through several examples, that show you real power of a Hadoop as your data platform. Arkadiusz Osinski - Works in Allegro Group as a System administrator. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cycling. Robert Mroczkowski - In 2006 graduated master studies in Computer Science at Nicolaus Copernicus University. In 2007 he graduated Bachelor Studies in Applied Informatics at Nicolaus Copernicus University. In years 2006 - 2011 he was a PhD student in Computer Science. His research field was Computer Science applied in Bioinformatcs. In 2012 he started to work as Unix System Administartor in Allegro Group. He gained experience in Hadoop World building and maintaining a cluster for GA. Every day he works with modern high-performance and high-available technologies, centrally managed in cloud environment.

Transcript of Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Page 1: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Hadoop:  challenge  accepted!

Arkadiusz  Osiński [email protected]

Robert  Mroczkowski [email protected]

Page 2: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

ToC -­‐‑   Hadoop  basics -­‐‑   Gather  data -­‐‑   Process  your  data -­‐‑   Learn  from  your  data -­‐‑   Visualize  your  data

Page 3: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

BigData -­‐‑  Petabytes  of  (un)structured  data

Page 4: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed

Page 5: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed -­‐‑   a  lot  of  data  is  not  gathered

Page 6: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

BigData -­‐‑  Petabytes  of  (un)structured  data -­‐‑   12%  of  data  is  analyzed -­‐‑   a  lot  of  data  is  not  gathered -­‐‑   how  to  gain  knowledge?

Page 7: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Power Big  Data

Data  Lake

Scalability

Petabytes

Mapreduce Commodity

Page 8: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer

Page 9: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system

Page 10: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware

Page 11: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability

Page 12: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD

Page 13: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD -­‐‑   Access  control

Page 14: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

HDFS -­‐‑   Storage  layer -­‐‑   Distributed  file  system -­‐‑   Commodity  hardware -­‐‑   Scalability -­‐‑   JBOD -­‐‑   Access  control -­‐‑   No  SPOF

Page 15: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

YARN -­‐‑   Distributed  computing  layer

Page 16: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data

Page 17: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce…

Page 18: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce… -­‐‑   and  others  applications

Page 19: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

YARN -­‐‑   Distributed  computing  layer -­‐‑   Operations  in  place  of  data -­‐‑   MapReduce… -­‐‑   and  others  applications -­‐‑   Resource  management

Page 20: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Let’s  squize  our  data  to  get  a  juice!!

Page 21: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Gather  data flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…) flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql

Page 22: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data -­‐‑   Hadoop  Streaming!

Page 23: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data -­‐‑   Hadoop  Streaming! -­‐‑   No  need  to  write  code  in  Java

Page 24: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data -­‐‑   Hadoop  Streaming! -­‐‑   No  need  to  write  code  in  Java -­‐‑   You  can  use  Python,  Perl  or  Awk

Page 25: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))    

Page 26: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey != line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1  print "{0}\t{1}".format(str(datekey),str(counter))    

Page 27: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-files ./map.py,./reduce.py \

-mapper ./map.py \

-reducer ./reduce.py \

-input /tweets/2014/04/*/*/* \

-input /tweets/2014/05/*/*/* \

-output /tweet_keyword

Page 28: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data (….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)

Page 29: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Process  your  data

Page 30: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Recommendations

Which  product  will  be  desired  by  client?

We’ve  got  historical  users  interaction  with  items.

Page 31: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski
Page 32: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Simple  Example Let’s  just  do  mahout    -­‐‑  it’s  easy!

> apt-get install mahout

> cat simple_example.csv

1,101

1,102

1,103

2,101

> hdfs dfs -put simple_example.csv

> mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \

-Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \

-Dmapred.output.dir=/mahout/output/wikilinks/simple_example \

-Dmapred.job.queue.name=atmosphere_prod

Page 33: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Simple  Example Tadadam!

> hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]

Page 34: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case

We’ve  got  links  between  wikipedia  articles,  and  want  to  propose  new  links  between  articles.

„Wikipedia   (i/ˌwɪkɨˈpiːdiəә/   or   i/ˌwɪkiˈpiːdiəә/   WIK-­‐‑i-­‐‑PEE-­‐‑dee-­‐‑əә)   is   a   collaboratively   edited,  multilingual,   free   Internet   encyclopedia   that   is   supported   by   the   non-­‐‑profit  Wikimedia   Foundation.   Volunteers   worldwide   collaboratively   write   Wikipedia'ʹs   30   million  articles  in  287  languages,  including  over  4.5  million  in  the  English  Wikipedia.  Anyone  who  can  access”  

Page 35: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case

Page 36: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case

hlp://users.on.net/%7Ehenry/pagerank/links-­‐‑simple-­‐‑sorted.zip

#!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }  

Page 37: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-Dmapreduce.job.max.split.locations=24 \

-Dmapreduce.job.queuename=hadoop_prod \

-Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \

-Dmapred.text.key.comparator.options=-n \

-Dmapred.output.compress=false \

-files ./mahout/mapper.awk \

-mapper ./mapper.awk \

-input /mahout/input/wikilinks/links-simple-sorted.txt \

-output /mahout/output/wikilinks/fixedinput

Page 38: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case Mahout  lib  count’s  similarity  Matrix  and  gave  recommendations  for  824  articles.

What’s  important,  we  didn’t  gather  any  knowledge  a  priori  and  just  ran  algorithm’s  out  of  box.

Page 39: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case Acadèmia_Valenciana_de_la_Llengua

FIFA Valencia

October_1 Calendar

Prehistoric_Iberia Link  appears  recently

Ceuta Spain  City  at  the  north  coast  of  Africa

Roussillon Part  of  France  by  the  border  with  Spain

Sweden J

Turís municipality  in  the  Valencian  Community

Vulgar_Latin Language  article Western_Italo-­‐‑Western_languages Language  article

Àngel_Guimerà Spanish  wriler

Page 40: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Wiki  Case

Page 41: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets

Let’s  find  group  of: •  tags   • users

Page 42: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets

•  Our  data  is  not  random •  We’ve  picked  specific  keywords •  We’ll  do  analysis  in  two  

orthogonal  directions

Page 43: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets {

"filter_level":"medium",

"contributors":null,

"text":"PROMOCIÓN MES DE MAYO. con ...",

"geo":null,

"retweeted":false,

"lang":"es",

"entities":{

"urls":[

{ "expanded_url":"http://www.agmuriel.com",

"indices":[ 69, 91 ],

"display_url":"agmuriel.com/#!-/c1gz",

"url":"http://t.co/APpPjRRTXn" } ]

}

(…)

 

Page 44: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets #!/usr/bin/python import json, sys for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[” entities"][”hashtags”] for tag in tags: print tag[“text”]+"\t"+text except KeyError: continue  

#!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastKey and lastKey != key: print lastKey+"\t"+text (lastKey,text) = (key,value) else: (lastKey,text) = (key,text+" "+value)  

Page 45: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-Dmapreduce.job.queuename=atmosphere_time \

-Dmapred.output.compress=false \

-Dmapreduce.job.max.split.locations=24 \

-D-Dmapred.reduce.tasks=20 \

-files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \

-mapper ./twitter_map.py \

-reducer ./twitter_reduce.py \

-input /project/atmosphere/tweets/2014/04/*/* \

-output /project/atmosphere/tweets/output \

-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat

Get  SequenceFile  with  proper  mapping

Page 46: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets

mahout seq2sparse \

-i /project/atmosphere/tweets/output \

-o /project/atmosphere/tweets/vectorized -ow \

-chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2

Calculate  vector  representation  for  text

{10:0.6292275202550768,14:0.7772211575566166}  {10:0.6292275202550768,14:0.7772211575566166}  {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}  {17:1.0}  {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}  

Page 47: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets I’ts  time  to  begin  clusterization

Let’s  find  100  clusters

mahout kmeans \

-i /tweets_5/vectorized/tfidf-vectors \

-c /tweets_5/kmeans/initial-clusters \

-o /tweets_5/kmeans/output-clusters \

-cd 1.0 -k 100 -x 10 -cl –ow \

-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

Page 48: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets Glance  at  results

BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE

PATCHING

Page 49: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets

It  was  easy  because  tags  are  very  dependent  (coocurence).

Page 50: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets Bigger  challenge  –  user  clustering

LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT

FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC

Page 51: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets Bigger  challenge  –  user  clustering

•  Results  show  that  dataset  is  strongly  curved  by  mobile  and  games

•  Dataset  wasn’t  random  –  we  subscribed    specific  keywords

•  OS  result  is  great!

Page 52: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets HADOOP  WORLD

run  predictive  machine  learning  algorithms  on  hadoop  without  even  knowing  mapreduce.:  data  scientists  are  very...  h:p://t.co/gdmqm5g1ar

rt  @mapr:  google  cloud  storage  connector  for  #hadoop:  quick  start  guide  now  avail  h:p://t.co/17hxtvdlir    #bigdata

Page 53: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Tweets HADOOP  WORLD

Cloudera  wants  to  do  big  data  in  Real  Time.

Hortonworks  wants  to  replace  cloudera  by  research.

Page 54: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Visualize  data add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012’ STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,LOWER(hashtags.text),lang    

Page 55: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Visualize  data add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;    

Page 56: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Visualize  data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;    

Page 57: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Visualize  data

Page 58: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Visualize  data Tag: eurovisiontve

Page 59: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Thank  you!

Questions?