Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Hadoop: challenge accepted!

Arkadiusz Osiński arkadiusz.osinski@allegrogroup.com

Robert Mroczkowski robert.mroczkowski@allegrogroup.com

ToC -‐‑  Hadoop basics -‐‑  Gather data -‐‑  Process your data -‐‑  Learn from your data -‐‑  Visualize your data

BigData -‐‑ Petabytes of (un)structured data

BigData -‐‑ Petabytes of (un)structured data -‐‑  12% of data is analyzed

BigData -‐‑ Petabytes of (un)structured data -‐‑  12% of data is analyzed -‐‑  a lot of data is not gathered

BigData -‐‑ Petabytes of (un)structured data -‐‑  12% of data is analyzed -‐‑  a lot of data is not gathered -‐‑  how to gain knowledge?

Power Big Data

Data Lake

Scalability

Petabytes

Mapreduce Commodity

HDFS -‐‑  Storage layer

HDFS -‐‑  Storage layer -‐‑  Distributed file system

HDFS -‐‑  Storage layer -‐‑  Distributed file system -‐‑  Commodity hardware

HDFS -‐‑  Storage layer -‐‑  Distributed file system -‐‑  Commodity hardware -‐‑  Scalability

HDFS -‐‑  Storage layer -‐‑  Distributed file system -‐‑  Commodity hardware -‐‑  Scalability -‐‑  JBOD

HDFS -‐‑  Storage layer -‐‑  Distributed file system -‐‑  Commodity hardware -‐‑  Scalability -‐‑  JBOD -‐‑  Access control

HDFS -‐‑  Storage layer -‐‑  Distributed file system -‐‑  Commodity hardware -‐‑  Scalability -‐‑  JBOD -‐‑  Access control -‐‑  No SPOF

YARN -‐‑  Distributed computing layer

YARN -‐‑  Distributed computing layer -‐‑  Operations in place of data

YARN -‐‑  Distributed computing layer -‐‑  Operations in place of data -‐‑  MapReduce…

YARN -‐‑  Distributed computing layer -‐‑  Operations in place of data -‐‑  MapReduce… -‐‑  and others applications

YARN -‐‑  Distributed computing layer -‐‑  Operations in place of data -‐‑  MapReduce… -‐‑  and others applications -‐‑  Resource management

Let’s squize our data to get a juice!!

Gather data flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…) flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql

Process your data -‐‑  Hadoop Streaming!

Process your data -‐‑  Hadoop Streaming! -‐‑  No need to write code in Java

Process your data -‐‑  Hadoop Streaming! -‐‑  No need to write code in Java -‐‑  You can use Python, Perl or Awk

Process your data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))

Process your data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey != line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1 print "{0}\t{1}".format(str(datekey),str(counter))

Process your data yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-files ./map.py,./reduce.py \

-mapper ./map.py \

-reducer ./reduce.py \

-input /tweets/2014/04/*/*/* \

-input /tweets/2014/05/*/*/* \

-output /tweet_keyword

Process your data (….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)

Process your data

Recommendations

Which product will be desired by client?

We’ve got historical users interaction with items.

Simple Example Let’s just do mahout -‐‑ it’s easy!

> apt-get install mahout

> cat simple_example.csv

> hdfs dfs -put simple_example.csv

> mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \

-Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \

-Dmapred.output.dir=/mahout/output/wikilinks/simple_example \

-Dmapred.job.queue.name=atmosphere_prod

Simple Example Tadadam!

> hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]

Wiki Case

We’ve got links between wikipedia articles, and want to propose new links between articles.

„Wikipedia (i/ˌwɪkɨˈpiːdiəә/ or i/ˌwɪkiˈpiːdiəә/ WIK-‐‑i-‐‑PEE-‐‑dee-‐‑əә) is a collaboratively edited, multilingual, free Internet encyclopedia that is supported by the non-‐‑profit Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access”

Wiki Case

hlp://users.on.net/%7Ehenry/pagerank/links-‐‑simple-‐‑sorted.zip

#!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }

Wiki Case

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-Dmapreduce.job.max.split.locations=24 \

-Dmapreduce.job.queuename=hadoop_prod \

-Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \

-Dmapred.text.key.comparator.options=-n \

-Dmapred.output.compress=false \

-files ./mahout/mapper.awk \

-mapper ./mapper.awk \

-input /mahout/input/wikilinks/links-simple-sorted.txt \

-output /mahout/output/wikilinks/fixedinput

Wiki Case Mahout lib count’s similarity Matrix and gave recommendations for 824 articles.

What’s important, we didn’t gather any knowledge a priori and just ran algorithm’s out of box.

Wiki Case Acadèmia_Valenciana_de_la_Llengua

FIFA Valencia

October_1 Calendar

Prehistoric_Iberia Link appears recently

Ceuta Spain City at the north coast of Africa

Roussillon Part of France by the border with Spain

Sweden J

Turís municipality in the Valencian Community

Vulgar_Latin Language article Western_Italo-‐‑Western_languages Language article

Àngel_Guimerà Spanish wriler

Wiki Case

Tweets

Let’s find group of: •  tags • users

Tweets

•  Our data is not random •  We’ve picked specific keywords •  We’ll do analysis in two

orthogonal directions

Tweets {

"filter_level":"medium",

"contributors":null,

"text":"PROMOCIÓN MES DE MAYO. con ...",

"geo":null,

"retweeted":false,

"lang":"es",

"entities":{

"urls":[

{ "expanded_url":"http://www.agmuriel.com",

"indices":[ 69, 91 ],

"display_url":"agmuriel.com/#!-/c1gz",

"url":"http://t.co/APpPjRRTXn" } ]

Tweets #!/usr/bin/python import json, sys for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[” entities"][”hashtags”] for tag in tags: print tag[“text”]+"\t"+text except KeyError: continue

#!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastKey and lastKey != key: print lastKey+"\t"+text (lastKey,text) = (key,value) else: (lastKey,text) = (key,text+" "+value)

Tweets

yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-Dmapreduce.job.queuename=atmosphere_time \

-Dmapred.output.compress=false \

-Dmapreduce.job.max.split.locations=24 \

-D-Dmapred.reduce.tasks=20 \

-files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \

-mapper ./twitter_map.py \

-reducer ./twitter_reduce.py \

-input /project/atmosphere/tweets/2014/04/*/* \

-output /project/atmosphere/tweets/output \

-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat

Get SequenceFile with proper mapping

Tweets

mahout seq2sparse \

-i /project/atmosphere/tweets/output \

-o /project/atmosphere/tweets/vectorized -ow \

-chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2

Calculate vector representation for text

{10:0.6292275202550768,14:0.7772211575566166} {10:0.6292275202550768,14:0.7772211575566166} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859} {17:1.0} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}

Tweets I’ts time to begin clusterization

Let’s find 100 clusters

mahout kmeans \

-i /tweets_5/vectorized/tfidf-vectors \

-c /tweets_5/kmeans/initial-clusters \

-o /tweets_5/kmeans/output-clusters \

-cd 1.0 -k 100 -x 10 -cl –ow \

-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure

Tweets Glance at results

BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE

PATCHING

Tweets

It was easy because tags are very dependent (coocurence).

Tweets Bigger challenge – user clustering

LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT

FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC

Tweets Bigger challenge – user clustering

•  Results show that dataset is strongly curved by mobile and games

•  Dataset wasn’t random – we subscribed specific keywords

•  OS result is great!

Tweets HADOOP WORLD

run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar

rt @mapr: google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir #bigdata

Tweets HADOOP WORLD

Cloudera wants to do big data in Real Time.

Hortonworks wants to replace cloudera by research.

Visualize data add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012’ STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,LOWER(hashtags.text),lang

Visualize data add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;

Visualize data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;

Visualize data

Visualize data Tag: eurovisiontve

Thank you!

Questions?

Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

Technology

Transcript of Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski

III MIÊDZYNARODOWY KONKURS SKRZYPCOWY im ... - Wieniawski€¦ · take part in the Wieniawski Competition and honourably accepted (only) the Second Prize. He was also deprived of

Real User Monitoring at Scale @ Atmosphere Conference 2016

Vascularization Potential of Electrospun Poly(L-Lactide-co … · 2017-04-05 · Received: 2016.05.20 Accepted: 2016.07.18 Published: 2017.03.31 4368 1 6 40 Vascularization Potential

Received: Accepted: 2014.08.17 Kannabinoidy a hemostaza ... · kannabinoidy • receptory kannabinoidowe • hemostaza • proces zakrzepowy • marihuana Summary Elements of the

Young Scientists’ Forum Konferencja Młodych Naukowców (2).pdf · in radiotherapy, radiobiology, and medical physics. Oral presentations of accepted abstracts will consist of a

METAPHORS OF SOCIETY AND WAYS OF EXERTING INFLUENCE … Bader… · 31 dominating the others. The social atmosphere is full of rivalry, clashes and conflicts, as well 1 It can be

Theexcavationandanalysisofporcupinedens … · 2013. 11. 5. · Received 23 October 2012; accepted 6 February 2013; published online 13 April 2013 on 4 WiesławWięckowskietal. inﬂuencing

Islam and its stereotypes seen by “the missionaries” no. 1/9...On the other hand, Islam worshippers are seen in a strongly critical and negative way, their customs are not accepted,

Aleksander TANSMAN (1897-1986) Fantazja na …...Tansman’s concern for the sound colour, which contributes to the unique atmosphere of the work. Sonata No. 2 for violin and piano

Atmosphere 2014: Who did this code review, goddammit! - Andrzej Angowski

Big datahome.agh.edu.pl/~wojnicki/wiki/_media/pl:ztb:ztb-hadoop.pdf · 2013-11-08 · Hadoop Sector MapReduce MapReduce Sphere UDF BigTable HBase/Hive Space GFS HDFS SDFS — —

Acta Universitatis Lodziensiscejsh.icm.edu.pl/cejsh/element/bwmeta1.element.hdl_11089_5624/c/folia... · century to the end of the 13. th. century (1290 is a commonly accepted date)

GLOBAL SPONSORS...Ready Bundle for Hadoop skalowalny od 5TB do 3.8PBCzym się wyróżnia Dell EMC Przetestowane i certyfikowane konfiguracje Obniżenie OpEx przy zakupie Splunk w architekturze

Respiratory virus-associated severe acute respiratory ...Accepted Manuscript. 2 . 9. Influenza Division, Centers for Disease Control –Atlanta, 1600 Clifton Road Atlanta, GA 30329-4027

Hadoop i Sparkmariuszrafalo.pl/sgh/bd2/BD 05 - Technologie składowania...MongoDB •Możliwa konfiguracja w zakresie CAP (dowolna)•Nierelacyjna baza danych, napisana w języku ++

Received: Dopamina – nie tylko neuroprzekaźnik* Accepted ... · Dopamina – nie tylko neuroprzekaźnik* Dopamine: not just a neurotransmitter Jakub Drożak, Jadwiga Bryła Zakład

Kamil Chmielewski, Jacek Juraszek - "Hadoop. W poszukiwaniu złotego młotka."

Hadoop i okolice

Ecosystems Biological Communities...Biological Communities !However, Lovejoy suggests that holistic explains earth’s history Once O 2 showed up in atmosphere, it has remained at

Restauracja „Stajnia” · Restauracja „Stajnia” Piła, 24 Października 2008 r. During many of my adventures with horses, captivated by the solemn atmosphere and the great