[WebMuses] Big data dla zdezorientowanych

Post on 05-Dec-2014

116 views 0 download

description

Łagodne wprowadzenie do Big Data dla... zdezorientowanych!

Transcript of [WebMuses] Big data dla zdezorientowanych

Opowie @przemur z

Plan prezentacji

• dobór parametrów replikacji węzła Hadoopa

• Pig czy Hive do ETL-a?

• samodzielne budowanie klastra czy Cloud?

Prawdziwy plan spotkania

• Co to jest “Big Data”?

• Roboty piszące zadania MapReduce

• Zaproszeni goście - Harimata, GE Healthcare

• Krasnale a Data Science

Big Data means "a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing

applications.” (Wikipedia)

http://www.winshuttle.com/big-data-timeline/

http://plyojump.com/classes/mainframe_era.php

http://escience.washington.edu/content/hyak-0

http://escience.washington.edu/content/hyak-0

Dane

Komputer

Komputer

Komputer

Komputer

Komputer

Dane

Dane

Dane DaneDane

Dane

DaneDane

Dane

DaneDane

Dane

Dane

Dane

DaneDane

…Dane Program

Program

Program

Program

Program

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

Program

Program

Program

Program

Program

JobTracker, NameNode,

http://www.tik.ee.ethz.ch/~ddosvax/cluster/

2005

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

Program

Program

Program

Program

Program

ResourceManager, NameNode, …

HDFS

DaneKomputer

Dane

Dane

Dane

DaneKomputer

Dane

Dane

Dane

Program

Program

Wyniki fazy Map

Komputer

Komputer

Wyniki fazy Map

Wyniki koncowe

Wyniki koncowe

Map Shuffle Reduce

MapReduce

… 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } …

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

filtered_words = FILTER words BY word MATCHES '\\w+';

word_groups = GROUP filtered_words BY word;

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;

SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word ORDER BY word;

0

50

100

150

200

April May June July

2003

Data Science vs Big Data ???

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Gdzie więcej informacji?

• http://www.meetup.com/datakrk/

• https://github.com/onurakpolat/awesome-bigdata

• https://class.coursera.org/datasci-001/lecture

• https://www.codeschool.com/courses/try-r

• …

Specjalne podziękowania dla: