[WebMuses] Big data dla zdezorientowanych
-
Upload
przemek-maciolek -
Category
Data & Analytics
-
view
116 -
download
0
description
Transcript of [WebMuses] Big data dla zdezorientowanych
Opowie @przemur z
Plan prezentacji
• dobór parametrów replikacji węzła Hadoopa
• Pig czy Hive do ETL-a?
• samodzielne budowanie klastra czy Cloud?
Prawdziwy plan spotkania
• Co to jest “Big Data”?
• Roboty piszące zadania MapReduce
• Zaproszeni goście - Harimata, GE Healthcare
• Krasnale a Data Science
Big Data means "a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing
applications.” (Wikipedia)
http://www.winshuttle.com/big-data-timeline/
http://plyojump.com/classes/mainframe_era.php
http://escience.washington.edu/content/hyak-0
http://escience.washington.edu/content/hyak-0
Dane
Komputer
Komputer
Komputer
Komputer
Komputer
Dane
Dane
Dane DaneDane
Dane
DaneDane
Dane
DaneDane
Dane
Dane
Dane
DaneDane
…Dane Program
Program
Program
Program
Program
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
Program
Program
Program
Program
Program
JobTracker, NameNode,
…
…
http://www.tik.ee.ethz.ch/~ddosvax/cluster/
2005
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
Program
Program
Program
Program
Program
ResourceManager, NameNode, …
HDFS
DaneKomputer
Dane
Dane
Dane
DaneKomputer
Dane
Dane
Dane
Program
Program
Wyniki fazy Map
Komputer
Komputer
Wyniki fazy Map
Wyniki koncowe
Wyniki koncowe
Map Shuffle Reduce
MapReduce
… 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } …
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input;
SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word ORDER BY word;
0
50
100
150
200
April May June July
2003
Data Science vs Big Data ???
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Gdzie więcej informacji?
• http://www.meetup.com/datakrk/
• https://github.com/onurakpolat/awesome-bigdata
• https://class.coursera.org/datasci-001/lecture
• https://www.codeschool.com/courses/try-r
• …
Specjalne podziękowania dla: