Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
-
Upload
hadoop-user-group-france -
Category
Sports
-
view
7.413 -
download
0
description
Transcript of Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra
Handling realtime and analytic workloads in a single cluster with Hadoop and Cassandra
Piotr Kołaczkowski
[email protected]@pkolaczk
Piotr Kołaczkowski
[email protected]@pkolaczk
Basic Cassandra + Hadoop Integration
C*
C*
C*
C*
C*
C*
C*
C*
CassandraCluster
Hadoop Cluster
NameNode & JobTracker
DataNode DataNode
DataNode DataNode
DataNode DataNode
CFIF
CFOF
ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Key: ByteBuffer
Value: SortedMap<ByteBuffer, IColumn>
(column name, value, timestamp)
row key
column name
ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
jim
age: 36 car: camaro gender: M
Input Value:
ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
carol
age: 37 car: subaru
Input Value:
ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
johnny
age: 12 gender: M
Input Value:
ColumnFamilyInputFormat
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
Input Key:
suzy
age: 10 gender: F
Input Value:
CFIF – Wide Row Support
Input Key:
jim
age: 36
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
CFIF – Wide Row Support
Input Key:
jim
car: camaro
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
CFIF – Wide Row Support
Input Key:
jim
gender: M
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
CFIF – Wide Row Support
Input Key:
carol
age: 37
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
CFIF – Wide Row Support
Input Key:
carol
car: subaru
Input Value:
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
CFIF – Cassandra Secondary Index Support
IndexExpression expr = new IndexExpression( ByteBufferUtil.bytes("car"), IndexOperator.EQ, ByteBufferUitl.bytes("subaru") );
ConfigHelper.setInputRange( job.getConfiguration(), Arrays.asList(expr));
jim age: 36 car: camaro gender: M
carol age: 37 car: subaru
johnny age: 12 gender: M
suzy age: 10 gender: F
ColumnFamilyOutputFormat
● Key: ByteBuffer (row key)
● Value: List<Mutation>
– Mutation: insert or delete a column
C*
C*
C*
C*
C*
C*
C*
C*
CassandraCluster
ColumnFamilyRecordWriter
writequeue
client
thrift
CFOF – Creating Mutations
ByteBuffer rowkey = ByteBufferUtil.bytes(“carol”);
Column column = new Column();column.name = ByteBufferUtil.bytes(“age”);column.value = ByteBufferUtil.bytes(37);
List<Mutation> mutations;Mutation mutation = new Mutation();mutation.column_or_supercolumn = new ColumnOrSuperColumn();mutation.column_or_supercolumn.column = column;mutations.add(mutation);
context.write(rowkey, mutationList);
BulkOutputFormat
Hadoop Temporary Dir
SSTable 1 SSTable 2 SSTable N...
flush
write
BulkRecordWriter
Memory Buffer
DataStax Enterprise:Cassandra and Hadoop in a Single Cluster
Basic Features
● Single, simplified component
● Workload separation
● No SPOF
● Peer to peer
● JobTracker failover
● No additional Cassandra config
System Administrator's View
Address DC Rack Workload Status State Load Owns Token 148873535527910577765226390751398592512101.202.204.101 Analytics rack1 Analytics(JT) Up Normal 78,96 GB 12,50% 0 101.202.204.102 Analytics rack1 Analytics(TT) Up Normal 82,65 GB 12,50% 21267647932558653966460912964485513216 101.202.204.103 Analytics rack1 Analytics(TT) Up Normal 74,96 GB 12,50% 42535295865117307932921825928971026432 101.202.204.104 Analytics rack1 Analytics(TT) Up Normal 78,79 GB 12,50% 63802943797675961899382738893456539648 101.202.204.105 Cassandra rack1 Cassandra Up Normal 67,42 GB 12,50% 85070591730234615865843651857942052864 101.202.204.106 Cassandra rack1 Cassandra Up Normal 60,86 GB 12,50% 106338239662793269832304564822427566080101.202.204.107 Cassandra rack1 Cassandra Up Normal 81,27 GB 12,50% 127605887595351923798765477786913079296101.202.204.108 Cassandra rack1 Cassandra Up Normal 77,17 GB 12,50% 148873535527910577765226390751398592512
Easy monitoring of your nodes, regardless of their workload type
Wait, but where are my files?
Hadoop M/R
HDFS
Hadoop M/R
CFS
Cassandra Server
Cassandra File System Properties
● Decentralized
● Replicated
● HDFS compatible
– compatible with Hadoop filesystem utilities
– allows for running M/R programs on DSE without any change
● Compressed
CFS Architecture
CFS Compaction
● Keeps track of deleted rows (blocks)
● When all blocks in SSTable removed, deletes the whole SSTable
Cassandra Storage
block 1block 2block 3
block 4block 5block 6
ts 1ts 2
block 6 block 6block 7block 8
ts 3ts 4
block 6block 9block 10X
Hive Integration
● CassandraHiveMetaStore
– stores Hive database metadata in Cassandra
– no need to run a separate RDBMS
● CassandraStorageHandler
– allows for direct access to C* tables with CFIF and CFOF
Hive Integration – Example
CREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string) STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' TBLPROPERTIES ("cassandra.ks.name" = "MyCassandraKS");
SELECT count(*) FROM MyHiveTable;
Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number>In order to set a constant number of reducers: set mapred.reduce.tasks=<number>Starting Job = job_201306041030_0001, Tracking URL = http://192.168.123.10:50030/jobdetails.jsp?jobid=job_201306041030_0001Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=192.168.123.10:8012 -kill job_201306041030_0001Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 12013-06-04 15:11:54,573 Stage-1 map = 0%, reduce = 0%2013-06-04 15:11:58,622 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec2013-06-04 15:11:59,691 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec...2013-06-04 15:12:28,288 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:29,304 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:30,330 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:31,339 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 secMapReduce Total cumulative CPU time: 31 seconds 910 msecEnded Job = job_201306041030_0001MapReduce Jobs Launched: Job 0: Map: 9 Reduce: 1 Cumulative CPU: 31.91 sec HDFS Read: 0 HDFS Write: 0 SUCCESSTotal MapReduce CPU Time Spent: 31 seconds 910 msecOK1000000Time taken: 46.246 seconds
Custom Column Mapping
CREATE EXTERNAL TABLE Users( userid string, name string, email string, phone string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,user_name,primary_email,home_phone");
Cassandra: row key user_name primary_email home_phone
Hive: userid name email phone