Post on 08-Jan-2017
BENEATH RDDIN APACHE SPARK
USING SPARK-SHELL AND WEBUI / / / JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
Jacek Laskowski is an independent consultantContact me at jacek@japila.pl or Delivering Development Services | Consulting | TrainingBuilding and leading development teamsMostly and these daysLeader of and
Blogger at and
@JacekLaskowski
Apache Spark ScalaWarsaw Scala Enthusiasts Warsaw Apache
SparkJava Champion
blog.jaceklaskowski.pl jaceklaskowski.pl
SPARKCONTEXTTHE LIVING SPACE FOR RDDS
SPARKCONTEXT AND RDDSAn RDD belongs to one and only one Spark context.
You cannot share RDDs between contexts.SparkContext tracks how many RDDs were created.
You may see it in toString output.
SPARKCONTEXT AND RDDS (2)
RDDRESILIENT DISTRIBUTED DATASET
CREATING RDD - SC.PARALLELIZEsc.parallelize(col, slices) to distribute a localcollection of any elements.
scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24
Alternatively, sc.makeRDD(col, slices)
CREATING RDD - SC.RANGEsc.range(start, end, step, slices) to createRDD of long numbers.
scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:24
CREATING RDD - SC.TEXTFILEsc.textFile(name, partitions) to create a RDD oflines from a file.
scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFile at <console>:24
CREATING RDD - SC.WHOLETEXTFILESsc.wholeTextFiles(name, partitions) to createa RDD of pairs of a file name and its content from adirectory.
scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wholeTextFiles at <console>:24
There are many more more advanced functions inSparkContext to create RDDs.
PARTITIONS (AND SLICES)Did you notice the words slices and partitions asparameters?Partitions (aka slices) are the level of parallelism.
We're going to talk about the level of parallelism later.
CREATING RDD - DATAFRAMESRDDs are so last year :-) Use DataFrames...early and often!A DataFrame is a higher-level abstraction over RDDs andsemi-structured data.DataFrames require a SQLContext.
FROM RDDS TO DATAFRAMESscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24
scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int]
scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]
...AND VICE VERSAscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24
scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]
scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70] at rdd at <console>:29
CREATING DATAFRAMES -SQLCONTEXT.CREATEDATAFRAMEsqlContext.createDataFrame(rowRDD, schema)
CREATING DATAFRAMES - SQLCONTEXT.READsqlContext.read is the modern yet experimental way.sqlContext.read.format(f).load(path), where fis:
jdbcjsonorcparquettext
EXECUTION ENVIRONMENT
PARTITIONS AND LEVEL OF PARALLELISMThe number of partitions of a RDD is (roughly) the numberof tasks.Partitions are the hint to size jobs.Tasks are the smallest unit of execution.Tasks belong to TaskSets.TaskSets belong to Stages.Stages belong to Jobs.Jobs, stages, and tasks are displayed in web UI.
We're going to talk about the web UI later.
PARTITIONS AND LEVEL OF PARALLELISM CD.In local[*] mode, the number of partitions equals thenumber of cores (the default in spark-shell)
scala> sc.defaultParallelism res0: Int = 8
scala> sc.masterres1: String = local[*]
Not necessarily true when you use local or local[n] masterURLs.
LEVEL OF PARALLELISM IN SPARK CLUSTERSTaskScheduler controls the level of parallelismDAGScheduler, TaskScheduler, SchedulerBackend workin tandemDAGScheduler manages a "DAG" of RDDs (aka RDDlineage)SchedulerBackends manage TaskSets
DAGSCHEDULER
TASKSCHEDULER AND SCHEDULERBACKEND
RDD LINEAGERDD lineage is a graph of RDD dependencies.Use toDebugString to know the lineage.Be careful with the hops - they introduce shuffle barriers.Why is the RDD lineage important?This is the R in RDD - resiliency.But deep lineage costs processing time, doesn't it?Persist (aka cache) it early and often!
RDD LINEAGE - DEMOWhat does the following do?
val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
RDD LINEAGE - DEMO CD.How many stages are there?
// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 []
Nothing happens yet - processing time-wise.
SPARK CLUSTERSSpark supports the following clusters:
one-JVM local clusterSpark StandaloneApache MesosHadoop YARN
You use --master to select the clusterspark://hostname:port is for Spark Standalone
And you know the local master URL, ain't you?local, local[n], or local[*]
MANDATORY PROPERTIES OF SPARK APPYour task: Fill in the gaps below.
Any Spark application must specify application name (akaappName ) and master URL.
Demo time! => spark-shell is a Spark app, too!
SPARK STANDALONE CLUSTERThe built-in Spark clusterStart standalone Master with sbin/start-master
Use -h to control the host name to bind to.Start standalone Worker with sbin/start-slave
Run single worker per machine (aka node) = web UI for Standalone cluster
Don't confuse it with the web UI of Spark applicationDemo time! => Run Standalone cluster
http://localhost:8080/
SPARK-SHELLSPARK REPL APPLICATION
SPARK-SHELL AND SPARK STANDALONEYou can connect to Spark Standalone using spark-shellthrough --master command-line option.
Demo time! => we've already started the Standalonecluster.
WEBUIWEB USER INTERFACE FOR SPARK APPLICATION
WEBUIIt is available under You can disable it using spark.ui.enabled flag.All the events are captured by Spark listeners
You can register your own Spark listener.Demo time! => webUI in action with different master URLs
http://localhost:4040/
QUESTIONS?- Visit - Follow at twitter - Use - Read notes.
Jacek Laskowski's blog@jaceklaskowski
Jacek's projects at GitHubMastering Apache Spark