Beneath RDD in Apache Spark by Jacek Laskowski

BENEATH RDDIN APACHE SPARK

USING SPARK-SHELL AND WEBUI / / / JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES

Jacek Laskowski is an independent consultantContact me at jacek@japila.pl or Delivering Development Services | Consulting | TrainingBuilding and leading development teamsMostly and these daysLeader of and

Blogger at and

@JacekLaskowski

Apache Spark ScalaWarsaw Scala Enthusiasts Warsaw Apache

SparkJava Champion

blog.jaceklaskowski.pl jaceklaskowski.pl

http://bit.ly/mastering-apache-spark

SPARKCONTEXTTHE LIVING SPACE FOR RDDS

SPARKCONTEXT AND RDDSAn RDD belongs to one and only one Spark context.

You cannot share RDDs between contexts.SparkContext tracks how many RDDs were created.

You may see it in toString output.

SPARKCONTEXT AND RDDS (2)

RDDRESILIENT DISTRIBUTED DATASET

CREATING RDD - SC.PARALLELIZEsc.parallelize(col, slices) to distribute a localcollection of any elements.

scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24

Alternatively, sc.makeRDD(col, slices)

CREATING RDD - SC.RANGEsc.range(start, end, step, slices) to createRDD of long numbers.

scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:24

CREATING RDD - SC.TEXTFILEsc.textFile(name, partitions) to create a RDD oflines from a file.

scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFile at <console>:24

CREATING RDD - SC.WHOLETEXTFILESsc.wholeTextFiles(name, partitions) to createa RDD of pairs of a file name and its content from adirectory.

scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wholeTextFiles at <console>:24

There are many more more advanced functions inSparkContext to create RDDs.

PARTITIONS (AND SLICES)Did you notice the words slices and partitions asparameters?Partitions (aka slices) are the level of parallelism.

We're going to talk about the level of parallelism later.

CREATING RDD - DATAFRAMESRDDs are so last year :-) Use DataFrames...early and often!A DataFrame is a higher-level abstraction over RDDs andsemi-structured data.DataFrames require a SQLContext.

FROM RDDS TO DATAFRAMESscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int]

scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

...AND VICE VERSAscala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at <console>:24

scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]

scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70] at rdd at <console>:29

CREATING DATAFRAMES -SQLCONTEXT.CREATEDATAFRAMEsqlContext.createDataFrame(rowRDD, schema)

CREATING DATAFRAMES - SQLCONTEXT.READsqlContext.read is the modern yet experimental way.sqlContext.read.format(f).load(path), where fis:

jdbcjsonorcparquettext

EXECUTION ENVIRONMENT

PARTITIONS AND LEVEL OF PARALLELISMThe number of partitions of a RDD is (roughly) the numberof tasks.Partitions are the hint to size jobs.Tasks are the smallest unit of execution.Tasks belong to TaskSets.TaskSets belong to Stages.Stages belong to Jobs.Jobs, stages, and tasks are displayed in web UI.

We're going to talk about the web UI later.

PARTITIONS AND LEVEL OF PARALLELISM CD.In local[*] mode, the number of partitions equals thenumber of cores (the default in spark-shell)

scala> sc.defaultParallelism res0: Int = 8

scala> sc.masterres1: String = local[*]

Not necessarily true when you use local or local[n] masterURLs.

LEVEL OF PARALLELISM IN SPARK CLUSTERSTaskScheduler controls the level of parallelismDAGScheduler, TaskScheduler, SchedulerBackend workin tandemDAGScheduler manages a "DAG" of RDDs (aka RDDlineage)SchedulerBackends manage TaskSets

DAGSCHEDULER

TASKSCHEDULER AND SCHEDULERBACKEND

RDD LINEAGERDD lineage is a graph of RDD dependencies.Use toDebugString to know the lineage.Be careful with the hops - they introduce shuffle barriers.Why is the RDD lineage important?This is the R in RDD - resiliency.But deep lineage costs processing time, doesn't it?Persist (aka cache) it early and often!

RDD LINEAGE - DEMOWhat does the following do?

val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)

RDD LINEAGE - DEMO CD.How many stages are there?

// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 []

Nothing happens yet - processing time-wise.

SPARK CLUSTERSSpark supports the following clusters:

one-JVM local clusterSpark StandaloneApache MesosHadoop YARN

You use --master to select the clusterspark://hostname:port is for Spark Standalone

And you know the local master URL, ain't you?local, local[n], or local[*]

MANDATORY PROPERTIES OF SPARK APPYour task: Fill in the gaps below.

Any Spark application must specify application name (akaappName ) and master URL.

Demo time! => spark-shell is a Spark app, too!

SPARK STANDALONE CLUSTERThe built-in Spark clusterStart standalone Master with sbin/start-master

Use -h to control the host name to bind to.Start standalone Worker with sbin/start-slave

Run single worker per machine (aka node) = web UI for Standalone cluster

Don't confuse it with the web UI of Spark applicationDemo time! => Run Standalone cluster

http://localhost:8080/

SPARK-SHELLSPARK REPL APPLICATION

SPARK-SHELL AND SPARK STANDALONEYou can connect to Spark Standalone using spark-shellthrough --master command-line option.

Demo time! => we've already started the Standalonecluster.

WEBUIWEB USER INTERFACE FOR SPARK APPLICATION

WEBUIIt is available under You can disable it using spark.ui.enabled flag.All the events are captured by Spark listeners

You can register your own Spark listener.Demo time! => webUI in action with different master URLs

http://localhost:4040/

QUESTIONS?- Visit - Follow at twitter - Use - Read notes.

Jacek Laskowski's blog@jaceklaskowski

Jacek's projects at GitHubMastering Apache Spark

Beneath RDD in Apache Spark by Jacek Laskowski

Data & Analytics

Transcript of Beneath RDD in Apache Spark by Jacek Laskowski

Apache Spark RDDs

SPEKTROSKOPIA NMR - chem.pg.edu.pl · SPEKTROSKOPIA NMR PODEJŚCIEPRAKTYCZNE DR INŻ. TOMASZ LASKOWSKI ... SkrótNMR (ang. Nuclear Magnetic Resonance) oznacza magnetyczny rezonans

CHEVROLET SPARK LIFE CHEVROLET - maulme

Beneath a moonless sky - Love Never Dies

Spark Hands-on

Wprowadzenie do Apache Spark · 2017-01-20 · Wprowadzenie do Apache Spark Jakub Toczek. Epoka informacyjna. MapReduce. MapReduce. Apache Hadoop narodziny w 2006 roku z Apache Nutch

PAWEŁ HULKA – LASKOWSKI (1881 – 1946)

02501484 Spark - Roxell

ZWIĄZKI BIOLOGICZNIE CZYNNE POCHODZENIA NATURALNEGO · 2016. 6. 6. · ZWIĄZKI BIOLOGICZNIE CZYNNE POCHODZENIA NATURALNEGO DR INŻ. TOMASZ LASKOWSKI ... W. najprostszym ujęciu,

EKOLOGIA GLOBALNA - Jagiellonian Universityeko.uj.edu.pl › laskowski › Globalne › W02_Energia_Info.pdfUdział róŜnych źródeł energii w Polsce Ł ączna poda Ŝ: 3516 PJ/rok

Products catalogue - TaurusModels · Products catalogue TAURUS Lukasz Laskowski ... Oberursel UI radial engine Scale: D3208 1:32 ... simplified assembly (fasteners integrated)

Społecznościowe kanały sprzedaży - Michał Laskowski

3.A.Poszewiecki M.Laskowski WG 4 2015 GEcejsh.icm.edu.pl/cejsh/element/bwmeta1.element.desklight... · 2016. 2. 14. · Michał Laskowski, Andrzej Poszewiecki Summary E-commerce has

fakty, mity, interpretacjeeko.uj.edu.pl/laskowski/Globalne/W02a_Ocieplenie globalne.pdf · fakty, mity, interpretacje... Ocieplenie globalne • Mit ... • Muchołówki Ŝałobne

LESZEK LASKOWSKI - biblioteka.koszalin.pl · 4 Szanowni Państwo Pomniki są świadectwami nie tylko naszej historii, lecz ... Makainzołot, Katyń, Borowicze, Swierdłowsk” i data

Spark Software pozyskuje finansowanie od funduszy i anio ...

Pestycydy: za i przeciw - UJeko.uj.edu.pl/laskowski/Globalne/W05_Pestycydy.pdf · U.S. Deprtament of Defence: Armed Forces Pest Management Board. Leiszmanioza Leiszmanioza Leiszmanioza

New Golden Media 990 CR HD PVR Spark LX Œ czyli Spark lub … Media 990 CR.pdf · 2011. 5. 27. · TECHNIKA 06.11 Ł TV-Sat Magazyn 9 bel zasilaj„cy 230 V, a w prawym rogu pojawiaj„cy

Porównanie wydajności i produktywności algorytmu tworzenia ...zeszyty-naukowe.wwsi.edu.pl/zeszyty/zeszyt15/Porownanie_wydajno… · stępu do danych. Podobnie jak środowisko Spark,

Instrukcja obsługi drona DJI Spark | DJI ARS · © 2017 DJI All Rights Reserved. 7 Spark Schemat drona 1. Śmigła 2. Silniki 3. Przednie diody LED 4. System czujników3D 5. Gimbal