Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets Representation of data coming to your system as an object format Rely on lineage (case of failure, recover) Transformation What you to to RDD to get another RDD (open file, filter) Actions Asking for an answer the system needs to provide (count, ) Lazy Evaluations Only done where there is an actual action to be done What is Spark? A GrowingStack Fast and expressive cluster computing system compatible with ApacheHadoop Improves efficiency through:» General execution graphs» In- memory storage Improves usability through:» Rich APIs in Scala, Java, Python» Interactive shell Up to 10 faster ondisk, 100 inmemory 2-5 lesscode Shark SQL Spark Streaming real- time Spark GraphX graph MLbase machine learning 1
Why a New Programming Model? Easy to use Compose well for large applications (Implementation) Higher level of computational model Fast data sharing and DAGs lead to more efficiency for the engine much simpler for the end users Spark s goal was to generalize MapReduce to support new apps within engine A Brief History : RDD An RDD is an immutable, partitioned, logical collection of records Spark enabled distributed data processing through functional transformations on distributed collections of data (RDDs) map filter sample union groupbykey reducebykey join cache Transf ormations (define a new RDD) reduce collect count save lookupkey Parallel operations (Actions) (return a result to driver) RDD Essentials Transformations create a new dataset from an existing one All transformations in Spark are Lazy Do not compute their results right away Remember the transformations applied to some base datasets Optimize the required calculations Recover from lost data partitions DataFrame A distributed collection of data organized into named columns Conceptually equivalent to a table in a relational database or a data frame in R/Python Under the hood, DataFrame contains an RDD composed of row objects with additional schema information of types Can incorporate SQL while working with DataFrames, using Spark SQL Can be constructed from a wide array of sources: structured data files tables in Hive external databases existing RDD vs DataFrame New DataFrame API goal: enable wider audiences beyond Big Data engineers to leverage the power of distributed processing provides a way to operate on them using existing RDD tranformations like map(). However, provides additional capabilities Register DataFrame as a temporary table to query it Supporting functions with behavior similar to SQL counterparts like select( ) Cache tables Sql queries using SQL return DataFrames allows Spark to run certain optimizations on the finalized query Since DataFrame has additional metadata due to its tabular format can process Json data, parquet data, HiveQL data at a time by loading them into a DataFrame 2
DataFrameExample JavaSparksc =...; // An existing JavaSpark. SQLsql = new org.apache.spark.sql.sql(sc); DataFramedf = sql.read().json("examples/src/main/resources/people.json"); // Displays the content of the DataFrame to stdout df.show(); DataFrameOperations // Print the schema in a tree format df.printschema(); // Select onlythe"name" column df.select("name").show(); // Select everyb o d y, but increment the ageby 1 df.select(df.col("name"),df.col("age").plus(1)).show(); // Select peopleolder than 21 df.filter(df.col("age").gt(21)).show(); // Count peopleby age df.groupby("age").count().show(); Running SQL Queries Programmatically SQLsql =... // An existing SQL DataFrame df = sql.sql("select * FROM table") DataFrameSupportedOperators map reduce sample JavaRDD<Person> people = // Apply a schema to an RDD of JavaBeans and register it as a table. DataFrame schemapeople = sql.createdataframe(people,person.class); schemapeople.registertemptable("people"); filter groupby sort union join count fold reducebykey groupbykey cogroup take first partitionby save... // SQL can be run over RDDs that have been registered as tables. DataFrameteenagers = sql.sql("select name FROM people WHERE age >= 13 AND age <= 19") leftouterjo in rightouterj oi n cross zip I/O Process in Spark : Write as text file in one partition? By default spark create one partition for each block of the file Make number of partition is equal n times the number of cores in the cluster all partition will process parallel and resources are also used equally What if data does not fit in memory to write in one partition? Use multiple partitions Different formats of input/output files Parquet Files CSV Files 3
Parquet Files A columnar format supported by many other data processing systems Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data Loading & Writing data // sqlfrom the previous example is us ed in this example. DataFrame schemapeople =... // The DataFrame from the previous example. // DataFrames can be saved as Parquet files, maintaining the schema information. schemapeople.write().parquet("people.pa rque t"); // Read in the Parquet file created above. Parquet files are self- describing so the schema is preserved. // The result of loading a parquet file is also a DataFrame. DataFrame parquetfile = sql.read().parquet("people.pa rque t"); Performance Tuning Partitions Fragmentation enables Spark to execute in parallel Level of fragmentation is function of #partitions in your RDD Caching Data In Memory Spark SQL can cache tables using an in- memory columnar format DataFrame schemapeople = sql.createdataframe(people, Person.class); //cache DataFrame in memory schemapeople.cache(); sql.cachetable(" tablen ame") Serialization (something transparent that spark does) Avoiding writing back and forth translate code into ideally compressed format for transferring over the network => Kryo Serialization Other Configuration Options Spark Documentation Example ConfigFile vi spark/config/spark- defaults.conf spark.eventlog.enabled true spark.serializer org.apache.spark.serializer.kryoserializer spark.shuffle.consolidatefiles true spark.kryo.referencetracking false spark.driver.extrajavaoptions "- XX:+UseCompressedOops" spark.executor.extrajavaoptions "- XX:+UseCompressedOops spark.default.parallelism 48 spark.driver.memory 2560M Spark MapReduce Comparison - The Bottomline Hadoop MapReduce is meant for data that does not fit in the memory whereas Apache Spark has a better performance for the data that fits in the memory, particularly on dedicated clusters. Hadoop processing model is On- disk (disk- base parallelization) while Spark can be in- memory or On- disk Apache Spark follows a DAG (Directed Acyclic Graph) execution engine for execution In a distributed system, a conventional program would not work as the data is split across nodes. DAG is a programming style for distributed systems The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together. The final result of a DAG scheduler is a set of stages. 4
DAG Example Hadoop MapReduce Vs. Hadoop SparkMapReduce vs. Tez vs. vs. Spark Tez vs. Spark Criteria Criteria 10 10 License Processing Model Language written in API License Open Source Open Open Source, Open Source, Open Source, Apache 2.0, version Apache 2.0, 2.0, version Apache 2.0, version Apache 2.0, version 2.x 2.x version 0.x version 1.x 0.x 1.x Processing On-Disk (Disk- Model based parallelization), Batch On-Disk, (Disk- Batch, based Interactive parallelization), Batch On-Disk, In-Memory, Batch, On-Disk, In-Memory, On-Disk, Interactive Batch, Interactive, Batch, Interactive, Streaming (Near Real- Streaming (Near Real- Time) Time) Language Java written Java Java Java Scala Scala in API [Java, Python, [Java, Java,[ Python, ISV/ Java,[ [Scala, ISV/ Java, Python], [Scala, Java, Python], Scala], User-Facing Scala], Engine/Tool User-Facing builder] Engine/Tool User-Facing builder] User-Facing Libraries Libraries None, separate tools None, None separate tools None [Spark Core, Spark [Spark Core, Spark Streaming, Spark SQL, Streaming, Spark SQL, MLlib, GraphX] MLlib, GraphX] Hadoop Vs. Spark Hadoop MapReduce Hadoop MapReduce vs. Tez vs. vs. Spark Tez vs. Spark Criteria Criteria Installation Bound Installation to Hadoop Bound Bound to Hadoop to Hadoop Isn t Bound bound to Hadoop to Isn t bound to Hadoop Hadoop Ease of Use Difficult Ease to of program, Use Difficult Difficult to program, to program Easy Difficult to program, to Easy to program, needs abstractions needs abstractions No Interactive mode No Interactive No Interactive mode no need of abstractions Interactive No mode no need of abstractions Interactive mode except Hive except mode Hive except Hive mode except Hive Compatibility to data Compatibility types and data to data to data types types and data and sources is sources data is sources is YARN integration YARN YARN application YARN Ground application up YARN integration application 11 to data types and data sources is Spark Ground is up moving YARN towards application YARN 11 to data types and data sources is Spark is moving towards YARN Conclusion Why did we need Spark after Hadoop? handles batch, interactive, and real- time within a single framework Easier to code programming at a higher level of abstraction more general: map/reduce is just one set of supported constructs Spark important Data Structures and I/O Files s Parquet Files Performance Tuning of Spark Change the default configurations in spark s default config file Computational model of Spark Hadoop for very big datasets, Spark for when data fits in memory Spark User Community 1000+ meetup members 80+ contributors 24 companies contributing 5