Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Size: px

Start display at page:

Download "Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us"

Marjory Welch
10 years ago
Views:

1 DATA INTELLIGENCE FOR ALL Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO

2 Agenda 1. Challenges & Motivation 2. DDF Overview 3. DDF Design & Architecture 4. Demo

3 Former Engineering Director of Google Apps (Google Founders Award) Former Professor and Co-Founder of the Computer Engineering program at HKUST Christopher Nguyen, PhD Adatao Inc. Co-Founder & CEO PhD Stanford, BS U.C. Berkeley Summa cum Laude Extensive experience building technology companies that solve enterprise challenges

4 How Have We Defined Big Data? Old Definition Big Data has Problems New Definition BIG DATA + BIG COMPUTE =Opportunitie$ Huge Volume High Velocity (Machine) Learn from Data Great Variety

5 Current Big Data World Batch Low-Latency + Real-Time Presentation Query & Compute Giraph Hive SQL Pig Java API Cascading MapReduce Cascading SummingBird Scalding Trident Java API Storm Shark SQL SparkR Web-Based ipython / R-Studio PySpark R Python Scala +Java DataFrame RDD Spark Scala + Java API DStream GraphX Data HDFS / YARN

6 CREATE EXTERNAL TABLE page_view(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' STORED AS TEXTFILE LOCATION '<hdfs_location>'; INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='us') SELECT pvs.viewtime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = US ;! public static DistributedRowMatrix runjob(path markovpath, Vector diag, Path outputpath, Path tmppath) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(markovPath.toUri(), conf); markovpath = fs.makequalified(markovpath); outputpath = fs.makequalified(outputpath); Path vectoroutputpath = new Path(outputPath.getParent(), "vector"); VectorCache.save(new IntWritable(Keys.DIAGONAL_CACHE_INDEX), diag, vectoroutputpath, conf);! // set up the job itself Job job = new Job(conf, "VectorMatrixMultiplication"); Maintenance is too painful! job.setinputformatclass(sequencefileinputformat.class); job.setoutputkeyclass(intwritable.class); - sample and copy data to local - lrfit <- glm( using ~ + age + education) - Export model - export model to XML - Deploy model using java

7 I have a dream

8 In the Ideal World data = load houseprice data dropna data transform (duration = now - begin) data train glm(price, bedrooms) It just works!!!

9 Make Big-Data API simple & accessible Java MAPREDUCE HBASE PySparkMAPREDUCE Python HBASE PySpark Java SparkR RHADOOP SparkR HIVE RHADOOP HIVE Python MONADS MONOIDS MONADS MONOIDS OR

10 Design Principles

11 DDF is Simple DDFManager manager = DDFManager.get( spark ); DDF ddf = manager.sql2ddf( select * from airline );

12 It s like table ddf.views.project( origin, arrdelay ); ddf.groupby( dayofweek, avg(arrdelay) ); ddf.join(otherddf)

13 It s just nice like this ddf.dropna()! ddf.getnumrows()! ddf.getsummary()

14 Focus on analytics, not MR Simple, high-level, data-science oriented APIs, powered by Spark RDD val filerdd = spark.textfile("hdfs://...") val counts = filerdd.flatmap(line => line.split(" ")).map(arrdelay_byday => (arrdelay_byday.split(, )(0), arrdelay_byday.split(, )(1))).reduceByKey(_ + _) counts.saveastextfile("hdfs://...") ddf ddf aggregate = manager.load( spark://ddf-runtime/ddf-db/airline ). ("dayofweek, sum(arrdelay)")

15 Quickly Access to a Rich set of familiar ML idioms Data Wrangling Model Building & Validation Model Deployment ddf.setmutable(true) ddf.dropna() ddf.transform( speed =distance/duration ) ddf.lm(0.1, 10) ddf.roc(testddf) manager.loadmodel(lm) lm.predict(point)

16 Seamless Integration with MLLib //plug-in algorithm linearregressionwithsgd = org.apache.spark.mllib.regression.linearregressionwithsgd! //run algorithm ddf.ml.train("linearregressionwithsgd", 10, 0.1, 0.1)

17 Easily Collaborate with Others Can I see your Data? DDF://com.adatao/airline

18 DDF on Multiple Languages

19 DDF Architecture Data Scientist/Engineer Owns Have access to DDF Manager Config Handler DDF ETL Handlers Statistics Handlers Representation Handlers ML Handlers

20 DATA INTELLIGENCE FOR ALL Demo Cluster Configuration 8 nodes x 8 cores x 30G RAM Data size 12GB/120 millions of rows

21 DDF offers Native R Data.Frame Experience Table-like Abstraction on Top of Big Data Focus on Analytics, not MapReduce Simple, Data-Science Oriented APIs, Powered by Spark Easily Test & Deploy New Components Pluggable Components by Design Collaborate Seamlessly & Efficiently Mutable & Sharable Work with APIs Using Preferred Languages Multi Language Support (Java, Scala, R, Python)

22 Example: Business Analyst Data Scientist Data Engineer Web Browser R-Studio Python PI Client PA Client DDF Client API API API HDFS

23 DATA INTELLIGENCE FOR ALL To learn more about Adatao & DDF contact us

Moving From Hadoop to Spark

+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee