Introduction to Big Data! with Apache Spark" UC#BERKELEY#



Similar documents
Big Data and Data Science: Behind the Buzz Words

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Native Connectivity to Big Data Sources in MSTR 10

HDP Hadoop From concept to deployment.

Wednesday, October 6, 2010

Integrating a Big Data Platform into Government:

Has been into training Big Data Hadoop and MongoDB from more than a year now

Databricks. A Primer

Databricks. A Primer

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Qsoft Inc

The Future of Data Management

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Big Data Analytics Nokia

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data on Microsoft Platform

Bringing Big Data to People

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Augmented Reality, Skype, icalendar and Custom Visualizations for MicroStrategy Mobile.

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

ANALYTICS CENTER LEARNING PROGRAM

Data Analyst Program- 0 to 100

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

So What s the Big Deal?

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Building Your Big Data Team

#TalendSandbox for Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Can the Elephants Handle the NoSQL Onslaught?

Peninsula Strategy. Creating Strategy and Implementing Change

The Inside Scoop on Hadoop

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Data Integration Checklist

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Microsoft SQL Server 2012 with Hadoop

Thursday, April 9, 2009

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Upcoming Announcements

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

Luncheon Webinar Series May 13, 2013

Hadoop Ecosystem B Y R A H I M A.

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

IBM BigInsights for Apache Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Hadoop and Map-Reduce. Swati Gore

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

TURN YOUR DATA INTO KNOWLEDGE

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Getting Started Practical Input For Your Roadmap

White Paper: What You Need To Know About Hadoop

Analytics Industry Trends Survey. Research conducted and written by:

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Introduction to Big Data Training

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CORPORATE OVERVIEW. Big Data. Shared. Simply. Securely.

HDP Enabling the Modern Data Architecture

Big Data: Tools and Technologies in Big Data

The 4 Pillars of Technosoft s Big Data Practice

Lecture Data Warehouse Systems

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland

Tips and Techniques on how to better Monitor, Manage and Optimize your MicroStrategy System High ROI DW and BI Solutions

Tap into Hadoop and Other No SQL Sources

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON BIG DATA MULTI-PLATFORM JUNE 25-27, 2014 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Data Services Advisory

Ganzheitliches Datenmanagement

A Case Study of Hadoop in Healthcare

Sunnie Chung. Cleveland State University

Certified Big Data and Apache Hadoop Developer VS-1221

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

project collects data from national events, both natural and manmade, to be stored and evaluated by

Workshop on Hadoop with Big Data

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Cost-Effective Business Intelligence with Red Hat and Open Source

Analyzing Big Data with AWS

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

IBM InfoSphere BigInsights Enterprise Edition

Big Data Infrastructure at Spotify

The BIg Picture. Dinsdag 17 september 2013

TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON JUNE 3-4, 2015 JUNE 5, 2015 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY)

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Transcription:

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture"

What is Data Science?" Data Science aims to derive knowledge! from big data, efficiently and intelligently" Data Science encompasses the set of! activities, tools, and methods that! enable data-driven activities in science,! business, medicine, and government " http://www.oreilly.com/data/free/what-is-data-science.csp "

Data Science One Definition" Machine" Learning" Data" Science" Domain" Expertise" http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram "

Element" Databases" Data Science" Data Value" Precious " Cheap " Data Volume" Modest" Massive" Examples" Priorities" Bank records, Personnel records, Census, Medical records" Consistency, Error recovery, Auditability" Contrast: Databases" Online clicks, GPS logs, Tweets, tree sensor readings" Speed, Availability, Query richness" Structured" Strongly (Schema)" Weakly or none (Text)" Properties" Transactions, ACID +" CAP * theorem (2/3), eventual consistency" Realizations" Structured Query Language (SQL)" * CAP = Consistency, Availability, Partition Tolerance" + ACID = Atomicity, Consistency, Isolation and Durability" NoSQL: Riak, Memcached, Apache Hbase, Apache River, MongoDB, Apache Cassandra, Apache CouchDB,, "

Contrast: Databases" Databases" Querying the past" Data Science" Querying the future" Related Business Analytics"» Goal: obtain actionable insight in complex environments"» Challenge: vast amounts of disparate, unstructured data and limited time"

Contrast: Scientific Computing" Image" General purpose ML classifier" Supernova" Not" Dr Peter Nugent (C3 LBNL)" Scientific Modeling" Data-Driven Approach" Physics-based models" General inference engine replaces model" Problem-Structured" Structure not related to problem" Mostly deterministic, precise" Statistical models handle true randomness, and unmodeled complexity" Run on Supercomputer or High-end Computing Cluster" Run on cheaper computer Clusters (EC2)"

Contrast: Traditional Machine Learning" Traditional Machine Learning" Develop new (individual) models" Prove mathematical properties of models" Improve/validate on a few, relatively clean, small datasets" Publish a paper" Data Science" Explore many models, build and tune hybrids" Understand empirical properties of models" Develop/use tools that can handle massive datasets" Take action!"

Recent Data Science Competitions" Using Data Science to find Data Scientists!"

Doing Data Science" The views of three Data Science experts"» Jim Gray (Turing Award winning database researcher)"» Ben Fry (Data visualization expert)"» Jeff Hammerbacher (Former Facebook Chief Scientist, Cloudera co-founder)" Cloud computing: Data Science enabler" 13"

Data Science One Definition" Machine" Learning" Data" Science" Domain" Expertise" http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram "

Jim Gray s Model " 1. Capture" 2. Curate" 3. Communicate" Turing award winner" 15"

Ben Fry s Model " 1. Acquire" 2. Parse" 3. Filter" 4. Mine" 5. Represent" 6. Refine" 7. Interact" Data visualization expert" 16"

Jeff Hammerbacher s Model" 1. Identify problem" 2. Instrument data sources" 3. Collect data" 4. Prepare data (integrate, transform,! clean, filter, aggregate)" 5. Build model" 6. Evaluate model" 7. Communicate results" Facebook, Cloudera" 17"

Key Data Science Enabler: Cloud Computing" Cloud computing reduces computing operating costs" Cloud computing enables data science on massive numbers of inexpensive computers " Figure: http://www.opengroup.org/cloud/cloud/cloud_for_business/roi.htm"

The Million-Server Data Center" http://spectrum.ieee.org/tech-talk/semiconductors/devices/what-will-the-data-center-of-the-future-look-like "

Data Scientist s Practice" Clean, prep" Large Scale Exploitation" Digging Around! in Data" Hypothesize Model" Evaluate! Interpret"

Data Science Topics" Data Acquisition" Data Preparation" Analysis" Data Presentation" Data Products" Observation and Experimentation"

What s Hard about Data Science?" Overcoming assumptions" Making ad-hoc explanations of data patterns" Not checking enough (validate models, data pipeline integrity, etc.)" Overgeneralizing" Communication" Using statistical tests correctly" Prototype! Production transitions" Data pipeline complexity (who do you ask?)"

Data Science One Definition" Machine" Learning" Data" Science" Domain" Expertise" http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram "

Extract Transform Load The Big Picture"

Data Acquisition (Sources)! in Web Companies" Examples from Facebook"» Application databases"» Web server logs"» Event logs"» Application Programming Interface (API)! server logs"» Ad and search server logs"» Advertisement landing page content"» Wikipedia"» Images and video"

Data Acquisition & Preparation Overview" Extract, Transform, Load (ETL)"» We need to extract data from the source(s)"» We need to load data into the sink"» We need to transform data at the source, sink, or in a staging area"» Sources: file, database, event log, web site,! Hadoop Distributed FileSystem (HDFS), "» Sinks: Python, R, SQLite, NoSQL store, files,! HDFS, Relational DataBase Management System! (RDBMS), "

Data Acquisition & Preparation! Process Model" The construction of a new data preparation process is done in many phases"» Data characterization"» Data cleaning"» Data integration" We must efficiently move data around in! space and time"» Data transfer"» Data serialization and deserialization (for files or! network)"

Data Acquisition & Preparation Workflow" The transformation pipeline or workflow often consists of many steps"» For example: Unix pipes and filters"» cat$data_science.txt$ $wc$ $mail$1s$"word$count $myname@some.com$" If a workflow is to be used more than once, it can be scheduled"» Scheduling can be time-based or event-based"» Use publish-subscribe to register interest (e.g., Twitter feeds)" Recording the execution of a workflow is known as! capturing lineage or provenance"» Spark s Resilient Distributed Datasets do this for you automatically"

Impediments to Collaboration" The diversity of tools and programming/scripting languages makes it hard to share" Finding a script or computed result is! often harder than just writing the! program from scratch!"» Question: How could we fix this?" View that most analysis work is! throw away "

Businessperson" Programmer" Enterprise" Web Company" Data Science Roles"

Data Sources"» Web pages"» Excel" The Businessperson" ETL"» Copy and paste" Data Warehouse"» Excel" Business Intelligence and Analytics"» Excel functions"» Excel charts"» Visual Basic" Image: http://www.fastcharacters.com/character-design/cartoon-business-man/ "

The Programmer" Data Sources"» Web scraping, web services API"» Excel spreadsheet exported as Comma Separated Values"» Database queries" ETL"» wget, curl, Beautiful Soup, lxml" Data Warehouse"» Flat files" Business Intelligence and Analytics"» Numpy, Matplotlib, R, Matlab, Octave" Image: http://doctormo.deviantart.com/art/computer-programmer-ink-346207753 "

Data Sources"» Application databases"» Intranet files"» Application server log files" The Enterprise" ETL"» Informatica, IBM DataStage, Ab Initio, Talend" Data Warehouse"» Teradata, Oracle, IBM DB2, Microsoft SQL Server" Business Intelligence and Analytics"» SAP Business Objects, IBM Cognos, Microstrategy, SAS,! SPSS, R" Image: http://www.publicdomainpictures.net/view-image.php?image=74743 "

Data Sources"» Application databases"» Logs from the services tier"» Web crawl data" The Web Company" ETL"» Apache Flume, Apache Sqoop, Apache Pig, Apache Oozie,! Apache Crunch" Data Warehouse"» Apache Hadoop/Apache Hive, Apache Spark/Spark SQL" Business Intelligence and Analytics"» Custom dashboards: Oracle Argus, Razorflow"» R" Image: http://www.future-web-net.com/2011/04/un-incendio-blocca-il-server-di-aruba.html"