1 Introduction to Big Data! with Apache Spark" UC#BERKELEY#
2 So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture"
3 What is Data Science?" Data Science aims to derive knowledge! from big data, efficiently and intelligently" Data Science encompasses the set of! activities, tools, and methods that! enable data-driven activities in science,! business, medicine, and government " "
4 Data Science One Definition" Machine" Learning" Data" Science" Domain" Expertise" "
5 Element" Databases" Data Science" Data Value" Precious " Cheap " Data Volume" Modest" Massive" Examples" Priorities" Bank records, Personnel records, Census, Medical records" Consistency, Error recovery, Auditability" Contrast: Databases" Online clicks, GPS logs, Tweets, tree sensor readings" Speed, Availability, Query richness" Structured" Strongly (Schema)" Weakly or none (Text)" Properties" Transactions, ACID +" CAP * theorem (2/3), eventual consistency" Realizations" Structured Query Language (SQL)" * CAP = Consistency, Availability, Partition Tolerance" + ACID = Atomicity, Consistency, Isolation and Durability" NoSQL: Riak, Memcached, Apache Hbase, Apache River, MongoDB, Apache Cassandra, Apache CouchDB,, "
6 Contrast: Databases" Databases" Querying the past" Data Science" Querying the future" Related Business Analytics"» Goal: obtain actionable insight in complex environments"» Challenge: vast amounts of disparate, unstructured data and limited time"
7 Contrast: Scientific Computing" Image" General purpose ML classifier" Supernova" Not" Dr Peter Nugent (C3 LBNL)" Scientific Modeling" Data-Driven Approach" Physics-based models" General inference engine replaces model" Problem-Structured" Structure not related to problem" Mostly deterministic, precise" Statistical models handle true randomness, and unmodeled complexity" Run on Supercomputer or High-end Computing Cluster" Run on cheaper computer Clusters (EC2)"
8 Contrast: Traditional Machine Learning" Traditional Machine Learning" Develop new (individual) models" Prove mathematical properties of models" Improve/validate on a few, relatively clean, small datasets" Publish a paper" Data Science" Explore many models, build and tune hybrids" Understand empirical properties of models" Develop/use tools that can handle massive datasets" Take action!"
9 Recent Data Science Competitions" Using Data Science to find Data Scientists!"
10 Doing Data Science" The views of three Data Science experts"» Jim Gray (Turing Award winning database researcher)"» Ben Fry (Data visualization expert)"» Jeff Hammerbacher (Former Facebook Chief Scientist, Cloudera co-founder)" Cloud computing: Data Science enabler" 13"
11 Data Science One Definition" Machine" Learning" Data" Science" Domain" Expertise" "
12 Jim Gray s Model " 1. Capture" 2. Curate" 3. Communicate" Turing award winner" 15"
13 Ben Fry s Model " 1. Acquire" 2. Parse" 3. Filter" 4. Mine" 5. Represent" 6. Refine" 7. Interact" Data visualization expert" 16"
14 Jeff Hammerbacher s Model" 1. Identify problem" 2. Instrument data sources" 3. Collect data" 4. Prepare data (integrate, transform,! clean, filter, aggregate)" 5. Build model" 6. Evaluate model" 7. Communicate results" Facebook, Cloudera" 17"
15 Key Data Science Enabler: Cloud Computing" Cloud computing reduces computing operating costs" Cloud computing enables data science on massive numbers of inexpensive computers " Figure:
16 The Million-Server Data Center" "
17 Data Scientist s Practice" Clean, prep" Large Scale Exploitation" Digging Around! in Data" Hypothesize Model" Evaluate! Interpret"
18 Data Science Topics" Data Acquisition" Data Preparation" Analysis" Data Presentation" Data Products" Observation and Experimentation"
19 What s Hard about Data Science?" Overcoming assumptions" Making ad-hoc explanations of data patterns" Not checking enough (validate models, data pipeline integrity, etc.)" Overgeneralizing" Communication" Using statistical tests correctly" Prototype! Production transitions" Data pipeline complexity (who do you ask?)"
20 Data Science One Definition" Machine" Learning" Data" Science" Domain" Expertise" "
21 Extract Transform Load The Big Picture"
22 Data Acquisition (Sources)! in Web Companies" Examples from Facebook"» Application databases"» Web server logs"» Event logs"» Application Programming Interface (API)! server logs"» Ad and search server logs"» Advertisement landing page content"» Wikipedia"» Images and video"
23 Data Acquisition & Preparation Overview" Extract, Transform, Load (ETL)"» We need to extract data from the source(s)"» We need to load data into the sink"» We need to transform data at the source, sink, or in a staging area"» Sources: file, database, event log, web site,! Hadoop Distributed FileSystem (HDFS), "» Sinks: Python, R, SQLite, NoSQL store, files,! HDFS, Relational DataBase Management System! (RDBMS), "
24 Data Acquisition & Preparation! Process Model" The construction of a new data preparation process is done in many phases"» Data characterization"» Data cleaning"» Data integration" We must efficiently move data around in! space and time"» Data transfer"» Data serialization and deserialization (for files or! network)"
25 Data Acquisition & Preparation Workflow" The transformation pipeline or workflow often consists of many steps"» For example: Unix pipes and filters"» cat$data_science.txt$ $wc$ $mail$1s$"word$count If a workflow is to be used more than once, it can be scheduled"» Scheduling can be time-based or event-based"» Use publish-subscribe to register interest (e.g., Twitter feeds)" Recording the execution of a workflow is known as! capturing lineage or provenance"» Spark s Resilient Distributed Datasets do this for you automatically"
26 Impediments to Collaboration" The diversity of tools and programming/scripting languages makes it hard to share" Finding a script or computed result is! often harder than just writing the! program from scratch!"» Question: How could we fix this?" View that most analysis work is! throw away "
27 Businessperson" Programmer" Enterprise" Web Company" Data Science Roles"
28 Data Sources"» Web pages"» Excel" The Businessperson" ETL"» Copy and paste" Data Warehouse"» Excel" Business Intelligence and Analytics"» Excel functions"» Excel charts"» Visual Basic" Image: "
29 The Programmer" Data Sources"» Web scraping, web services API"» Excel spreadsheet exported as Comma Separated Values"» Database queries" ETL"» wget, curl, Beautiful Soup, lxml" Data Warehouse"» Flat files" Business Intelligence and Analytics"» Numpy, Matplotlib, R, Matlab, Octave" Image: "
30 Data Sources"» Application databases"» Intranet files"» Application server log files" The Enterprise" ETL"» Informatica, IBM DataStage, Ab Initio, Talend" Data Warehouse"» Teradata, Oracle, IBM DB2, Microsoft SQL Server" Business Intelligence and Analytics"» SAP Business Objects, IBM Cognos, Microstrategy, SAS,! SPSS, R" Image: "
31 Data Sources"» Application databases"» Logs from the services tier"» Web crawl data" The Web Company" ETL"» Apache Flume, Apache Sqoop, Apache Pig, Apache Oozie,! Apache Crunch" Data Warehouse"» Apache Hadoop/Apache Hive, Apache Spark/Spark SQL" Business Intelligence and Analytics"» Custom dashboards: Oracle Argus, Razorflow"» R" Image:
CAP4770/5771 Introduction to Data Science Fall 2016 University of Florida, CISE Department Prof. Daisy Zhe Wang Based on notes from CS194 at UC Berkeley by Michael Franklin, John Canny, and Jeff Hammerbacher
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager email@example.com
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Software Engineering for Big Data CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo Big Data Big data technologies describe a new generation of technologies that aim
NAME NAMIT EXECUTIVE SUMMARY EXPERTISE DELIVERIES Around 10+ years of experience on Big Data Technologies such as Hadoop and MongoDB, Java, Python, Big Data Analytics, System Integration and Consulting
Integrating a Big Data Platform into Government: Drive Better Decisions for Policy and Program Outcomes John Haddad, Senior Director Product Marketing, Informatica Digital Government Institute s Government
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Bringing Big Data to People Microsoft s modern data platform SQL Server 2014 Analytics Platform System Microsoft Azure HDInsight Data Platform Everyone should have access to the data they need. Process
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals
INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale
End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. Orion.Gebremedhin@Neudesic.COM B-orgebr@Microsoft.com @OrionGM The Inside Scoop
Integrating Cloudera and SAP HANA Version: 103 Table of Contents Introduction/Executive Summary 4 Overview of Cloudera Enterprise 4 Data Access 5 Apache Hive 5 Data Processing 5 Data Integration 5 Partner
The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn Presented by :- Ishank Kumar Aakash Patel Vishnu Dev Yadav CONTENT Abstract Introduction Related work The Ecosystem Ingress
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
Microsoft SQL Server 2012 with Hadoop Debarchan Sarkar Chapter No. 1 "Introduction to Big Data and Hadoop" In this package, you will find: A Biography of the author of the book A preview chapter from the
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster Nov 7, 2012 Who I Am Robert Lancaster Solutions Architect, Hotel Supply Team firstname.lastname@example.org @rob1lancaster Organizer of Chicago
Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture Apps and data source extensions with APIs Future white label, embed or integrate Power BI Deploy Intelligent
The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media
Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
TURN YOUR DATA INTO KNOWLEDGE 100% open source Business Intelligence and Big Data Analytics www.spagobi.org @spagobi Copyright 2016 Engineering Group, SpagoBI Labs. All rights reserved. Why choose SpagoBI
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?
CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
WHITE PAPER Four Key Pillars To A Big Data Management Solution EXECUTIVE SUMMARY... 4 1. Big Data: a Big Term... 4 EVOLVING BIG DATA USE CASES... 7 Recommendation Engines... 7 Marketing Campaign Analysis...
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
IBM BigInsights for Apache Hadoop Efficiently manage and mine big data for valuable insights Highlights: Enterprise-ready Apache Hadoop based platform for data processing, warehousing and analytics Advanced
Global Information Platforms Evolving the Data Warehouse Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera April 9, 2009 Presentation Outline Introductions What we ve built Short
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
Hadoop for Oracle database professionals Alex Gorbachev Calgary, AB September 2013 Alex Gorbachev Chief Technology Officer at Pythian Blogger Cloudera Champion of Big Data OakTable Network member Oracle
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at email@example.com.
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
P4.1 Reference Architectures for Enterprise Big Data Use Cases Romeo Kienzler, Data Scientist, Advisory Architect, IBM Germany, Austria, Switzerland IBM Center of Excellence for Data Science, Cognitive
TECHNOLOGY TRANSFER PRESENTS MIKE FERGUSON BIG DATA MULTI-PLATFORM ANALYTICS JUNE 25-27, 2014 RESIDENZA DI RIPETTA - VIA DI RIPETTA, 231 ROME (ITALY) firstname.lastname@example.org www.technologytransfer.it
WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING Using Cloudera to Improve Data Processing CLOUDERA WHITE PAPER 2 Table of Contents What is Data Processing? 3 Challenges 4 Flexibility and Data Quality
beyond possible Big Use End-user applications Big Analytics Visualisation tools Big Analytical tools Big management systems The 4 Pillars of Technosoft s Big Practice Overview Businesses have long managed
Tips and Techniques on how to better Monitor, Manage and Optimize your MicroStrategy System InfoCepts 'LJLWDOO\ VLJQHG E\,QIR&HSWV '1 FQ,QIR&HSWV JQ,QIR&HSWV F 8QLWHG 6WDWHV O 86 R,QIR&HSWV RX,QIR&HSWV
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Ganzheitliches Datenmanagement für Hadoop Michael Kohs, Senior Sales Consultant @mikchaos The Problem with Big Data Projects in 2016 Relational, Mainframe Documents and Emails Data Modeler Data Scientist
Leading a Healthcare Company to the Big Data Promised Land: A Case Study of Hadoop in Healthcare Mohammad Quraishi (IT Senior Principal - Cigna) email@example.com About me BS in Computer Science and Engineering
IBM InfoSphere BigInsights Enterprise Edition Efficiently manage and mine big data for valuable insights Highlights Advanced analytics for structured, semi-structured and unstructured data Professional-grade
Sunnie Chung Cleveland State University Data Scientist Big Data Processing Data Mining 2 INTERSECT of Computer Scientists and Statisticians with Knowledge of Data Mining AND Big data Processing Skills:
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Big Data Analytics 1 Priority Discussion Topics What are the most compelling business drivers behind big data analytics? Do you have or expect to have data scientists on your staff, and what will be their