Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs [email protected]
|
|
|
- Dennis Fox
- 10 years ago
- Views:
Transcription
1 Big Data and Data Science Grows Up Ron Bodkin Founder & CEO Think Big Analy8cs 1
2 Source IDC 2
3 Hadoop Open Source Distributed Cluster SoGware Distributed file system Java- based MapReduce Resource manager Started in Nutch project (open source crawler) Inspired by Google MapReduce and GFS 2/7/12
4 Hadoop Components SQL Store Ingest Service Outgest Service SQL Store Logs Key italics: process : MR jobs Primary Master Server Job Tracker Name Node Cluster Secondary Master Server Secondary Name Node Slaves Client Servers Hive, Pig,... cron+bash, Azkaban, Sqoop, Scribe, Monitoring, Management Slave Server Slave Server Slave Server Task Tracker Task Tracker Task Tracker Data Node Data Node Data Node... Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk 27 4
5 Data Processing Models 5
6 Input Mappers Sort, Shuffle Reducers MapReduce 101 Output Hadoop uses MapReduce (doc1, " ") (hadoop, 1) (mapreduce, 1) (uses, 1) 0-9, a-l (a, [1,1]), (hadoop, [1]), (is, [1,1]) a 2 hadoop 1 is 2 There is a Map phase (doc2, " ") (is, 1), (a, 1) m-q (map, 1),(phase,1) (there, 1) (map, [1]), (mapreduce, [1]), (phase, [1,1]) map 1 mapreduce 1 phase 2 (doc3, "") There is a Reduce phase (doc4, " ") (is, 1), (a, 1) (phase,1) (there, 1), (reduce 1) r-z (reduce, [1]), (there, [1,1]), (uses, 1) reduce 1 there 2 uses 1 6
7 Word Count: Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 7
8 MapReduce Wiring public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args). getremainingargs(); if (otherargs.length!= 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } 8
9 Hive Overview A SQL- based tool for data warehousing using Hadoop clusters. Lowers the barrier for Hadoop adop4on for exis8ng SQL apps and users.. Translates SQL to MapReduce Provides an op4mizer Extensible data types & UDFs The first popular meta- data service for Hadoop 9
10 Word Count in Hive CREATE TABLE docs (line STRING); LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) as count from (SELECT explode(split(line, '\\s')) AS word FROM docs) w GROUP BY word ORDER BY count DESC, word; 10
11 Pig Overview Pig La4n is a higher- level map/reduce language A simple data flow language designed for produc8vity not Turing complete (yet!) Built- in support for joins, filters, etc. Provides an op4mizer translates into Hadoop map reduce job steps Allows user- defined func8ons With HCatalog will share metadata with Hive 11
12 Sample Pig Script lines = LOAD 'docs/*' USING TextLoader(); words = FOREACH lines GENERATE FLATTEN(TOKENIZE($0)); groups = GROUP words BY $0; counts = FOREACH groups GENERATE $0, COUNT($1); sorted = ORDER counts BY $1 desc, $0; STORE sorted INTO 'output/wc' USING PigStorage('\t'); 12
13 MapReduce Frameworks Cascading Java- based op8mizer & rela8onal operators Crunch Abstract collec8ons and op8mizer Streaming, Pipes Non- Java integra8on (Perl, Python, Ruby, C/C++, ) Tap Simplify 8me series processing, use of diverse tools and data formats 13
14 public static class WordCountMapper extends TapMapper } public void map(string in, Pipe<CountRec> out) { StringTokenizer tokenizer = new StringTokenizer(in); } while (tokenizer.hasmoretokens()) { out.put(countrec.newbuilder(). setword(tokenizer.nexttoken()). } Tap Mapper setcount(1).build())); 14
15 Tap Wiring public static main(string[] args) throws Exception { } CommandOptions o = new CommandOptions(args); Tap tap = new Tap(o); tap.newphase().reads(o.input).map(wordcountmapper.class).combine(wordcountreducer.class).groupby("word").writes(o.output).reduce(wordcountreducer.class); tap.make(); 15
16 IntegraFon 16
17 Reference Architecture Addi8ve data processing power for flexibility. Big Data Strategy is integrated with HBase, rela8onal, exis8ng BI and data warehouse technology. Provides capability to create data science discipline using full set data. Analysis capability on all internal data with capability to add external data at will. 17
18 HBase Tables for Hadoop inspired by Google s Big Table Supports both batch and random access Ad hoc lookup Website serving queries High Consistency Maturing rapidly (e.g., reducing latency variance) S8ll a performance tax vs. DFS 18
19 Unstructured Data IngesFon Batch log shipping No distributed management and monitoring Syslog forwarding No distributed management and monitoring Apache Kakfa Wrinen in Java Distributed message rou8ng Distributed monitoring and management (agents) Apache Flume Wrinen in Java Pluggable sources, adapters, sinks Distributed monitoring and management (agents) Other streaming frameworks Scribe, Chukwa, Honu Messaging 19
20 Hadoop High Availability Name Node & Job Tracker tradi8onally a SPOF Durability excellent (3x replicas careful ops) MapR offers HA filesystem and auto- recovery for Job Tracker crash Hadoop 0.23 promises HA Name Node & Job Tracker 20
21 Ac8ve/Ac8ve op8on for strict SLAs Parallel ingest of data DR: not all or nothing Backup op8ons Recovery Parallel processing Dist cp HBase: snapshots and replica8on 21
22 Version CompaFbility People upgrade Hadoop by installing a new cluster Tradi8onally stop and upgrade all Rolling upgrades in MapR and futures for Hadoop Protocols not backward compa8ble (8l 0.24+) Only one version of processing API (8l 0.23) 22
23 HosFng OpFons Cloud- based Amazon EMR and S3 notable Popular with startups & POCs Private Data Center or Coloca8on Commodity Hardware w/ DAS: has been the main approach for enterprise IT Appliance Commodity Hardware w/ SAS or SAN 23
24 Batch processing ETL Model training Model scoring Fast analy8cs Search Lookup Common Workloads 24
25 Sweet- Spot Machine ConfiguraFon 8-12 cores 8-12 JBOD 2 TB spindles GB of RAM 10 gige Changing quickly (more density, disk drive shortage) 25
26 Common Uses 26
27 IT Log & Security Forensics & Analy8cs Automated Device Data Analy8cs Find New Signal Predict Events React in real 8me Failure Analysis Proac8ve Fixes Product Planning 100% Capture Data Governance Shared Services Cross Sell/Upsell Customer Analy8cs Mone8ze Data Adver8sing Analy8cs Anribu8on Customer Value Segmenta8on Insights Op8miza8on Social Media Big Data Warehouse Analy8cs Hadoop + MPP + EDW Cost Reduc8on Ad Hoc Insight Flexibility Predic8ve Analy8cs 27
28 Automated Device Support Case Study 28
29 The Enterprise Data Warehouse has been a founda8on of analysis for 20 years It shines for structured analysis, repor8ng, etc. Enterprises are dealing with new needs New data types, at new scale Need to build analy8c models The Big Data Warehouse integrates Hadoop with an Enterprise Data Warehouse 29
30 Why? Challenges Cost to store unstructured data Poor response 8me to changing BI needs Data Warehouse access for departments Goals Integrate unstructured data with data warehouse Predic8ve analy8cs based on data science Comprehensive access to cluster for all users 30
31 Hadoop s Role Support semi- structured and unstructured data Large scale storage Transac8on- level detail (e.g., clickstreams) Archival Integrated data: mul8ple warehouses, new data sources, Powerful processing capacity Perform large scale analyses/studies Drill to detail in large fact tables Query without structure: agility to analyze data without preprocessing Transforma8on to build dimensional models, aggregates, and summaries Build predic8ve models 31
32 Data Agility Classic Warehouse ETL Pre- parse all data Normalize up front Feed data marts New ideas need IT projects Big Data Warehouse Store raw data Parse only when proven Approximate parse on demand Capacity for analysis on demand Prove ideas before projects to op8mize 32
33 Common Data Sets User ac8vity logs Website, ad server, mobile Social data Graph, ac8vity, profiles Sensor data Hardware & sogware phone home, IT logs, cell phones, smart grid, energy Databases To join with less structured, handle evolving schemas Time series, text, scien8fic 33
34 Data Value Chain Integra8ng data mul4plies its value Data Provider Data Integrator Data Consumer (Internal Products) Data Marketplace (Amazon, MicrosoG, Infochimps, Buzz Data, Quantbench) 34
35 Data Science 35
36 What is Data Science? aka Machine Learning, Data Mining Exploratory Modeling Iden8fying trends and excep8ons Detec8ng signal Confirmatory Modeling Building models to capture signal Proving at scale Working with data bo?om up Norvig: more data beats be?er algorithms 36
37 People A New Role Exists the Data ScienFst One Part Scien8st/Sta8s8cian Two Parts Sleuth/Ar8st One Part Programmer One Part Entrepreneur Focused on data not models 37
38 Techniques Supervised Decision trees, random forest Logis8c regression Back- propaga8ng Neural Networks Support Vector Machines Unsupervised (probabilis8c and clustering models) Principal Component Analysis K- means clustering Single Value Decomposi8on Bayesian Networks 38
39 Technologies Open Source Libraries Mahout (hnp://mahout.apache.org/) Mallet (hnp://mallet.cs.umass.edu/) Weka (hnp:// OpenNLP (hnp://incubator.apache.org/opennlp/) RHadoop (hnps://github.com/revolu8onanaly8cs/ Rhadoop) Tools Karmasphere Analyst: visualiza8on Greenplum Chorus: collabora8on and annota8on 2/7/12
40 Process Best PracFces Data Science with a Big Data Warehouse Structured + Unstructured Data Make 100 s of mistakes with linle cost Find Promising small signal detec8on Quickly move to produc8on for tes8ng Establish signal detec8on capabili8es Discipline to learn and experiment Innovate with new data sources Retain lessons learned & Voodoo IP Mone8ze your data science investments.! 40
41 Real- Fme Big Data 41
42 Hadoop Processing Today: batch- processing Time to spin up JVM instances. HDFS op8mized for disk scans. Designed for reliability: tolerate failures Not yet suitable for real- 4me event processing Storm, KaHa, message queues,... Futures: shared storage and cluster management With mul8ple processing models 42
43 Edge Serving Needs Scale Out Simple Opera8ons Fast Parallel Export (for profiles, scores) DSS Analy8cs Feed (Fast Parallel Import) Fast Analy8cs (opera8onal repor8ng) 43
44 Edge Serving OpFons Applica8on- clustered SQL mysql, Oracle, etc. nosql clusters MongoDB, CouchBase, Cassandra, HBase, Citrus Leaf scalable RDBMS for OLTP Oracle RAC?, MySQL Cluster VoltDB, Clustrix 2/7/12
45 NoSQL Database Types Key- value stores Distributed Hash Table, single index Tabular/columnar stores Big Table style column families, scans, MapReduce Document stores JSON- style self- contained structures, 2ndary indices Object stores Document store + foreign keys Graphical stores Links among nodes and traversal op8ons 2/7/12
46 HBase Architecture Uses HDFS to handle replica8on Gives us replica8on, resiliency Consists of Master node and Region nodes 3- level hierarchy to get node where data is stored: Client Master Region (Metadata) Region (Data table) * Data table loca8on is cached in client ager lookup, to speed access 2/7/12
47 Cassandra Architecture Uses Amazon Dynamo- style model Distributed hash table All nodes are homogeneous (no master, no SPOF) Nodes organized in a ring client can connect to any node, as they communicate over the ring: Client 2/7/12
48 MongoDB Simpler document database model access keys, simple filter queries no joins You can do secondary indices Including geo indexing Eventual consistency model (allows CAP tradeoffs) Updates normally replicated to a slave, defers disk writes for major performance boost Focus on simple API Mongo- Hadoop integra8on ac8vely developed 2/7/12
49 Streaming Big Data Responding to incoming events at scale Requires keeping state Simple cases applica8on logic with scale- out database SQL- style SQLStream, InfoSphere Streams MapReduce- style emerging Kava, S4, Storm, FlumeBase 2/7/12
50 Futures 50
51 CompuFng Trends The growth of storage density has well outpaced the growth of data transfer rates Storage Transfer Rate
52 CompuFng Trends, cont d. In 1990, you could read all the data from a typical drive in about 5 minutes Today, it would take over 2 hours And, seek 8mes have improved even more slowly than data transfer rates (SSDs improve this) Network speeds in the data center have improved at a comparable speed (60%/yr) So clusters of commodity servers allow throughput Clusters of servers allow RAM density 52
53 Commodity Hardware in 2017? 512 GB of RAM 64 cores 15 TB spinning disks 1 TB SSDs for caching 100 Gigabit (InfiniBand?) 53
54 Hadoop 0.23 (2.0?) Explosion in New Applica8on Models HBase Prominence Data Science Prac8ces, Tools, Technologies Integra8on Trends in Big Data for
55 55
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
Next Gen Hadoop Gather around the campfire and I will tell you a good YARN
Next Gen Hadoop Gather around the campfire and I will tell you a good YARN Akmal B. Chaudhri* Hortonworks *about.me/akmalchaudhri My background ~25 years experience in IT Developer (Reuters) Academic (City
BIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
Enterprise Data Storage and Analysis on Tim Barr
Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,
Mrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved
Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Using RDBMS, NoSQL or Hadoop?
Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
So What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
Introduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Big Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard
Hadoop and Relational base The Best of Both Worlds for Analytics Greg Battas Hewlett Packard The Evolution of Analytics Mainframe EDW Proprietary MPP Unix SMP MPP Appliance Hadoop? Questions Is Hadoop
HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"
HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter
Outline of Tutorial. Hadoop and Pig Overview Hands-on
Outline of Tutorial Hadoop and Pig Overview Hands-on 1 Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab October 2011 Overview Concepts & Background MapReduce and
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
DNS Big Data Analy@cs
Klik om de s+jl te bewerken Klik om de models+jlen te bewerken! Tweede niveau! Derde niveau! Vierde niveau DNS Big Data Analy@cs Vijfde niveau DNS- OARC Fall 2015 Workshop October 4th 2015 Maarten Wullink,
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
Real Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ
End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,
The Future of Data Management with Hadoop and the Enterprise Data Hub
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
Stinger Initiative: Introduction
Stinger Initiative: Introduction Interactive Query on Hadoop Chris Harris E-Mail : [email protected] Twitter : cj_harris5 Page 1 The World of Data is Changing Data Explosion 1 Zettabyte (ZB) = 1
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?
BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand? The Big Data Buzz big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
I/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Texas Digital Government Summit. Data Analysis Structured vs. Unstructured Data. Presented By: Dave Larson
Texas Digital Government Summit Data Analysis Structured vs. Unstructured Data Presented By: Dave Larson Speaker Bio Dave Larson Solu6ons Architect with Freeit Data Solu6ons In the IT industry for over
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
HDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Big data blue print for cloud architecture
Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
IBM Big Data Platform
Mike Winer IBM Information Management IBM Big Data Platform The big data opportunity Extracting insight from an immense volume, variety and velocity of data, in a timely and cost-effective manner. Variety:
.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken
Klik om de s+jl te bewerken Klik om de models+jlen te bewerken Tweede niveau Derde niveau Vierde niveau.nl ENTRADA Vijfde niveau CENTR-tech 33 November 2015 Marco Davids, SIDN Labs Wie zijn wij? Mijlpalen
Big Data? Definition # 1: Big Data Definition Forrester Research
Big Data Big Data? Definition # 1: Big Data Definition Forrester Research Big Data? Definition # 2: Quote of Tim O Reilly brings it all home: Companies that have massive amounts of data without massive
Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering
Cloudian The Storage Evolution to the Cloud.. Cloudian Inc. Pre Sales Engineering Agenda Industry Trends Cloud Storage Evolu4on of Storage Architectures Storage Connec4vity redefined S3 Cloud Storage Use
Real World Big Data Architecture - Splunk, Hadoop, RDBMS
Copyright 2015 Splunk Inc. Real World Big Data Architecture - Splunk, Hadoop, RDBMS Raanan Dagan, Big Data Specialist, Splunk Disclaimer During the course of this presentagon, we may make forward looking
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
SQream Technologies Ltd - Confiden7al
SQream Technologies Ltd - Confiden7al 1 Ge#ng Big Data Done On a GPU- Based Database Ori Netzer VP Product 26- Mar- 14 Analy7cs Performance - 3 TB, 18 Billion records SQream Database 400x More Cost Efficient!
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Oracle Database 12c Plug In. Switch On. Get SMART.
Oracle Database 12c Plug In. Switch On. Get SMART. Duncan Harvey Head of Core Technology, Oracle EMEA March 2015 Safe Harbor Statement The following is intended to outline our general product direction.
NextGen Infrastructure for Big DATA Analytics.
NextGen Infrastructure for Big DATA Analytics. So What is Big Data? Data that exceeds the processing capacity of conven4onal database systems. The data is too big, moves too fast, or doesn t fit the structures
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
