Building Recommenda/on Pla1orm with Hadoop. Jayant Shekhar
|
|
|
- Imogene McCormick
- 10 years ago
- Views:
Transcription
1 Building Recommenda/on Pla1orm with Hadoop Jayant Shekhar 1
2 Agenda Why Big Data Recommenda/on Pla1orm? Architecture Design & Implementa/on Common Recommenda/on PaEerns 2
3 Recommendations is one of the commonly used use cases of Hadoop Recommendations Broader Use Cases Product Recommendation People/Social Recommendation Merchant Recommendation Content Recommendation Query Recommendation Sponsored Search Advertising Recommendations can be Realtime News Recommendations Merchant/Offer Recommendations on mobile Offline Similar Profiles/Resumes In Between Recommendation Patterns People who viewed this also viewed Content/Product Recommendation Frequently Bought Together Related Searches or Query Recommendation Related Articles/News People/Social Recommendation 3
4 Recommendations are delivered through Data Sets Involved Web Mobile Postal mail Newspaper/Magazine ads Different Time to Action Targeted Items/Products/Content Transaction Data User Data Logs & User Activity Additional 3 rd Party Data Geo Social Reviews User would view the content now/buy the Product Now User would buy the product in a week Next when he/she goes grocery shopping User would buy the product in the next 3 months TV/Dishwasher etc. Vacation Also need to be able to determine/differentiate between the users in a household 4
5 In Summary Users have lots of options for most things Their attention span is very limited Highly targeted Recommendations stand a much better chance of conversions/ better user experience Larger datasets of items, user behavior, related data (from 3 rd party etc.) are needed for targeting Recommendations Use Cases are Many, Fast Changing & Complex Having an agile, flexible, scalable, end-to-end Platform is the way to go Hence we need a Big Data Platform for Recommendations 5
6 Architecture 6
7 Components A Hadoop based Recommendation Platform has to provide the following: ETL Feature Generation Recommendation Algorithms/Model Generation Workflows & Scheduling Serving Systems Support Testing of the Models Measure Performance 7
8 Recommendation Platform RDBMS Logs Sqoop HIVE/PIG/ Impala HBase Real-time Interactions Web Server Logs Flume Giraph HDFS Mahout, etc Recommenda/on Server Web Server Logs Oozie ZooKeeper Solr Web Server SAN 8
9 Feature Generation & Model Building HIVE/PIG/Impala HBase Real- &me Analy&cs Data Data Model Generation Model Model Data Data Feature Generation Model Model Data Data HDFS Model Model Hadoop enables easy iteration over the process of Model Generation and testing it out offline. 9
10 Serving Systems Real-time Content HBase Web Server Recommenda/on Server Web Server Docs HDFS Pre- calculated Results Features Models Solr Web Server Real-time Content Web Server 10
11 Components/Design 11
12 ETL RDBMS Logs Sqoop HIVE/PIG/ Impala HBase Mahout, etc Logs Flume HDFS Logs Oozie ZooKeeper HTTP SAN 12
13 ETL/Flume Web Server Interceptors Serializers Web Server Flume Flume HBase HDFS Web Server Flume Web Server Has concept of Source/Channel/Sink *Flume User Guide 13
14 Flume agent.sources.s1.interceptors.i1.type = org.apache.flume.interceptor.hostinterceptor$builder agent.sinks.hdfs- sink.serializer = com.cloudera.logs.flume.avro.myserializer$builder agent_foo.sources = s1 agent_foo.sinks = k1 agent_foo.channels = c1 # proper/es of s1 agent_foo.sources.s1.type = avro agent_foo.sources.s1.bind = localhost agent_foo.sources.s1.port = agent_foo.sources.s1.channels = c1 # proper/es of c1 agent_foo.channels.c1.type = memory agent_foo.channels.c1.capacity = 1000 agent_foo.channels.c1.transac/oncapacity = 100 # proper/es of k1 agent_foo.sinks.k1.type = hdfs agent_foo.sinks.k1.hdfs.path = hdfs:// analy/cs- 1.cloudera.com/user/jayant/flume1 agent_foo.sinks.k1.hdfs.rollsize = agent_foo.sinks.k1.hdfs.rollinterval = 60 agent_foo.sinks.k1.hdfs.rollcount = 0 agent_foo.sinks.k1.hdfs.filetype = DataStream agent_foo.sinks.k1.hdfs.fileprefix = logs- agent_foo.sinks.k1.hdfs.writeformat = Text agent_foo.sinks.k1.serializer = avro_event agent_foo.sinks.k1.channel = c1 flume- ng agent - - conf- file avro.conf - - name agent_foo - Dflume.root.logger=INFO,console flume- ng avro- client - H localhost - p F./error_log 14
15 ETL Transform Load Flume Interceptors, Serializers etc. Map Reduce HIVE PIG Streaming Data loaded onto HDFS/HIVE Data loaded into HBase Data loaded into external systems RDBMS/No SQL Sessionization Field Extraction Match/Dedup.. 15
16 HIVE & Apache Logs CREATE EXTERNAL TABLE apachelog ( remote_ip STRING, [14/Jan/2012:06:25: ] request_date STRING, "GET /example.com HTTP/1.1" "- " method STRING, "Mozilla/5.0 (compa/ble; Googlebot/2.1; request STRING, +hep:// protocol STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.regexserde' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]).. [([^]]+)] \"([^ ]) ([^ ]) ([^ \"])\" *", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ) STORED AS TEXTFILE loca/on '/user/jayant/apachelog' ; select remote_ip from apachelog; 16
17 HIVE & Transforms add file t2.pl; INSERT OVERWRITE TABLE processed_logs SELECT t.x, t.y, t.z from ( ) t; SELECT transform( body) using 't2.pl' as (x, y, z) from mylogs 17
18 Feature Generation Generates the features needed for the Recommendation Algorithms Various groups can share the features / work done by others on the Platform Daily Query Volume Number of transaction for a user at a Merchant in the last X months Sentiment of a user for a given merchant Distance of a User from a Merchant List of merchants visited by a user in a trip Number of times the user viewed an item in the last week Items bought together 18
19 Feature Generation Hadoop and HBase are ideal for Building and Sharing of features offline on HDFS MR/HIVE/PIG/Impala/Mahout/Streaming provide extreme flexibility in writing code for Feature Generation Makes it tremendously easy to share features, results, iterate and innovate etc. MR/HIVE/PIG/Streaming Hbase Real- &me Analy&cs Data Data Data Transformation Feature Generation Applying ML Data Data HDFS Data 19
20 Workflows with Oozie Start Map Reduce PIG Java/ Mahout End FS HIVE Fork Join Decision Sqoop oozie job -oozie -config job.properties -run Can also be triggered by time frequency and data availability 20
21 ML on Hadoop Different users prefer different set of tools for Data Science/ ML on Hadoop Mahout Impala/HIVE/ PIG RHIPE Rhadoop 21
22 Model Generation Mahout/HIVE/PIG/R Data Data Model Generation Model Model Data Data Clustering / Collaborative Filering Model Model Data Data HDFS Model Model 22
23 ML Tools on Hadoop Mahout ML Algorithms in Java/MR Input as text or SequenceFiles. Can write your own embedded Driver Programs Can control the output format with your own embedded code PIG RANK/sample functions Supports UDFs Has built in math UDFs Streaming foreach/group by/order by HIVE Sampling Supports sampling bucketized tables Block sampling Hook into streaming with Transform UDFs Group by/order by CREATE TABLE mymodel AS SELECT fit_logis/c(actual, features) model FROM obs1; CREATE TABLE test_model AS SELECT obs1.actual, predict(features, model) as predicted FROM obs1 JOIN model_out; 23
24 ML Tools on Hadoop Streaming with PIG trans= load my_transac/ons' as (id, item_id, dt, amt); highdivs = stream trans through `calculate.pl` as (id, score); calculate.pl is invoked once on every map or reduce task 24
25 ML Tools R Popular tool with Data Scientists & Analysts Has numerical and visualization methods for statistics & machine learning Strong support for statistical analysis, predictive analytics, data mining, etc. Use Reduce Script to invoke R script on a list of records RHadoop rmr MapReduce rhdfs R interface to HDFS rhbase R interface to HBase RHIPE Provides access to HDFS and MR from within the R environment Requires R, RHIPE and protobuf installed on all the data nodes. 25
26 ML Algorithms & Mahout Collaborative Filtering Clustering Classification Pattern Mining MR with Aggregation Decision Tree Content Analysis Collaborative Filtering Item Based User Based Slope One Recommender Clustering K-means Clustering Canopy Clustering Fuzzy K-Means Classification Pattern Mining Logistic Regression Naïve Bayes Complementary Naïve Bayes Random Forests Parallel FP Growth 26
27 Mahout - Clustering K-Means hadoop jar my.jar com.cloudera.clustering.featuregen.kmeansinputgenerator input output hadoop jar org.apache.mahout.vectorizer.sparsevectorsfromsequencefiles -i kmeans_input mahout seq2sparse -i kmeans_input o tfidf-vectors hadoop jar my.jar com.cloudera.clustering.kmeans.kmeansdriver -i../tfidf-vectors -k dm org.apache.mahout.common.distance.cosinedistancemeasure -x 10 -ow 27
28 Item Based Collaborative Filtering Build an item-item matrix determining relationships between pairs of items Using the matrix, and the data on the current user, infer his/her taste UserID, ItemID, Ra.ng 10, 1000, , 1001, , 1004, , 1001, , 1002, , 1003, , 1004, , 1000, , 1001, , 1002,
29 Collaborative Filtering UserID R x = R value for item 1002: 1 ( 5.0 ) + 2 ( 3.0 ) + 2 ( 0.0 ) + 1 ( 0.0 ) + 1 ( 2.5 ) ==
30 Mahout Collaborative Filtering Item Based Collaborative Filtering hadoop jar mahout.jar org.apache.mahout.cf.taste.hadoop.similarity.item.itemsimilarityjob - - input /user/xyz/ra/ngs - - output preference - m 10 Can also be used to generate features for say Logis/c Regression 30
31 Trends, Aggregation & Counters Most Popular Searches/Downloads/News Articles/Movies/Products Load results into HBase Use HBase where we need NRT count of things (categories/products etc.) Impala is very useful here for faster SLAs HBase Counters Has concept of Incrementing column values Avoids lock row/read value/increment it/write it back/unlock rows Great for counting specific metrics over time Example - count per URL/Product Can disable write to WAL on puts 31
32 Graph Processing with Giraph on Hadoop Provides a more natural way to model graph problems One implements a Vertex, which has a value and edges and is able to send and receive messages to other ver/ces in the graph as the computa/on iterates. Bulk Synchronous Processing compu/ng model. Sent messages are available at the next superstep during compute Vertex to Vertex messages delivered in the next superstep Computa/on completes when all components complete Runs as a single map only job. Executes the compute method for every Vertex it is assigned once per Superstep 32
33 Graph Processing with Giraph on Hadoop Use Cases Recommend Missing Links in a Social Network How are users connected Clustering find related people in groups Iterative Graph Ranking 33
34 Offline Training & Testing of Models Hadoop provides an excellent platform to train and test out the Models and various Algorithms Train Model Test Score Training Set Test Set 34
35 A/B Testing A/B testing is used to test the performance of the Models online X% New Model Traffic (100-X)% Old Model A/B testing involves: Partitioning real traffic to two models and then measuring the performance to the desired result (maximize CTR, revenue, page views etc.). The partitioning logic can get complicated. In such cases they can be pre-computed on Hadoop offline and pushed to an online store. 35
36 Serving Systems Real-time Content HBase Web Server Recommenda/on Server Web Server Docs HDFS Pre- calculated Results Features Models Index Solr Web Server Real-time Content Load Pre-computed stuff into HBase Models generated offline are loaded into the serving systems to compute in real-time Cache recommendation for certain time periods and re-calculate on expiry Web Server 36
37 Bulk Loading into HBase Prepare via a MR job using HFileOutputFormat Or, use importtsv to prepare the StoreFiles from tsv files Import the prepared data using completebulkload tool If the target table does not already exist in HBase, it would be created automatically. $ bin/hbase org.apache.hadoop.hbase.mapreduce.importtsv - Dimporttsv.columns=a,b,c - DimporEsv.bulk.output=/user/xyz/myoutput <tablename> <hdfs-inputdir> hadoop jar hbase-version.jar completebulkload /user/xyz/myoutput mytable Takes the same output path where the imporesv put its results 37
38 Measuring Performance Metrics CTR Revenue Engagement Why use Hadoop? Store all tracking events on HDFS Easier to scale to joining high number of tracking streams Look back far in time Track Behavior & Performance over time Retrain a Model if performance drops Measure the performance of any newly deployed model Try out new stuff 38
39 Common Recommenda/on PaEerns Content/Product Recommenda/on People who viewed this Product/Profile also viewed Frequently bought together Related Searches or Query Recommenda/on Related Ar/cles/News People/Social Recommenda/ons 39
40 Content/Product Recommendation Use Cases Recommend Product Recommend Movies/Videos Design Collaborative Filtering Logistic Regression Frequently bought together Use Cases Find items that are frequently bought together Related Searches Design Parallel FP Growth ~echang/recsys08-69.pdf 40
41 Parallel FP Growth mahout fpg - i core/src/test/resources/retail.dat - o paeerns \ - k 50 \ - method mapreduce \ - regex '[\ ]
42 People who viewed this Product also viewed Use Cases View Item Page View Content (Movies, Videos, News) Design This data does not change often Record the views in a session co-view Collaborative Filtering 42
43 Related Searches or Query Recommendation Design Use Query Log Data Cluster similar queries Use Parallel FP Growth to find the related searches Query Distance Based on keywords or phrases Based on searches in the same session Based on common clicked URLs Based on the distance of the clicked documents Related Articles/News Batch clustering with K-Means NRT clustering using the centroids Perform canopy on left over articles 43
44 Social/People Recommendations Use Case Recommend Missing Links in a Social Network Bipartite Matching Recommend Men/Women Design Take existing edges and friend of friends Build Regression Models based on latest activity Scale easily offline with Hadoop as number of friends of friends and activities could be very high. Giraph 44
45 June 13 at SF MarrioE Marquis Call for Speakers is open un/l April 1 Early Bird reg open as of Monday night, and Strata aeendees get a $25 discount (use code promo- strata) hbasecon.com 45
46 Thank you! Jayant Shekhar, Solu/ons Architect,
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
Hadoop Job Oriented Training Agenda
1 Hadoop Job Oriented Training Agenda Kapil CK [email protected] Module 1 M o d u l e 1 Understanding Hadoop This module covers an overview of big data, Hadoop, and the Hortonworks Data Platform. 1.1 Module
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah
Apache Hadoop: The Pla/orm for Big Data Amr Awadallah CTO, Founder, Cloudera, Inc. [email protected], twicer: @awadallah 1 The Problems with Current Data Systems BI Reports + Interac7ve Apps RDBMS (aggregated
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc. 2011 2014. All Rights Reserved
Hortonworks & SAS Analytics everywhere. Page 1 A change in focus. A shift in Advertising From mass branding A shift in Financial Services From Educated Investing A shift in Healthcare From mass treatment
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang
The Big Data Ecosystem at LinkedIn Presented by Zhongfang Zhuang Based on the paper The Big Data Ecosystem at LinkedIn, written by Roshan Sumbaly, Jay Kreps, and Sam Shah. The Ecosystems Hadoop Ecosystem
Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here
Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Speaker Name or Subhead Goes Here GridKa School 2013, Karlsruhe 2013-08-27 Mirko Kämpf [email protected]
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Comprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
BIG DATA - HADOOP PROFESSIONAL amron
0 Training Details Course Duration: 30-35 hours training + assignments + actual project based case studies Training Materials: All attendees will receive: Assignment after each module, video recording
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385
brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview
BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM An Overview Contents Contents... 1 BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM... 1 Program Overview... 4 Curriculum... 5 Module 1: Big Data: Hadoop
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Constructing a Data Lake: Hadoop and Oracle Database United!
Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Qsoft Inc www.qsoft-inc.com
Big Data & Hadoop Qsoft Inc www.qsoft-inc.com Course Topics 1 2 3 4 5 6 Week 1: Introduction to Big Data, Hadoop Architecture and HDFS Week 2: Setting up Hadoop Cluster Week 3: MapReduce Part 1 Week 4:
Data Analyst Program- 0 to 100
Development Data Analyst Program- 0 to 100 Master the Data Analysis tools like Pig and hive Data Science Build a recommendation engine 1 Data Analyst Program- 0 to 100 HADOOP SCHOOL OF TRAINING Basics
The Future of Data Management with Hadoop and the Enterprise Data Hub
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
How to Hadoop Without the Worry: Protecting Big Data at Scale
How to Hadoop Without the Worry: Protecting Big Data at Scale SESSION ID: CDS-W06 Davi Ottenheimer Senior Director of Trust EMC Corporation @daviottenheimer Big Data Trust. Redefined Transparency Relevance
Apache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
How Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Big Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
MySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
Building Scalable Big Data Pipelines
Building Scalable Big Data Pipelines NOSQL SEARCH ROADSHOW ZURICH Christian Gügi, Solution Architect 19.09.2013 AGENDA Opportunities & Challenges Integrating Hadoop Lambda Architecture Lambda in Practice
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley
WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley Disclaimer: This material is protected under copyright act AnalytixLabs, 2011. Unauthorized use and/ or duplication of this material or
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Dominik Wagenknecht Accenture
Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Big Data Analytics 1 Big Data Is Less About Size, And More About Freedom TechCrunch!!!!!!!!! Total data: bigger than big data 451 Group Findings: Big Data Is More Extreme Than Volume Gartner!!!!!!!!!!!!!!!
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics
Big Data Open Source Stack vs. Traditional Stack for BI and Analytics Part I By Sam Poozhikala, Vice President Customer Solutions at StratApps Inc. 4/4/2014 You may contact Sam Poozhikala at [email protected].
Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011
Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
BIG DATA HADOOP TRAINING
BIG DATA HADOOP TRAINING DURATION 40hrs AVAILABLE BATCHES WEEKDAYS (7.00AM TO 8.30AM) & WEEKENDS (10AM TO 1PM) MODE OF TRAINING AVAILABLE ONLINE INSTRUCTOR LED CLASSROOM TRAINING (MARATHAHALLI, BANGALORE)
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Important Notice. (c) 2010-2013 Cloudera, Inc. All rights reserved.
Hue 2 User Guide Important Notice (c) 2010-2013 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this document
Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics
In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning
Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments
Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
KNIME & Avira, or how I ve learned to love Big Data
KNIME & Avira, or how I ve learned to love Big Data Facts about Avira (AntiVir) 100 mio. customers Extreme Reliability 500 employees (Tettnang, San Francisco, Kuala Lumpur, Bucharest, Amsterdam) Company
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer, Cofounder @mikeolson
The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer, Cofounder @mikeolson 1 A New Platform for Pervasive Analytics Multiple big data opportunities
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata
Up Your R Game James Taylor, Decision Management Solutions Bill Franks, Teradata Today s Speakers James Taylor Bill Franks CEO Chief Analytics Officer Decision Management Solutions Teradata 7/28/14 3 Polling
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.
Introduction p. xvii Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p. 9 State of the Practice in Analytics p. 11 BI Versus
Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP
Big-Data and Hadoop Developer Training with Oracle WDP What is this course about? Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse
SQL Server 2012 PDW Ryan Simpson Technical Solution Professional PDW Microsoft Microsoft SQL Server 2012 Parallel Data Warehouse Massively Parallel Processing Platform Delivers Big Data HDFS Delivers Scale
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study
Oracle Big Data Spatial & Graph Social Network Analysis - Case Study Mark Rittman, CTO, Rittman Mead OTN EMEA Tour, May 2016 [email protected] www.rittmanmead.com @rittmanmead About the Speaker Mark
Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
Certified Big Data and Apache Hadoop Developer VS-1221
Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification
Best Practices for Hadoop Data Analysis with Tableau
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
