BIG DATA SYSTEM DEVELOPMENT: AN EMBEDDED CASE STUDY WITH A GLOBAL OUTSOURCING FIRM
|
|
|
- Aron Little
- 10 years ago
- Views:
Transcription
1 BIG DATA SYSTEM DEVELOPMENT: AN EMBEDDED CASE STUDY WITH A GLOBAL OUTSOURCING FIRM Prof. Hong-Mei Chen IT Management, Shidler College of Business University of Hawaii at Manoa, USA Prof. Rick Kazman IT Management, Shidler College of Business University of Hawaii at Manoa, USA Software Engineering Institute, Carnegie Mellon University, USA Serge Haziyev, Olha Hrytsay SoftServe Inc. Austin, TX, USA
2 OUTLINE Research Motivation Research Foundations Research Method Results Future Research Directions Conclusions 2
3 Big Data: Big Promise Big hype Big data is the new oil Big data is the new gold 3
4 ??????????? HOW?? 4
5 Challenges 5V requirements Proliferation of Big Data Technology Rapid Big Data Technology Changes Complexity Paradigm Shifts Short history of big data system development in Enterprises 5
6 2013 CIO Survey Big Data Survey ( Jan. 2013) 55% of big data projects were not completed 6
7 Gartner Survey (Dec. 2014): Big Data Investment Grows but Deployments Remain Scarce in 2014 Hype is wearing thin Only 13% of respondents said their IT organizations put big data projects into production this year, but that's 5% higher than last year. 24% of those polled voted against the use of big data technologies in their business. 7
8 2013 was the year of experimentation and early deployment; so is percent of respondents have invested or plan to invest in big data in the next 24 months, up from 64 percent in Like 2013, much of the work today revolves around strategy development and the creation of pilots and experimental projects. Note: The Gartner survey of 302 Gartner Research Circle members worldwide, which was conducted in June
9 Research Objectives To help enterprises navigate through uncharted waters and be better equipped for their big data endeavors. To uncover methodological voids and provide practical guidelines. 9
10 Research Questions 1. How does big data system development (processes and methods) differ from small (traditional, structured) data system development? 2. How can existing software architecture approaches be extended or modified to address new requirements for big data system design? 3. How can data modeling/design methods in traditional structured database/datawarehouse development be extended and integrated with architecture methods for effective big data system design? 10
11 Small Data System Development ANSI Standard 3-layer DBMS Architecture Clear Data-Program Independence (logical and physical data independence) Well-established RAD design process Iterative design of 7 phases Clear separation of each design phase Mature conceptual design tools: ER, UML, etc. Relational model dominance (95% market) Relational model easy to understand SQL easy to use, standardized Architecture Choice is relatively simple N-tier client-server design 11
12 Data/program Independence: ANSI 3-Layer DBMS Architecture (1980s) Schema 12
13 Architecture Design is critical and complex in Big data System Development I. Volume: Distributed and scalable architecture II. Variety: Polyglot persistence architecture III. Velocity: Complex Event processing + Lambda Architecture IV. Veracity: Architecture design for understanding the data sources and the cleanliness, validation of each V. Value: New architecture for hybrid, agile Analytics, big data analytics cloud, integrating the new and the Old (EDW, ETL) VI. Integration: Integrating separate architectures addressing each of the 5V challenges 13
14 Research Questions 1. How does big data system development (processes and methods) differ from small (traditional, structured) data system development? 2. How can existing software architecture approaches be extended or modified to address new requirements for big data system design? 3. How can data modeling/design methods in traditional structured database/datawarehouse development be extended and integrated with architecture methods for effective big data system design? 14
15 Research Method Case study research is deemed suitable: system development, be it big or small data, cannot be separated from its organizational and business contexts. How and Why research questions. the research is largely exploratory Multiple cases: increase methodological rigor Collaborative Practice Research SSV, in the outsourcing industry who has successfully deployed 10 big data projects that can be triangulated Embedded Case Study 15
16 Reasons for selecting an outsourcer Outsourcing is an important and common means to realize a big data strategy Big data professional service is the largest segment of big data market and continues to grow. Outsourcing mitigates shortages of skills and expertise in the areas where they want to grow. 16
17 Big Data Market is Expected to Grow Rapidly 17
18 Collaborative Practice Research (CPR) Steps in an Iteration 1) Appreciate problem situation 2) Study literature 3) Develop framework 4) Evolve Method 5) Action 6) Evaluate experiences 7) Exit 8) Assess usefulness 9) Elicit research results 18
19 Collaborative Practice Research (CPR) Appreciate problem situation Study literature Assess usefulness Elicit research results Develop framework Appreciate problem situation Study literature Exit Evaluate experience s Action Evolve Method Assess usefulness Elicit research results Develop framework Appreciate problem situation Study literature Exit Evolve Method Assess usefulness Elicit research results Develop framework ADD 2.0 (Cases 1-4) Evaluate experience s Action Exit Evaluate experience s Action Evolve Method ADD 2.5 -> 3.0 (Cases 5-6) BDD (Cases 3-4, 7-10) 19
20 ADD ADD (Attribute-Driven Design) is an architecture design method "driven" by quality attribute concerns Version 1.0 released 2000 by SEI. Version 2.0 released November 2006 (on Current SEI site) Version 2.5 published in 2013 by the researcher team Version 3.0 to be published in 2016 by the researcher team. The method provides a detailed set of steps for architecture design enables design to be performed in a systematic, repeatable way leading to predictable outcomes. 20
21 Embedded Cases 1-3 Case # Business goals Start Big data Technologies Challenges 1 Network Security, Intrusion Prevention US MNC IT corp. (Employees > 320,000) 2 Anti-Spam Network Security System US MNC Networking equipment corp. employees > 74,000 3 Online Coupon Web Analytics Platform US MNC: World s largest coupon site, 2014 Revenue > US$200M Provide ability for security analysts to improve intrusion detection techniques; Observe traffic behavior and make infrastructure adjustments: Adjust company security policies Improve system performance Validation of the new developed set of antispam rules against the large training set of known s Detection of the best anti-spam rules in terms of performance and efficacy In-house Web Analytics Platform for Conversion Funnel Analysis, marketing campaign optimization, user behavior analytics clickstream analytics, platform feature usage analysis Late 2010, 8.5 month Machine generated data - 7.5BLN event records per day collected from IPS devices Near real-time reporting Reports which touch billions of rows should generates < 1 min K Anti-spam rules 2012, Ongoing 5M training set 100+ Nodes in Hadoop Clusters 500 million visits a year 25TB+ HP Vertica Data Warehouse 50TB+ Hadoop Cluster Near-Real time analytics (15 minutes is supported for clickstream data) ETL - Talend Storage/DW InfoBright EE, HP Vertica OLAP Pentaho Mondrian BI JasperServer Pro Vanilla Apache Hadoop (HDFS,MapReduce,Oozie,Zo okeeper ) Perl/Python SpamAssassin Perceptron Data Lake - (Amazon EMR) /Hive/Hue/MapReduce/Flu me/spark DW: HP Vertica, MySQL ETL/Data Integration custom using python BI: R, Mahout, Tableau High throughput, different device data schemas (versions) keep system performance at required level when supporting IP/geography analysis: avoid join. Keep required performance for complex querying over billions rows MapReduce was written on Python and Hadoop Streaming was used. The challenge was to optimize jobs performance. Optimal Hadoop cluster configuration for maximizing performance and minimize map-reduce processing time Minimize transformation time for semi-structured data Data quality and consistency complex data integration fast growing data volumes, performance issues with 21 Hadoop Map/Reduce (moving to Spark)
22 Embedded Cases 4-6 Case # Business goals Start Big data Technologies Challenges 4 Social Marketing Analytical Platform US MNC Internet marketing (user reviews) 14 Revenue > US$ 48M 5 Cloud-based Mobile App Development Platform US private Internet Co. Funding > US$100M 6 Telecom E-tailing platform Russian mobile phone retailer 14 Revenue: 108B rubles Build in-house Analytics Platform for ROI measurement and performance analysis of every product and feature delivered by the e-commerce platform; Provide analysis on how end-users are interacting with service content, products, and features Provide visual environment for building custom mobile applications Charge customers by usage Analysis of platform feature usage by endusers and platform optimization Build an OMNI-Channel platform to improve sales and operations analyze all enterprise data from multiple sources for real-time recommendation and sales 2012, ongoing Volume - 45 TB Sources - JSON Throughput - > 20K/sec Latency (1 hour for static/pre-defined reports /real-time for streaming data) 2013, 8 month Data Volume > 10 TB Sources: JSON Data Throughput > 10K/sec Analytics - selfservice, pre-defined reports, ad-hoc Data Latency 2 min End of 2013, (did only discovery) Analytics on 90+ TB (30+ TB structured, 60+ TB unstructured and semi-structured data) Elasticity: through SDE principles Lambda architecture Amazon AWS, S3 Apache Kafka, Storm Hadoop - CDH 5, HDFS(raw data), MapReduce), Cloudera Manager, Oozie, Zookeper HBase (2 clusters: batch views, streaming data) Middleware: RabbitMQ, Amazon SQS, Celery DB: Amazon Redshift, RDS, S3 Jaspersoft Elastic Beanstalk Integration: Python Aria Subscription Billing Platform Hadoop (HDFS, Hive, HBase) Cassandra HP Vertica/Teradata Microstrategy/Tableau Hadoop upgrade CDH 4 to CDH 5 Data integrity and data quality Very high data throughput caused a challenge with data loss prevention (introduced Apache Kafka as a solution) System performance for data discovery (introduced Redshift considering Spark) Constraints - public cloud, multi-tenant schema extensibility minimize TCO achieve high data compression without significant performance degradation was quite challenging. technology selection: performance benchmarks and price comparison of Redshift vs HPVertica vs Amazon RDS). Data Volume for real-time analytics Data Variety: data science over data in different formats from multiple data sources Elasticity: private cloud, Hadoop as a service with autoscale capabilities 22
23 Embedded Cases 7-10 Case # Business goals Start Big data Technologies Challenges 7 Social Relationship Marketing Platform US private Internet Co. Funding > US$100M 8 Web Analytics & Marketing Optimization US MNC IT consulting co. (Employees > 430,000) 9 Network Monitoring & Management Platform US OSS vendor Revenue > US$ 22M 10 Healthcare Insurance Operation Intelligence US health plan provider Employees> 4,500 Revenue> US$10B Build social relationship platform that allows enterprise brands and organizations to manage, monitor, and measure their social media programs Build an Analytics module to analyze and measure results. Optimization of all web, mobile, and social channels Optimization of recomm-endations for each visitor High return on online marketing investments Build tool to monitor network availability, performance, events and configuration. Integrate data storage and collection processes with one web-based user interface. IT as a service Operation cost optimization for 3.4 million members Track anomaly cases (e.g. control schedule 1 and 2 drugs, refill status control) Collaboration tool between 65,000 providers ongoing (redesign 2009 system) 2014, Ongoing (Redesign system) 2014, Ongoing (Redesign 2006 system) 2014, Phase 1: 8 months, ongoing > one billion social connections across 84 countries 650 million pieces of social content per day MySQL (~ 11 Tb) Cassandra (~ 6Tb), ETL (> 8Tb per day) Data Volume > 1 PB 5-10 GB per customer/day Data sources clickstream data, webserver logs collect data in large datacenters (each: gigabytes to terabytes) real-time data analysis and monitoring (< 1 minute) types of devices: hundreds Velocity: 10K+ events per second Complex Event Processing - pattern detection, enrichment, projection, aggregation, join High scalability, Highavailability, faulttolerance Cassandra MySQL Elasticsearch SaaS BI Platform - GoodData Clover ETL, custom in Java, PHP, Amazon S3,Amazon SQS RabbitMQ Vanilla Apache Hadoop (HDFS,MapReduce,Oo zie,zookeeper ) Hadoop/HBase Aster Data Oracle Java/Flex/JavaScript MySQL RRDtool HBase Elasticsearch AWS VPC Apache Mesos, Apache Marathon, Chronus Cassandra Apache Storm ELK (Elasticsearch, Logstash, Kibana) Netflix Exhibitor Chef Minimize data processing time (ETL) Implement incremental ETL, processing and uploading only the latest data. Hive performance for analytics queries. Difficult to support real-time scenario for ad-hoc queries. Data consistency between two layers: raw data in Hadoop and aggregated data in relational DW Complex data transformation jobs High memory consumption of HBase when deployed in a single server mode Technology selection constraints by HIPAA compliance: SQS(selected) vs Kafka Chef Resource optimization: extending/fixing open source frameworks 90% utilization ratio 23 Constraints: AWS, HIPAA
24 RESULTS Big Data System Development Framework Big Data system Design (BDD) method 24
25 BDD Framework 25
26 BDD Framework 1. New Development Process Data-program independence undone 2. Futuring big data scenario generation for innovation utilizing Eco-Arch method (Chen & Kazman, 2012). 3. Architecture design integrated with new big data modeling techniques: Extended DFD (BD-DFD), big data architecture template, transformation rules. 4. Extended architecture design method ADD 2.0 (by CMU SEI) to ADD 3.0, then to BDD. 5. Use of design concepts databases (reference architecture, frameworks, platforms, architectural and deployment patterns, tactics, data models) and a technology catalogue with quality attributes ratings. 6. Adding architecture evaluation, BITAM (Business and IT Alignment Model), for risk analysis and ensuring alignment with business goals and innovation desires. BITAM (Chen et.al. 2005, 2010) extended ATAM. 26
27 ECO-ARCH Method (Chen & Kazman, 2012) 27
28 ECO-ARCH Method (Chen & Kazman, 2012) 28
29 Big Data Architecture Design: Data Element Template A Scenario description includes the 6 elements: source, stimuli, environment, artifacts, response, response metrics. 1) Data sources: what are the data used in the scenario, where is it (are they) generated? Answer questions below for each source. 2) Data source quality: is this data trustworthy? How accurate does it represent the real world element it represents? Such as temperature taken? 3) Data content format: structured, semi-structured, unstructured? Specify subtypes. 4) Data velocity: what is the speed and frequency the data is generated/ingested? 5) Data volume and Frequency: What is the volume and frequency of data? 6) Data Time To Live (TTL): How long will the data live during processing? 7) Data storage : What is the volume and frequency of the data generated that need to be stored. 8) Data Life: how long should the data need to be kept in storage? (Historical storage/time series or legal requirements). 9) Data Access type: OLTP (transactional), OLAP (aggregates-based), OLCP (advanced analytics) 10) Data queries/reports by who: what questions are asked about the data by who? What reports (real time, minutes, days, monthly?) 11) Access pattern: read-heavy, write-heavy, or balanced? 12) Data read/write frequency: how often is the data read, written? 13) Data response requirements: how fast of the data queries needs to respond? 14) Data consistency and availability requirements: ACID or BASE (strong, medium, weak)?
30 Technology Catalogue: Topology 30
31 Ratings on Quality Attributes 31
32 BITAM (Business-IT Alignment Model) 1) Business Model: drivers, strategies, revenue streams, investments, constraints, regulations 2) Business Architecture: applications, business processes, workflow, data flow, organization, skills 3) IT Architecture: hardware, software, networks, components, interfaces, platforms, standards (Chen, Kazman, & Garg, 2005) 32
33 Work-in-Progress/Future Research 1. Prototyping vs. Architecture Analysis 2. Eco-Arch extension: More case studies 3. Decision support system (DSS) for knowledge-based big data technology selection 4. Automation of big data technology cataloguing 5. New big data design patterns for hybrid environment 6. Conceptual design for NOSQL data modeling 7. Metadata management for big data 8. Neo-Metropolis Model: BDaaS, etc. 33
34 Conclusions (1) 1. CPR approach balance rigor and relevance. 2. BDD framework describes a new process of big data system development, which is dramatically different from small data system development, reflecting the paradigm shifts required for big data system development. 3. Paradigm shifts and complexity in big data management underscore the importance of an architecture-centric design approach. 34
35 Conclusions (2) 4. BDD method is the first attempt to extend both architecture design methods and data modeling techniques for big data system design and integrate them in one method for design efficiency and effectiveness. 5. BDD method focuses on futuring for innovation. 6. BDD advances ADD 2.0 to ADD BDD method embodies best practice of complexity mitigation by utilizing quality attribute driven design strategies, reference architectures, technology catalogue (with ratings) and other design concepts databases for knowledge-based design and agile orchestration of technology. 35
36 Implications Disruptive Innovation Management Software Engineering Education 36
37 MAHALO & ALOHA!!! 37
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES
BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES Relational vs. Non-Relational Architecture Relational Non-Relational Rational Predictable Traditional Agile Flexible Modern 2 Agenda Big Data
Big Analytics: A Next Generation Roadmap
Big Analytics: A Next Generation Roadmap Cloud Developers Summit & Expo: October 1, 2014 Neil Fox, CTO: SoftServe, Inc. 2014 SoftServe, Inc. Remember Life Before The Web? 1994 Even Revolutions Take Time
Datenverwaltung im Wandel - Building an Enterprise Data Hub with
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth
MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager [email protected]
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Real Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
Virtualizing Apache Hadoop. June, 2012
June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING
Addressing Open Source Big Data, Hadoop, and MapReduce limitations
Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1 Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2 How Big are Data?
Tap into Hadoop and Other No SQL Sources
Tap into Hadoop and Other No SQL Sources Presented by: Trishla Maru What is Big Data really? The Three Vs of Big Data According to Gartner Volume Volume Orders of magnitude bigger than conventional data
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
HDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies
Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08
Talend Big Data. Delivering instant value from all your data. Talend 2014 1
Talend Big Data Delivering instant value from all your data Talend 2014 1 I may say that this is the greatest factor: the way in which the expedition is equipped. Roald Amundsen race to the south pole,
Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Big Data Web Analytics Platform on AWS for Yottaa
Big Data Web Analytics Platform on AWS for Yottaa Background Yottaa is a young, innovative company, providing a website acceleration platform to optimize Web and mobile applications and maximize user experience,
BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata
BIG DATA: FROM HYPE TO REALITY Leandro Ruiz Presales Partner for C&LA Teradata Evolution in The Use of Information Action s ACTIVATING MAKE it happen! Insights OPERATIONALIZING WHAT IS happening now? PREDICTING
Business Intelligence for Big Data
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
Big Data Analytics Platform @ Nokia
Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru
Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy Presented by: Jeffrey Zhang and Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop?
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing
Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing Wayne W. Eckerson Director of Research, TechTarget Founder, BI Leadership Forum Business Analytics
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
Native Connectivity to Big Data Sources in MSTR 10
Native Connectivity to Big Data Sources in MSTR 10 Bring All Relevant Data to Decision Makers Support for More Big Data Sources Optimized Access to Your Entire Big Data Ecosystem as If It Were a Single
Information Builders Mission & Value Proposition
Value 10/06/2015 2015 MapR Technologies 2015 MapR Technologies 1 Information Builders Mission & Value Proposition Economies of Scale & Increasing Returns (Note: Not to be confused with diminishing returns
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld
Tapping into Hadoop and NoSQL Data Sources in MicroStrategy Presented by: Trishla Maru Agenda Big Data Overview All About Hadoop What is Hadoop? How does MicroStrategy connects to Hadoop? Customer Case
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani
A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani Technical Architect - Big Data Syntel Agenda Welcome to the Zoo! Evolution Timeline Traditional BI/DW Architecture Where Hadoop Fits In 2 Welcome to
HDP Enabling the Modern Data Architecture
HDP Enabling the Modern Data Architecture Herb Cunitz President, Hortonworks Page 1 Hortonworks enables adoption of Apache Hadoop through HDP (Hortonworks Data Platform) Founded in 2011 Original 24 architects,
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and
Big Data at Cloud Scale
Big Data at Cloud Scale Pushing the limits of flexible & powerful analytics Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For
Zynga Analytics Leveraging Big Data to Make Games More Fun and Social
Connecting the World Through Games Zynga Analytics Leveraging Big Data to Make Games More Fun and Social Daniel McCaffrey General Manager, Platform and Analytics Engineering World s leading social game
Next-Generation Cloud Analytics with Amazon Redshift
Next-Generation Cloud Analytics with Amazon Redshift What s inside Introduction Why Amazon Redshift is Great for Analytics Cloud Data Warehousing Strategies for Relational Databases Analyzing Fast, Transactional
Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns
How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns Table of Contents Abstract... 3 Introduction... 3 Definition... 3 The Expanding Digitization
Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014
Forecast of Big Data Trends Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014 Big Data transforms Business 2 Data created every minute Source http://mashable.com/2012/06/22/data-created-every-minute/
[Hadoop, Storm and Couchbase: Faster Big Data]
[Hadoop, Storm and Couchbase: Faster Big Data] With over 8,500 clients, LivePerson is the global leader in intelligent online customer engagement. With an increasing amount of agent/customer engagements,
Big data blue print for cloud architecture
Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Big Data and Market Surveillance. April 28, 2014
Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Getting Started Practical Input For Your Roadmap
Getting Started Practical Input For Your Roadmap Mike Ferguson Managing Director, Intelligent Business Strategies BA4ALL Big Data & Analytics Insight Conference Stockholm, May 2015 About Mike Ferguson
More Data in Less Time
More Data in Less Time Leveraging Cloudera CDH as an Operational Data Store Daniel Tydecks, Systems Engineering DACH & CE Goals of an Operational Data Store Load Data Sources Traditional Architecture Operational
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014
5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for
Roadmap Talend : découvrez les futures fonctionnalités de Talend
Roadmap Talend : découvrez les futures fonctionnalités de Talend Cédric Carbone Talend Connect 9 octobre 2014 Talend 2014 1 Connecting the Data-Driven Enterprise Talend 2014 2 Agenda Agenda Why a Unified
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
The Inside Scoop on Hadoop
The Inside Scoop on Hadoop Orion Gebremedhin National Solutions Director BI & Big Data, Neudesic LLC. VTSP Microsoft Corp. [email protected] [email protected] @OrionGM The Inside Scoop
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Data Challenges in Telecommunications Networks and a Big Data Solution
Data Challenges in Telecommunications Networks and a Big Data Solution Abstract The telecom networks generate multitudes and large sets of data related to networks, applications, users, network operations
The Future of Data Management with Hadoop and the Enterprise Data Hub
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah Cofounder & CTO, Cloudera, Inc. Twitter: @awadallah 1 2 Cloudera Snapshot Founded 2008, by former employees of Employees
Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015
Step by Step: Big Data Technology Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015 Data Sources IT Infrastructure Analytics 2 B y 2015, 20% of Global 1000 organizations
Modernizing Your Data Warehouse for Hadoop
Modernizing Your Data Warehouse for Hadoop Big data. Small data. All data. Audie Wright, DW & Big Data Specialist [email protected] O 425-538-0044, C 303-324-2860 Unlock Insights on Any Data Taking
QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering
QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering June 2014 Page 1 Contents Introduction... 3 About Amazon Web Services (AWS)... 3 About Amazon Redshift... 3 QlikView on AWS...
An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise
An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise Solutions Group The following is intended to outline our
Agile Business Intelligence Data Lake Architecture
Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step
Why Big Data in the Cloud?
Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data
Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Data Warehouse design
Data Warehouse design Design of Enterprise Systems University of Pavia 10/12/2013 2h for the first; 2h for hadoop - 1- Table of Contents Big Data Overview Big Data DW & BI Big Data Market Hadoop & Mahout
Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.
Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013
Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the
Introducing Oracle Exalytics In-Memory Machine
Introducing Oracle Exalytics In-Memory Machine Jon Ainsworth Director of Business Development Oracle EMEA Business Analytics 1 Copyright 2011, Oracle and/or its affiliates. All rights Agenda Topics Oracle
Executive Summary... 2 Introduction... 3. Defining Big Data... 3. The Importance of Big Data... 4 Building a Big Data Platform...
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP Oracle ESG Data Systems Architecture
An Integrated Big Data & Analytics Infrastructure June 14, 2012 Robert Stackowiak, VP ESG Data Systems Architecture Big Data & Analytics as a Service Components Unstructured Data / Sparse Data of Value
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Comprehensive Analytics on the Hortonworks Data Platform
Comprehensive Analytics on the Hortonworks Data Platform We do Hadoop. Page 1 Page 2 Back to 2005 Page 3 Vertical Scaling Page 4 Vertical Scaling Page 5 Vertical Scaling Page 6 Horizontal Scaling Page
Data Integration Checklist
The need for data integration tools exists in every company, small to large. Whether it is extracting data that exists in spreadsheets, packaged applications, databases, sensor networks or social media
SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON
SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON 2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Trafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
Cloud Ready Data: Speeding Your Journey to the Cloud
Cloud Ready Data: Speeding Your Journey to the Cloud Hybrid Cloud first Born to the cloud 3 Am I part of a Cloud First organization? Am I part of a Cloud First agency? The cloud applications questions
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
Cisco IT Hadoop Journey
Cisco IT Hadoop Journey Srini Desikan, Program Manager IT 2015 MapR Technologies 1 Agenda Hadoop Platform Timeline Key Decisions / Lessons Learnt Data Lake Hadoop s place in IT Data Platforms Use Cases
Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?
Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time? Kai Wähner [email protected] @KaiWaehner www.kai-waehner.de Disclaimer! These opinions are my own and do not necessarily
Transforming the Telecoms Business using Big Data and Analytics
Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe
A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle
A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle Growth in Data Diversity and Usage 1.8 Zettabytes of Data in 2011, 20x Growth by 2020
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014
Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014 Defining Big Not Just Massive Data Big data refers to data sets whose size is beyond the ability of typical database software tools
Ganzheitliches Datenmanagement
Ganzheitliches Datenmanagement für Hadoop Michael Kohs, Senior Sales Consultant @mikchaos The Problem with Big Data Projects in 2016 Relational, Mainframe Documents and Emails Data Modeler Data Scientist
Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop
1 Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop 2 Pivotal s Full Approach It s More Than Just Hadoop Pivotal Data Labs 3 Why Pivotal Exists First Movers Solve the Big Data Utility Gap
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera
SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP Eva Andreasson Cloudera Most FAQ: Super-Quick Overview! The Apache Hadoop Ecosystem a Zoo! Oozie ZooKeeper Hue Impala Solr Hive Pig Mahout HBase MapReduce
Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA
Big Data: What You Should Know Mark Child Research Manager - Software IDC CEMA Agenda Market Dynamics Defining Big Data Technology Trends Information and Intelligence Market Realities Future Applications
Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here
Data Virtualization for Agile Business Intelligence Systems and Virtual MDM To View This Presentation as a Video Click Here Agenda Data Virtualization New Capabilities New Challenges in Data Integration
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
