Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software



Similar documents
Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Cloud Computing. Big Data. High Performance Computing

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Real-Time Big Data Analytics for the Enterprise

Fast, Low-Overhead Encryption for Apache Hadoop*

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

High Performance Computing and Big Data: The coming wave.

Intel Platform and Big Data: Making big data work for you.

HPC & Big Data THE TIME HAS COME FOR A SCALABLE FRAMEWORK

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Extended Attributes and Transparent Encryption in Apache Hadoop

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Scaling up to Production

Intel Cyber Security Briefing: Trends, Solutions, and Opportunities. Matthew Rosenquist, Cyber Security Strategist, Intel Corp

The Future of Data Management

Next-Gen Big Data Analytics using the Spark stack

Hur hanterar vi utmaningar inom området - Big Data. Jan Östling Enterprise Technologies Intel Corporation, NER

Intel Media SDK Library Distribution and Dispatching Process

Cloud-based Analytics and Map Reduce

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Near-Real-Time Big Data: Hadoop 效 能 最 佳 化 調 校 分 析 美 商 英 特 爾 亞 太 科 技 有 限 公 司 台 灣 分 公 司 鄭 智 成

Upcoming Announcements

Life With Big Data and the Internet of Things

Big Data Analytics Nokia

Hadoop Ecosystem B Y R A H I M A.

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Accelerating Business Intelligence with Large-Scale System Memory

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Hadoop Applications on High Performance Computing. Devaraj Kavali

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

BIG DATA TRENDS AND TECHNOLOGIES

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

The Foundation for Better Business Intelligence

Workshop on Hadoop with Big Data

Deploying Hadoop with Manager

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

The Future of Data Management with Hadoop and the Enterprise Data Hub

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Intel Service Assurance Administrator. Product Overview

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Accelerating Business Intelligence with Large-Scale System Memory

新 一 代 軟 體 定 義 的 網 路 架 構 Software Defined Networking (SDN) and Network Function Virtualization (NFV)

HDP Hadoop From concept to deployment.

HDP Enabling the Modern Data Architecture

How Cisco IT Built Big Data Platform to Transform Data Management

Big Data One size doesn t fit all. Dr. Jean-Laurent Philippe, PhD Directeur Avant-Vente, Intel EMEA ICAR 2013

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Move Data from Oracle to Hadoop and Gain New Business Insights

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Intel Cyber-Security Briefing: Trends, Solutions, and Opportunities

Data Security in Hadoop

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

IBM Big Data Platform

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

So What s the Big Deal?

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

#TalendSandbox for Big Data

Peers Techno log ies Pv t. L td. HADOOP

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Modern Data Architecture for Predictive Analytics

Transforming the Telecoms Business using Big Data and Analytics

Modernizing Your Data Warehouse for Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

The Open Cloud Near-Term Infrastructure Trends in Cloud Computing

Bringing Big Data to People

Native Connectivity to Big Data Sources in MSTR 10

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

IBM BigInsights for Apache Hadoop

ITG Software Engineering

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Oracle Big Data SQL Technical Update

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Different NFV/SDN Solutions for Telecoms and Enterprise Cloud

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

TRAINING PROGRAM ON BIGDATA/HADOOP

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Big Data Performance Growth on the Rise

2015 Global Technology conference. Diane Bryant Senior Vice President & General Manager Data Center Group Intel Corporation

Data processing goes big

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Transcription:

Big Data for Big Science Bernard Doering Business Development, EMEA Big Data Software

Internet of Things 40 Zettabytes of data will be generated WW in 2020 1 SMART CLIENTS INTELLIGENT CLOUD Richer user experiences Richer data to analyze INTELLIGENT THINGS 2.8 Zettabytes of data generated WW in 2012 1 Richer data from devices Sources: (1) IDC Digital Universe 2020, (2) IDC

Transformative Forces in Computing Science 10 18 HPC Cloud Open Source Enabling exascale computing on massive data sets Helping enterprises build open interoperable clouds Contributing code and fostering ecosystem

Intel Distribution for Apache Hadoop* software Hardware-enhanced and optimised for industry leading performance & security Strengthens Apache Hadoop* ecosystem

Intel Distribution for Apache Hadoop* v3.0 Connectors Ingest, Analysis, Visual Intel Manager for Apache Hadoop software Deployment, Configuration, Monitoring, Alerts, and Security Sqoop 1.4.1 Data Exchange Flume 1.3.0 Log Collector Zookeeper 3.4.5 Coordination Oozie 3.3.0 Workflow Pig 0.9.2 Scripting Mahout 0.7 Machine Learning Hcatalog Metadata YARN (MRv2) Distributed Processing Framework HDFS Hadoop Diatributed File System Hive 0.10.0 SQL Query HBase 0.96.1 Columnar Store

Project Gryphon SQL on Hadoop from Intel 6 INTEL CONFIDENTIAL,

Deploying SQL applications on Hadoop Problem Statement SQL-92 HiveQL currently accepts only a small subset of SQL as valid queries Current approaches to enabling SQL on Hadoop provide incomplete SQL HiveQL Enterprises need open source coverage & realtime performance of analytic SQL queries on Hadoop MapReduce Hive HBase HDFS Data Nodes 7 INTEL CONFIDENTIAL

Introducing Project Gryphon Panthera meets Phoenix Enables full SQL-92 coverage for OLAP applications on Hadoop with Hive as the execution back-end Enables low-latency SQL queries on HBase with more efficient storage engine and better performing JDBC drivers Enables real-time SQL using HBase co-processor framework and several Hive query optimizations Is open source under ASL license 8 INTEL CONFIDENTIAL

Intel Distribution for Apache Hadoop* software Hardware-enhanced Open platform Enables partner analytics Performance Management Security

Backed by portfolio of datacenter products Software Cache Acceleration Software Server Storage & Memory Network

Intel portfolio delivers balanced performance >4 hours Intel Xeon processor Shown to improve 1 Terabyte sort from 4 hours to 7 minutes Intel Xeon 5690 7200 HDD ~50% improved Intel SSD 520 Series ~80% improved Intel 10GbE Adapters ~50% improved Intel Distribution for Apache Hadoop* software ~40% improved 1GbE Adapter Other brands and names are the property of their respective owners ~7 minutes Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Internal testing For more information go to : intel.com/performance `

Why Intel for Hadoop? Transparent encryption in Hive, Pig, MapReduce, HDFS Up to 20x faster en/decryption with Intel AES-NI 1 Up to 30x faster Terasort with Xeon, SSD, 10GbE 1 Up to 8.5X faster queries in Hive* & HBase 1 Support for Lustre* filesystem 1: Based on internal testing; * Trademarks belong to others

Why Hadoop* + Lustre*? As HPC moves to Exascale, bigger simulations require better tools for analytics Hadoop * is the de-facto software platform for big data analytics but HDFS* expects compute nodes with direct attached storage HPC clusters have decoupled storage and compute nodes Lustre * is the file system of choice for most HPC clusters Lustre* is POSIX compliant: uses Java native file system Lustre* as the single storage platform for HPC & analytics is easier to manage 13

Use Cases

Computing Sciences to make a better world Government & Research Commerce & Industry New Users & New Uses My goal is simple. It is complete understanding of the universe, why it is as it is, and why it exists at all Better Products Faster Time to Market From Diagnosis to personalized treatments quickly Stephen Hawking Reduced R&D Genomics Clinical Information Basic Science Business Transformation Data-Driven Discovery Transform data into useful knowledge

Computing Science to help save lives

Data-Driven Discovery Hypothesis Formation Modeling & Prediction Drug Discovery Treatment Optimization Astronomy Particle Physics Public Policy Trend Analysis Genome Data EMR Clininical Trials Sensor Data Images Sim Data Census Data Text A/V Surveys Life Sciences Physical Sciences Social Sciences

Data-Driven Discovery in Science 1 human genome = 1 petabyte Finding patterns in clinical and genome data at scale can help cure cancer and other diseases. 18

Reducing the Cost of Human Genome Sequencing $100,000,000 $10,000,000 $1,000,000 $100,000 $10,000 $1,000 2001 2003 2005 2007 2009 2011 2013 Source: National Human Genome Research Project

Data-Intensive Discovery: Genomics Value Enable researchers to discover biomarkers and drug targets by correlating genomic data sets Analytics Provide curated data sets with pre-computed analysis (classification, correlation, biomarkers) Provide APIs for applications to combine and analyze public and private data sets Intel Distribution Data Management Use Hive and Hadoop for query and search Dynamically partition and scale HBASE

Computing with Hadoop to make a better world Government & Research 80,000 Scientific Documents No Doctor can read or analyse Mahout Library for analytics Data stored on HDFS EU Project with leading universities and research hospitals.

Data-Driven Business Data Value Product Innovation Market Insight Data Analysis Customer Service Network Optimization Business Efficiency Behavior Modeling Fraud Analytics Client Engagement Data Management Content CDR IP Traffic Product Shop Customer Behavior Customer Behavior Transactions Telco Retail FSI

Enterprise Data Store with Hadoop Value 300 million wireless subscribers Enable subscriber access to billing data 30X gain in performance; lower TCO Subscriber Self Service Analytics Provides real-time retrieval of 6 months data Supports new BI with 15 types of queries Enables targeted ad serving and promotions Data Management Use Hadoop/HBase for search and analysis 30 TB/month of billing data 300K reads/second; 800K inserts/second 133-node cluster / Intel Xeon E5 processors CDR

Intel IT Big Data Platform Components MPP* Platform 3rd-party solution 100x faster than traditional systems Intel Xeon processor E7 family blades scale easily Predictive Analytics Engine In house development Enables real time, on-going Predictive service Intel Xeon processor E7 family Intel Distribution Of Hadoop Based on Apache Hadoop Optimized for Intel Xeon processors, SSD and 10GbE (Up to 20x performance boost) Distributed file system that can scale linearly HBase NoSql DB

Big Data in Action at Intel Test Time Reduction: Predictive analytics in manufacturing to identify failing parts Improve Quality & Increase Yield Expected to save ~$200M in 2013 Malware Detection: Analyzing ~4B access events per day at the system, network, & application levels to discover new malware threats before they arise Reduce and prevent network intrusion

Data-Rich Communities: Smart City Value Enforce traffic laws and detect license fraud Monitor and predict traffic patterns In a city of 31 million people Detection Prevention Analytics Detect traffic law violations automatically Detect driver license fraud by data mining Forecast traffic with predictive analytics Data Management 30,000 cameras 6Mb/s stream rate per camera 15 PB of images in active use 2 billion records in HBase Regional Local

Driving innovation with big data analytics European car manufacturer uses big data analytics to predict machine failure and build faster and safer cars. Data collected from Sensors and CPUs embedded in the cars and signals sent to the Big Data Cloud for analysis. Manufacturer predicts growth to >30 PB by 2015 and ~ 300 PB by 2018.

With strong support from strategic partners *Other brands and names are the property of their respective owners.

Poly-structured Data Match methods to data Hadoop + NoSQL Next-Gen Analytics Structured Data Relational Databases *Other brands and names are the property of their respective owners.

CERN is Big Data

Data-Driven Discovery in Science 600 million collisions / sec CERN Detecting 1 in 1 trillion events to help find the Higgs Boson What else is possible? OpenLab with Intel - Intel Distribution for Apache Hadoop? 31

Bringing Hadoop* MapReduce to Lustre* Data Hadoop* Adaptor for Lustre* Available with Intel Distribution of Apache Hadoop* software 3.0 Based on YARN (Apache Hadoop 2.x) Packaged as a single Java * library (JAR) Easy to deploy with minor changes No change in the way jobs are submitted Hadoop Compute Nodes InfiniBand Interconnect Lustre Storage Nodes 32

Addressing the HPC Big Data Challenge Intel HPC Distribution for Apache Hadoop* Software Intel Manager for Hadoop* Software Deployment, Configuration, Monitoring, Altering and Security Intel Manager for Lustre* Software Sqoop Flume Data Exchange Log Collector ZooKeeper Coordination Oozie Workflow Pig Scripting YARN (MRv2) Distributed Processing Framework HDFS Mahout Machine Learning Hadoop Distributed File Systems R Connectors Statistics Hive SQL Query HBase Columnar Storage Moab, Slurm Slurm, Lustre MPI

Intel HPC Distribution: Open Platform for High Performance Data Analytics Performance Bring compute to the data: Run MapReduce* on Lustre* without code changes Run MapReduce* faster: Avoid the intermediate file shuffle with shared storage Efficiency Avoid Hadoop* islands in the sea of HPC systems Run MapReduce jobs alongside HPC workloads with full access to the cluster resources Manageability Use the seamless integration to manage one common platform for Hadoop and HPC Develop with multiple programming models and deploy on shared storage

Join the BETA program Early adopters of the combined Intel Distribution for Apache Hadoop Software and Intel EE for Lustre Software solution will receive a free, exclusive limited-use version of the software and exchange insights with Intel experts. To be considered for the BETA, please contact Intel: hpdd-info@intel.com bernard.doering@intel.com bruno.riva@intel.com 35

For more information hadoop.intel.com intel.com/bigdata @intelhadoop 37

Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps. Intel's compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Intel Xeon, Intel Xeon Phi, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. Other names and brands may be claimed as the property of others. Copyright 2013, Intel Corporation. All rights reserved.