Continuum Data Analytics Stack
|
|
|
- Amberlynn McCormick
- 10 years ago
- Views:
Transcription
1 Continuum Data Analytics Stack Pycon DE 2013, Köln Yves Hilpisch Continuum Analytics Europe
2 Agenda Big Data and Python Architecting for Data Continuum s Stack
3 Big Data and Python
4 Origin of Big Data Movement Storage disruption: plummeting HDD costs, cloud-based storage also: I/O evolution: 10gE SANs, Flash drives ETL disruption: Hadoop/Hive/HBase Basic analytics & statistics: counting things Facebook, Twitter, Youtube, Instagram...
5 Big Data (circa 2012)
6
7 The Players Data Processing & Low-level infrastructure Traditional BI vendors New BI startups Data-oriented startups Analytics-as-a-service Big data infrastructure platforms (DB & analytical compute as a service)
8 Another perspective (2011)
9 Observed Trends Diversification away from SQL & relational DBs Messy data, agile data processing Dynamic schema management Acknowledgement of heterogenous data environment Focus on high performance Richer simulations, processing more data Modern hardware revolution (SSDs, GPUs, etc.) Advanced visualization Interactive, novel plots Beyond simple reports and dashboards Advanced analytics Richer statistical models, Bayesian approaches Machine learning Predictive databases
10 Summary Data Trends big data: be it in terms of volume, complexity or velocity, available data is growing drastically; new technologies, an increasingly connected world, more sophisticated data gathering techniques and devices as well as new cultural attitudes towards social media are among the drivers of this trend real-time economy: today, decisions have to be made in real-time, business strategies are much shorter lived and the need to cope faster with the ever increasing amount and complexity of decision-relevant data steadily increases
11 However,... Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single scale-up server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. Raja Appuswamy et al. (2013): Nobody Ever Got Fired for Buying a Cluster. Microsoft Research, Cambridge UK.
12 Architecting for Data
13 Data Revolution Internet Revolution True Believer, 1996: Businesses that build network-oriented capability into their core will fundamentally outcompete and destroy their competition. Data Revolution True Believer, 2010: Businesses that build data comprehension into their core will destroy their competition over the next 5-10 years
14 Opportunities Advanced ML & Predictive DBs will provide transformative insights to nearly every business. Mobile & hi-speed connectivity means more dimensions of customer life are being digitized. Every bit of new data makes old data more valuable Analyzing historical data becomes more important Developing internal data analysis capability means you can more easily build data products to sell downstream. This is becoming an industry unto itself.
15 Technical Challenges Hardware & software do not yet make data analysis easy at terabyte scales Current analytics are mostly I/O bound. Next generation advanced analytics will be compute bound (simulations, distributed LinAlg). Efficiency matters. Reproducible analytical environment. Library & language choices can add air gaps between domain expert and analytical infrastructure.
16 Business Challenges Data exploration is new discipline for most businesses. Balancing agility & process for data-oriented processes and analytical libraries. Bad data architecture will generally not cause catastrophic failures. Instead, will erode your ability to compete. It s hard to know when you are sucking.
17 Data Matters Data has mass. Scalability requires minimizing data-movement (only as necessary). Deep/Advanced Analytics needs full computing stack, as accessible as SQL and Excel. Data should only move when it has to (to communicate results, to replicate, to back-up) not because the technology doesn t allow access.
18 Algorithms Matter...a Mac Mini running GraphChi can analyze Twitter s social graph from 2010 which contains 40 million users and 1.2 billion connections in 59 minutes. The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers, Guestrin says. -- MIT Tech Review...Spark, running on a cluster of 50 machines (100 CPUs) runs five iterations of Pagerank on the twitter-2010 in seconds. GraphChi solves the same problem in less than double of the time (790 seconds), with only 2 CPUs.
19 Berkeley Data Stack (BDAS)
20 Memory Matters 1980s 90s-00s 2010s Mechanical disk Mechanical disk Mechanical disk Solid state disk Capacity Main memory Main memory Main memory Speed Level 3 cache Level 2 cache Level 2 cache Central processing unit (CPU) Level 1 cache CPU Level 1 cache CPU (a) (b) (c) Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks. implemented several memory layers with different capabilities: lowerlevel caches (that is, those closer to of memory hierarchy (for an example in progress, see the Sequoia project at Programmers should exploit the optimizations inherent in temporal and spatial locality as much as possible.
21 Speed Matters
22 Jeff Hammerbacher s Advice Instrument everything Put all your data in one place Data first, questions later Store first, structure later (often the data model is dependent on the analysis you'd like to perform) Keep raw data forever Let everyone party on the data Introduce tools to support the whole research cycle (think of the scope of the product as the entire cycle, not just the container) Modular and composable infrastructure
23 Architecting for Data Data exploration as the central task. Data visualization as a first-class citizen. Enable agility.
24 Continuum s Stack
25 Continuum Analytics Visualisation Data Analysis Data Processing Domains Finance Geophysics Defense Advertising & Web Analytics Scientific Computing Scalable Computing Enterprise Python Scientific Computing Technologies Array/Columnar data processing Distributed computing, HPC GPU and new vector hardware Machine learning, predictive analytics Interactive Visualization
26 Mission To revolutionize data analytics and visualization by moving high-level Python code and domain expertise closer to data. This vision rests on four pillars: simplicity: advanced, powerful analytics, accessible to domain experts and business users via a simplified programming paradigm interactivity: interactive analysis and visualization of massive data sets collaboration: collaborative, shareable analysis (data, code, results, graphics) scale: out-of-core, distributed data processing
27 Big Picture Empower domain experts with high-level tools that exploit modern hardware Array Oriented Computing
28 Projects Blaze: High-performance Python library for modern vector computing, distributed and streaming data Numba: Vectorizing Python compiler for multicore and GPU, using LLVM Bokeh: Interactive, grammar-based visualization system for large datasets Common theme: High-level, expressive language for domain experts; innovative compilers & runtimes for efficient, powerful data transformation
29 Blaze Objectives Flexible descriptor for tabular and semi-structured data Seamless handling of: On-disk/Out-of-core Streaming data Distributed data Uniform treatment of: arrays of structures and structures of arrays missing values ragged shapes categorical types computed columns
30 Blaze Status DataShape type grammar NumPy-compatible C++ calculation engine (DyND) Synthesis of array function kernels (via LLVM) Fast time series routines (dynamic time warping for pattern matching) Array Server prototype BLZ columnar storage format 0.2 current release, working on
31 Schematic Database Array Server array+sql:// Python REPL, Scripts GPU Node Array Server array:// Blaze Client Synthesized Array/Table view Viz Data Server C, C++, FORTRAN file:// Array Server array:// JVM languages NFS
32 Kiva: Array Server Data Shape + Raw JSON = Web Service type KivaLoan = { id: int64; name: string; description: { languages: var, string(2); texts: json # map<string(2), string>; }; status: string; # LoanStatusType; funded_amount: float64; basket_amount: json; # Option(float64); paid_amount: json; # Option(float64); image: { id: int64; template_id: int64; }; video: json; activity: string; sector: string; use: string; delinquent: bool; location: { country_code: string(2); country: string; town: json; # Option(string); geo: { level: string; # GeoLevelType pairs: string; # latlong type: string; # GeoTypeType } };... {"id":200533,"name":"miawand Group","description":{"languages": ["en"],"texts":{"en":"ozer is a member of the Miawand Group. He lives in the 16th district of Kabul, Afghanistan. He lives in a family of eight members. He is single, but is a responsible boy who works hard and supports the whole family. He is a carpenter and is busy working in his shop seven days a week. He needs the loan to purchase wood and needed carpentry tools such as tape measures, rulers and so on.\r\n \r \nhe hopes to make progress through the loan and he is confident that will make his repayments on time and will join for another loan cycle as well. \r\n\r\n"}},"status":"paid","funded_amount": 925,"basket_amount":null,"paid_amount":925,"image":{"id": ,"template_id": 1},"video":null,"activity":"Carpentry","sector":"Construction","use":"He wants to buy tools for his carpentry shop","delinquent":null,"location": {"country_code":"af","country":"afghanistan","town":"kabul Afghanistan","geo":{"level":"country","pairs":"33 65","type":"point"}},"partner_id": 34,"posted_date":" T20:30:03Z","planned_expiration_date": null,"loan_amount": 925,"currency_exchange_loss_amount":null,"borrowers": [{"first_name":"ozer","last_name":"","gender":"m","pictured":true}, {"first_name":"rohaniy","last_name":"","gender":"m","pictured":true}, {"first_name":"samem","last_name":"","gender":"m","pictured":true}],"ter ms": {"disbursal_date":" t07:00:00z","disbursal_currency":"afn"," disbursal_amount":42000,"loan_amount":925,"local_payments": [{"due_date":" t07:00:00z","amount":4200}, {"due_date":" t07:00:00z","amount":4200}, {"due_date":" t07:00:00z","amount":4200}, {"due_date":" t07:00:00z","amount":4200}, {"due_date":" t07:00:00z","amount":4200}, {"due_date":" t08:00:00z","amount":4200}, {"due_date":" t08:00:00z","amount":4200}, {"due_date":" t08:00:00z","amount":4200}, {"due_date":" t08:00:00z","amount":4200}, {"due_date":" t08:00:00z","amount": 4200}],"scheduled_payments": gb of JSON => network-queryable array: ~5 minutes
33 Akamai Dataset ETL Hardware Memory Hive 8x 16 core, 2 GHz (128 cores) RAM: 8x 382 GB HDD: 8x 15k rpm Python script 1x 8 core, 2.2 GHz RAM: 144 GB HDD: 2x 7200rpm Time (traceroute) Routes/hr/Ghz 5 hrs, 635M routes 11 hrs, 113M routes 496k 584k Python performs ~18% better with almost no optimization resulting IPMap can be used for realtime, online query and aggregation
34 Querying Traceroute in BLZ format Meant for dealing with Big Data (RAM consumption is extremely low) 1k Random Full Scan Time RAM Time RAM BLZ (disk) BLZ (mem) NPY (memmap) NumPy (mem) 3.5s 0.04mb 2.9s 8mb 2.37s 210mb 2.4s 210mb 0.24s 0.2mb 0.23s 602mb.13s 603mb 0.23s 603mb
35 Numba Just-in-time, dynamic compiler for Python Optimize data-parallel computations at call time, to take advantage of local hardware configuration Compatible with NumPy, Blaze Leverage LLVM ecosystem: Optimization passes Inter-op with other languages Variety of backends (e.g. CUDA for GPU support)
36 Numba C++ C Fortran Python LLVM IR x86 ARM PTX Numba turns Python into a compiled language
37 Example Numba
38 LLVM-based architecture
39 Image Processing ~1500x def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result
40 Example: Mandelbrot Vectorized from numbapro import vectorize sig = 'uint8(uint32, f4, f4, f4, f4, uint32, uint32, target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Tesla S2050 Kind Time Speed-up Python 263,6 1.0x CPU 2, x GPU 0, x
41 Example: N-Body Simulation Simulation of movement of N bodies (space objects, particles) Loop-heavy algorithm to calculate the interactions between all bodies Pure Python 70 sec NumPy sec (= 97x speed-up) Numba sec (= 7x speed-up = 670 x total)
42 Bokeh Language-based (instead of GUI) visualization system High-level expressions of data binding, statistical transforms, interactivity and linked data Easy to learn, but expressive depth for power users Interactive Data space configuration as well as data selection Specified from high-level language constructs Web as first class interface target Support for large datasets via intelligent downsampling ( abstract rendering )
43 Bokeh Inspirations: Chaco: interactive, viz pipeline for large data Protovis & Stencil : Binding visual Glyphs to data and expressions ggplot2: faceting, statistical overlays Design goal: Accessible, extensible, interactive plotting for the web even for non-javascript programmers
44 Bokeh & BokehJS Demos BokehJS demos Audio Spectrogram Bokeh Examples - Low-level Python interface - IPython Notebook integration - ggplot example
45 Abstract Rendering Pixels'are'Bins ' and'always'have'been' A' B' B' A' D' C' Z>View' Geometry' D' C' Counts' Pixels'
46 Hi-def Alpha
47 Kiva: Abstract Rendering Basic Abstract Rendering can identify trouble spots in standard plots, and also offer automatic tone mapping, taking perception into account. 37 mil elements, showing adjacency between entities in Kiva dataset
48 Abstract Rendering B# A# D# C# Geometry# Reduce#?????????? Aggregates#( Abstract #Pixels)# Transfer# Pixels#
49 Abstract Rendering of Sparsity Drawing the Dark in Kiva Example Akin to mapping the ocean trenches; typical viz starts at sea level & goes up.
50 Cloud-hosted Python analytics environment Full Linux sandbox for every user IPython notebook Interactive Javascript plotting Easy to share notebooks & code with other users Free plan: 512mb memory, 10gb disk Premium plans include: more powerful machines, more memory/disk, SSH access, cluster support
51
52 Data Summary Explorer
53 Continuum Data Explorer (CDX)
54 Continuum Analytics Europe GmbH Rathausstrasse Voelklingen Germany Dr. Yves J.
Unlocking the True Value of Hadoop with Open Data Science
Unlocking the True Value of Hadoop with Open Data Science Kristopher Overholt Solution Architect Big Data Tech 2016 MinneAnalytics June 7, 2016 Overview Overview of Open Data Science Python and the Big
BIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
Understanding the Value of In-Memory in the IT Landscape
February 2012 Understing the Value of In-Memory in Sponsored by QlikView Contents The Many Faces of In-Memory 1 The Meaning of In-Memory 2 The Data Analysis Value Chain Your Goals 3 Mapping Vendors to
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren
News and trends in Data Warehouse Automation, Big Data and BI Johan Hendrickx & Dirk Vermeiren Extreme Agility from Source to Analysis DWH Appliances & DWH Automation Typical Architecture 3 What Business
Big Data Paradigms in Python
Big Data Paradigms in Python San Diego Data Science and R Users Group January 2014 Kevin Davenport! http://kldavenport.com [email protected] @KevinLDavenport Thank you to our sponsors: Setting up
In-memory computing with SAP HANA
In-memory computing with SAP HANA June 2015 Amit Satoor, SAP @asatoor 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Hyperconnectivity across people, business, and devices give rise to
Everything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
Hybrid Software Architectures for Big Data. [email protected] @hurence http://www.hurence.com
Hybrid Software Architectures for Big Data [email protected] @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
Software-defined Storage Architecture for Analytics Computing
Software-defined Storage Architecture for Analytics Computing Arati Joshi Performance Engineering Colin Eldridge File System Engineering Carlos Carrero Product Management June 2015 Reference Architecture
Cost-Effective Business Intelligence with Red Hat and Open Source
Cost-Effective Business Intelligence with Red Hat and Open Source Sherman Wood Director, Business Intelligence, Jaspersoft September 3, 2009 1 Agenda Introductions Quick survey What is BI?: reporting,
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014
BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets
HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect [email protected]
HPC and Big Data EPCC The University of Edinburgh Adrian Jackson Technical Architect [email protected] EPCC Facilities Technology Transfer European Projects HPC Research Visitor Programmes Training
Key Attributes for Analytics in an IBM i environment
Key Attributes for Analytics in an IBM i environment Companies worldwide invest millions of dollars in operational applications to improve the way they conduct business. While these systems provide significant
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Realizing the True Potential of Software-Defined Storage
Realizing the True Potential of Software-Defined Storage Who should read this paper Technology leaders, architects, and application owners who are looking at transforming their organization s storage infrastructure
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Tableau Server Scalability Explained
Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com
Big Data Are You Ready? Thomas Kyte http://asktom.oracle.com The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated
Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
2009 Oracle Corporation 1
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
The most powerful open source data science technologies in your browser.!! Yves Hilpisch
The most powerful open source data science technologies in your browser.!! Yves Hilpisch I. The Market and The Problem II. How We Solve The Problem III. Market Size and Facts IV. Strategic Opportunities
Managing the PowerPivot for SharePoint Environment
Managing the PowerPivot for SharePoint Environment Melissa Coates Blog: sqlchick.com Twitter: @sqlchick SharePoint Saturday 3/16/2013 About Melissa Business Intelligence & Data Warehousing Developer Architect
Databricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
From Spark to Ignition:
From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for
SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013
SAP HANA SAP s In-Memory Database Dr. Martin Kittel, SAP HANA Development January 16, 2013 Disclaimer This presentation outlines our general product direction and should not be relied on in making a purchase
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack
Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack HIGHLIGHTS Real-Time Results Elasticsearch on Cisco UCS enables a deeper
Reference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
White Paper February 2010. IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario
White Paper February 2010 IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario 2 Contents 5 Overview of InfoSphere DataStage 7 Benchmark Scenario Main Workload
Exploring the Synergistic Relationships Between BPC, BW and HANA
September 9 11, 2013 Anaheim, California Exploring the Synergistic Relationships Between, BW and HANA Sheldon Edelstein SAP Database and Solution Management Learning Points SAP Business Planning and Consolidation
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
IBM Global Technology Services September 2007. NAS systems scale out to meet growing storage demand.
IBM Global Technology Services September 2007 NAS systems scale out to meet Page 2 Contents 2 Introduction 2 Understanding the traditional NAS role 3 Gaining NAS benefits 4 NAS shortcomings in enterprise
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Tableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
Presto/Blockus: Towards Scalable R Data Analysis
/Blockus: Towards Scalable R Data Analysis Andrew A. Chien University of Chicago and Argonne ational Laboratory IRIA-UIUC-AL Joint Institute Potential Collaboration ovember 19, 2012 ovember 19, 2012 Andrew
Big Data Technologies Compared June 2014
Big Data Technologies Compared June 2014 Agenda What is Big Data Big Data Technology Comparison Summary Other Big Data Technologies Questions 2 What is Big Data by Example The SKA Telescope is a new development
SQL Server 2012 Parallel Data Warehouse. Solution Brief
SQL Server 2012 Parallel Data Warehouse Solution Brief Published February 22, 2013 Contents Introduction... 1 Microsoft Platform: Windows Server and SQL Server... 2 SQL Server 2012 Parallel Data Warehouse...
Analytics on Spark & Shark @Yahoo
Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Intelligent Business Operations and Big Data. 2014 Software AG. All rights reserved.
Intelligent Business Operations and Big Data 1 What is Big Data? Big data is a popular term used to acknowledge the exponential growth, availability and use of information in the data-rich landscape of
5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014
5 Keys to Unlocking the Big Data Analytics Puzzle Anurag Tandon Director, Product Marketing March 26, 2014 1 A Little About Us A global footprint. A proven innovator. A leader in enterprise analytics for
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez
IT of SPIM Data Storage and Compression EMBO Course - August 27th Jeff Oegema, Peter Steinbach, Oscar Gonzalez 1 Talk Outline Introduction and the IT Team SPIM Data Flow Capture, Compression, and the Data
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising
Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1 Executive Summary AdReady is working to develop and deploy sophisticated
Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems
Fastboot Techniques for x86 Architectures Marcus Bortel Field Application Engineer QNX Software Systems Agenda Introduction BIOS and BIOS boot time Fastboot versus BIOS? Fastboot time Customizing the boot
Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation
Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation Forward-Looking Statements During our meeting today we may make forward-looking
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012
MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012 Course Overview This course provides students with the knowledge and skills to design business intelligence solutions
The Data Placement Challenge
The Data Placement Challenge Entire Dataset Applications Active Data Lowest $/IOP Highest throughput Lowest latency 10-20% Right Place Right Cost Right Time 100% 2 2 What s Driving the AST Discussion?
Graph Database Proof of Concept Report
Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
System Requirements Table of contents
Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5
All-Flash Storage Solution for SAP HANA:
All-Flash Storage Solution for SAP HANA: Storage Considerations using SanDisk Solid State Devices WHITE PAPER 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Preface 3 Why SanDisk?
How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router
HyperQ Hybrid Flash Storage Made Easy White Paper Parsec Labs, LLC. 7101 Northland Circle North, Suite 105 Brooklyn Park, MN 55428 USA 1-763-219-8811 www.parseclabs.com [email protected] [email protected]
Fact Sheet In-Memory Analysis
Fact Sheet In-Memory Analysis 1 Copyright Yellowfin International 2010 Contents In Memory Overview...3 Benefits...3 Agile development & rapid delivery...3 Data types supported by the In-Memory Database...4
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Where is... How do I get to...
Big Data, Fast Data, Spatial Data Making Sense of Location Data in a Smart City Hans Viehmann Product Manager EMEA ORACLE Corporation August 19, 2015 Copyright 2014, Oracle and/or its affiliates. All rights
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Luncheon Webinar Series May 13, 2013
Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration
Analyzing Big Data with AWS
Analyzing Big Data with AWS Peter Sirota, General Manager, Amazon Elastic MapReduce @petersirota What is Big Data? Computer generated data Application server logs (web sites, games) Sensor data (weather,
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Data Centric Interactive Visualization of Very Large Data
Data Centric Interactive Visualization of Very Large Data Bruce D Amora, Senior Technical Staff Gordon Fossum, Advisory Engineer IBM T.J. Watson Research/Data Centric Systems #OpenPOWERSummit Data Centric
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here
PLATFORM Top Ten Questions for Choosing In-Memory Databases Start Here PLATFORM Top Ten Questions for Choosing In-Memory Databases. Are my applications accelerated without manual intervention and tuning?.
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Microsoft Big Data. Solution Brief
Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,
LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE. Matt Kixmoeller, Pure Storage
LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE Matt Kixmoeller, Pure Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies
CE 504 Computational Hydrology Computational Environments and Tools Fritz R. Fiedler
CE 504 Computational Hydrology Computational Environments and Tools Fritz R. Fiedler 1) Operating systems a) Windows b) Unix and Linux c) Macintosh 2) Data manipulation tools a) Text Editors b) Spreadsheets
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers
ZT Systems (ST based) Applied Micro development platform HP Redstone platform Mitac Dell Copper platform ARM in Servers 1 Server Ecosystem Momentum 2009: Internal ARM trials hosting part of website on
Effective Java Programming. efficient software development
Effective Java Programming efficient software development Structure efficient software development what is efficiency? development process profiling during development what determines the performance of
