SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS



Similar documents
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Ecosystem B Y R A H I M A.

Integrating Big Data into the Computing Curricula

BIG DATA What it is and how to use?

Search and Real-Time Analytics on Big Data

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Big Data Explained. An introduction to Big Data Science.

Hadoop IST 734 SS CHUNG

Big Data and Fast Data combined is it possible?

How To Use Big Data For Telco (For A Telco)

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Building Scalable Big Data Pipelines

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Big Data Course Highlights

Real Time Big Data Processing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Information Builders Mission & Value Proposition

How To Create A Data Visualization With Apache Spark And Zeppelin

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Openbus Documentation

Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Lecture Data Warehouse Systems

BIG DATA TECHNOLOGY. Hadoop Ecosystem

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Advanced Big Data Analytics with R and Hadoop

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Big Data and Apache Hadoop s MapReduce

Challenges for Data Driven Systems

Big Data Analysis and HADOOP

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

How To Scale Out Of A Nosql Database

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

NoSQL and Hadoop Technologies On Oracle Cloud

A Brief Outline on Bigdata Hadoop

Reference Architecture, Requirements, Gaps, Roles

The 4 Pillars of Technosoft s Big Data Practice

Big Data and Industrial Internet

Making Sense of Big Data in Insurance

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data Analytics Nokia

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Big Data and Analytics: Challenges and Opportunities

Big Data Analysis: Apache Storm Perspective

MapReduce with Apache Hadoop Analysing Big Data

Big Data: Tools and Technologies in Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Apache HBase. Crazy dances on the elephant back

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Large scale processing using Hadoop. Ján Vaňo

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

BIG DATA TRENDS AND TECHNOLOGIES

Architectures for Big Data Analytics A database perspective

Thomas Baumann Swiss Mobiliar Bern, Switzerland

CSE-E5430 Scalable Cloud Computing Lecture 2

#TalendSandbox for Big Data

Big Data. Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott Trivadis AG Trivadis. Big Data - Marriage of RDBMS-DWH and Hadoop & Co.

Dell In-Memory Appliance for Cloudera Enterprise

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Unified Big Data Processing with Apache Spark. Matei

Ali Ghodsi Head of PM and Engineering Databricks

So What s the Big Deal?

COMP9321 Web Application Engineering

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Big Data on Google Cloud

Using Data Mining and Machine Learning in Retail

An Approach to Implement Map Reduce with NoSQL Databases

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Using distributed technologies to analyze Big Data

Native Connectivity to Big Data Sources in MSTR 10

WHAT S NEW IN SAS 9.4

Oracle Big Data SQL Technical Update

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Internals of Hadoop Application Framework and Distributed File System

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Large Scale/Big Data Federation & Virtualization: A Case Study

Large-Scale Data Processing

Big Data on Microsoft Platform

Transforming the Telecoms Business using Big Data and Analytics

Cloud Scale Distributed Data Storage. Jürmo Mehine

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Big Data for Investment Research Management

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Transcription:

Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical Solution using Lambda Architecture Where BigData Industry is Going? SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS CHARLES CAI ASHWANI ROY 8 March 2013

Presenter: Charles Cai Charles Cai makes a living by designing and implementing trading and risk systems for investment banks. Currently a Chief Front Office Technical Architect in a global energy trading firm. Twitter: @caidong Linkedin: charlescai

Presenter: Ashwani Roy Ashwani Roy Masters in Finance Student at London Business School and VP at a Tier 1 Investment Bank. Love to mix programming and Applied Mathematics to solve difficult problems in Investment Banking Twitter: @Ashwani_Roy Linkedin: ashwaniroy

3548

Why Finance Industry should care? We care because of Compliance requirements Risk Management Pricing Rise of Machines (Ecommerce) Cost Cutting BTW: Twitter is also part of Market Data

Sample Interest Model / Simulations

A quick Monte Carlo Demo Demo Computing this is functional Some Terminology PV = present value = Cash flows discounted to current time Delta = change in price / change in interest rate Gamma.. Vega.. Rho.. Theta.. Vanna. And other Greeks

Monte Carlo Simulations -Results <results> = func<i,j,k Parallelize computation with mappers Save results and run reducers [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP}..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data}..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]

Compliance Dodd-Frank requires >= five years records Fast Disaster recovery requirements (Tapes backup not acceptable) All Bloomberg and other chats to be saves in quick reportable form Many more in Basel 3 and Dodd Frank Act You need to # get chats for AshwaniRoy@bloomberg.net and ashwanir@reuters.net # from the 5 years Bloomberg and Reuters log of a global investment bank of 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5 years) # for all EURUSD swaps only.. Additional filters and aggregation requirements

Big Data Industry History: Google s Papers 1

1 Google s Big Data Papers: 2003 2006 GFS Google File System 2003 Distributed file system 3 x copies Commodity machines Colossus (2012) MapReduce 2004 Input à Map à Partition à Compare à Shuffle à Sort à Reduce à Output BigTable 2006 Distributed Key- Value columnfamily based database

Hadoop Distributed File System (HDFS) http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png

Google s MapReduce Programming Model 1

Apache Hbase: Column Family Distributed K-V Store

1 Google s Big Data Papers 2: 2010 - now Percolator Dremel Pregel 2010 Incremental update/ compute built on BigTable Adds transactions, locks, notifications SPFs: Stream Processing Frameworks + underlying database 2010 Online analytics and visualization SQL like language for structured data Each row is JSON object in protobuf format Column based Spanner (2012), BigQuery, F1 2010 Scalable graph computing Worker threads à nodes à parallel superstep à messages à nodes à Aggregator/Combiners (global statistics) PageRank, shortest path, bipartite matching Impala Tez/Stinger Microsoft Trinity

Unstructured Data: Index/Search Engine Github Code Search: 17 TB

Apache Lucene/SOLR Open Source Indexing and Search Engine 4,000+ Enterprise users IBM, HP, Cisco Apple, Linkedin Wikipedia CNet, Sky Twitter

What s Next for Hadoop? Real-time! Nathan Marz

Some more use cases Save money to save your jobs Save money to your firm can do more E Commerce is norm Market sentiment analysis cannot be relied on using Bloomberg's sentiment analysis only.. Add some more

2 Lambda Architecture Nathan Marz, BackType/Twitter query = func (data,...) Excel / VBA Java, C#/F#... MatLab 3 rd party ETL Tools R Technical analysis Alerts Join across data sources (e.g. correlation among weather / energy) Curating/cleanse curves Derive curves, building models Back-testing models Visualization of the above! Real-time ticks, events Historical (all history data points) Curated/cleansed curves Derived curves Back-testing models

Lambda Architecture : query = func (data,...) Batch Layer (Hadoop) Servicing Layer All data (HDFS/HBase) Batch recompute Precompute Views (MapReduce) QFD 1 QFD 2 Excel/Apps Access / Centralization / Manipulation New data stream Acquisition Ad- hock Analysis/Writeback: Java/ C#,R/Clojure, HIVE/PIG, Talend/3 rd party,... Quality / Access / Manipulation QFD N Batch views (HDFS/Impala) Merge Speed Layer (Storm) QFD N Realtime views (Apache Hbase) Tableau/Spotfire Visualization QFD 2 Process stream Realtime increment QFD 1 Alerts Automation / Aggregation / Centralization RDBMS/DW + Full- text Search + Graph Database COTS Reporting Tools Visualization RDBMS Full- text Search Graph Database MDX/DW (T+ 1) Metadata / Classification / Curation

Online resources and alternative stacks An Introduction to Data Science.PDF Free e-book on Data Science with R under Creative Commons Licenses Berkeley Data Analytics Stack (Open Source: Mesos cluster management, Spark/Streaming cluster computing, Shark-SQL/DW) Learning Statistics with R, Free Big Data Education: Advanced Data Science DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr ) An example lambda architecture for real-time analysis of hashtags using Trident, Hadoop and Splout SQL Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture Open source clustered Lucene: elasticsearch used by GitHub (17 TB code)

Distributed Computing System: CAP Theorem Consistency all nodes see the same data at the same time Availability a guarantee that every request receives a response about whether it was successful or failed Partition tolerance the system continues to operate despite arbitrary message loss or failure of part of the system https://github.com/thinkaurelius/titan/wiki/storage-backend-overview http://en.wikipedia.org/wiki/cap_theorem http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

Lambda Architecture : Enterprise Data Data size Retention granular level Speed of change Speed of reaction Volume Velocity Variety Value Data sources Data formats (./ semi-/nonstructured ) Quality of data Ways to improve data quality Discover hidden business insights

Lambda Architecture Nathan Marz, BackType/Twitter Design Principle: Human fault-tolerance Immutability Pre-computation Lambda Architecture: Batch Layer Serving Layer Speed Layer Technology Stack Apache Hadoop/HBase/Cloudera Impala Twitter Storm