Data Analytics Infrastructure

Similar documents

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data and Data Science: Behind the Buzz Words

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Next-Generation Cloud Analytics with Amazon Redshift

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

TOP 8 TRENDS FOR 2016 BIG DATA

Big Data Architectures. Tom Cahill, Vice President Worldwide Channels, Jaspersoft

Cloud Big Data Architectures

Workshop on Hadoop with Big Data

BIRT in the World of Big Data

Database Usage in the Public and Private Cloud: Choices and Preferences

ANALYTICS CENTER LEARNING PROGRAM

Native Connectivity to Big Data Sources in MSTR 10

Integrating Salesforce Using Talend Integration Cloud

Has been into training Big Data Hadoop and MongoDB from more than a year now

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Building Your Big Data Team

Open Source Technologies on Microsoft Azure

Peninsula Strategy. Creating Strategy and Implementing Change

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Databricks. A Primer

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Cost-Effective Business Intelligence with Red Hat and Open Source

6 Steps to Faster Data Blending Using Your Data Warehouse

Hadoop & Spark Using Amazon EMR

Databricks. A Primer

White Paper: Datameer s User-Focused Big Data Solutions

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Success Step 1: Get the Technology Right

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Real Time Big Data Processing

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Bringing Big Data to People

Moving From Hadoop to Spark

Scalable Architecture on Amazon AWS Cloud

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Building your Big Data Architecture on Amazon Web Services

How To Handle Big Data With A Data Scientist

Performance and Scalability Overview

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Big Data & Cloud Computing. Faysal Shaarani

So What s the Big Deal?

Performance and Scalability Overview

WHITE PAPER. Data Migration and Access in a Cloud Computing Environment INTELLIGENT BUSINESS STRATEGIES

Big Data Analytics - Accelerated. stream-horizon.com

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

I/O Considerations in Big Data Analytics

Big Data and Data Science. The globally recognised training program

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Hadoop IST 734 SS CHUNG

AIST Data Symposium. Ed Lenta. Managing Director, ANZ Amazon Web Services

The Inside Scoop on Hadoop

The? Data: Introduction and Future

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

May 6, 2014 Presented by: Kyle McNerney

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Composite Software Data Virtualization Turbocharge Analytics with Big Data and Data Virtualization

TRAINING PROGRAM ON BIGDATA/HADOOP

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Tableau Online. Understanding Data Updates

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

CPS 516: Data-intensive Computing Systems. Instructor: Shivnath Babu TA: Zilong (Eric) Tan

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Tips and Techniques on how to better Monitor, Manage and Optimize your MicroStrategy System High ROI DW and BI Solutions

Sisense. Product Highlights.

Cloud Computing and Big Data What Technical Writers Need to Know

Big Data Infrastructure at Spotify

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Enterprise Operational SQL on Hadoop Trafodion Overview

Big Data Analytics Nokia

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Thing Big: How to Scale Your Own Internet of Things.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Applications for Big Data Analytics

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Talend Open Studio for Big Data. Release Notes 5.2.1

Big Data Technologies Compared June 2014

Big Data Multi-Platform Analytics (Hadoop, NoSQL, Graph, Analytical Database)

Transcription:

Data Analytics Infrastructure Data Science SG Nov 2015 Meetup Le Nguyen The Dat @lenguyenthedat

Backgrounds ZALORA Group (2013 2014) o Biggest online fashion retails in South East Asia o Data Infrastructure & Data Science

Backgrounds Commercialize.TV (2015 ) o Multi Channel media network focusing on China audiences o Data Infrastructure & Insights

Challenges No central data source: o Data stored in multiple locations o Unclear ownership Data definition and quality: o Little to none documentation o Different formula, rules owned by different departments o Always dirty no matter what Reporting Descriptive analytics: o Immediate needs, automations o Important to do it right (and quick!)

Data Warehouse

Database Technologies SQL Relational Databases: o MySQL, PostgreSQL o MS SQL Server, Oracle SQL NoSQL: o Redis o Cassandra o MongoDB o DynamoDB (AWS) o RethinkDB Map Reduce ecosystem: o Hadoop: HDFS Pig Hive Hbase o Spark: RDD Spork Shark (Spark SQL) Hbase-Spark Massively Parallel Processing (MPP): o Vertica (HP) - Greenplum (EMC) Netezza (IBM) o ParAccel (Amazon Redshift)

Database Technologies SQL Relational Databases: o MySQL, PostgreSQL o MS SQL Server, Oracle SQL NoSQL: (?) o Redis o Cassandra o MongoDB o DynamoDB (AWS) o Neo4j Map Reduce ecosystem: o Hadoop: HDFS Pig Hive Hbase o Spark: RDD Spork Shark (Spark SQL) Hbase-Spark Massively Parallel Processing (MPP): o Vertica (HP) - Greenplum (EMC) Netezza (IBM) o ParAccel (Amazon Redshift)

Data Warehouse Amazon Redshift aws.amazon.com/redshift o Cloud-based, Fully managed o SQL (PostgreSQL 8.0.2) o On-demand ($2000/year) o Scalable (Petabyte-scale) o FAST! amplab.cs.berkeley.edu/benchmark/ All in ONE place! o Product information o Customer information o Tracking data o External data sources (Social Media, 3 rd Party datasets)

Extract-Transform-Load (ETL)

ETL Amazon Redshift s COPY command. docs.aws.amazon.com/redshift/latest/dg/r_copy.html

ETL Custom made: o Simple bash script, python, SQL o Use cases: Scrappers Excel / CSV imports Data Pipeline Frameworks: o Large scale, more complicated o Examples: Spotify s Luigi github.com/spotify/luigi Yelp s Mycroft github.com/yelp/mycroft 3 rd Party Services: o aws.amazon.com/redshift/partners Flydata Rjmetrics

Applications

Self-services Dashboards Intermediate users o SQL / Excel / WYSIWYG Tools o Re:dash www.redash.io Open source: github.com/getredash/redash Try it out: demo.redash.io Self-manage & deployment: o Docker o Pre-baked AMI (Amazon Web Services) o Google Cloud Images Supports lots of database types (Redshift, MySQL, PostgreSQL, Big Query, MongoDB ) Users need to know SQL Web-based, collaborative work type

Demo: re:dash data sources usage http://demo.redash.io/queries/756

Demo: NYC Taxis Tip Amounts http://demo.redash.io/queries/753

Self-services Dashboards Intermediate users o SQL / Excel / WYSIWYG Tools o Tableau www.tableau.com Licensed software (14 days trial) Tableau Public (Free: public.tableau.com) Self-host or Tableau host (fully managed) Supports a lot more database types Group, User management customized access right Drag & Drop software as well as web-based

Demo: Social media dashboard Baidu è Import.io API èdata Warehouse è Tableau

Demo: Market Research - video platform performance SimilarWeb API èdata Warehouse è Tableau

Others (Tableau Public): SEA Games Result history tiny.cc/seagames Rakuten Viki data challenge tiny.cc/viki-viz

Advanced applications Advanced users o Data Warehouse connection (JDBC - PostgreSQL) o Automated, highly customized reports. o Data Science: Recommendation engine Predictive modeling Classifications

Advanced applications Internal reporting tool Data Warehouse è SQL, Python (Django), JS è Product-Finder [ Black dress SKU or ID Tiffany Atmosphere ] Sales Info Tracking Info Product Info

Advanced applications Recommendation Engine Data Warehouse è SQL, Python, Haskell è ZALORA Website Similar products: Similar products: Similar products: Similar products:

Conclusions

Team & Technology stack o Small team of 1-4 programmers o Amazon Web Services No upfront cost Low maintenance Scalability Integrations o Shell Scripts, Python, Haskell, D3.js o Unix, Open-source technologies

Takeaways o (Good) data infrastructure is important: Build it first (before you hire a data scientist!) Build it right: stable fast scalable. o There is no silver bullet: Understand what you need Always do more research o Data infrastructure is NOT that hard! Utilize existing, modern technologies Avoid old, proprietary technology that were built for the 90s!

References Engineering Blogs and Tutorials o o o o o o o Benchmarks o o o https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/ https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83 https://engineering.pinterest.com/blog/powering-interactive-data-analysisredshift http://engineering.ifttt.com/data/2015/10/14/data-infrastructure/ https://blog.rjmetrics.com/2015/10/15/the-data-infrastructure-meta-analysishow-top-engineering-organizations-built-their-big-data-stacks/ https://www.youtube.com/watch?v=reqtxqudpzo https://www.periscopedata.com/amazon-redshift-guide https://amplab.cs.berkeley.edu/benchmark/ https://www.flydata.com/blog/with-amazon-redshift-ssd-querying-a-tb-of-datatook-less-than-10-seconds/ https://www.flydata.com/blog/hive-and-redshift-a-brief-comparison/

Thank you!