Search Big Data with MySQL and Sphinx. Mindaugas Žukas

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Search Big Data with MySQL and Sphinx. Mindaugas Žukas www.ivinco.com"

Transcription

1 Search Big Data with MySQL and Sphinx Mindaugas Žukas

2 Agenda Big Data Architecture Factors and Technologies MySQL and Big Data Sphinx Search Server overview Case study: building a Big Data search engine System Overview Web Layer Architecture Data Store Architecture Distributed Search Engine Architecture System monitoring overview

3

4 Big Data Architecture: Technologies There are many different technologies for data storage and analytics No one tool is the right fit for all Big Data problems More than 50% of Big Data projects fail With Big Data and High Load projects people invent their own bicycles everyday.. And it is OK* * Consult with your doctor Big Data experts in your circle

5 Big Data Architecture: Factors Data characteristics (volume, velocity, variety, quality, complexity etc.) Database workloads (I/O patterns, OLTP/Analytics/mixed, Real-time/Batch, etc.) Planned use cases Available hardware choices and capabilities (cloud, commodity hardware, state of the art hardware) Business requirements, System quality attributes (accuracy, efficiency, scalability, reliability, availability, maintainability, security etc.) Team size and experience level Budget constraints Time constraints many more

6 MySQL and Big Data

7 MySQL and Big Data Why MySQL? Open Source Actively developed and supported Very popular, large community Many developers familiar with MySQL Easier to hire experts

8 MySQL and Big Data Typical use cases: Hadoop (or other) data store + MySQL for aggregated data, reports Sharded MySQL used as a Big Data store

9 What is Sharding? All data in a single MySQL database Data distributed over a number of MySQL databases with the same structure One state of the art server Number of commodity servers

10 MySQL Sharding To shard or not to shard? You can keep quite large amounts of data (multi-tb) having high performance in MySQL without sharding Sharding adds complexity But if you really have large-scale DB growth plans (Big Data), sharding may be the only option (not just with with MySQL) Sharding methods By ID range, By hash, By function, Look-up table Using modern tools like Oracle MySQL Fabric, ScaleBase, Vitess, Jetpants

11 Sphinx Search Server

12 Sphinx Overview Implements advanced search features Enables apps respond faster Scalable to billions of documents Quick to learn. Easy to use. Simple to maintain.

13 Why Sphinx? Speed 10x-1000x faster than built-in search (MySQL, Postgres, MS SQL, Oracle..) Real-time indexes Feature-rich search Relevancy, synonyms, stopwords, index multiple languages, use 3 rd party linguistic libraries etc. Scalable Aggregates search results from thousands of boxes (largest known installation boxes) 300M queries per day on Craigslist.org Built-in High Availability / Load balancing Easy to integrate SphinxQL Great documentation Easy learning curve

14 Why Sphinx? Boolean search AND OR NOT: hello world hello & world hello -world Per-field world Field body) hello world Search within first N hello Phrase search hello world Per field relevancy ranking weights Proximity search hello world ~10 Word Distance hello NEAR/10 world Quorum matching GEO distance search (with syntax for mi/km/m) Add attributes to the index and use WHERE, ORDER, GROUP for integers, floats, strings Many more

15 3. Fetch docs by ID Sphinx At Work Application 1. Search Query 2. Search results (IDs) Sphinx daemon Database Sphinx indexer Sphinx index

16 Sphinx Advanced Indexing Application 1. Search Query 3. Search results (IDs) Sphinx forwarder Re-index often, Only new records and updates Big Database Re-index once in a while all database, Reset delta index Delta index Main index 2. Query all indexes and aggregate results

17 Sphinx Indexing (Character Level) charset_table define what characters matter: Use ranges: a..z, U+410..U+42F Char mapping: A->a, A..Z->a..z ngram_chars indexing hieroglyphs as separate tokens: 我 喜 歡 iphone, 這 是 一 個 偉 大 的 手 機 ngram_chars = U U+2FA1F 我 喜 歡 iphone 這 是 一 個 偉 大 的 手 機

18 Sphinx Indexing (Word Level) Stopwords on, a, the, my search on my site = search on a site search on the site search my site Exceptions and wordforms U.S.A. => USA U.S. => USA US => USA United States => USA Vitamin a The Matrix AT&T => AT&T Stemming (does => do)

19 Sphinx And Big Text Sphinx can use 3 rd party tools to work with Big Text : Chinese phrase ( I like the iphone, it is a great phone ) 我 喜 歡 iphone, 這 是 一 個 偉 大 的 手 機 Using ngram_chars indexing hieroglyphs as separate tokens: ngram_chars = U U+2FA1F 我 喜 歡 iphone 這 是 一 個 偉 大 的 手 機 Using Basis Rosette linguistics technology: 我 喜 歡 iphone 這 是 一 個 偉 大 stopword( 的 ) 手 機

20 Sphinx Lithuanian Stemming Example select * from LTtest where match('lietuvai'); id select * from LTtest where match('žemės'); id

21 Case Study: Building a Big Data Search Engine

22 Some Stats MySQL stores 120TB of compressed data and growing (tens of billions of text documents) Incoming data up to 5,000 new docs/s Data indexing latency under 5 minutes System uses 200+ different servers (~half of it for HA/redundancy) Up to 25,000 queries per second on main MySQL DB server API responses vary from small result sets with a few documents to tens of megabytes of result data

23 The Architecture - Factors Prerequisites High performance High Load High Availability Scalability (keep up with fast-growing data and usage) Near real-time (Low-latency) Feature-rich, quality search (multi-language, boolean, relevancy, synonyms etc.) Efficiency and maintainability (unreasonably small budget, small team, commodity hardware) Lots of structured data (Forums, Blogs, Comments, News, Twitter etc.)

24 The Architecture - Technologies Main technologies CentOS PfSense Squid Apache Percona MySQL Server PHP Java Memcached RabbitMQ Kafka Sphinx Search Server (+ Basis Rosette Linguistics)

25 Building a Big Data Search Engine The Web Layer Data Flow Technologies

26 Web Application (Data flow) 1. User search query Web layer 4. Search results Firewall Load balancers / Cache Application web servers 2. Run search query and get matching doc IDs Data store Search engine Data collection MySQL clusters Data DB clusters Main DB Indexing Indexers Indexers Sphinx Clusters

27 Web Layer (Technologies) pfsense 1 Failover pfsense 2 Load Balancing Squid 1 Squid 2 Squid 3 Squid 4 Cache Cache Cache Cache Load Balancing Web 1 Web 2 Web 3 Web 4 Web N Logs MySQL Data Store MySQL Main DB Memcached Distributed Sphinx Search Index Analytics/Mo nitoring

28 Building a Big Data Search Engine The Data Store Structure Sharding High-availability and Backups Data loading

29 MySQL Data Store MySQL (Percona Server) Sharding: Main DB cluster + distributed data storage clusters Currently stores more than 120TB Scalability (scale out / scale up) High Availability (MySQL Replication; Percona Replication Manager)

30 MySQL Data Store Structure Forum Data Group Different data types stored in separate Data groups (e.g. Blogs, Twitter, Forums etc.) Within Data groups, data is splitted into a number of Shards (MySQL databases) DB cluster 1 DB cluster 2... Twitter Data Group DB cluster 1 DB cluster 2... DB cluster N DB cluster N Main DB highavailability cluster System service data and sharding meta data Shards are distributed over a number of DB Clusters Sharding/routing information is stored in Main DB or defined algorithmically (hash) Data Group N DB Cluster 1 DB Cluster 2 DB Cluster 3... DB Cluster N Shard 1 Shard 2 Shard 15 Shard 16 Shard 17 Shard 25 Shard 26 Shard 27 Shard 38 Shard X Shard Y Shard Z

31 MySQL Data Store High-availability And Backups Each DB cluster consists of three servers to ensure high availability {DataType}dbN-1 {DataType}dbN-2 {DataType}dbN-3 Master A Slave A Backup A Slave B Backup B Master B Backup C Backup Master C Replication Slave C Copy Big Backup Archive When 1 server is down, failover is automatic, when 2 servers are down we can manually enable backup instance to ensure availability.

32 MySQL Main DB High-availability Percona Replication Manager (PRM) agents running on all servers MainDB Master PRM agent Replication MainDB Slave 1 PRM agent MainDB Slave 2 Application PRM agent

33 MySQL Main DB High-availability When Master goes down PRM agents on the Slaves make instant decision on who of them will become new Master. On failover maindbmaster VIP gets assigned to a Slave which becomes a new Master, application just keeps using maindbmaster MainDB Master PRM agent MainDB Slave 1 PRM agent Application Writes to maindbmaster MainDB Master (ex-slave 2) PRM agent

34 MySQL - Data Loading Incoming data XML files Data store group Data store group Kafka RabbitMQ Other data sources Multi-process Multi-process Loaders Multi-process Loaders Loaders Data store group Shard 0 Shard 1 Shard 255 Logs Rejected data Multi-process Loaders: - validate the data - inserts data into the proper DB shards - Having many shards we can write large amounts of data in parallel

35 Building a Big Data Search Engine The Search Engine SE Summary Distributed Index Architecture Dive Into Indexing Configuration Centralized Indexer and HA

36 Sphinx Search Index Summary Sphinx Search index is distributed across Search Engine Clusters. 100% automated centralized data indexing High availability High Scalability (scale up and scale out)

37 Sphinx Distributed Index Architecture Web Server N Forums Sphinx Forwarder Blogs Sphinx Forwarder X Sphinx Forwarder Forum Search Engine Group Forum SE01 Blogs Search Engine Group Blogs SE01 Forum SE02 Blogs SE Forum SE-N Blogs SE-N Application uses Sphinx Forwarders on Web servers to run the search queries for different data groups The index in each data group is split into a number of Search Nodes distributed over a number of SE boxes (e.g. blogsse01, blogsse02 etc.). Each box has a pair with the same Search Nodes for High Availability Search Engine Group X xse01 Node 1 Node 2 Node 3 Node 4 xse02 Node 5 Node 6 Node 7 Node 8 xse03 Node 9 Node 10 Node 11 Node xse-n Node N Node N Node N Node N

38 Sphinx Distributed Index Architecture Forum SE Group / ForumSE01 Each Search Node serves index for several DB Shards. ForumSE01-2 ForumSE01-1 Each server has a pair with the same Search Nodes to ensure high availability. Node 4 Node 3 Node 2 1 Node 4 Node 3 Node 2 1 Both servers can be used by Sphinx with automatic load balancing. Data store / Forum Data group DB Cluster 1 DB Cluster 2 DB Cluster 3... DB Cluster N [Shard X] Shard X Shard X Shard X [Shard X] Shard X Shard X Shard X [Shard X] [Shard X] Shard X Shard X

39 Sphinx Single Index Node Structure SE Group X -> SE Cluster SE01 -> Node 1 DELTA DELTA-week [daily] - Re-index few times per minute - Takes a few seconds - Re-index at midnight - Takes a few minutes - Resets delta index indexing index inc_node1'... collected 6216 docs, 2.9 MB sorted 1.2 Mhits, 100.0% done total 6216 docs, bytes total sec, bytes/sec, docs/sec indexing index w eek_node1'... collected docs, MB sorted Mhits, 100.0% done total docs, bytes total sec, bytes/sec, docs/sec DELTA-3month [weekly] - Re-index every Sunday, - Takes 45 minutes - Resets other delta indexes indexing index 3month_node1'... collected docs, MB sorted Mhits, 100.0% done total docs, bytes total sec, bytes/sec, docs/sec MAIN - Re-index all node once in a month or on demand - Takes several hours - Resets all indexes indexing index big_node1'... collected docs, MB sorted Mhits, 100.0% done total docs, bytes total sec, bytes/sec, docs/sec

40 Sphinx Centralized Indexing Indexer01 Blogs SE Group -> BlogSE01 Jobs for BlogSE01-node1 Build index 1 Build index 2 Build indexes Indexer Worker Indexer Worker BlogSE01-1 Build index N Indexer Worker Get data from MySQL Node 1 Index 1 Index 2 Index N Node 4 Jobs for BlogSE01-node2 Jobs for BlogSE01-node3 Jobs for BlogSE01-node4 Jobs for BlogSE02 Jobs Jobs for for BlogSE02 BlogSE02 Jobs for BlogSE-N Convert to XML enrich, normalize, process, etc. Feed to Sphinx indexer Validate index Copy index to destination boxes BlogSE01-2 Node 1 Index 1 Index 2 Index N Node 4

41 Building a Big Data Search Engine System Monitoring Main tools Best practices

42 System Monitoring Over 12,000 service checks Main tools: Nagios - Monitoring, Alerts Zabbix - Monitoring, Charts Pingdom - Availability, Responsiveness OpsGenie (alt: Pagerduty) - Alert escalation, calls/notifications VividCortex - DB monitoring, performance analysis ThousandEyes - Network monitoring

43 System Monitoring With distributed system good instrumentation is vital log as much as you can and link logs entries by ids so you can track every query for the request (web request -> sphinx/mysql) Watch P95/P99 performance Continuously improve monitoring, incident escalation and alerting systems Do incident post mortem analysis

44 Take-aways Carefully consider all your Big Data project factors before choosing tools for your architecture Be not afraid to experiment and invent your own wheel MySQL scales well and can be a dependable tool to store large amounts of structured data Sphinx Search is a powerful search engine that: can scale to thousands of servers can be used with any data sources (directly or via XML stream) can be useful not only for search, but also for speeding up analytics tasks MySQL/Sphinx allowed us to build a successful scalable system that makes large amounts of data searchable and operates under high load

45 Thank You For Your Attention! Questions?

Scalable Architecture on Amazon AWS Cloud

Scalable Architecture on Amazon AWS Cloud Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies kalpak@clogeny.com 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect

More information

Database Scalability {Patterns} / Robert Treat

Database Scalability {Patterns} / Robert Treat Database Scalability {Patterns} / Robert Treat robert treat omniti postgres oracle - mysql mssql - sqlite - nosql What are Database Scalability Patterns? Part Design Patterns Part Application Life-Cycle

More information

MyISAM Default Storage Engine before MySQL 5.5 Table level locking Small footprint on disk Read Only during backups GIS and FTS indexing Copyright 2014, Oracle and/or its affiliates. All rights reserved.

More information

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications

More information

High Availability Using MySQL in the Cloud:

High Availability Using MySQL in the Cloud: High Availability Using MySQL in the Cloud: Today, Tomorrow and Keys to Success Jason Stamper, Analyst, 451 Research Michael Coburn, Senior Architect, Percona June 10, 2015 Scaling MySQL: no longer a nice-

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live Real-time reporting at 10,000 inserts per second Wesley Biggs CTO 25 October 2011 Percona Live Agenda 1. Who we are, what we do, and (maybe) why we do it 2. Solution architecture and evolution 3. Top 5

More information

Full Text Search with Sphinx

Full Text Search with Sphinx OSCON 2009 Peter Zaitsev, Percona Inc Andrew Aksyonoff, Sphinx Technologies Inc. Sphinx in a nutshell Free, open-source full-text search engine Fast indexing and searching Scales well Lots of other (unique)

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

From Spark to Ignition:

From Spark to Ignition: From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for

More information

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com Matt Wilson Director, Consumer Web Operations, WebMD @mattwilsoninc 9/12/2013 About this talk Go over original site

More information

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!) MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!) Erdélyi Ernő, Component Soft Kft. erno@component.hu www.component.hu 2013 (c) Component Soft Ltd Leading Hadoop Vendor Copyright 2013,

More information

NOT IN KANSAS ANY MORE

NOT IN KANSAS ANY MORE NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky

More information

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social Connecting the World Through Games Zynga Analytics Leveraging Big Data to Make Games More Fun and Social Daniel McCaffrey General Manager, Platform and Analytics Engineering World s leading social game

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

Table of Contents. Overview... 1 Introduction... 2 Common Architectures... 3. Technical Challenges with Magento... 6. ChinaNetCloud's Experience...

Table of Contents. Overview... 1 Introduction... 2 Common Architectures... 3. Technical Challenges with Magento... 6. ChinaNetCloud's Experience... Table of Contents Overview... 1 Introduction... 2 Common Architectures... 3 Simple System... 3 Highly Available System... 4 Large Scale High-Performance System... 5 Technical Challenges with Magento...

More information

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research

More information

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1

Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1 Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots

More information

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings Solution Brief Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings Introduction Accelerating time to market, increasing IT agility to enable business strategies, and improving

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Achieving Zero Downtime and Accelerating Performance for WordPress

Achieving Zero Downtime and Accelerating Performance for WordPress Application Note Achieving Zero Downtime and Accelerating Performance for WordPress Executive Summary WordPress is the world s most popular open source website content management system (CMS). As usage

More information

BeBanjo Infrastructure and Security Overview

BeBanjo Infrastructure and Security Overview BeBanjo Infrastructure and Security Overview Can you trust Software-as-a-Service (SaaS) to run your business? Is your data safe in the cloud? At BeBanjo, we firmly believe that SaaS delivers great benefits

More information

SCALABLE DATA SERVICES

SCALABLE DATA SERVICES 1 SCALABLE DATA SERVICES 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview MySQL Database Clustering GlusterFS Memcached 3 Overview Problems of Data Services 4 Data retrieval

More information

Cloud Based Application Architectures using Smart Computing

Cloud Based Application Architectures using Smart Computing Cloud Based Application Architectures using Smart Computing How to Use this Guide Joyent Smart Technology represents a sophisticated evolution in cloud computing infrastructure. Most cloud computing products

More information

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13 Scaling Pinterest Yash Nelapati Ascii Artist Pinterest is... An online pinboard to organize and share what inspires you. Growth March 2010 Page views per day Mar 2010 Jan 2011 Jan 2012 May 2012 Growth

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

Managing MySQL Scale Through Consolidation

Managing MySQL Scale Through Consolidation Hello Managing MySQL Scale Through Consolidation Percona Live 04/15/15 Chris Merz, @merzdba DB Systems Architect, SolidFire Enterprise Scale MySQL Challenges Many MySQL instances (10s-100s-1000s) Often

More information

ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA

ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call

More information

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP Operates more like a search engine than a database Scoring and ranking IP allows for fuzzy searching Best-result candidate sets returned Contextual analytics to correctly disambiguate entities Embedded

More information

Search and Real-Time Analytics on Big Data

Search and Real-Time Analytics on Big Data Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its

More information

STeP-IN SUMMIT 2014. June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

STeP-IN SUMMIT 2014. June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions 11 th International Conference on Software Testing June 2014 at Bangalore, Hyderabad, Pune - INDIA Performance testing Hadoop based big data analytics solutions by Mustufa Batterywala, Performance Architect,

More information

SQL Server Administrator Introduction - 3 Days Objectives

SQL Server Administrator Introduction - 3 Days Objectives SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying

More information

So What s the Big Deal?

So What s the Big Deal? So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data

More information

Big Data with Component Based Software

Big Data with Component Based Software Big Data with Component Based Software Who am I Erik who? Erik Forsberg Linköping University, 1998-2003. Computer Science programme + lot's of time at Lysator ACS At Opera Software

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Full Text Search in MySQL 5.1 New Features and HowTo

Full Text Search in MySQL 5.1 New Features and HowTo Full Text Search in MySQL 5.1 New Features and HowTo Alexander Rubin Senior Consultant, MySQL AB 1 Full Text search Natural and popular way to search for information Easy to use: enter key words and get

More information

Monetizing Millions of Mobile Users with Cloud Business Analytics

Monetizing Millions of Mobile Users with Cloud Business Analytics Monetizing Millions of Mobile Users with Cloud Business Analytics MicroStrategy World 2013 David Abercrombie Data Analytics Engineer Agenda Tapjoy Big Data Architecture MicroStrategy Cloud Implementation

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Social Networks and the Richness of Data

Social Networks and the Richness of Data Social Networks and the Richness of Data Getting distributed Webservices Done with NoSQL Fabrizio Schmidt, Lars George VZnet Netzwerke Ltd. Content Unique Challenges System Evolution Architecture Activity

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

YouTube Vitess. Cloud-Native MySQL. Oracle OpenWorld Conference October 26, 2015. Anthony Yeh, Software Engineer, YouTube. http://vitess.

YouTube Vitess. Cloud-Native MySQL. Oracle OpenWorld Conference October 26, 2015. Anthony Yeh, Software Engineer, YouTube. http://vitess. YouTube Vitess Cloud-Native MySQL Oracle OpenWorld Conference October 26, 2015 Anthony Yeh, Software Engineer, YouTube http://vitess.io/ Spoiler Alert Spoilers 1. History of Vitess 2. What is Cloud-Native

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at

More information

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc. Beyond Web Application Log Analysis using Apache TM Hadoop A Whitepaper by Orzota, Inc. 1 Web Applications As more and more software moves to a Software as a Service (SaaS) model, the web application has

More information

XpoLog Competitive Comparison Sheet

XpoLog Competitive Comparison Sheet XpoLog Competitive Comparison Sheet New frontier in big log data analysis and application intelligence Technical white paper May 2015 XpoLog, a data analysis and management platform for applications' IT

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011 NoSQL - What we ve learned with mongodb Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011 DW2.0 and NoSQL management decision support intgrated access - local v. global - structured v.

More information

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Big Data Infrastructure at Spotify

Big Data Infrastructure at Spotify Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013 2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services

More information

DataStax Enterprise 3.x

DataStax Enterprise 3.x DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen 2012 DataStax 1 About the Presenter Big Data Engineer at DataStax Co-author of Programming Hive and Lucene and Solr: The Definitive

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Database Scalability and Oracle 12c

Database Scalability and Oracle 12c Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data marcelle@piction.com Warning I will be covering topics and saying things that will cause a rethink in

More information

MySQL Security for Security Audits

MySQL Security for Security Audits MySQL Security for Security Audits Presented by, MySQL AB & O Reilly Media, Inc. Brian Miezejewski MySQL Principal Consultat Bio Leed Architect ZFour database 1986 Senior Principal Architect American Airlines

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper

Vectorwise 3.0 Fast Answers from Hadoop. Technical white paper Vectorwise 3.0 Fast Answers from Hadoop Technical white paper 1 Contents Executive Overview 2 Introduction 2 Analyzing Big Data 3 Vectorwise and Hadoop Environments 4 Vectorwise Hadoop Connector 4 Performance

More information

Database Selection Matrix. January 2015

Database Selection Matrix. January 2015 Database Selection Matrix January 2015 Table of Contents Introduction Development Matrix Data Model Query Model Availability of Developer Training Operations Matrix High Availability Scalability Storage

More information

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com 1 Katta & Hadoop Katta - Distributed Lucene Index in Production Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com foto by: belgianchocolate@flickr.com 2 Intro Business intelligence reports from

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Testing & Assuring Mobile End User Experience Before Production. Neotys

Testing & Assuring Mobile End User Experience Before Production. Neotys Testing & Assuring Mobile End User Experience Before Production Neotys Agenda Introduction The challenges Best practices NeoLoad mobile capabilities Mobile devices are used more and more At Home In 2014,

More information

SQL Server 2012 Performance White Paper

SQL Server 2012 Performance White Paper Published: April 2012 Applies to: SQL Server 2012 Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication.

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY Tokyo. Koln Sebastopol. Cambridge Farnham. FIFTH EDITION Oracle Essentials Rick Greenwald, Robert Stackowiak, and Jonathan Stern O'REILLY" Beijing Cambridge Farnham Koln Sebastopol Tokyo _ Table of Contents Preface xiii 1. Introducing Oracle 1

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

the missing log collector Treasure Data, Inc. Muga Nishizawa

the missing log collector Treasure Data, Inc. Muga Nishizawa the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days

More information

In-memory computing with SAP HANA

In-memory computing with SAP HANA In-memory computing with SAP HANA June 2015 Amit Satoor, SAP @asatoor 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Hyperconnectivity across people, business, and devices give rise to

More information

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco Decoding the Big Data Deluge a Virtual Approach Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco High-volume, velocity and variety information assets that demand

More information

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A Scalable Data Transformation Framework using the Hadoop Ecosystem A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics Platform. Contents Pentaho Scalability and

More information

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Simplifying Big Data Analytics: Unifying Batch and Stream Processing John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Streaming Analy.cs S S S Scale- up Database Data And Compute Grid

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Architecture and Mode of Operation

Architecture and Mode of Operation Software- und Organisations-Service Open Source Scheduler Architecture and Mode of Operation Software- und Organisations-Service GmbH www.sos-berlin.com Scheduler worldwide Open Source Users and Commercial

More information

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps

I Logs. Apache Kafka, Stream Processing, and Real-time Data Jay Kreps I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data Integration? 2. What is Apache Kafka? 3. Logs and Distributed Systems 4. Logs and Data Integration 5. Logs

More information

Tushar Joshi Turtle Networks Ltd

Tushar Joshi Turtle Networks Ltd MySQL Database for High Availability Web Applications Tushar Joshi Turtle Networks Ltd www.turtle.net Overview What is High Availability? Web/Network Architecture Applications MySQL Replication MySQL Clustering

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

How to choose High Availability solutions for MySQL MySQL UC 2010 Yves Trudeau Read by Peter Zaitsev. Percona Inc MySQLPerformanceBlog.

How to choose High Availability solutions for MySQL MySQL UC 2010 Yves Trudeau Read by Peter Zaitsev. Percona Inc MySQLPerformanceBlog. How to choose High Availability solutions for MySQL MySQL UC 2010 Yves Trudeau Read by Peter Zaitsev Percona Inc MySQLPerformanceBlog.com -2- About us http://www.percona.com http://www.mysqlperformanceblog.com/

More information

TECHNOLOGY WHITE PAPER Jun 2012

TECHNOLOGY WHITE PAPER Jun 2012 TECHNOLOGY WHITE PAPER Jun 2012 Technology Stack C# Windows Server 2008 PHP Amazon Web Services (AWS) Route 53 Elastic Load Balancing (ELB) Elastic Compute Cloud (EC2) Amazon RDS Amazon S3 Elasticache

More information

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations

How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations How Comcast Built An Open Source Content Delivery Network National Engineering & Technical Operations Jan van Doorn Distinguished Engineer VSS CDN Engineering 1 What is a CDN? 2 Content Router get customer

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Big data blue print for cloud architecture

Big data blue print for cloud architecture Big data blue print for cloud architecture -COGNIZANT Image Area Prabhu Inbarajan Srinivasan Thiruvengadathan Muralicharan Gurumoorthy Praveen Codur 2012, Cognizant Next 30 minutes Big Data / Cloud challenges

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information