So What s the Big Deal?
Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data Case Study (if we have time) Q&A Links to More Information
Introduction RhinoSource, Inc. Oracle App/Tech Consulting and Managed Services Oracle E Business Suite Oracle Business Intelligence Oracle Database (performance, partitioning and replication) Application Development and Advanced PL/SQL Development Advanced Technology Consulting and Managed Services Big Data Mobile Applications Cloud Computing CIO Level Advisory Services IT Strategy, Planning and Project Management ERP/CRM Evaluation and Implementation
WHAT IS BIG DATA?
What Makes Up Big Data? Blog posts, user comments Emails and Messaging Web server logs Instrumentation of online stores Image and video uploads Process data, such as RFID Sensor device data External data sets Census data Weather data Geographical data Shadow Data (replicated copies and change journals)
The 3 V s of Big Data Velocity Data generated at a faster rate than ever before Server logs, smart phones, sensor devices, RFID Challenge: Existing systems cannot process new data fast enough Variety Data more varied and complex Structured and unstructured Many formats: text, document, image, video Challenge: Existing databases do not handle varying data formats well Volume Orders of magnitude larger 2.5 Zetabytes of new data created in 2012 8 Zetabytes on new data projected to be created in 2015 3 Billion Internet users, 15 Billion connected devices Challenge: Existing databases do not cost effectively scale to Big Data sizes
Big Data Growth Trend Zettabytes 40% CAGR
How Much is 1 Zettabyte?
New Data by End of 2015 17 ZB of New Data By end of 2015!
SO, WHAT S THE BIG DEAL?
Can Big Data Show Us The Way? Scientific American, Dec 11: "...the rise of 'big data' [is] a trend that is striking many scientists as being on a par with the invention of the telescope and microscope." "...many experts believe we are on the cusp of opening up new worlds of inquiry."
Big Data Advantages Better, more accurate predictions Deeper, richer insights into customers, business partners and the operations Real time Big Data analytics enables faster decision making Creates competitive advantage Improves bottom line
Big Data Spending Companies have spent $4.3 billion on Big Data as of the end of 2012. Gartner predicts those initial investments will in turn trigger a domino effect of upgrades and new initiatives Valued at $34 billion for 2013, per Gartner. Over a 5 year period, spend is estimated at $232 billion.
BIG DATA TECHNOLOGIES
A Brief History of Big Data Scale Data Warehouse Distributed Big Data Cluster RDBMS RAC Time
How Do We Store Big Data? NoSQL databases store data records as key value pairs Or as triplets with a timestamp. Schema less or schema optional Values may be structured or unstructured (developer s choice). Not relational No relationships between records No join support in a NoSQL database. Does not use SQL to store and retrieve records. Highly optimized for retrieval and appending operations. High performance writes. High performance retrieval by primary key. Little functionality beyond record storage and retrieval. Highly Scalable to huge amounts of data Millions or Billions of records Partition data across many distributed, inexpensive servers for cost effective scalability and availability Must trade off between Availability versus Consistency (CAP Theorem).
Popular NoSQL Databases Key Value Stores Column Oriented Databases Graph Databases Document Databases
Why Not Relational for Big Data? Transforming and loading data into RDBMS requires extensive preprocessing of data into a pre defined schema Doesn t work well for semi structured and unstructured data Can take more time than is available before next batch must be loaded Joining multiple data sets at query time is an expensive operation RDBMS scaling must be done vertically to larger and more expensive servers and storage solutions RDBMS clustering requires expensive networking and shared storage infrastructures Fiber Channel, Infiniband, SAN, NAS Challenging to distribute data across data centers Replication strategies are add ons and complex Strict Consistency requirement is enforced at the cost of write performance and availability (CAP Theorem)
Dr. Brewer s CAP Theorem CA: Pick 2 Oracle RAC RDBMS CP: BigTable Hadoop/Hbase MongoDB Oracle NoSQL Redis AP: Cassandra CouchDB Dynamo Riak SimpleDB
Scalability Comparison (Logarithmic Scale) 100000 21 PB, 2000 Nodes at Facebook 10000 1000 100 10 TB, 100 Nodes at CraigsList 71 TB, 48 Nodes at Amazon 300 TB, 400 Nodes at Digital Reasoning Terabytes Server Nodes 10 1 MongoDB Oracle RAC Cassandra Hadoop RAC
Scalability Comparison (Linear Scale) 25000 21 PB, 2000 Nodes at Facebook 20000 15000 10000 10 TB, 100 Nodes at CraigsList 71 TB, 48 Nodes at Amazon 300 TB, 400 Nodes at Digital Reasoning Terabytes Server Nodes 5000 0 MongoDB Oracle RAC Cassandra Hadoop RAC
Feature Comparison Cassandra Best of the NoSQLs for Cross Data Center Replication and High Availability Known to scale to 100 s of Terabytes (but theoretically to Petabytes) Tunable Consistency at operation level for writes and reads. Availability model (AP). Primary and Secondary Indexes Queries are Real Time (CQL, Thrift) No Join Support Masterless Peer to Peer Ring Architecture = No S.P.O.F. Provides most cost effective HA and scalability of the NoSQLs Written in Java Minimum of 3 nodes recommended. Easy to install and setup on commodity hardware. Hadoop/HBase The current Gold Standard of the NoSQLs for Data Analysis Known to scale to Petabytes (1000 s of Terabytes) Consistency model (CP) Hadoop Queries are Batch (MapReduce). HBase provides real time queries similar to Cassandra. Joins are Possible Master Slave Architecture = S.P.O.F. (Name/JobTracker Node) Written in Java Minimum of 5 nodes recommended. More challenging installation and setup. Warm Standby and Shared Storage Required for High Availability Failover, so higher infrastructure costs.
Best of All Worlds DataStax Enterprise Cassandra Real Time Database Peer to Peer HA Architecture Cross Data Center Replication Real Time, Low Latency Queries Hadoop A Analytics Map/Reduce, Hive, Pig (Joins) Solr Search Full Text Search Rich Document Handling (Word, PDF)
Plus Cluster Management
Current Big Data Challenges Integrating Big Data with existing databases and BI/reporting systems. JDBC, ODBC sqoop Security and Encryption DataStax Enterprise 3.0 (In Beta) Transparent Data Encryption Internal and External Authentication Data Auditing
IDENTIFYING BIG DATA OPPORTUNITIES
Big Data Use Cases Context for Interactions and Transactions Reward Points Warranty Policies Social media chatter Survey response feedback Website requests Connection with Outside Patterns Weather Data Demographic Data Geographical Data Government Compliance Data Improving Disaster and Outage Response Times by Spotting Trends Compliance Checks and Audits Competitive Insights into How Your Products and Services (and your competition s) are used and perceived in the marketplace. Database Infrastructure Behind Mobile and Web Applications
Great Places to Look for Big Data Opportunities Server Logs Web server and app server logs Call center/phone system logs Product Data Performance data Sensor data Positional data Streamed live or captured in Log files / Data files Current RDBMS Archive Purge Strategies What data are you deleting every day/month/year? Financial Data, Operational Data, Customer Interactions
Implementing Big Data Identify "Game Changing" Big Data opportunities. Define a business case. Identify existing business and functional capabilities. Augment existing capabilities with 3rd party assistance. Conduct low cost Proof of Concept project to demonstrate feasibility.
Low Cost Proof of Concept Take advantage of a cloud platform like Amazon Web Services (AWS) and Amazon EC2. Run a multi node cluster for less than $25/day. Get started instantly. Have a cluster up and running in only a few hours. NoSQL technologies are perfectly suited for the cloud deployed model. Amazon Machine Images (AMIs) exist for most NoSQL products that can be started in just a few minutes. You can make it as secure as you need it to be.
Low Cost Proof of Concept Now that you have a cluster up and running: Load up some test data. (Check out sqoop.) Get your HiveQL book in hand and start doing some analysis. Delete the servers once you are done. Only pay for the time the servers are running. You can always bring the cluster in house for production, but you might find out it s more cost effective to leave it in the Cloud!
(If we have time) BIG DATA CASE STUDY
Client Overview Mobile social networking startup Focused on families with kids Launching in Q1 2013 Currently in Stealth Mode pending launch the first week of March, 2013 Big Data Use Case: Infrastructure behind mobile app
The Challenge Big Data application Semi structured and unstructured data Low latency (<100ms) for user experience 24 x 7 high availability Cloud deployment (Amazon AWS) Analytical capability required
The Solution DataStax Enterprise Big Data Database Cluster Cassandra database for low latency reads and writes Cluster architecture for high availability Tunable read and write consistency Integrated Hadoop workload support for analytics Integrated Solr workload support for search feature DataStax OpsCenter tool for cluster management Benefits High performance reads and writes = good customer experience Only single cluster required for Cassandra, Hadoop and Solr Commercial grade support Cost effective solution Fast deployment (30 days)
Technical Details Installed DataStax Enterprise 2.2.1 on Amazon AWS 3 x M1.Large Nodes Will double to 6 nodes later in the year Each node will hold ~800GB of data Implemented monitoring and alerts Cluster stats collected every 15 seconds Stats stored in db and graphed Amazon SNS for notifications (email and SMS)
Amazon AWS and EC2
OpsCenter Cluster Management
Cluster Ring View
Performance Monitoring
Customized Dashboards
More Custom Dashboards
Q&A
More Reading www.rhinosource.com/bigdata.html
Thank you!