BI/Analytics for NoSQL: Review of Architectures
What we'll answer in 50 minutes Who is this guy? How do I enable AdHoc, self service reporting on NoSQL? How do I improve the performance of dashboards on top of NoSQL? How do I integrate NoSQL data with my other data not inside NoSQL? How do I enable, easy to build simple reports but also preserve the ability for rich NoSQL queries?
Nicholas Goodman Open Source BI thought leader 50+ Open Source BI customer projects Blogger, whitepapers, etc Entrepreneur DynamoBI Corporation Bayon Technologies, Inc. Data Geek, hacker, tinkerer, committer GOAL: Share perspectives, research, opinions. DISCLAIMER: Your Mileage...
How do we answer those Q's?
Promise of Big Data NoSQL/Hadoop/MapReduce Systems Keep more of it Cost effective analysis Massive scale data, now accessible to everyone (elastic) Not just SQL queries, more complex analysis ACCOMPLISHED: WEB SCALE, MASSIVE NEVER BEFORE SEEN SCALE OF DATA STORAGE AND PROCESSING
Reality Check! Petabytes? Y Cheap Storage? Y Raw Processing? Y Rich Query Languages? Y Flexible data structures? Y Reliable, Fault Tolerant? Y Fast Queries? N Ad Hoc access? N Accessibility to commodity BI tools? N Easy report authoring? N Levels of Aggregation? N Integrated Data? N Big Data has solved the INFRASTRUCTURE of raw/core data storage but has provided less value to what BUSINESS users want for analytics.
Data Gaps too! Code, Developers MR, Rich Graph/Access Hierarchical, Unstructured Analysts w/ Excel, Dashboards Simple 2D (tables, charts) Filtering and easy analytics
Levels of Aggregation SAME DATA AT VARIOUS LEVELS OF AGGREGATION HUGELY IMPORTANT IN REAL LIFE IMPLEMENTATIONS! 1 ROW TO 1 BILLION ROWS 10K 1 MILLION 100 MILLION 100 BILLION
Architectures NoSQL reports NoSQL thru and thru NoSQL + MySQL NoSQL as ETL Source NoSQL programs in BI Tools NoSQL via BI Database (SQL)
NoSQL reports Pay Developer to build applications for reports Apps 100% Richness of NoSQL Up to date, current Excellent performance on large datasets Custom built, beautiful reports/dashboards Single system to manage $$, developer driven process No commodity BI tools Managing rollups/summaries Schema-less = Harder! Hard to integrate other reporting information
NoSQL thru and thru Pay Developer to build FLEXIBLE applications for reports Indices Aggs Advanced Apps All of NoSQL report advantages Managed aggregations, rollups Guided Adhoc available inside application Higher performance for dashboards/summaries $$, developer driven process $$, app required for aggs No commodity BI tools Hard to integrate other reporting information Limited AdHoc (only developer built combinations)
NoSQL + MySQL Pay Developer to build FLEXIBLE applications for reports ETL App MySQL Less IT $$ since developers aren't building reports Rich, NoSQL analysis left in place (ETL + NoSQL) Easy, Ad Hoc reporting via commodity BI tools Easier to understand data for self service reports Data freshness (24 hrs old) Once into MySQL no rich NoSQL application use (M/R) BI Tool can connect ONLY to data in MySQL, not NoSQL Aggregations still self managed in MySQL
NoSQL as ETL Data Source NoSQL treated like any other data source Informatica Teradata Allows use of consolidated, BI tool for AdHoc Enables integrated (combined) datasets for reporting Aggregations Often managed Best of Breed tools ETL Development Expense Data Latency Loss of NoSQL language richness Traditional DW tools are $$ Scaling issues with DW Database
NoSQL programs in BI Tools Write a program in BI tool that flattens data, output into report Rich use of NoSQL native language Direct, up to date access Access to 100% of dataset Leverage guided report parameter pages Less expensive than apps Developer required to write program ($$) Slow-er (aggs, summaries) Lacks integration with other datasets Still (usually) no AdHoc access
NoSQL via BI Database (SQL) Enable NoSQL data access via SQL (gasp!) Live Query Cached, 24hr data Easy reports, easy (SQL) Integration with other data ETL is simple INSERT/MERGEs Live, up to date access High performance, cached data AdHoc access to Live + Cached Aggregations/Summaries Another system in between Still needs to be refreshed, nightly Not all capabilities for NoSQL richness available via SQL
Mozilla: NoSQL thru and thru(db) Socorro Project: Crash reports, optionally sent to Mozilla https://crash-stats.mozilla.com
X: NoSQL via SQL Using Splunk (ie, a commercial NoSQL-eee data aggregator/etc) Desire to use Tableau for advanced analytics/visualization
Meteor Solutions: NoSQL thru and thru Using Cloudant BigCouch solution (SaaS) High performance set of multi purpose indices on pre defined aggregations Up to date aggregation/reports Better fit for Social Media graph structures over relational DB Custom built BI applications (dashboards/reports) providing a flexible guided view through data Advanced Apps
A,B,C: NoSQL + MySQL Many Many companies (3 we've worked with) All web related companies (semi structured, some, mostly volume) Heavy lifting and storage, and ETL/Data prepartion inside Hadoop Push summarized, aggregated data into MySQL for analysis by easy, dashboarding/bi Tools ETL App MySQL