Big Data Patterns Ron Bodkin Founder and President, Think Big 1
About Me Ron Bodkin Founder and President, Think Big I have 9 years experience working with Big Data and Hadoop. In 2010, I founded Think Big to help companies realize measurable value from Big Data. Our expertise spans all facets of data science and data engineering and helps our customers drive maximum value from their Big Data initiatives. Patterns in this talk from large-scale deployments in high tech manufacturing & digital marketing. Follow me at @ronbodkin 2
Agenda Context Patterns Conclusions 3
Big Data: The Key is Variety Definition: Datasets so complex and large that they are awkward to work with using standard tools and techniques Location Social Images Weblogs Videos Text Audio Sensor Size is not what is most important it s variety 4
How is Information Management Changing? Schema on Read? Yes as step one But data still has underlying structure It s more like agile modeling reflect as much structure as needed Loosely coupled schemas without platform guarantees but enable more application flexibility Data Modeling isn t dead! Metadata is more important than ever Data Warehouses embracing Big Data principles (e.g., elasticity, JSON ) 5
Changes in the Platform Entry Level Hadoop cluster circa 2015 (20 nodes) 240 cores 1 PB spinning disk 10 TB RAM 10-40 GbE Low software cost Disk transfer times increasing => many disks => DAS (2005-2020) Distributed RAM increasingly important to expedite computation although data volumes increasing faster The network will be the computer (really!) => you can distribute disks separately across high bandwidth fabrics (2020+) Changes many assumptions in traditional physical modeling 6
Changes in Logical Modeling JSON-like structures Complex collections of relations, arrays, map of items Graphs Storing complex, dynamically changing not static relationships Binary/CLOB/specialized data Ability to execute specialized programs to interpret and process 7
Changes in Physical Modeling 8 Big Data unpacks the database metaphor Data distribution: key design, sharding/distribution, file formats Multiple computational algorithms, e.g., MapReduce, Computational Graph (Spark, Tez), data flow, streaming, graph engines Integrity is an application concern Storage is cheap Denormalization and materialized views common Yet compression is popular often for IO savings Summarization is orders of magnitude more powerful Index lookups are increasingly costly Distributed systems impose eventual consistency, reconciliation demands
9 Leading Financial Asset Manger Financial Services Photo courtesy of Flickr. Creative Commons Challenge Siloed consumer analytics Lack of agility in analysis Slow ETL Solution Scalable ETL Discovery analytics tech & process Cross-channel data science models Cloudera Enterprise, HBase, Greenplum Results Scalable Processing Extracted customer behavior signals from raw data for existing and new behavior models Faster time to insight
Leading Enterprise Tech Component Vendor 10 High Tech Manufacturing Photo courtesy of Flickr. Creative Commons Challenge Data search parties waste engineers time Excess scrap waste, slow time to market Reactive analytics model Solution Scalable data lake Search and deep analytic queries Integrated assembly insights for data science models Hive, Impala, Red Shift, Elastic search Big data training and hackathons Results Supply chain line of sight from R&D, manufacturing, to servicing at customer sites End-to-end proactive analytics: reduced development time, improved manufacturing yield, increased customer satisfaction Proactive, scale analytics led to better engineering theory
Patterns 11
Important New Patterns 12 Denormalized Fact Profile Event History Timeline Assembly Distributed Sources Late Data Deep Aggregates Recovery Multiple Active Clusters
Event History Fact table about common events to allow e.g., cross-channel analytics in context E.g., clickstream, posts, purchases, content consumption, device activity Stored in columnar format (e.g., Parquet, ORCfile) Join as was value of slowly changing dimensions Often extension column of unparsed/not modeled JSON-like data Partitioned by event time buckets, perhaps also by other dimension(s) Event id Actor id Time Event col s Dim id s Dim col s Ext. Data 123 uid1 1/1/15 13:16:11 456 uid2 1/1/15 13:16:14 { TstA : 1 } { TstB : 1 } 13
Timeline Pivot on event history: table of actors with events over time Customer journey, device history Enable support/analysis on specific items, long-lived analysis May have hierarchy of actors (e.g., household, individual, device) May be array of events, many columns or subsorted (cluster key) Also stored in columnar format, may be partitioned May be updated in near real-time AND batch Often holds cached alogirthm values (combined Profile) Actor id Segments Ev1:id Ev1:fact Ev1:dims Ev1:ext Ev2:id uid1 [1, 3, 7] 123 1/1/15 13:16:11 { TstA : 1 } 789 uid2 [2, 3] 456 1/1/15 13:16:14 { TstB : 1 } 0ab 14
Event Analytics Propensity/segmentation May be scored in real-time using Timeline/Profile May be hybrid scored batch using Event History Trained from timeline Attribution Score impact of past events on new event (e.g., purchase, churn) Algorithms range from simple rules to Shapley value Natural in timeline Reporting, exploration Often via Deep Aggregates, using HyperLogLog Discovery 15
Event Data Management Identity merge Discovery of new identities (e.g., cookie logs in, Facebook connect) Indirection or rewrites Requires rescoring Expiration/archival Efficiency, policy requirements Governance Lineage & security 16
Network Ongoing status of configuration Parts in assembly Related items (versions) Social groups Can be people, devices etc. Maintain links in graph structure May be current or historical Use links to pull full context from Event History or Timeline Search -> simple query -> complex analytics E.g., transitive closure, impact analysis Technologies Giraph, GraphX TitanDB, Neo4j 17
Distributed Sources Unlike simple all or nothing feeds May have many distributed sources feeding data It s critical to know whether all (or enough) data has arrived Goals only produce analytic results when sufficient provide provenance timeliness & completeness statistics Need SLA s about timeliness and required fraction of data Control totals Metadata about process (expected lineage) Heartbeats/configuration 18 Root cause of complexity of ingestion
Late Data Data may be delayed due to Upstream system failures (server down esp. with unreliable delivery, network outage) Offline/disconnected devices (endemic with mobile & IoT) Metadata to track lineage is critical Define delay time where with high confidence sufficient data has arrived Process authoritative derived data after that time May process incremental/incomplete data earlier (a la economic statistics) May re-process in emergency (restatement) May include changed data in later period Report on how much data has arrived late Implementation: bucket on event time, secondary on delay epoch (partitions for late data) Zipfian Distribution 19
Conclusions 20
Probabilistic Data Structures 21 Increasingly valuable as an optimization technique, e.g., Bloom filters Hashed key values for array Check key to see if may be present indexing/filtering sparse reads HyperLogLog, sketch sets Multiple hashes used to estimate count of unique items Far more space compact (KB s to count billions of items +/- 2%) Can be composed (unlike exact unique counts) e.g., across time, categories MinHash Least hashed value in common between two sets Used to identify duplicates, estimate overlap in arbitrary sets
Anti-Patterns 3 rd Normal Form, Star Schema, Snowflake Schema Index lookups slow in general Focus on partitioned reads not disk seeks Poor results in practice Not natural representations for repeating events, nested structure Use of SSD, maturing optimizers, platform updates (Kudu?) are slowly improving an industry would love this to happen Expect data marts to work in Big Data before data warehouses do 22
Conclusions Much of Big Data today is trade-craft Learned lore & derived from first principles As we scale data lakes & analytics, critical to have common vocabulary, shared understandings I d love your input on common patterns & practices Look for blogs with more depth on each pattern at http://thinkbig.teradata.com/author/rbodkin/ Reach me at @ronbodkin, ron.bodkin@thinkbiganalytics.com 23