Crack Open Your Operational Database Jamie Martin jameison.martin@salesforce.com September 24th, 2013
Analytics on Operational Data Most analytics are derived from operational data Two canonical approaches in-situ: run analytics on operational store ex-situ: move data (ETL) to optimized store Crack open the operational database direct external access into live OLTP database
In-Situ Analytics on Operational Data is Limited OLTP databases are not built for analytics Operational databases are usually resource constrained built for short transactions, simple queries over modest data sets limited query expressiveness storage impedance mismatch (e.g. row vs. column) hybrids exist, but cannot bridge the gap across all dimensions limited CPU, cache and IOPS for analytics long running queries cause lock conflicts or MVCC inefficiencies Run analytics on an optimized analytics engine optimized columnar stores massively scalable compute engines fast aggregation engines (OLAP)
Ex-situ Analytics is an ETL Nightmare Get a snapshot of operational store Run analytics somewhere else use some other compute perhaps with more optimal storage Capture ongoing changes from OLTP engine must be non-disruptive usually needs to be transactionally consistent often need to keep the analytics live In practice this is really really painful ETL nightmare: expensive, rigid, slow, fragile data governance and provenance problems
Hadoop: Unconstrained Data Access Open data ecosystem distributed storage: HDFS distributed compute: M/R (YARN) really interesting data access possibilities Unconstrained access to data data files are all out there in the wild on HDFS storage formats are typically public (implement M/R Input/OutputFormat) ecosystem encourages integration (can run M/R directly on HBase HFiles) this is very different than your typical DBMS
OLTP Enabled Analytics: Snapshots Goal: take snapshot directly against an active database Getting to bytes is complicated understand semantics of data - columns, datatypes for table T logical to physical mapping - where is the data physical consistency - coordination with writers transactional consistency persistence formats - understanding data layout (rows, columns, etc) Traditional database is a black box contents of table: system catalogs, JDBC metadata, etc where is the data: table spaces->dbs->partitions->...->extents->pages physical consistency: in-memory latches, pinning transactional consistency: in-memory lock tables, MVCC information persistence formats: proprietary data queries results
Direct Snapshots An approach to direct external access logical to physical data mapping externalized through public catalog service find the specific persistent artifacts that contain desired data DBMS abdicates space management physical consistency without latching immutable storage (not Aries) anyone can read persistence w/o coordination transactional consistency through MVCC records contain transaction information consistent point in time via filters on data (not PiTR) published persistence formats These are the same techniques that are needed to scale up and out MVCC & immutable data to scale up cross node catalog describing persistence to scale out
Taking a Snapshot Snapshot acquisition obtain snapshot for table T locate immutable artifacts that may have data for table T register interest in them as of a point in time (MVCC) get a consistent snapshot access the data directly, with impunity direct analytics, e.g. M/R on OLTP data dump in secondary system for subsequent analytics release snapshot Consistency without fine-grained coordination
Change Detection Allow direct external access to OLTP transaction log Externalized access transaction log as a externally meaningful data stream track transaction logs in external catalog physical consistency - logs are already append only/immutable transactional consistency - tie data MVCC to log records published log formats Models pull log chunks as needed apply them to snapshots push log records on a data bus enables streaming analytics
Challenges Schema evolution snapshot cannot require DDL coordination hard to receive schema changes from the firehose of changes Externalizing persistence formats is easier said than done