Reusable Data Access Patterns

Size: px

Start display at page:

Download "Reusable Data Access Patterns"

Kory Thornton
8 years ago
Views:

1 Reusable Data Access Patterns Gary Helmling, Software HBaseCon May 7

2 Agenda A brief look at data storage challenges How these challenges have influenced our work at Cask Exploration of Datasets and how they help Review of some common data access patterns and how Datasets apply A look underneath at the tech that makes it work

3 Data Storage Challenges Many HBase apps tend to solve similar problems with common needs Developers rebuild these solutions on their own, sometimes repeated for each app Effective schema (row key) design is hard byte[] conversions, no real native types No composite keys Some efforts - Orderly library and HBase types (HBASE-8089) - but nothing complete

4 Cask Data Application Platform CDAP is a scale-out application platform for Hadoop and HBase Enables easy scaling of application components Combines real-time and batch data processing Abstracts data storage details with Datasets Built from experience by Hadoop and HBase users and contributors CDAP and the components it builds on are open source (Apache License v2.0): Tephra, a scalable transaction engine for HBase and Hadoop Apache Twill, makes writing distributed apps on YARN as simple as running threads

5 CDAP Datasets Encapsulate a data access pattern in a reusable, domain-specific API Establishes best practices in schema definition Abstract away underlying storage platform App 1 HBase Table 1 Dataset Table 2 App 2

6 How CDAP Datasets Help Reusable as data storage templates Easy sharing of stored data: Between applications Batch and real-time processing Integrated testing Extensible to create your own solutions Leverage common services to ease development: Transactions Readless increments

7 Reusing Datasets CDAP integrates lifecycle management Centralized metadata provides key configuration Transparent integration with other systems Hive metastore Map Reduce Input/OutputFormats Spark RDDs

8 Reusing Datasets /** * Counter Flowlet. */ public class Counter extends AbstractFlowlet private KeyValueTable wordcountstable; }

9 Testing Datasets CDAP provides three storage backends: in-memory, local (standalone), distributed In-Memory Local Distributed Dataset APIs NavigableMap LevelDB HBase Local Temp Files Local Files HDFS Development Lifecycle

10 Extending Datasets Existing datasets can be used as building blocks for your own annotation injects wrapped instance in custom code Operations seamlessly wrapped in same transaction, no need to re-implement public class UniqueCountTable extends AbstractDataset { public UniqueCountTable(DatasetSpecification Table Table entrycounttable) { } }

11 HBase Data Patterns & Datasets Secondary Indexes Object-mapping Timeseries Data cube

12 Secondary Indexing Example use case: Entity storage - store customer records indexed by location HBase sorts data by row key Retrieving by a secondary value means storing a reference in another table Two types: global and local Global: efficient reads, but updates can be inconsistent Local: updates can be made consistent, but reads require contacting all servers IndexedTable Dataset performs global indexing Uses two tables: data table, index table Uses global transactions to keep updates consistent

13 Object-Mapping Example use case: Entity storage - easily store User instances for user profiles Easy serialization / deserialization of Java objects (think Hibernate) Maps property fields to HBase columns No defined schemas in HBase Accessing data by other means requires knowledge of object structure ObjectMappedTable Dataset: automatically persists object properties as columns in HBase Metadata managed by CDAP Stores the object's schema Automatically registers a table definition in Hive metastore with the same schema

14 Timeseries Data Example use case: any data organized around a time dimension System metrics Stock ticker data Sensor data - smart meters Constructing keys to avoid hotspotting and support efficient retrieval can be tricky TimeseriesTable Dataset: for each data key, stores a set of (timestamp, value) records Each stored value may have a set of tags used to filter results Each row represents a time bucket, individual values in that bucket stored as columns When reading data, projects entries back into a simple Iterator for easy consumption

15 Data Cube Example use case: Retail product sales reports, web analytics Stores fact entries, with aggregated values along configured combinations of the fact dimensions Pre-aggregation necessary for efficient retrieval HBase increments can be costly in write-heavy workload Querying requires knowledge of pre-aggregation structure Reconfiguration can be difficult Need metadata around configuration Cube Dataset: uses readless increments for efficient aggregation Transactions keep pre-aggregations consistent Dataset framework manages metadata

16 Transactions Provided by Tephra ( an open-source, distributed, scalable transaction engine designed for HBase and Hadoop Each transaction assigned a time-based, globally unique transaction ID Transaction = Write Pointer: Timestamp for HBase writes Read pointer: Upper bound timestamp for reads Excludes: List of timestamps to exclude from reads HBase cell versions provide MVCC for Snapshot Isolation

17 Tephra Architecture Client start / commit Tx Manager (active) Tx Manager (standby) HBase Client read / write RS 1 RS 2 Tx CP Tx CP Client

18 Transactional Writes Client sets write pointer (transaction ID) as timestamp on all writes Maintains set of change coordinates (row-level or column-level granularity depending on needs) On commit, client sends change set to Transaction Manager If any overlap with change sets of commits since transaction start, returns failure On commit failure, attempts to rollback any persisted changes Deletes use special markers instead of HBase deletes HBase deletes cannot be rolled back

19 Transactional Reads TransactionAwareHTable client sets the transaction state as an attribute on all read operations Get, Scan Transaction Processor RegionObserver translates transaction state into request properties max versions time range TransactionVisibilityFilter - excludes cells from: Invalid transactions (failed but not cleaned up) In-progress transactions Delete markers TTL d cells

20 Increment Performance HBase increments perform read-modify-write cycle Happens server-side, but read operation still incurs overhead Read cost is unnecessary if we don't care about return value Not a great fit for write-heavy workloads

21 HBase Increments Example: Word count on Hello, hello, world counting 2nd hello row:col timestamp value 1. read: value = 1 2. modify: value += 1 hello:count write: value = 2

22 Readless Increments Readless increments store individual increment values for each write Mark cell value as Increment instead of normal Put Increment values are summed up on read

23 Readless Increments Example: Word count on Hello, hello, world counting 2nd hello row:col timestamp value hello:count write: increment =

24 Readless Increments Example: Word count on Hello, hello, world reading current count for hello row:col timestamp value hello:count read: = 2 (total value)

25 Readless Increments Increments become simple writes Good for write-heavy workloads (many uses of increments) Reads incur extra cost from reading all versions up to latest full sum HBase RegionObserver merges increments on flush and compaction Limits cost of coalesce-on-read Work well with transactions: increments do not conflict!

26 Want to Learn More? Open-source (Apache License v2) Website: Open-source (Apache License v2) Website: Mailing List: Mailing List:

27 QUESTIONS? Want to work on these and other challenges?

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage