Where is Hadoop Going Next?

Size: px

Start display at page:

Download "Where is Hadoop Going Next?"

April Russell
10 years ago
Views:

1 Where is Hadoop Going Next? Owen O Malley November 2014 Page 1

2 Who am I? Worked at Yahoo Seach Webmap in a Week Dreadnaught to Juggernaut to Hadoop MapReduce Security Hive Apache/Open Source Champion PhD in Software Engr from UC Irvine Page 2

3 Topics Hadoop History A beginning is the time for taking the most delicate care that the balances are correct. Themes Storage Computation Security - Herbert Page 3

4 What was the Problem? Yahoo needed to build WebMaps faster Whole web analysis for Yahoo Search WebMap in a Week WebMap used Dreadnaught Roughly like MapReduce and HDFS Scaled to 800 machines Assigned nodes in backup pairs Single application cluster Started on C++ DFS & MapReduce Page 4

Search WebMap in a Week WebMap used Dreadnaught Roughly like MapReduce

5 What did Hadoop Do Right? Focus on a few customers Helped Yahoo Search analytics team Terasort benchmarks Expected Failures Storage corrects automatically Healthy in minutes instead of hours Nodes are automatically assigned No chokepoints Data never travels through singleton RAM isn t large enough Page 5

benchmarks Expected Failures Storage corrects automatically Healthy in

6 What did Hadoop Do Right? Simplified FileSystem abstraction No random writes Apache Many companies working together Open governance Open Source Many hands and eyes Use the source, Luke Open platform Page 6

7 Storage The more storage you have, the more stuff you accumulate. - Stewart Page 7

8 HDFS Phases Single HDFS NameNode Cross cluster references Federated HDFS NameNodes Need HDFS Block Storage factored out Wider variety of applications Need co-location of files Not entire table, but sections of the table ACID (and HBase) base and delta files Correlated tables Page 8

applications Need co-location of files Not entire table, but sections

9 File Formats Phases Text and Sequence File RCFile Avro ORC and Parquet Columnar formats Type specific encoding Self describing metadata at end Page 9

10 ORC Light-weight indexes Predicate pushdown Answers from metadata Seeking within file Available from Hive, Pig, & MapReduce C++ reader/writer coming Page 10

11 Computation A process cannot be understood by stopping it. Understanding must move with the flow of the process, must join it and flow with it. - Herbert Page 11

12 Why does Hadoop Need ACID? Hadoop and Hive have always Worked without ACID Perceived as tradeoff for performance Add or replace entire partitions But, your data isn t static It changes daily, hourly, or faster Managing change makes the user s life better Need consistent views of changing data! Page 12

performance Add or replace entire partitions But, your data isn t static It

13 Use Cases Updating a Dimension Table Changing a customer s address Delete Old Records Remove records for compliance Update/Restate Large Fact Tables Fix problems after they are in the warehouse Streaming Data Ingest A continual stream of data coming in Page 13

Update/Restate Large Fact Tables Fix problems after they are in

14 Longer Term Use Cases Multiple statement transactions Group statements that need to work together Query tables as they appeared in past Configurable length of history Row-level lineage Track users and queries that updated each row Page 14

appeared in past Configurable length of history Row-level

15 Design HDFS Does Not Allow Arbitrary Writes Store changes as delta files Stitched together by client on read Writes get a Transaction ID Sequentially assigned by Metastore Reads get Committed Transactions Provides snapshot consistency No locks required Provide a snapshot of data from start of query Page 15

assigned by Metastore Reads get Committed Transactions Provides snapshot

16 Vectorization MapReduce s RecordReader boolean next(k key, V value); Better to process 1000 rows at a time Amortizes the cost of method calls Use primitive arrays for tight inner loops No access methods Extremely important for operator trees Branches (including virtual dispatch) kill pipelining Can run at 100m rows/second Page 16

arrays for tight inner loops No access methods Extremely important for operator

17 Tez Replacing MapReduce as basis for Hive, Pig, Cascading Executes entire DAG of tasks More options for shuffle Scales up and down dynamically Queries scheduled as one application instead of waves of jobs. Page 17

for shuffle Scales up and down dynamically Queries

18 Hive Cost Based Optimizer Current optimizer is a mess of rules Rule interactions are complex Optiq provides a framework YACC for optimizers Make better choices Huge impact on performance Obsoletes lots of old advice Page 18

framework YACC for optimizers Make better choices Huge

19 LLAP Live Long and Process Persistent Hive execution engine JVM startup costs are huge JIT cost alone is staggering Hot Table Data Caching Keep hot columns and partitions in memory Sub-second answers Page 19

20 Security There is no such thing as perfect security, only varying levels of insecurity. - Rushdie Page 20

21 Audit and Authorization Three A s of security Authentication, Authorization, and Audit Phases No users Users, but no authentication Authorization Next centralized authorization and audit Encryption Page 21

22 Encryption Underlying file system Thief breaks into data center HDFS encryption Parallels HDFS authorization Prevents AFN attacks Column encryption Encrypt just PII columns, rolling keys Value encryption No salt weak sauce so joins work Page 22

23 Thank You! Questions & Answers Hortonworks Inc Page 23

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model