Big Data and Scripting Systems build on top of Hadoop

Transcription

1 Big Data and Scripting Systems build on top of Hadoop 1,

2 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the provided programming language Pig Latin is similar to query languages like SQL still procedural, not declarative commands describe actions to execute, not the desired result can be extended using various languages developed at Yahoo, moved to the Apache Software foundation in 2007 top level project (pig.apache.org)

3 3, Pig/Latin aim provide means to quickly express algorithms that can be run on a Hadoop cluster simple/easy to learn language enable rapid prototyping of map reduce applications runs on local or distributed Hadoop interactive or batch mode commands from script or command line are translated into Hadoop jobs executed on the Hadoop system

4 4, an example session bds]$ pig -x local... A = load /etc/passwd using PigStorage( : ); dump A ;... start in local mode (access to local file system) assign file content to variable A dump file content to console load does not result in an actual action lazy evaluation: content of file is only loaded when necessary for execution dump A ; starts a map reduce job that reads and prints the file content

5 5, concepts in Pig Latin commands (operators) act on relations a relation is typically a CSV-file from hdfs example: assign a relation with named fields to variable A A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float); further operators then transform relations into other relations example: group items by field age B = GROUP A BY age; again, nothing is executed until needed relations can have schemas (column names and types) schemas can be used to ensure type-safety

6 6, overview use LOAD to assign relations from files to variables modify relations with the various operators ( FILTER, GROUP, UNION, JOIN ) store results using STORE pig statements can be embedded in a number of languages example in JavaScript: importpackage(packages.org.apache.pig.scripting.js) Pig = org.apache.pig.scripting.js.jspig function main() { input = "original" output = "output" P = Pig.compile("A = load $in ; store A into $out ;") result = P.bind({ in :input, out :output}).runsingle() if (result.issuccessful()) {...

7 7, extending Pig Latin new functions can be added by implementing them in various languages Java, Python, JavaScript, Ruby most extensive support is provided in Java: extend class org.apache.pig.evalfunc register in Pig REGISTER myudfs.jar; Java programs can be run native on Hadoop special interfaces allow more efficient integration of special types of functions

8 8, aggregation interfaces generic (EvalFunc) gets one input tuple at a time Algebraic - incremental aggregation of values can be executed distributed must be able to compute own intermediate results produces intermediate and final results Accumulator gets all tuples belonging to one key produces one (final) output value for all tuples of one key FilterFunc decides for each input tuple whether it passes specified Java-types relate to information from relation-schema implemented functions can be overloaded - providing fitting implementations for different types

9 9, Apache Hive top level Apache project built on top of Hadoop distributed data-warehouse allowing queries and transformations use various file systems as backend (HDFS, Amazon S3 fs,... ) SQL-like query language HiveQL execution by translation into map reduce jobs indexing to accelerate queries command line interface Hive CLI originally developed by Facebook turned into open source project

10 10, HiveQL create a table with two columns: hive> create table student (sid string, sname string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY, ; tables correspond to directories in the underlying file system stored as CSV-files load some data: hive> LOAD DATA INPATH /tmp/students.txt INTO TABLE student; imports the file content into Hive s storage dropping the table deletes data and index from Hive storage, does not affect external data select * from student;

11 11, HiveQL multiple tables can be joined only equality joins SELECT * FROM student JOIN scores ON (student.sid = marks.sid); many standard SQL statements available, e.g. GROUP BY INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; grouping and aggregation write result into new table

12 12, indexing indexes allow speed up of certain operations indexes are stored independent from actual data in RDBMS example: CREATE INDEX index_name ON TABLE base_table_name (col_name,...) AS index_type... Hive allows plugin indexes implement create, rebuild, drop in interface HiveIndexHandler (Java)

13 13, data organization databases tables correspond to (top-level) directories partitions - sub directories of the table directory tables can be partitioned by some column CREATE TABLE table (col1 INT, col2 STRING) PARTITIONED BY (col3 DATE); buckets - further break downs of partitions, allow better organization with respect to map reduce storage of actual data in flat files arbitrary formats can be used, description via regular expressions

14 14, Hive summary bring together SQL functionality and scaling features of Hadoop subset of table operations specified by SQL no low-latency queries optimized for scalability storage in flat files on distributed file system querying/processing by translation into map reduce jobs on Hadoop extending storage by indexing in RBMS due to distributed storage: no individual updates appending to tables not (yet) possible

15 15, HBase sparse, distributed, persistent multidimensional sorted map implementation of the Bigtable 1 idea (google) uses HDFS and Hadoop, Zookeeper for storage and execution implements servers for administration and storage/computation mapping keys to values keys are structured, values are arbitrary implements random read/write access on top of HDFS provides consistency accessible via shell or Java-API 1 research.google.com/archive/bigtable.html

16 16, structure of HBase keys are stored sorted, allowing range queries keys are highly structured into: rowkey column family column timestamp tables are stored sparsely missing values are not encoded but simply not stored every value has to be stored with full address data is distributed, load balancing automated consistency is guaranteed on row-key level: all changes within one rowkey are atomic all data of one rowkey is stored on a single machine

17 17, storage and access data is partitioned by keys column families define storage properties columns are only label for the corresponding values principle operations: put insert/update value delete delete value get retrieve single value scan retrieve collection of values sequential reading

18 18, timestamp each value is stored together with a timestamp - the version there can be arbitrary many values in the same row/column family/column differing only by their version get/scan can retrieve more than one version of a value put uses by default current time as version, can be changed to other timestamp it is possible to delete only certain versions (e.g. the old ones) of a value column families can have time-to-live value, versions older than this will be deleted automatically

19 19, coprocessors HBase provides hooking into events coprocessors run on the server side are triggered by all kinds of specific events reading/scanning writing/deletion before start and after completion of individual operations possible use cases: secondary indexes reference integrity aggregating views

20 20, guarantees in HBase atomicity mutations are atomic within a row operation result reported not atomic over multiple rows (parts may fail, others succeed) consistency and isolation returned rows consist of complete rows having existed at some point: contained data may have changed in between the data returned refers to a single point in the past scans are not consistent over multiple rows different rows may refer to different points in time

21 21, guarantees in HBase visibility after successful writing, data is immediately visible to all clients versions of rows strictly increase durability refers to data being stored on disk data that has been read from a cell is guaranteed to be durable successful operations are durable, failed operations not visibility and durability may be tuned for performance individual reads without visibility guarantees instead of durability only periodic writing of data

22 22, what HBase can and can t do joins - no not supported naturally joins have to be implemented on the application level select - yes it is possible to create a scan reading contents of a complete table or subset of columns scans can be provided with a filter that selects subsets of rows/columns filters are executed on the server side group by/aggregate - not yet is planned feature workarounds using Hive (external table) or map reduce

23 comparison Pig Latin allows to view data as tables provides ad hoc queries extendable to arbitrary map reduce jobs Hive tries to provide SQL-functionality slow, large scale queries structured query language, query planner Hbase more like a NOSQL database or key/value store no sql operations, only storage and retrieval guarantees for operations optimized for random, real-time access note: Pig and Hive can access data from Hbase directly note: Cassandra is a dbs similar to HBase, optimized for security 23,

24 24, Mahout implementations of data mining/learning algorithms top-level Apache project provide a library for easy access to machine learning implementations provide algorithms for the most common problems: clustering classification frequent pattern mining,... optimize for practical (e.g. business) usage language: Java

25 25, some implemented algorithms (examples) classification logistic regression Bayesian classification random forrests hidden Markov models clustering (fuzzy) k-means EM-clustering latent Dirichlet allocation spectral clustering frequent item set mining Parallel FP Growth spectral decomposition (approximated) single value decomposition recommanders/collaborative filtering item-based matrix-factorization

26 26, usage, parameters, environment provides large set of Java-classes these can be used directly in other applications some (most) implemented as map reduce jobs implementations can be adapted to specific problems provide individual I/O classes individual similarity/distance functions,... most algorithms can be started from command line some provide standalone servers for integration into other systems runs single machine or distributed (on Hadoop) integration of Apache Lucene (document search engine)

27 27, example 2 a recommendation engine can be specified by configuring a standard framework applying it to data configure classes implementing Java interfaces: DataModel reading information from storage in files UserSimilarity define similarity function between users ItemSimilarity define similarity function between items Recommender the actual recommender implementation UserNeighborhood compute neighborhood of similar users Mahout provides off the shelf implementations for all of these can be extended by individual implementations uses library Taste 2 source:

28 28, configuring the parts //create the data model FileDataModel datamodel = new FileDataModel(new File(recsFile)); UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(dataModel); // Optional: usersimilarity.setpreferenceinferrer( new AveragingPreferenceInferrer(dataModel)); //Get a neighborhood of users UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighborhoodSize, usersimilarity, datamodel); data model reads directly from file use Pearson correlation as user similarity

29 29, generating recommendations: //Create the recommender Recommender recommender = new GenericUserBasedRecommender(dataModel, neighborhood, usersimilarity); User user = datamodel.getuser(userid); System.out.println("User: " + user); //Print out the users own preferences first TasteUtils.printPreferences(user, handler.map); //Get the top 5 recommendations List<RecommendedItem> recommendations = recommender.recommend(userid, 5); TasteUtils.printRecs(recommendations, handler.map); Mahout provides classes that allow to run the individual computations directly on an Hadoop cluster