Media Upload and Sharing Website using HBASE

Transcription

1 A-PDF Merger DEMO : Purchase from to remove the watermark Media Upload and Sharing Website using HBASE Tushar Mahajan Santosh Mukherjee Shubham Mathur

2 Agenda Motivation for the project Introduction Summary of how we used Hadoop Why HBASE not RDBMS? Current Status Challenges Future Work

3 Motivation Facebook, Stumble Upon HBASE

4 Motivation Cont. "Web 2.0 is the business revolution in the computer industry caused by the move to the internet as platform and an attempt to understand the rules for success on that new platform. Chief among those rules is this: Build applications that harness network effects to get better the more people use them. (This is what I've elsewhere called 'harnessing collective intelligence.')" - Tim O'Reilly, Grand Poobah2.0

5 Why HBASE and not RDBMS RDBMS Powerful For small scale ideal to use What if someday, my site ranks top in google search. How do i scale my performance? Although you can run several instances of mysql on different machines.

6 But will it help?

7 No!

8 Scaling MySQL hard, Oracle Expensive(and hard). Machine cost goes up faster speed Turns off all relational feature to scale (!!) Turns off secondary(!) indexes too That is not the power of RBDMS, its power is to build indexes, scale no of rows.

9 Prob Cont. Tables are harder to scale at sizes as low as 500GB Hard to read data at these sizes

10 In case of Schema change? What about schema change or migrations? Mysql is not your friend Only gets harder with more data

11 HBASE Mostly schema-less Dynamic distribution Motivation for HBASE? Google Bigtables

12 HBASE is an Apache open source project whose goal is to provide Bigtable like storage for the Hadoop Distributed computing Environment.

13 Data Model Similar to that of Bigtable. Applications store data rows in labeled tables. A data row has a sortable row key and an arbitrary number of columns. A column name has the form <family>:<label> where <family> and <label> can be arbitrary byte arrays.

14 HBASE storage model Column oriented database Column name is arbitary data, can have, variable, number of column per row. Can random read and write Tables are split into roughly equal size regions Region split as they grow, thus dynamically adjusting your data set.

15 ( Language(HQL Hbase Query ${HBASE_HOME}/bin/hbase shell [--help] Usage:./bin/hbase shell [-- master:ip_address:port] [--html] Running the above command on command line presents before you the following prompt: hql>

16 Sample Hbase Query To create a table: CREATE TABLE table_name(column_family_definition [,column_family_definition]... ) Column_family_definition: column_family_name [MAX_VERSIONS=n] [MAX_LENGTH=n] [COMPRESSION=NONE RECORD BLOCK] [IN_MEMORY] [BLOOMFILTER=NONE BLOOMFILTER COUNTING_BL OOMFILTER RETOUCHED_BLOOMFILTER VECTOR_SIZE=n NUM_HASH=n]

17 (.. Contd ) Sample HBASE Queries SELECT Syntax: SELECT { column_name, [, column_name]... expr[alias] *} FROM table_name [WHERE row='row_key' STARTING FROM 'row-key' [UNTIL 'stop-key']] [NUM_VERSIONS = version_count] [TIMESTAMP 'timestamp'] [LIMIT = row_count] [INTO FILE 'file_name']

18 (.. contd ) Sample HBASE Queries Insert data into table Syntax: INSERT INTO table_name (... ('value', (colmn_name,...) VALUES WHERErow='row_key'[TIMESTAMP'timestamp']; column_name: column_family_name column_family_name:column_label_name

19 HQL FACTS The hql shell prompt has now been depreciated. It has been moved to a newer shell version. PS: Never bother to mention hql in IRC.

20 Sample php to communicate with Hbase // open a new connection to rest server. Hbase Master default port is $hbase = new hbase_rest($ip, $port); // get list of tables $tables = $hbase->list_tables(); // get table column family names and compression stuff $table_info=$hbase>table_schema("search_index");

21 Sample and end row keys of each region Php File (Cont) // get start $regions = $hbase->regions($table); // select data from hbase $results = $hbase->select($table,$row_key); // insert data into hbase the $column and $data can be arrays with more then one column inserted in one request $hbase->insert($table,$row,$column(s),$data(s));

22 Scaling HBASE Add more machines to scale Base model(bigtables) scale past 1000TB

23 No Inherent reason why HBASE couldn't

24

25 How to store data in HBASE? Maybe not your raw log data... Results, processing it with hadoop By storing the defined version in HBASE, can keep up with huge data demands and serve to your website

26 Website access Using thrift gateway, php code accesses HBASE No additional caching other than what Hbase provides

27 Large data Storage Over 9 billion rows and 1300 GB in Hbase Can map reduce a 700GB table in ~20 min This is about 6 million rows/sec

28 Challenges Lack of Documentation Its new hard to find any document library or tutorial. Hostel Wireless Issues Need atleast 2 computer to test. Thrift is still in early stage. Lot of php issues :(, no help nearby IRC Freenode #hbase channel was very helpful ( slow (but process is

29 Alternatives Cassandra Hypertable

30 References Home Page Wiki Freenode IRC #hbase html

31

32

33 Thank You For the Patience

34 HBASE is an Apache open source project whose goal is to provide Bigtable like storage for the Hadoop Distributed computing Environment.

35 Data Model Similar to that of Bigtable. Applications store data rows in labeled tables. A data row has a sortable row key and an arbitrary number of columns. A column name has the form <family>:<label> where <family> and <label> can be arbitrary byte arrays.

36 HBASE QUERY LANGUAGE (HQL) ${HBASE_HOME}/bin/hbase shell [--help] Usage:./bin/hbase shell [--master:ip_address:port] [-- html] Running the above command on command line presents before you the following prompt: hql>

37 Sample HBASE Queries To create a table: Syntax: CREATE TABLE table_name(column_family_definition [, column_family_definition] ) Column_family_definition: column_family_name [MAX_VERSIONS=n] [MAX_LENGTH=n] [COMPRESSION=NONE RECORD BLOCK] [IN_MEMORY] [BLOOMFILTER=NONE BLOOMFILTER COUNTING_BL OOMFILTER RETOUCHED_BLOOMFILTER VECTOR_SIZE=n NUM_HASH=n]

38 Sample HBASE Queries (Contd..) SELECT Syntax: SELECT { column_name, [, column_name]... expr[alias] *} FROM table_name [WHERE row='row_key' STARTING FROM 'row-key' [UNTIL 'stop-key']] [NUM_VERSIONS = version_count] [TIMESTAMP 'timestamp'] [LIMIT = row_count] [INTO FILE 'file_name']

39 Sample HBASE Queries (contd..) Insert data into table Syntax: INSERT INTO table_name (colmn_name,...) VALUES ('value',...) WHERErow='row_key'[TIMESTAMP'timestamp']; column_name: column_family_name column_family_name:column_label_name

40 HQL FACTS The hql shell prompt has now been depreciated. It has been moved to a newer shell version. PS: Never bother to mention hql in IRC.

41 Sample Php File To Communicate With HBASE // open a new connection to rest server. Hbase Master default port is $hbase = new hbase_rest($ip, $port); // get list of tables $tables = $hbase->list_tables(); // get table column family names and compression stuff $table_info=$hbase>table_schema("search_index");

42 Sample Php File (Contd..) // get start and end row keys of each region $regions = $hbase->regions($table); // select data from hbase $results = $hbase->select($table,$row_key); // insert data into hbase the $column and $data can be arrays with more then one column inserted in one request $hbase->insert($table,$row,$column(s),$data(s));

43 Sample Php File (Contd..) // start a scanner on a set range of table $handle = $hbase- >scanner_start($table,$cols,$start_row,$end_row); // pull the next row of data for a scanner handle $results = $hbase->scanner_get($handle); // delete a scanner handle $hbase->scanner_delete($handle);

44 Overview Basically the file uploaded through the web page is inserted into the HBASE in the form of its byte representation. On being requested the file, depending upon the key, we select a region of HBASE and output the user the corresponding file.

45 Table Schema and a Lot More!

46 The Hbase Table The table consists of a row key which Is unique. Associated with the row key it has column family. The column family comprises of two columns. One stores the file name whereas the other stores the actual file data.

47 HBASE Schema Row Post TempAddress+timestamp Name Data (in bytes) Hdfs://Downloads :12:07 DiaryofJane.mp

48 HBASE Schema (Contd..) Give a unique row key corresponding to a column family. Associate a time-stamp with the temporary download location of each file. The time-stamp associated includes the time of upload+the date of upload to nullify the clashes.

49 Backend Associated: The file available in the temporary download location is copied into the HBASE Php used as a framework. The Thrift API acts as a bridge for php to communicate with the HBASE. The thrift API enables socket connection The php code runs the HBASE code written in Java in the hbase directory.

50 Backend Associated-Thrift A software library and set of code generation tools. Developed by Facebook. Used for implementing efficient and scalable backend services. Goal: To enable efficient and reliable communication across programming languages.

51 Backend Associated (contd..) The java code takes as argument the download location and the actual file along with the file name. A time-stamp is then associated with the download location. The download location being fixed for every user, we are able to generate a unique key using time-stamp.

52 Backend Associated (contd..) Associate a file stream to read the file Change file into its corresponding byte representation using JAVA methods. Create a put object associated with the table using the row key. The byte representation of the file and the file name is then fed into this put object. The put object then inserts the data in HBASE.

53 Code Snippet Public class HbaseClient{ Public static void main (String args[]) throws IO Exception { String rowkey=time+. +temp; Put p=new Put(Bytes.toBytes(rowkey)); } } p.add(bytes.tobytes( post ),Bytes.toBytes( name ),Byt es.tobytes(temp));

54 Program Execution The url associated with the file is then returned to the user The url when clicked is passed as an argument to another java file interacting with HBASE. This java file creates a get object. Associated with the url, which is also a unique row key for the table, it returns to the user the data.

55 Conclusion The Thrift API enables the php code to communicate with the java code written in Hbase directory. There still remains an isssue with the scalablity of the project. That we could handle by storing the files in distributed HDFS cluster as the temporary location.

56 Conclusion (contd..) We could then scale the Hbase tables across multiple nodes in case the size of the table grows large. This scaling of the HBASE table could be done on the basis of the regions associated with the table.

57 Problems Faced Lack of documentation on the Thrift API Lack of sample codes or tutorials. Even the thrift home page contains links to the tutorial that doesn t work. Lots of scenarios to stumble upon and explore.