Apache HBase NoSQL on Hadoop
|
|
|
- Kevin Philip Warner
- 10 years ago
- Views:
Transcription
1 Apache HBase NoSQL on Hadoop Project Report Advanced Databases Muhammed Mohiyudheen Ziyad ( ) Sharma Akshi ( ) IT4BI Masters /12/2015
2 Apache HBase NoSQL on Hadoop 1 Table of Contents ABSTRACT... 3 APACHE HBASE INTRODUCTION... 4 HISTORY... 4 THE NEED FOR HBASE... 4 WHEN TO USE HBASE?... 4 QUICK START GUIDE... 5 RUN MODES... 5 ACCESSING HBASE... 5 HBASE CONFIGURATION PROPERTIES... 5 WEB CONSOLES... 5 ARCHITECTURE... 6 MASTER SERVER... 6 REGION SERVER... 6 REGION... 6 HBASE DATA ARCHITECTURE... 6 DATA MODEL... 7 NAMESPACE... 7 TABLE... 7 ROW... 7 COLUMN... 8 COLUMN FAMILY... 8 COLUMN QUALIFIER... 8 CELL... 8 TIMESTAMP... 8 PROJECT ENVIRONMENT... 8 HBASE SETUP... 8 DATA SET: NORTHWIND... 9 DATA DEFINITION LANGUAGE... 9 CREATE A NAMESPACE... 9 CREATE A TABLE... 9 DROP TABLE/NAMESPACE TRUNCATE TABLE ALTER TABLE DATA MODEL OPERATIONS LOADING DATA Using Java API Using importtsv for HDFS Files BulkLoad (for huge data sets)... 13
3 Apache HBase NoSQL on Hadoop 2 DATA RETRIEVAL Get Scans DELETE AGGREGATIONS JOINS FEATURES OF HBASE AUTOMATIC VERSIONING DYNAMIC SCHEMA PRE REGION SPLITS COMPRESSION ENCODING TIME TO LIVE (TTL) HBASE AND MAPREDUCE CLIENT FILTERS Partial Key Scan SingleColumnValueFilter RegexStringComparator FamilyFilter QualifierFilter RowFilter ACID IN HBASE HIGH AVAILABILITY HBASE AS AN OBJECT STORE NorthwindCustomer Class Data Loading Data Retrieval SCALABILITY AND PERFORMANCE DATA LOAD OPERATION UPDATE AND READS READ ONLY WORKLOAD EXAMPLE FOR HBASE SCHEMA DESIGN CORE DESIGN CONCEPTS DESIGNED SCHEMA Customers Employees Products Orders PHOENIX REFERENCES... 40
4 Apache HBase NoSQL on Hadoop 3 Abstract With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), ability to store and process huge amounts of any kind of data has become critical. Apache Hadoop is a framework supporting an ecosystem of tools used to store and manage such huge volumes of data. Apache HBase, the NoSQL database on HDFS (Hadoop distributed file system), provides real-time read/write access to the large datasets managed with Hadoop. In this project, we try to understand the architecture of HBase, and explore the different features offered with suitable applications. This project also aims to identify how to use HBase effectively for various use cases, and where it should be avoided.
5 Apache HBase NoSQL on Hadoop 4 Apache HBase Introduction Apache HBase provides random, real time access to your data in Hadoop. It was created for hosting very large tables, making it a great choice to store multi-structured or sparse data. History The HBase story begins in 2006, when the San Francisco-based startup Powerset was trying to build a natural language search engine for the Web. They were looking for an alternative. The Google BigTable paper had just been published. Building an open source system to run on top of Hadoop s Distributed Filesystem (HDFS) in much the same way that BigTable ran on top of the Google File System seemed like a good approach because: 1. it was a proven scalable architecture 2. could leverage existing work on Hadoop s HDFS 3. could both contribute to and get additional leverage from the growing Hadoop ecosystem. The need for HBase Out of the box Hadoop can handle a high volume of multi-structured data. But it can not handle a high velocity of random reads and writes and it is unable to change a file without completely rewriting it. Fast random reads require the data to be stored structured (ordered). The only possibility to modify a file stored on HDFS without rewriting is appending. Fast random writes into sorted files only by appending seems to be impossible. The solution to this problem is the Log-Structured Merge Tree (LSM Tree). Designed on top of Hadoop as a NoSQL database, HBase deals with these drawbacks of HDFS. The HBase data structure is based on LSM Trees. When to use HBase? Of course, the first pre-requisite to use HBase is to have an existing Hadoop infrastructure. After that, the following points needs to be considered in order to use HBase for your use case. 1. Huge amount of data (hundreds of millions or billions of rows) 2. Fast random reads and/or writes 3. Well known access patterns HBase might not a be a good fit, if your use case falls in any of the below categories: 1. New data only needs to be appended 2. Batch processing instead of random reads 3. Complicated access patterns, and data de-normalization is not an option 4. Full ANSI SQL support required 5. A single node can deal with the volume and the velocity of the complete data set
6 Apache HBase NoSQL on Hadoop 5 Quick start guide In this section we discuss about the things you should know in order to get started with HBase. Run Modes 1. Stand-alone mode: In standalone mode, HBase does not use HDFS, it uses the local filesystem instead and it runs all HBase daemons and a local ZooKeeper all up in the same JVM. 2. Pseudo-distributed mode Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and Zookeeper) runs as a separate process and the data can be stored in HDFS configured on the single node. Use this configuration testing and prototyping on HBase. 3. Fully distributed In a distributed configuration, the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include primary and backup Master instances, multiple Zookeeper nodes, and multiple RegionServer nodes. Accessing HBase HBase data can be accessed using 1. HBase shell (Command line access): The Apache HBase Shell is (J)Ruby's IRB with some HBase particular commands added. Anything you can do in IRB, you should be able to do in the HBase Shell. 2. Client API: The native API provided by HBase is in Java, and it supports many other APIs as well. HBase configuration properties Apache HBase uses the same configuration system as Apache Hadoop. All configuration files are located in the conf/ directory, which needs to be kept in sync for each node on your cluster. hbase-site.xml is the main configuration file. Web Consoles The details of your hbase cluster can be easily accessed through the web consoles provided by HBase. By default, the Master console, runs on the port 16010, and the Regionserver console runs on
7 Apache HBase NoSQL on Hadoop 6 Architecture HBase has three major components: the client library, a master server, and region servers. Master server The HMaster is responsible for assigning the regions to each HRegionServer when you start HBase, uses Apache ZooKeeper, a reliable, highly available, persistent and distributed coordination service, to facilitate that task. The master server is also responsible for handling load balancing of regions across region servers, to unload busy servers and move regions to less occupied ones. The master is not part of the actual data storage or retrieval path. It negotiates load balancing and maintains the state of the cluster, but never provides any data services to either the region servers or the clients, and is therefore lightly loaded in practice. In addition, it takes care of schema changes and other metadata operations, such as creation of tables and column families. Region Server Region servers are responsible for all read and write requests for all regions they serve, and also split regions that have exceeded the configured region size thresholds. Clients communicate directly with them to handle all data-related operations. Region In HBase, rows of data is stored in tables. Tables are split into chunks of rows called regions. The regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process. All the rows in the table that sort between the region s start key and end key are stored in the same region. Regions are non-overlapping, i.e. a single row key belongs to exactly one region at any point in time. A region is only served by a single region server at any point in time, which is how HBase guarantees strong consistency within a single row. Hbase Data architecture A Region in turn, consists of many Stores, which correspond to column families. Each Store instance can in turn have one or more StoreFile instances, which are lightweight wrappers around the actual storage file called HFile. A Store also has one MemStore and zero or more StoreFiles. The data for each column family is stored and accessed separately. HBase handles basically two kinds of file types: one is used for the write-ahead log (WAL) and the other for the actual data storage. The files are primarily handled by the HRegionServer.
8 Apache HBase NoSQL on Hadoop 7 Data Model In HBase, data is stored in tables, which have rows and columns, in the form of a multidimensional map (note that it is not similar to a table in a relational database) Namespace A namespace is a logical grouping of tables analogous to a database in relation database systems. A namespace can be created, removed or altered. If namespace is not specified while creating a table, the default namespace will be used. Table An HBase table consists of multiple rows. Row A row in HBase consists of a row key and one or more columns with values associated with them. Rows are sorted lexicographically by the row key. For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other. A common row key pattern is a website domain. If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain. Please note that rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted.
9 Apache HBase NoSQL on Hadoop 8 Column A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character. Column Family Column families physically collocate a set of columns and their values, often for performance reasons. Each column family has a set of storage properties, such as whether its values should be cached in memory, how its data is compressed or its row keys are encoded, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family. One could argue that a column family to HBase is similar to a table to relational DB. Column Qualifier A column qualifier is added to a column family to provide the index for a given piece of data. Given a column family content, a column qualifier might be content:html, and another might be content:pdf. Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows. Cell A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value s version. Timestamp A timestamp is written alongside each value, and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell. Project Environment The project is executed using two instances of HBase Hbase setup 1. Stand-alone HBase installation on OS-X 2. Pseduo distributed HBase on Hortonworks Sandbox (HDP 2.3.2) In both instances the latest stable release of HBase (version 1.1.2) is used. The Java version used for developing client code is 1.7 Note: The code snippets are highlighted in this color
10 Apache HBase NoSQL on Hadoop 9 Data Set: Northwind The Northwind database captures all the sales transactions that occurs between the imaginary company i.e. Northwind traders and its customers as well as the purchase transactions between Northwind and its suppliers. In this project, we are using this dataset for showcasing various data operations on HBase. The primary data used is the customer table. This table is exported as a flat file from MS SQL Server, and made available to use for HBase data load. sample of the Northwind customer data Data Definition Language In this section we will discuss about the Data definition language of HBase, along with best practices on schema design. HBase schemas can be created or updated using the the Apache HBase Shell or by using Admin in the Java API. Create a namespace From the Hbase shell, execute the command: create_namespace 'northwind' Create a table Create a table in namespace with one column family c and able to maintain at most three versions create 'northwind:customer', {NAME => 'c',versions => 3; Note that, 1. HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low 2. Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the same time 3. Use smallest possible names for column families (in this example c ), as the column family name is stored in each cell of a row.
11 Apache HBase NoSQL on Hadoop 10 Drop table/namespace In order to drop any table, first it should be disabled. disable 'northwind:customer' drop 'northwind:customer' drop_namespace 'northwind' Truncate table Truncate will disable the table first, drop it and recreate it using the same schema. Below screenshot represents how truncate is being executed. Alter table Alter table to add an additional column family d. The versions field indicate how many versions should be kept for each row for this column family. alter customer', {NAME' => d', VERSIONS => 5 Note: All the above operations can also be performed through the native Java API. Data Model Operations The four primary data model operations are Get, Put, Scan, and Delete. Loading Data Different ways of loading data into Hbase table are: 1. Using Java API It supports real time updates using the Put class. Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Example: public class HBaseDataLoadUsingPut { public static void main(string[] args) throws MasterNotRunningException, ZooKeeperConnectionException, IOException { // TODO Auto-generated method stub // Define the configuration to connect to HBase table Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "northwind:customer"); BufferedReader reader = null;
12 Apache HBase NoSQL on Hadoop 11 String[] record; String customerid; String companyname; String contactname; String contacttitle; String address; String city; String region; String postalcode; String country; String phone; String fax; try { // Update the file path to point to the Northwind customer file File file = new File("/data/northiwind_customer.txt"); reader = new BufferedReader(new FileReader(file)); String line; while ((line = reader.readline())!= null) { record = line.split("\t"); customerid = record[0]; companyname = record[1]; contactname = record[2]; contacttitle = record[3]; address = record[4]; city = record[5]; region = record[6]; postalcode = record[7]; country = record[8]; phone = record[9]; fax = record[10]; // Create Put objects to store the data to HBase table Put put = new Put(Bytes.toBytes(customerId)); // row key put.add(bytes.tobytes("c"), Bytes.toBytes("companyName"), Bytes.toBytes(companyName)); put.add(bytes.tobytes("c"), Bytes.toBytes("contactName"), Bytes.toBytes(contactName)); put.add(bytes.tobytes("c"), Bytes.toBytes("contactTitle"), Bytes.toBytes(contactTitle)); put.add(bytes.tobytes("c"), Bytes.toBytes("address"), Bytes.toBytes(address)); put.add(bytes.tobytes("c"), Bytes.toBytes("city"), Bytes.toBytes(city)); put.add(bytes.tobytes("c"), Bytes.toBytes("region"), Bytes.toBytes(region)); put.add(bytes.tobytes("c"), Bytes.toBytes("postalCode"), Bytes.toBytes(postalCode)); put.add(bytes.tobytes("c"), Bytes.toBytes("country"), Bytes.toBytes(country)); put.add(bytes.tobytes("c"), Bytes.toBytes("phone"), Bytes.toBytes(phone)); put.add(bytes.tobytes("c"), Bytes.toBytes("fax"), Bytes.toBytes(fax)); table.put(put); catch (IOException e) { e.printstacktrace(); finally { try { reader.close(); catch (IOException e) { e.printstacktrace();
13 Apache HBase NoSQL on Hadoop 12 Output (Sample): 2. Using importtsv for HDFS Files ImportTsv is a utility that will load data in TSV format from HDFS into HBase. The below example explains how it is used for loading data via Put. First, our data-set is loaded to HDFS. And then importtsv is called from command-line to load this data to the pre-created HBase table. Screen-shot of HDFS directory (using Ambari web console) Usage: $ hbase org.apache.hadoop.hbase.mapreduce.importtsv - Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir> Example: hbase org.apache.hadoop.hbase.mapreduce.importtsv - Dimporttsv.columns=HBASE_ROW_KEY,c:CompanyName,c:ContactName,c:ContactTitle,c:Address,c:City,c:Region,c:PostalC ode,c:country,c:phone,c:fax northwind:customer /tmp/northiwind_customer.txt
14 Apache HBase NoSQL on Hadoop 13 Output (sample): 3. BulkLoad (for huge data sets) The bulk load feature uses a MapReduce job to output table data in HBase s internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. Step 1: The data import has been prepared, either by using the importtsv tool with the importtsv.bulk.output option or by some other MapReduce job using the HFileOutputFormat. $ bin/hbase org.apache.hadoop.hbase.mapreduce.importtsv - Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir> Example: hbase org.apache.hadoop.hbase.mapreduce.importtsv -Dimporttsv columns=hbase_row_key,c:companyname,c:contactname,c:contacttitle,c:address,c:city,c:region,c:postalcode,c:count ry,c:phone,c:fax -Dimporttsv.bulk.output=hdfs:///tmp/northiwind_customer_load northwind:customer /tmp/northiwind_customer.txt Output: HBase data files (StoreFiles) is generated on HDFS, as shown below: /tmp/northiwind_customer_load/c/c4d1e85e045a4b2d8f4645c5be9cd97f Step 2: The completebulkload tool is used to import the prepared data into the running cluster. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. $ hbase org.apache.hadoop.hbase.mapreduce.loadincrementalhfiles <hdfs://storefileoutput> <tablename>
15 Apache HBase NoSQL on Hadoop 14 -OR- $ hadoop jar hbase-server-version.jar completebulkload [-c /path/to/hbase/config/hbasesite.xml] /user/user_name/myoutput mytable Example: hbase org.apache.hadoop.hbase.mapreduce.loadincrementalhfiles /tmp/northiwind_customer_load northwind:customer Output (sample): Note that, in this case all the cells are stored with the same time stamp. Data Retrieval 1. Get Get returns attributes for a specified row. Example: get 'northwind:customer', 'WILMK'
16 Apache HBase NoSQL on Hadoop Scans Scan allow iteration over multiple rows for specified attributes. scan northwind:customer The same operations (Get and Scan) can be done using the Java API as well. It allows the user to effectively use the HBase data in client applications. Example for HBase Scan is given below: public class HBaseTableScan { public static void main(string[] args) { // TODO Auto-generated method stub // create a configuration to connect to HBase. Configuration hconf = HBaseConfiguration.create(); /* Uncomment the below code to use with a remote client. * The example uses the properties for HDP sandbox. * hconf.set(constants.hbase_configuration_zookeeper_quorum, * "sandbox.hortonworks.com"); * hconf.setint(constants.hbase_configuration_zookeeper_clientport, * 2181); */ try { HTable htable = new HTable(hConf, "northwind:customer"); byte[] family = Bytes.toBytes("c"); // creating a scan object. Optionally you can include // start and stop row keys Scan scan = new Scan(); scan.addfamily(family); // Get the result object and iterate through your results ResultScanner rs = htable.getscanner(scan); for (Result r = rs.next(); r!= null; r = rs.next()) { // Use the result object rs.close(); catch (IOException e) { // TODO Auto-generated catch block e.printstacktrace(); Delete Delete removes a row from a table. HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compactions.
17 Apache HBase NoSQL on Hadoop 16 Example: In this example, we delete the Fax column for a particular customer. After delete operation, the Fax column is no longer displayed. Aggregations Hbase doesn t have its own in-built aggregating operations but it can handle aggregations using the below two methods: 1. You can write you own mapreduce job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple, and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API - instead it will be accessed right from HDFS in sequential manner. 2. Register HBase table as external table in Hive and do aggregations. Data will be accessed via HBase API which is not that efficient. It is most powerful way to group by HBase data. It does imply running MR jobs but by Hive, not by HBase. Usage: CREATE TABLE hive_managed_table (key string, value1 string, value2 int) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,columnfamilyname:val1,columnfamilyname:val2") TBLPROPERTIES ("hbase.table.name" = "namespace:table_name");
18 Apache HBase NoSQL on Hadoop 17 Joins Whether HBase supports joins is a common question, and there is a simple answer: it doesn t, at least not in the way that RDBMS support them (e.g., with equi-joins or outer-joins in SQL). The read data model operations in HBase are Get and Scan. However, that doesn t mean that equivalent join functionality can t be supported in your application, but you have to do it yourself. The two primary strategies are either de-normalizing the data upon writing to HBase, or to have lookup tables and do the join between HBase tables in your application or MapReduce code. So which is the best approach? It depends on what you are trying to do, and as such there isn t a single answer that works for every use case. Features of HBase In this section, we will explore the striking features of HBase, which makes it a suitable choice for the best NoSQL database on Hadoop. The notable features of HBase include 1. Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation. 2. Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows. 3. Automatic RegionServer failover 4. Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system. 5. MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink. 6. Java Client API: HBase supports an easy to use Java API for programmatic access. 7. Thrift/REST API: HBase also supports Thrift and REST for non-java front-ends. 8. Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization. 9. Operational Management: HBase provides built-in web-pages for operational insight as well as JMX metrics. Automatic Versioning The maximum number of versions to store for a given column is part of the column schema and is specified at table creation, or via an alter command, via HColumnDescriptor.DEFAULT_VERSIONS. Example: create a table name emp with a single column family f and keep a maximum of 5 versions of all columns in the column family. And then the salary of the employee is updated 5 times.
19 Apache HBase NoSQL on Hadoop 18 create 'emp', {NAME => 'f', VERSIONS => 5; put 'emp', '10000', 'f:salary', '65000' ; put 'emp', '10000', 'f:salary', '70000' ; put 'emp', '10000', 'f:salary', '75000' ; put 'emp', '10000', 'f:salary', '80000' ; HBase allows to scan the latest n versions, where n varies from 1 to max value, in this case its 5. For example: scan 'emp', {VERSIONS=>3 ROW COLUMN+CELL column=f:salary, timestamp= , value= column=f:salary, timestamp= , value= column=f:salary, timestamp= , value= row(s) in seconds Note that, on addition of the 6 th version of the data the oldest version will be lost, as the maximum number of versions is set to 5. Dynamic Schema Column qualifiers in Hbase are mutable and may differ greatly between rows. One row in a table can have 1 column, where as the next row in the same table can have 1 million columns. This dynamic schema is useful in many applications, as we don t have to specify the number of columns at the table creation time. Let s take an example of Northwind customer data, the association between customer and country is as below: CustomerID ALFKI ANATR ANTON AROUT BERGS BLAUS BLONP BOLID BONAP BOTTM BSBEV CACTU CENTC CHOPS COMMI CONSH DRACD DUMON Country Germany Mexico Mexico UK Sweden Germany France Spain France Canada UK Argentina Mexico Switzerland Brazil UK Germany France
20 Apache HBase NoSQL on Hadoop 19 This association can be stored in the Hbase table with country as the rowkey. The table has one column family named c and the customers associated to the country can be stored as column qualifiers. If some extra information needs to be added to the country-customer pair, we can simply add it to the value e.g. if complete address of the person need to be stored. Below is the snapshot of northwind-country data. As you can see, there are multiple column qualifiers associated to a rowkey. Whenever a new customer is added, a new column qualifier is added to the country (row) he belongs to. In real world scenarios, one country will have thousands of customers, and all of them can be stored in just a single row in HBase. The advantage is, if your application needs to retrieve all the customers belonging to a particular category (in this case, country), a single get operation is enough, if the rowkey of the table is category (in this case, country).
21 Apache HBase NoSQL on Hadoop 20 Pre region splits With a process called pre-splitting, you can create a table with many regions by supplying the split points at the table creation time. Since pre-splitting will ensure that the initial load is more evenly distributed throughout the cluster, you should always consider using it if you know your key distribution beforehand. However, pre-splitting also has a risk of creating regions, that do not truly distribute the load evenly because of data skew, or in the presence of very hot or large rows. If the initial set of region split points is chosen poorly, you may end up with heterogeneous load distribution, which will in turn limit your clusters performance. There is no short answer for the optimal number of regions for a given load, but you can start with a lower multiple of the number of region servers as number of splits, then let automated splitting take care of the rest. We will go through an example to understand how pre region split helps in distributing the data. Without pre-region split First, let s create the northwind customer table without any pre-region splits. If you go to the HBase web console, you can see that there is only one region for this table.
22 Apache HBase NoSQL on Hadoop 21 Applying pre-region split Here, we can apply the pre-region split while creating the customer table ( rowkey is customerid). CustomerID in this dataset is a varchar starting from A to Z. As per the split defined below, the data will be distributed amongst 27 different regions depending on the starting letter of the rowkey. For example, all customerid starting from A will go to the A region. Here we assume that the first character of customerid is uniformly distributed from A to Z) create 'northwind:customer', {NAME => 'c', VERSIONS => 3, {SPLITS => [ 'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'] The web console now shows that this table has 27 online regions. After loading the data, you can see that different regions have received the data based on the value of customerid.
23 Apache HBase NoSQL on Hadoop 22 Compression HBase comes with support for a number of compression algorithms that can be enabled at the column family level or for compacting compression. The available algorithms are NONE (no compression), GZ, LZO, LZ4 and SNAPPY (which is probably better than LZO in performance). Enabling compression requires installation of the corresponding libraries (unless you only want to use the Java based GZIP compression), and specifying the chosen algorithm in the column family schema. Column family compression design considerations: 1. Already compressed data (such as JPEG) should be in an uncompressed column family. 2. Small but very often used families should not be compressed The different cases where compression can be applied, are explained below. a) To enable compression during table creation: create northwind:products, {NAME => 'colfam1', COMPRESSION => 'LZ4
24 Apache HBase NoSQL on Hadoop 23 b) To enable/change/disable compression algorithms for existing tables, use alter command. create northwind:customer, c disable ' northwind:customer ' alter ' northwind:customer ', { NAME => 'c', COMPRESSION => 'LZ4 enable ' northwind:customer' Note that, only newly flushed store files after the change will use the new compression format. c) To force all existing HFiles to be rewritten with the newly selected compression format Issue a major_compact '<tablename>' in the shell to start a major compaction process in the background. It will rewrite all files, and therefore use the new settings. Recommendations 1. If the values are large (and not pre-compressed, such as images), use a data block compressor. 2. Use GZIP for cold data, which is accessed infrequently. GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio. 3. Use Snappy or LZO for hot data, which is accessed frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high of a compression ratio. 4. In most cases, enabling Snappy or LZO by default is a good choice, because they have a low performance overhead and provide space savings. 5. Before Snappy became available by Google in 2011, LZO was the default. Snappy has similar qualities as LZO but has been shown to perform better. Encoding Data block encoding attempts to limit duplication of information in keys, taking advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys and the schema of a given table. The encoding methods include Prefix, Diff, Fast Diff, Prefix tree etc. We will explore the prefix encoding as an example. Prefix encoding Often, keys are very similar. Specifically, keys often share a common prefix and only differ near the end. For instance, one key might be RowKey:Family:Qualifier0 and the next key might be RowKey:Family:Qualifier1. In Prefix encoding, an extra column is added which holds the length of the prefix shared between the current key and the previous key. Assuming the first key here is totally different from the key before, its prefix length is 0. The second key s prefix length is 23, since they have the first 23 characters in common. Obviously if the keys tend to have nothing in common, Prefix will not provide much benefit.
25 Apache HBase NoSQL on Hadoop 24 The following image shows a hypothetical column family with no data block encoding. And with prefix encoding, the column family will look like the image below: Time to Live (TTL) ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row, even the current one. The TTL time encoded in the HBase for the row is specified in UTC. Store files which contains only expired rows are deleted on minor compaction. Setting hbase.store.delete.expired.storefile to false disables this feature. Setting minimum number of versions to other than 0 also disables this. Recent versions of HBase also support setting time to live on a per cell basis. Cell TTLs are submitted as an attribute on mutation requests (Appends, Increments, Puts, etc.) using Mutation#setTTL. If the TTL attribute is set, it will be applied to all cells updated on the server by the operation. There are two notable differences between cell TTL handling and ColumnFamily TTLs: Cell TTLs are expressed in units of milliseconds instead of seconds. A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting.
26 Apache HBase NoSQL on Hadoop 25 To specify TTL for tables using shell command: create 'northwind:products', {NAME => 'p', VERSIONS => 1, TTL => To sepcify TTL for tables using Hbase API: The API provides the following getter and setter to read and write the TTL: int gettimetolive(); void settimetolive(int timetolive); HBase and Mapreduce Apache MapReduce is a software framework used to analyze large amounts of data, and is the framework used most often with Apache Hadoop. HBase is tightly integrated with Mapreduce and allows to access its tables from mapreduce jobs for both read and writes. Different types of Mapreduced Jobs for Hbase includes: 1. MapReduce Job Summary to HBase table 2. Read from HBase table in MapReduce job and write Summary to HDFS File 3. HBase MapReduce Summary to RDBMS table The below code explains how a mapreduce job writes its output to a HBase table. The data set is read from HDFS and processed parallely using mapreduce framework, and the output is loaded to HBase table northwind:customer. Driver Class: public class HBaseMapreduceWrite { public static void main(string[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration config = HBaseConfiguration.create(); Job job = new Job(config, "HBaseMapreduceWrite"); job.setjarbyclass(hbasemapreducewrite.class); // class that contains mapper FileInputFormat.addInputPath(job, new Path(args[0])); // set the input path job.setmapperclass(hbasewritemapper.class); job.setmapoutputkeyclass(immutablebyteswritable.class); job.setmapoutputvalueclass(put.class); TableMapReduceUtil.initTableReducerJob("northwind:customer", // output table null, // reducer class job); job.setnumreducetasks(0); // To create a map only job System.exit(job.waitForCompletion(true)? 0:1);
27 Apache HBase NoSQL on Hadoop 26 Mapper Class: public class HBaseWriteMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { public void map(longwritable row, Text value, Context context) throws IOException, InterruptedException { String[] record; String customerid; String companyname; String contactname; String contacttitle; String address; String city; String region; String postalcode; String country; String phone; String fax; // get the attributes of the customer record from the mapper input values record = value.tostring().split("\t"); customerid = record[0]; companyname = record[1]; contactname = record[2]; contacttitle = record[3]; address = record[4]; city = record[5]; region = record[6]; postalcode = record[7]; country = record[8]; phone = record[9]; fax = record[10]; // create the Put object using the customer attributes to be stored in HBase Put put = new Put(Bytes.toBytes(customerId)); put.add(bytes.tobytes("c"), Bytes.toBytes("companyName"), Bytes.toBytes(companyName)); put.add(bytes.tobytes("c"), Bytes.toBytes("contactName"), Bytes.toBytes(contactName)); put.add(bytes.tobytes("c"), Bytes.toBytes("contactTitle"), Bytes.toBytes(contactTitle)); put.add(bytes.tobytes("c"), Bytes.toBytes("address"), Bytes.toBytes(address)); put.add(bytes.tobytes("c"), Bytes.toBytes("city"), Bytes.toBytes(city)); put.add(bytes.tobytes("c"), Bytes.toBytes("region"), Bytes.toBytes(region)); put.add(bytes.tobytes("c"), Bytes.toBytes("postalCode"), Bytes.toBytes(postalCode)); put.add(bytes.tobytes("c"), Bytes.toBytes("country"), Bytes.toBytes(country)); put.add(bytes.tobytes("c"), Bytes.toBytes("phone"), Bytes.toBytes(phone)); put.add(bytes.tobytes("c"), Bytes.toBytes("fax"), Bytes.toBytes(fax)); // Write the Put object to context using customerid as the key. context.write(new ImmutableBytesWritable(customerId.getBytes()), put);
28 Apache HBase NoSQL on Hadoop 27 Client Filters Get and Scan instances can be optionally configured with filters which are applied on the RegionServer. HBase supports many types of filters to retrieve the required data, on the row, family and column level. In this section, we will explore the major filters and comparators offered by HBase. 1. Partial Key Scan The most efficient way to retrieve the required rows is to use a start and stop row in the scan operation, provided the rows you are looking for are continuous. In most cases, the row-key design is such that rows are clubbed together based on the most common access pattern. Also, remember that Hbase store the data in sorted on rowkey. The following example shows how to use this feature to retrieve all the customers whose ID starts with the letters WAR. The stop row is given as WAS, means do not retrieve those customers which comes after WAR. Note that stop-row is not included in the range. scan 'northwind:customer', {STARTROW => 'WAR',STOPROW => 'WAS' ROW COLUMN+CELL WARTH column=c:address, timestamp= , value=torikatu 38 WARTH column=c:city, timestamp= , value=oulu WARTH column=c:companyname, timestamp= , value=wartian Herkku WARTH column=c:contactname, timestamp= , value=pirkko Koskitalo WARTH column=c:contacttitle, timestamp= , value=accounting Manager WARTH column=c:country, timestamp= , value=finland WARTH column=c:fax, timestamp= , value= WARTH column=c:phone, timestamp= , value= WARTH column=c:postalcode, timestamp= , value=90110 WARTH column=c:region, timestamp= , value=null 1 row(s) in seconds 2. SingleColumnValueFilter SingleColumnValueFilter can be used to test column values for equivalence (CompareOp.EQUAL), inequality (CompareOp.NOT_EQUAL), or ranges (e.g., CompareOp.GREATER). The following is example tests the equivalence of a column to a String "my value". SingleColumnValueFilter filter = new SingleColumnValueFilter( cf, column, CompareOp.EQUAL, Bytes.toBytes("my value") ); scan.setfilter(filter); 3. RegexStringComparator RegexStringComparator supports regular expressions for value comparisons RegexStringComparator comp = new RegexStringComparator("my."); // any value that starts with 'my' SingleColumnValueFilter filter = new SingleColumnValueFilter( cf,column, CompareOp.EQUAL,comp); scan.setfilter(filter);
29 Apache HBase NoSQL on Hadoop FamilyFilter FamilyFilter can be used to filter on the ColumnFamily. It is generally a better idea to select ColumnFamilies in the Scan than to do it with a Filter. Scan scan = new Scan(); scan.addfamily(family); // (optional) limit to one family 5. QualifierFilter QualifierFilter can be used to filter based on Column (aka Qualifier) name. Examples are ColumnPrefixFilter and ColumnRangeFilter Scan scan = new Scan(); // (optional) limit to one row scan.addfamily(family); // (optional) limit to one family Filter f = new ColumnPrefixFilter(prefix); scan.setfilter(f); ColumnRangeFilter is used for slicing of data. Scan scan = new Scan(row, row); // (optional) limit to one row scan.addfamily(family); // (optional) limit to one family Filter f = new ColumnRangeFilter(startColumn, true, endcolumn, true); scan.setfilter(f); 6. RowFilter As discussed earlier, it is a better idea to use the startrow/stoprow methods on Scan for row selection, however RowFilter can also be used. RowFilter filter = new RowFilter(CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes(myRowId.toString())); ACID in HBase Apache HBase is not an ACID compliant database. However, it does guarantee certain specific properties. HBase employs a kind of MVCC (Multi-version concurrency control). And HBase has no mixed read/write transactions. In a nutshell each RegionServer maintains what we call "strictly monotonically increasing transaction numbers". When a write transaction (a set of puts or deletes) starts it retrieves the next highest transaction number. In HBase this is called a WriteNumber. When a read transaction (a Scan or Get) starts it retrieves the transaction number of the last committed transaction. HBase calls this the ReadPoint. Each created KeyValue is tagged with its transaction's WriteNumber (this tag is called the memstore timestamp in HBase. Note that this is separate from the application-visible timestamp.)
30 Apache HBase NoSQL on Hadoop 29 The high level flow of a write transaction in HBase looks like this: 1. lock the row(s), to guard against concurrent writes to the same row(s). 2. retrieve the current writenumber. 3. apply changes to the WAL (Write Ahead Log). 4. apply the changes to the Memstore (using the acquired writenumber to tag the KeyValues). 5. commit the transaction, i.e. attempt to roll the Readpoint forward to the acquired Writenumber. 6. unlock the row(s). The high level flow of a read transaction looks like this: 1. open the scanner. 2. get the current readpoint. 3. filter all scanned KeyValues with memstore timestamp > the readpoint. 4. close the scanner (this is initiated by the client). In reality it is more complicated, but this explanation is to illustrate it on high level. Note that a reader acquires no locks at all, but we still get all of ACID. It is important to realize that this only works if transactions are committed strictly serially; otherwise an earlier uncommitted transaction could become visible when one that started later commits first. In HBase transaction are typically short, so this is not a problem. HBase does exactly that - All transactions are committed serially. Committing a transaction in HBase means settting the current ReadPoint to the transaction's WriteNumber, and hence make its changes visible to all new Scans. HBase keeps a list of all unfinished transactions. A transaction's commit is delayed until all prior transactions committed. Note that HBase can still make all changes immediately and concurrently, only the commits are serial. Also note that a scan will always reflect a view of the data at least as new as the beginning of the scan. High Availability For achieving high availability for reads, HBase provides a feature called region replication. In this model, for each region of a table, there will be multiple replicas that are opened in different RegionServers. By default, the region replication is set to 1, so only a single region replica is deployed and there will not be any changes from the original model. If region replication is set to 2 or more, then the master will assign replicas of the regions of the table. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers and also in the same rack (if possible).
31 Apache HBase NoSQL on Hadoop 30 Hbase as an Object store Instead of storing different parameters as column qualifiers, HBase can also store the data as a serialized object. This is a very useful feature when you have 1. thousands of fields for an entity, and whenever a single field changes, you need to create a new version of the entity. In conventional design, you have to update all the column qualifiers. If the entity is stored as a single object (in one column qualifier), you can modify the object, and put it back as a new version. 2. When your application demands the data to be preserved as the objects used in the application. Any change in the parameters of the objects can then be modified at the application level, and not at the database. We will try to explain this concept using the same northwind:customer example. Instead of storing the different attributes of the customer as different columns, we will create a customer object, and store it in the Hbase table. Also, we will explain, how can it be retrieved. NorthwindCustomer Class public class NorthwindCustomer implements Writable { public Map<String, String> customermap; public NorthwindCustomer() { customermap = new HashMap<String, String>(); public NorthwindCustomer(Map<String, String> customermap) { super(); this.customermap = customermap; public void setcustomerdetails(map<string, String> customermap) { this.customermap = customermap; public Map<String, String> getcustomerdetails() { return public void readfields(datainput in) throws IOException { customermap.clear(); int entries = in.readint(); String key; String value; customermap = new HashMap<String, String>(); for (int i = 0; i < entries; i++) { key = in.readutf(); value = in.readutf(); customermap.put(key, public void write(dataoutput out) throws IOException { out.writeint(customermap.size()); for (String key : customermap.keyset()) { out.writeutf(key); out.writeutf(customermap.get(key));
32 Apache HBase NoSQL on Hadoop 31 Data Loading public class HBaseLoadDataAsObject { public static void main(string[] args) throws MasterNotRunningException, ZooKeeperConnectionException, IOException { // TODO Auto-generated method stub Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "northwind:customer"); BufferedReader reader = null; String[] record; Map<String, String> customermap = new HashMap<String, String>(); NorthwindCustomer northwindcustomer ; try { File file = new File("/data/northiwind_customer.txt"); reader = new BufferedReader(new FileReader(file)); String line; while ((line = reader.readline())!= null) { record = line.split("\t"); customermap.clear(); customermap.put("customerid", record[0]); customermap.put("companyname", record[1]); customermap.put("contactname", record[2]); customermap.put("contacttitle", record[3]); customermap.put("address", record[4]); customermap.put("city", record[5]); customermap.put("region", record[6]); customermap.put("postalcode", record[7]); customermap.put("country", record[8]); customermap.put("phone", record[9]); customermap.put("fax", record[10]); northwindcustomer = new NorthwindCustomer(customerMap); Put put = new Put(Bytes.toBytes(record[0])); put.add(bytes.tobytes("c"), Bytes.toBytes("profile"), (serialize(northwindcustomer))); table.put(put); catch (IOException e) { e.printstacktrace(); finally { try { reader.close(); catch (IOException e) { e.printstacktrace(); // Serialize the object public static byte[] serialize(writable writable) throws IOException { ByteArrayOutputStream out = new ByteArrayOutputStream(); DataOutputStream dataout = new DataOutputStream(out); writable.write(dataout); dataout.close(); return out.tobytearray();
33 Apache HBase NoSQL on Hadoop 32 Data Retrieval Screen shot of the HBase object store public class HBaseGetObjectData { public static void main(string[] args) { // TODO Auto-generated method stub Configuration hconf = HBaseConfiguration.create(); /* * hconf.set(constants.hbase_configuration_zookeeper_quorum,"sandbox.hortonworks.com") ; * hconf.setint(constants.hbase_configuration_zookeeper_clientport,2181); */ Result r; NorthwindCustomer northwindcustomer; Map<String, String> customermap; try HTable htable = new HTable(hConf, "northwind:customer"); byte[] family = Bytes.toBytes("c"); byte[] qualifier = Bytes.toBytes("profile"); Get get = new Get(Bytes.toBytes("WOLZA")); r = htable.get(get); byte[] value = r.getvalue(family, qualifier); northwindcustomer = new NorthwindCustomer(); deserialize(northwindcustomer, value); customermap = northwindcustomer.getcustomerdetails(); for (String property : customermap.keyset()) { System.out.println(property + " == " + customermap.get(property)); htable.close(); catch (IOException e) { // TODO Auto-generated catch block e.printstacktrace(); // deserialize the object public static void deserialize(writable writable, byte[] bytes) throws IOException { ByteArrayInputStream in = new ByteArrayInputStream(bytes); DataInputStream datain = new DataInputStream(in); writable.readfields(datain); datain.close();
34 Apache HBase NoSQL on Hadoop 33 The result of the above operation is given below: country == Poland contacttitle == Owner address == ul. Filtrowa 68 city == Warszawa phone == (26) contactname == Zbyszek Piestrzeniewicz companyname == Wolski Zajazd postalcode == customerid == WOLZA region == NULL fax == (26) Scalability and Performance The real performance of HBase is visible when the amount of data is huge and the number of nodes in the cluster is adequate. In our project setup we are using only a pseudo distributed cluster, and it is not enough to test the scalability and performance of HBase. In order to get an idea of the performance of HBase, we are referring to a vendor independent case study of NoSQL databases, done by Altoros systems Inc. the databases were tested under the same conditions, regardless of their specifics, using the Yahoo Cloud Serving Benchmark. The complete case study can be found in the references section of this report. Some of the interesting results are given below: Data load operation 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster. HBase demonstrated by far the best writing speed. With precreated regions and deferred log flush enabled, it reached 40K ops/sec. Cassandra also showed great performance during the loading phase with around 15K ops/sec.
35 Apache HBase NoSQL on Hadoop 34 Update and reads Next, an update-heavily scenario that simulates the database work during which typical actions of an e-commerce solution user are recorded, is performed. As you can see, during updates, HBase and Cassandra went far ahead from the main group with the average response latency time not exceeding two milliseconds. HBase was even faster. HBase client was configured with AutoFlush turned off. The updates aggregated in the client buffer and pending writes flushed asynchronously, as soon as the buffer became full. To accelerate updates processing on the server, the deferred log flush was enabled and WAL edits were kept in memory during the flush period. During reads, per-column family compression provides HBase and Cassandra with faster data access. HBase was configured with native LZO and Cassandra with Google's Snappy compression codecs. Although the computation ran longer, the compression reduces the number of bytes read from the disk.
36 Apache HBase NoSQL on Hadoop 35 Read only workload This read-only workload simulated a data caching system. The data was stored outside the system, while the application was only reading it. Thanks to B-tree indexes, sharded MySQL became the winner in this competition.
37 Apache HBase NoSQL on Hadoop 36 Example for HBase schema design In this section, we will design a HBase data schema for the Northwind database (assuming that its data size grows huge and hence is being ported to HBase), whose relational schema is given below. Please note that the ideal design of any HBase database, will highly depend on the real access patterns to the data. In this example, we will try to explain the principles of HBase schema design, and how to bring more efficiency based on the access patterns. Core Design Concepts The core concepts we should consider in the initial design are 1. There is no referential integrity offered by HBase. So its up to the application designer/developer to take care of the referential integrity in the database. 2. Since storage is not a problem in HDFS, de-normalization should be applied wherever applicable. It also facilitates fast data retrieval with a single row scan, compared to multiple rows (potentially, across multiple tables) in a normalized design. 3. Since ROW_KEY is the single identifier for a row, it should be unique and should contain a range of values in order to get the data distributed equally in the cluster. 4. Keep the ROW_KEY length as short as is reasonable such that they can still be useful for required data access (e.g. Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs when designing row keys. 5. Heavy joins should be avoided using composite row-keys, based on the access patterns.
38 Apache HBase NoSQL on Hadoop Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the same time. 7. Try to keep the column family names as small as possible, preferably one character (e.g. "d" for data/default). 8. Also note that, row keys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. 9. Finally, if there is a requirement for secondary indexes, we can create separate index tables, and update them periodically. Designed Schema Considering these factors, the designed HBase schema is given below. 1. Customers Three tables in the Northwind relational database (Customers, CustomerCustomerDemo, CustomerDemographics ) are combined to form a single table. In the demographics column family, a dynamic column qualifier is used instead of a static column qualifier. This allows us to handle n to n relationship between customers and customer demographics. Note: Throughout the design, unless mentioned otherwise, the column values are the values from the relational DB for the corresponding column qualifiers. 2. Employees Four tables in the Northwind relational database (Employees, EmployeeTerritories, Region, Territories) are combined to form a single table. In order to handle the multiple territory information for a particular employee, the territory column family (t) uses a composite row-key of EmployeeID and TerritoryID, separated by a delimiter (hyphen character is a good choice to use as a delimiter, since its lexicographical order in ASCII is prior than alpha numeric characters)
39 Apache HBase NoSQL on Hadoop Products Three tables in the Northwind relational database (Products, Categories, Suppliers) are combined to form a single table. Note that the CategoryID column is not really necessary, as all the category information is stored in the same row for each product. It is kept in the design in order to have reference to the RDBMS system.
40 Apache HBase NoSQL on Hadoop Orders Three tables in the Northwind relational database (Orders, OrderDetails, Shippers) are combined to form a single table. In order to retain the order details of each products in an order, the order details column family uses a composite rowkey of OrderID and ProductID. Note that the order of the keys in this design will allow us to get all the products of a particular OrderID, but not vice versa. If we want to see all the orders of a particular product, we can create an index table with ProductID as rowkey and OrderIDs as column qulaifiers. This table needs to be updated when ever a new order entry is inserted in Orders table. As we discussed earlier, the access patterns matter a lot in the HBase schema design! Phoenix Apache Phoenix is a relational database layer over HBase, delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
41 Apache HBase NoSQL on Hadoop 40 References 1. HBase: The Definitive Guide - O'Reilly ( 2. apache_hbase_reference_guide Apache (
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
Apache HBase: the Hadoop Database
Apache HBase: the Hadoop Database Yuanru Qian, Andrew Sharp, Jiuling Wang Today we will discuss Apache HBase, the Hadoop Database. HBase is designed specifically for use by Hadoop, and we will define Hadoop
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
HBase Schema Design. NoSQL Ma4ers, Cologne, April 2013. Lars George Director EMEA Services
HBase Schema Design NoSQL Ma4ers, Cologne, April 2013 Lars George Director EMEA Services About Me Director EMEA Services @ Cloudera ConsulFng on Hadoop projects (everywhere) Apache Commi4er HBase and Whirr
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
Advanced Java Client API
2012 coreservlets.com and Dima May Advanced Java Client API Advanced Topics Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
HBase. Lecture BigData Analytics. Julian M. Kunkel. [email protected]. University of Hamburg / German Climate Computing Center (DKRZ)
HBase Lecture BigData Analytics Julian M. Kunkel [email protected] University of Hamburg / German Climate Computing Center (DKRZ) 11-12-2015 Outline 1 Introduction 2 Excursion: ZooKeeper 3 Architecture
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Scaling Up 2 CSE 6242 / CX 4242. Duen Horng (Polo) Chau Georgia Tech. HBase, Hive
CSE 6242 / CX 4242 Scaling Up 2 HBase, Hive Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le
Data storing and data access
Data storing and data access Plan Basic Java API for HBase demo Bulk data loading Hands-on Distributed storage for user files SQL on nosql Summary Basic Java API for HBase import org.apache.hadoop.hbase.*
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Introduction to Hbase Gkavresis Giorgos 1470
Introduction to Hbase Gkavresis Giorgos 1470 Agenda What is Hbase Installation About RDBMS Overview of Hbase Why Hbase instead of RDBMS Architecture of Hbase Hbase interface Summarise What is Hbase Hbase
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems
Storage of Structured Data: BigTable and HBase 1 HBase and BigTable HBase is Hadoop's counterpart of Google's BigTable BigTable meets the need for a highly scalable storage system for structured data Provides
Scaling Up HBase, Hive, Pegasus
CSE 6242 A / CS 4803 DVA Mar 7, 2013 Scaling Up HBase, Hive, Pegasus Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko,
Integrating VoltDB with Hadoop
The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI
Big Data Development CASSANDRA NoSQL Training - Workshop March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI ISIDUS TECH TEAM FZE PO Box 121109 Dubai UAE, email training-coordinator@isidusnet M: +97150
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00
Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database
Integration of Apache Hive and HBase
Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 About Me User and committer of Hadoop since 2007 Contributor to Apache Hadoop, HBase, Hive and Gora Joined
Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF
Non-Stop for Apache HBase: -active region server clusters TECHNICAL BRIEF Technical Brief: -active region server clusters -active region server clusters HBase is a non-relational database that provides
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Hypertable Goes Realtime at Baidu. Yang Dong [email protected] Sherlock Yang(http://weibo.com/u/2624357843)
Hypertable Goes Realtime at Baidu Yang Dong [email protected] Sherlock Yang(http://weibo.com/u/2624357843) Agenda Motivation Related Work Model Design Evaluation Conclusion 2 Agenda Motivation Related
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Ankush Cluster Manager - Hadoop2 Technology User Guide
Ankush Cluster Manager - Hadoop2 Technology User Guide Ankush User Manual 1.5 Ankush User s Guide for Hadoop2, Version 1.5 This manual, and the accompanying software and other documentation, is protected
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
ITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Bigtable is a proven design Underpins 100+ Google services:
Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
Bigdata High Availability (HA) Architecture
Bigdata High Availability (HA) Architecture Introduction This whitepaper describes an HA architecture based on a shared nothing design. Each node uses commodity hardware and has its own local resources
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
MongoDB Developer and Administrator Certification Course Agenda
MongoDB Developer and Administrator Certification Course Agenda Lesson 1: NoSQL Database Introduction What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL Types of NoSQL
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework
Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework Aryan TaheriMonfared Tomasz Wiktor Wlodarczyk Chunming Rong Department of Electrical Engineering and Computer Science University
HareDB HBase Client Web Version USER MANUAL HAREDB TEAM
2013 HareDB HBase Client Web Version USER MANUAL HAREDB TEAM Connect to HBase... 2 Connection... 3 Connection Manager... 3 Add a new Connection... 4 Alter Connection... 6 Delete Connection... 6 Clone Connection...
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
low-level storage structures e.g. partitions underpinning the warehouse logical table structures
DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures
Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Using RDBMS, NoSQL or Hadoop?
Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases
Postgres & Hadoop Agenda! Strengths of PostgreSQL! Strengths of Hadoop! Hadoop Community! Use Cases Best of Both World Postgres Hadoop World s most advanced open source database solution Enterprise class
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Hadoop Project for IDEAL in CS5604
Hadoop Project for IDEAL in CS5604 by Jose Cadena Mengsu Chen Chengyuan Wen {jcadena,mschen,[email protected] Completed as part of the course CS5604: Information storage and retrieval offered by Dr. Edward
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk
Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria
