Data storing and data access

Similar documents

Xiaoming Gao Hui Li Thilina Gunarathne

Comparing SQL and NOSQL databases

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Integration of Apache Hive and HBase

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data and Scripting Systems build on top of Hadoop

Workshop on Hadoop with Big Data

A Scalable Data Transformation Framework using the Hadoop Ecosystem

How To Scale Out Of A Nosql Database

Can the Elephants Handle the NoSQL Onslaught?

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

The Hadoop Eco System Shanghai Data Science Meetup

Apache HBase: the Hadoop Database

Replicating to everything

Media Upload and Sharing Website using HBASE

NoSQL Database Options

NoSQL: Going Beyond Structured Data and RDBMS

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

NoSQL Data Base Basics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

An Approach to Implement Map Reduce with NoSQL Databases

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Using distributed technologies to analyze Big Data

Sentimental Analysis using Hadoop Phase 2: Week 2

Complete Java Classes Hadoop Syllabus Contact No:

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Integrating Big Data into the Computing Curricula

Big Data Course Highlights

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

HBase Key Design coreservlets.com and Dima May coreservlets.com and Dima May

Open source large scale distributed data management with Google s MapReduce and Bigtable

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

How To Write A Nosql Database In Spring Data Project

How, What, and Where of Data Warehouses for MySQL

Oracle Big Data SQL Technical Update

Big Data and Scripting Systems build on top of Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

American International Journal of Research in Science, Technology, Engineering & Mathematics

Cloud Scale Distributed Data Storage. Jürmo Mehine

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Advanced Java Client API

Peers Techno log ies Pv t. L td. HADOOP

Scaling Up HBase, Hive, Pegasus

Hadoop Ecosystem B Y R A H I M A.

Using RDBMS, NoSQL or Hadoop?

Important Notice. (c) Cloudera, Inc. All rights reserved.

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Big Data and Data Science: Behind the Buzz Words

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Unleash the Power of HBase Shell

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Cloudera Certified Developer for Apache Hadoop

Enterprise Operational SQL on Hadoop Trafodion Overview

Apache Kylin Introduction Dec 8,

ITG Software Engineering

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

MongoDB Developer and Administrator Certification Course Agenda

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Map Reduce & Hadoop Recommended Text:

Apache Sqoop. A Data Transfer Tool for Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Moving From Hadoop to Spark

An Open Source NoSQL solution for Internet Access Logs Analysis

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

IBM SPSS Analytic Server Version 2. User's Guide

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Getting Started with MongoDB

Understanding NoSQL on Microsoft Azure

Move Data from Oracle to Hadoop and Gain New Business Insights

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Introduction to Hbase Gkavresis Giorgos 1470

Open Source Technologies on Microsoft Azure

Constructing a Data Lake: Hadoop and Oracle Database United!

Apache HBase. Crazy dances on the elephant back

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

HBase. Lecture BigData Analytics. Julian M. Kunkel. University of Hamburg / German Climate Computing Center (DKRZ)

Hadoop IST 734 SS CHUNG

How To Create A Large Data Storage System

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Hadoop and MySQL for Big Data

Transcription:

Data storing and data access

Plan Basic Java API for HBase demo Bulk data loading Hands-on Distributed storage for user files SQL on nosql Summary

Basic Java API for HBase import org.apache.hadoop.hbase.*

Data modification operations Storing the data Put it inserts or updates a new cell (s) Append it updates a cell Bulk loading loads the data directly to store files Reading Get single row lookup (one or many cells) Scan rows range scanning Deleting data delete ( column families, rows, cells) truncate

Adding a row with Java API 1. Configuration creation Configuration config = HBaseConfiguration.create(); 2. Establishing connection Connection connection = ConnectionFactory.createConnection(config); 3. Table opening Table table = connection.gettable(tablename.valueof(table_name)); 4. Create a put object Put p = new Put(key); 5. Set values for columns p.addcolumn(family, col_name, value);. 6. Push the data to the table table.put(p);

Getting data with Java API 1. Open a connection and instantiate a table object 2. Create a get object Get g = new Get(key); 3. (optional) specify certain family or column only g.addcolumn(family_name, col_name); //or g.addfamily(family_name); 4. Get the data from the table Result result= table.get(g); 5. Get values from the result object byte [] value = result.getvalue(family, col_name); //.

Scanning the data with Java API 1. Open a connection and instantiate a table object 2. Create a Scan object Scan s = new Scan(start_row,stop_row); 3. (optional) Set columns to be retrieved s.addcolumn(family,col_name) 4. Get a result scanner object ResultScanner scanner = table.getscanner(s); 5. Iterate through results for (Result row : scanner) { // do something with the row }

Filtering scan results will not prevent from reading the data set -> will reduce the network utilization only! there are many filters available for column names, values etc. and can be combained scan.setfilter(new ValueFilter(GREATER_OR_EQUAL,1500); scan.setfilter(new PageFilter(25));

Demo Lets store persons data in HBase Description of a person: id first_name last_name data of brith profession.? Additional requirement Fast records lookup by Last Name

Demo source data in CSV 1232323,Zbigniew,Baranowski,M,1983-11-20,Poland,IT,CERN 1254542,Kacper,Surdy,M,1989-12-12,Poland,IT,CERN 6565655,Michel,Jackson,M,1966-12-12,USA,Music,None 7633242,Barack,Obama,M,1954-12-22,USA,President,USA 5323425,Andrzej,Duda,M,1966-01-23,Poland,President,Poland 5432411,Ewa,Kopacz,F,1956-02-23,Poland,Prime Minister,Poland 3243255,Rolf,Heuer,M,1950-03-26,Germany,DG,CERN 6554322,Fabiola,Gianotti,F,1962-10-29,Italy,Particle Physici 1232323,Lionel,Messi,M,1984-06-24,Argentina,Football Player,

Demo - designing Generate a new id when inserting a person Has to be unique sequence of incremented numbers incrementing has to be an atomic operation Recent value for id has to be stored (in a table) Row key = id? maybe row key = last_name+id? Lets keep: row key = id Fast last_name lookups Additional indexing table

Demo - Tables Users with users data row_key = userid Counters for userid generation row_key = main_table_name usersindex for indexing users table row_key = last_name+userid? row_key = column_name+value+userid

Demo Java classes UsersLoader loading the data generates userid from counters table loads the users data into users table updates usersindex table UsersScanner performs range scans scans the usersindex table ranges provided by a caller gets the details of given records from the users table

Hands on Get the scripts wget cern.ch/zbaranow/hbase.zip unzip hbase.zip cd hbase/part1 Preview: UsersLoader.java, UsersScanner.java Create tables hbase shell -n tables.txt Compile and run javac cp `hbase classpath` *.java java cp `hbase classpath` UserLoader users.csv 2>/dev/null java cp `hbase classpath` UsersScanner last_name Baranowski Baranowskj 2>/dev/null

Schema design consideration

Key values Is the most important aspect in designing fast data reading vs fast data storing Fast data access (range scans) keep in mind the right order of row key parts username+timestamp vs timestamp+username for fast recent data retrievals it is better to insert new rows into the first regions of the table Example: key=10000000000-timestamp Fast data storing distribute rows across regions Salting Hashing

Tables Two options Region 1 Wide - large number of columns Key F1:COL1 F1:COL2 F2:COL3 F2:COL4 r1 r1v1 r1v2 r1v3 r1v4 r2 r2v1 r2v2 r2v3 r2v4 r3 r3v1 r3v2 r3v3 r3v4 Tall - large number of rows Region 1 Region 2 Region 3 Region 4 Key r1_col1 r1_col2 r1_col3 r1_col4 r2_col1 r2_col2 r2_col3 r2_col4 r3_col1 F1:V r1v1 r1v1 r1v3 r1v4 r2v1 r2v2 r2v3 r2v4 r3v1

Bulk data loading

Bulk loading Why? For loading big data sets already available on HDFS Faster direct data writing to HBase store files No footprint on region servers How? 1. Load the data into HDFS 2. Generate a hfiles with the data using MapReduce write your own or use importtsv has some limitations 3. Embed generated files into HBase

Bulk load demo 1. Create a target table 2. Load the CSV file to HDFS 3. Run ImportTsv 4. Run LoadIncrementalHFiles All commands in: bulkloading.txt

Part 2: Distributed storage (hands on)

Hands on: distributed storage Let s imagine we need to provide a backend storage system for a large scale application e.g. for a mail service, for a cloud drive We want the storage to be distributed content addressed In the following hands on we ll see how Hbase can do this

Distributed storage: insert client The application will be able to upload a file from a local file system and save a reference to it in users table A file will be reference by its SHA-1 fingerprint General steps: read a file and calculate a fingerprint check for file existence save in files table if not exists add a reference in users table in media column family

Distributed storage: download client The application will be able to download a file given an user ID and a file (media) name General steps: retrieve a fingerprint from users table get the file data from files table save the data to a local file system

Distributed storage: exercise location Get to the source files cd../part2 Fill the TODOs support with docs and previous examples Compile with javac -cp `hbase classpath` InsertFile.java javac -cp `hbase classpath` GetMedia.java

SQL on HBase

Running SQL on HBase From Hive or Impala HTable mapped to an external table Some DMLs are supported insert (but not overwrite) updates are available by duplicating a row with insert statement

Use cases for SQL on HBase Data warehouses facts table : big data scanning -> impala + parquet dimensional table: random lookups -> hbase Read write storage Metadata counters

How to? Create an external table with hive Provide column names and types (key column should be always a string) STORED BY 'org.apache.hadoop.hive.hbase.hbasestoragehandler' WITH SERDEPROPERTIES "hbase.columns.mapping = ":key,main:first_name,main:last_name. TBLPROPERTIES ("hbase.table.name" = "users ); Try it out! hive f./part2/sqlonhbase.txt

Summary

What was not covered Writing co-processors - stored procedures HBase table permissions Filtering of data scanner results Using map reduce for data storing and retrieving Bulk data loading with custom map reduce Using different APIs - Thrift

Summary Hbase is a key-value, wide-columnar store Horizontal (regions) + Vertical (col. Families) partitioning Row Key values are indexed within regions Data typefree data stored in bytes arrays Tables are semi structured Fast random data access by key Not for massive parallel data processing! Stored data can be modified (updated, deleted)

Other similar NoSQL engines Apache Cassandra MongoDB Apache Accumulo (on Hadoop) Hypertable (on Hadoop) HyperDex BerkleyDB / Oracle NoSQL

Announcement: Hadoop users forum Why? Exchange knowledge and experience about The technology itself Current successful projects on Hadoop@CERN Service requirements Who? Everyone how is interested in Hadoop (and not only) How? e-group: it-analytics-wg@cern.ch When? Every 2-4 weeks Starting from 7 th of October