Elosztott, skálázódó adatbázis-kezelő rendszer

Similar documents

Practical Cassandra. Vitalii

Hands-on Cassandra. OSCON July 20, Eric

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

The Apache Cassandra storage engine

BIG DATA TOOLS. Top 10 open source technologies for Big Data

How To Scale Out Of A Nosql Database

Evaluation of NoSQL databases for large-scale decentralized microblogging

Preparing Your Data For Cloud

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Open source large scale distributed data management with Google s MapReduce and Bigtable

Distributed Systems. Tutorial 12 Cassandra

Slave. Master. Research Scholar, Bharathiar University

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Xiaoming Gao Hui Li Thilina Gunarathne

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

NoSQL: Going Beyond Structured Data and RDBMS

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Introduction to Big Data Training

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem B Y R A H I M A.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Can the Elephants Handle the NoSQL Onslaught?

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

CASSANDRA. Arash Akhlaghi, Badrinath Jayakumar, Wa el Belkasim. Instructor: Dr. Rajshekhar Sunderraman. CSC 8711 Project Report

Introduction to Apache Cassandra

Cassandra. Jonathan Ellis

NoSQL Data Base Basics

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Moving From Hadoop to Spark

Open source Google-style large scale data analysis with Hadoop

Large scale processing using Hadoop. Ján Vaňo

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

So What s the Big Deal?

How To Handle Big Data With A Data Scientist

Cloud Scale Distributed Data Storage. Jürmo Mehine

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Hadoop. Sunday, November 25, 12

Scaling Up HBase, Hive, Pegasus

Big Systems, Big Data

White Paper: What You Need To Know About Hadoop

Cloud Computing at Google. Architecture

Cloudera Certified Developer for Apache Hadoop

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Applications for Big Data Analytics

Comparing SQL and NOSQL databases

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open Source Technologies on Microsoft Azure

Big Data Technologies Compared June 2014

BIG DATA TRENDS AND TECHNOLOGIES

Lecture Data Warehouse Systems

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

INTRODUCTION TO CASSANDRA

Introduction to Hbase Gkavresis Giorgos 1470

A Distributed Storage Schema for Cloud Computing based Raster GIS Systems. Presented by Cao Kang, Ph.D. Geography Department, Clark University

May 6, DataStax Cassandra South Bay Meetup. Cassandra Modeling. Best Practices and Examples. Jay Patel Architect, Platform

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data and Scripting Systems build on top of Hadoop

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Advanced Data Management Technologies

Structured Data Storage

Using Cloud Services for Test Environments A case study of the use of Amazon EC2

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise

Referential Integrity in Cloud NoSQL Databases

MapReduce with Apache Hadoop Analysing Big Data

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Bigtable is a proven design Underpins 100+ Google services:

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Cassandra A Decentralized Structured Storage System

Dominik Wagenknecht Accenture

How To Use Big Data For Telco (For A Telco)

NOSQL DATABASES AND CASSANDRA

Big Data Explained. An introduction to Big Data Science.

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Big Data With Hadoop

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Application Development. A Paradigm Shift

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Infrastructures for big data

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Transcription:

Elosztott, skálázódó adatbázis-kezelő rendszer http://cassandra.apache.org/ Molnár András (modras@ilab.sztaki.hu) Garzó András (garzo@ilab.sztaki.hu) 2012. július 13. péntek

Eredet In Greek mythology,...

Eredet In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named. (Cassandra: The Definitive Guide)

CAP

CAP RDBMS's

CAP RDBMS's HBase Bigtable

CAP RDBMS's Voldemort Cassandra HBase Bigtable

CAP (Cassandra: The Definitive Guide)

Tuneable consistency set consistency level... (Cassandra: The Definitive Guide)

Mit mond magáról? Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon s Dynamo and its data model on Google s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. That s exactly 50 words. Of course, if you were to recite that to your boss in the elevator, you'd probably get a blank look in return. (Cassandra: The Definitive Guide)

Mit mond magáról? Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its Indexed, distribution schema-free, design on Amazon s Dynamo and row-oriented its data store model on Google s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. That s exactly 50 words. Of course, if you were to recite that to your boss in the elevator, you'd probably get a blank look in return. (Cassandra: The Definitive Guide)

Mikor érdemes használni? (pl.) Cassandra vs RDBMS nagy adatmennyiség, szerver klaszter... Cassandra vs HBase always writable... Cassandra vs Voldemort dozens or hundreds of columns...

Use case examples Large deployments many nodes Lots of writes, statistics & analysis always writable Geographical distribution data locality Evolving applications no strict schema (Cassandra: The Definitive Guide)

Fejlesztés állása aktuális verzió: 1.1.2, released 2012-07-02 [v1.1.1 released: 2012-06-04]

Kik supportálják? Third-party solution provides e.g. Cassandra wiki

Kik használják? Twitter is using Cassandra for analytics: for real-time analytics, for geolocation and places of interest data, and for data mining over the entire user store. Mahalo uses it for its primary near-time data store. Facebook still uses it for inbox search, though they are using a proprietary fork. Digg uses it for its primary near-time data store. Rackspace uses it for its cloud service, monitoring, and logging. Reddit uses it as a persistent cache. Cloudkick uses it for monitoring statistics and analytics. Ooyala uses it to store and serve near real-time video analytics data. SimpleGeo uses it as the main data store for its real-time location infrastructure. Onespot uses it for a subset of its main data store. (Cassandra: The Definitive Guide)

Kik használják? Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines. More: http://www.datastax.com/cassandrausers ( cassandra.apache.org )

Data Model

Sparse table Data Model (Cassandra: The Definitive Guide)

Sparse table Data Model

Data Model Super column family feature ( becoming deprecated ) Not officially deprecated, but not highly recommended either Ed Anuff: Cassandra Indexing Techniques

Column name byte[] Queried against (predicates) Determines sort order value byte[] Opaque to Cassandra timestamp long Conflict resolution (Last Write Wins) (Cassandra: The Definitive Guide)

Column sorting Column names are stored in sorted order according to the value of compare_with: AsciiType, BytesType, LexicalUUIDType, IntegerType, LongType, TimeUUIDType, UTF8Type CompositeType... Custom... (Cassandra: The Definitive Guide)

Row storing & sorting Column sorting is controllable, but key sorting isn t; row keys always sort in byte order. Rows are stored in an order defined by the partitioner (for example, with RandomPartitioner, they are in random order, etc.). (Cassandra: The Definitive Guide)

Alternate indexes Native secondary indexes Wide rows as lookup and grouping tables Custom secondary indexes Ed Anuff: Cassandra Indexing Techniques

Minta példa

Minta példa User Stores users Keyed on a unique ID (UUID). Columns for username and password Username Indexes users Keyed on username UUID Username Password Username UUID One column, the unique UUID for user Eric Evans: Hands-on Cassandra

Minta példa Friends Maps a user to the users (s)he follows Keyed on user ID Columns for each user being followed Followers Maps a user to those following her/him Keyed on username UUID Username Follows (followees) Columns for each user following Followers (followers) Eric Evans: Hands-on Cassandra

Minta példa Tweets Stores tweets and maps them to users Keyed on a unique identifier Columns: Unique identifier User ID Body of the tweet timestamp TweetID UUID Body Timestamp Eric Evans: Hands-on Cassandra

Minta példa Timeline The materialized view of Tweets for a user. Keyed on user ID Columns that map timestamps to Tweet ID Userline The collection of Tweets attributed to a user Keyed on user ID UUID UUID TweetID (tweetidsof) TweetID (tweetidsto) Columns that map timestamps to Tweet ID Eric Evans: Hands-on Cassandra

2. minta példa (Cassandra: The Definitive Guide)

Design patterns Materialized View, Valueless Column, Aggregate Key,...? (Cassandra: The Definitive Guide)

Wide Rows If your data model has no rows with over a hundred columns, you re either doing something wrong or shouldn t be using Cassandra wide row for grouping wide row as a simple index composite column names, e.g. Indexes = { "User_Keys_By_Last_Name" : { {"adams", 1} : "e5d...", {"anderson", 1} : "e5f...", {"anderson", 2} : "e71...",... Ed Anuff: Cassandra Indexing Techniques

Model the queries first Start with your queries. Ask what queries your application will need, and model the data around that instead of modeling the data first, as you would in the relational world. (Cassandra: The Definitive Guide)

Működés - írás (Cassandra: The Definitive Guide)

Működés - olvasás (Cassandra: The Definitive Guide)

Limitations All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. A single column value may not be larger than 2GB. (However, large values are read into memory when requested, so in practice "small number of MB" is more appropriate. [might be changed?]) The maximum number of column per row is 2 billion. The key (and column names) must be under 64K bytes. no subcolumn indexing [might be changed?] (Cassandra wiki)

Hadoop Map/Reduce Map/Reduce jobok írhatók Cassandrás adatokra Pig és Hive is képes Cassandrás adatokon dolgozni

Telepítés, minta futtatás Ld. README.txt tar -zxvf apache-cassandra-$version.tar.gz... (log, lib könyvtárak beállítása)... (config fájl szerkesztése v. default-on hagyása) szerver: bin/cassandra -f CLI kliens: bin/cassandra-cli --host localhost

Telepítés, minta futtatás Cassandra-cli - adatbázis parancsok pl. create keyspace Keyspace1; use Keyspace1; create column family Users with... set Users[jsmith][first] = 'John'; set Users[jsmith][last] = 'Smith'; set Users[jsmith][age] = long(42); get Users[jsmith]; => (column=last, value=smith, timestamp=1287604215498000) => (column=first, value=john, timestamp=1287604214111000) => (column=age, value=42, timestamp=1287604216661000) Returned 3 results. ~ schema / database ~ table (no explicit schema) ~ insert/update : set columns of row with id jsmith ~ select * from Users where id=jsmith del Users[jsmith][age]; del Users[jsmith]; ~ delete/update set null : remove columns of row or full row

Saját tapasztalatok Egy gépen, egy node-dal... Több gépen...

Itt a vége. Köszönöm a figyelmet!