Elosztott, skálázódó adatbázis-kezelő rendszer

Size: px

Start display at page:

Download "Elosztott, skálázódó adatbázis-kezelő rendszer"

Miranda Maxwell
10 years ago
Views:

Elosztott, skálázódó adatbázis-kezelő rendszer http://cassandra.apache.

1 Elosztott, skálázódó adatbázis-kezelő rendszer Molnár András Garzó András július 13. péntek

2 Eredet In Greek mythology,...

3 Eredet In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named. (Cassandra: The Definitive Guide)

But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen but no one would

4 CAP

5 CAP RDBMS's

6 CAP RDBMS's HBase Bigtable

7 CAP RDBMS's Voldemort Cassandra HBase Bigtable

8 CAP (Cassandra: The Definitive Guide)

9 Tuneable consistency set consistency level... (Cassandra: The Definitive Guide)

10 Mit mond magáról? Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazon s Dynamo and its data model on Google s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. That s exactly 50 words. Of course, if you were to recite that to your boss in the elevator, you'd probably get a blank look in return. (Cassandra: The Definitive Guide)

consistent, column-oriented database that bases its distribution design on Amazon s Dynamo and its data model on Google s Bigtable.

11 Mit mond magáról? Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its Indexed, distribution schema-free, design on Amazon s Dynamo and row-oriented its data store model on Google s Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web. That s exactly 50 words. Of course, if you were to recite that to your boss in the elevator, you'd probably get a blank look in return. (Cassandra: The Definitive Guide)

column-oriented database that bases its Indexed, distribution schema-free, design on Amazon s Dynamo and row-oriented its data store model on

12 Mikor érdemes használni? (pl.) Cassandra vs RDBMS nagy adatmennyiség, szerver klaszter... Cassandra vs HBase always writable... Cassandra vs Voldemort dozens or hundreds of columns...

13 Use case examples Large deployments many nodes Lots of writes, statistics & analysis always writable Geographical distribution data locality Evolving applications no strict schema (Cassandra: The Definitive Guide)

Geographical distribution data locality Evolving

14 Fejlesztés állása aktuális verzió: 1.1.2, released [v1.1.1 released: ]

15 Kik supportálják? Third-party solution provides e.g. Cassandra wiki

16 Kik használják? Twitter is using Cassandra for analytics: for real-time analytics, for geolocation and places of interest data, and for data mining over the entire user store. Mahalo uses it for its primary near-time data store. Facebook still uses it for inbox search, though they are using a proprietary fork. Digg uses it for its primary near-time data store. Rackspace uses it for its cloud service, monitoring, and logging. Reddit uses it as a persistent cache. Cloudkick uses it for monitoring statistics and analytics. Ooyala uses it to store and serve near real-time video analytics data. SimpleGeo uses it as the main data store for its real-time location infrastructure. Onespot uses it for a subset of its main data store. (Cassandra: The Definitive Guide)

Rackspace uses it for its cloud service, monitoring, and logging. Reddit uses it as a persistent cache. Cloudkick uses it for monitoring statistics and analytics.

17 Kik használják? Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines. More: ( cassandra.apache.org )

Cisco, OpenX, Digg, CloudKick, Ooyala, and more companies that have large, active

18 Data Model

19 Sparse table Data Model (Cassandra: The Definitive Guide)

20 Sparse table Data Model

21 Data Model Super column family feature ( becoming deprecated ) Not officially deprecated, but not highly recommended either Ed Anuff: Cassandra Indexing Techniques

22 Column name byte[] Queried against (predicates) Determines sort order value byte[] Opaque to Cassandra timestamp long Conflict resolution (Last Write Wins) (Cassandra: The Definitive Guide)

23 Column sorting Column names are stored in sorted order according to the value of compare_with: AsciiType, BytesType, LexicalUUIDType, IntegerType, LongType, TimeUUIDType, UTF8Type CompositeType... Custom... (Cassandra: The Definitive Guide)

24 Row storing & sorting Column sorting is controllable, but key sorting isn t; row keys always sort in byte order. Rows are stored in an order defined by the partitioner (for example, with RandomPartitioner, they are in random order, etc.). (Cassandra: The Definitive Guide)

25 Alternate indexes Native secondary indexes Wide rows as lookup and grouping tables Custom secondary indexes Ed Anuff: Cassandra Indexing Techniques

26 Minta példa

27 Minta példa User Stores users Keyed on a unique ID (UUID). Columns for username and password Username Indexes users Keyed on username UUID Username Password Username UUID One column, the unique UUID for user Eric Evans: Hands-on Cassandra

28 Minta példa Friends Maps a user to the users (s)he follows Keyed on user ID Columns for each user being followed Followers Maps a user to those following her/him Keyed on username UUID Username Follows (followees) Columns for each user following Followers (followers) Eric Evans: Hands-on Cassandra

29 Minta példa Tweets Stores tweets and maps them to users Keyed on a unique identifier Columns: Unique identifier User ID Body of the tweet timestamp TweetID UUID Body Timestamp Eric Evans: Hands-on Cassandra

30 Minta példa Timeline The materialized view of Tweets for a user. Keyed on user ID Columns that map timestamps to Tweet ID Userline The collection of Tweets attributed to a user Keyed on user ID UUID UUID TweetID (tweetidsof) TweetID (tweetidsto) Columns that map timestamps to Tweet ID Eric Evans: Hands-on Cassandra

31 2. minta példa (Cassandra: The Definitive Guide)

32 Design patterns Materialized View, Valueless Column, Aggregate Key,...? (Cassandra: The Definitive Guide)

33 Wide Rows If your data model has no rows with over a hundred columns, you re either doing something wrong or shouldn t be using Cassandra wide row for grouping wide row as a simple index composite column names, e.g. Indexes = { "User_Keys_By_Last_Name" : { {"adams", 1} : "e5d...", {"anderson", 1} : "e5f...", {"anderson", 2} : "e71...",... Ed Anuff: Cassandra Indexing Techniques

34 Model the queries first Start with your queries. Ask what queries your application will need, and model the data around that instead of modeling the data first, as you would in the relational world. (Cassandra: The Definitive Guide)

35 Működés - írás (Cassandra: The Definitive Guide)

36 Működés - olvasás (Cassandra: The Definitive Guide)

37 Limitations All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound. A single column value may not be larger than 2GB. (However, large values are read into memory when requested, so in practice "small number of MB" is more appropriate. [might be changed?]) The maximum number of column per row is 2 billion. The key (and column names) must be under 64K bytes. no subcolumn indexing [might be changed?] (Cassandra wiki)

38 Hadoop Map/Reduce Map/Reduce jobok írhatók Cassandrás adatokra Pig és Hive is képes Cassandrás adatokon dolgozni

39 Telepítés, minta futtatás Ld. README.txt tar -zxvf apache-cassandra-$version.tar.gz... (log, lib könyvtárak beállítása)... (config fájl szerkesztése v. default-on hagyása) szerver: bin/cassandra -f CLI kliens: bin/cassandra-cli --host localhost

40 Telepítés, minta futtatás Cassandra-cli - adatbázis parancsok pl. create keyspace Keyspace1; use Keyspace1; create column family Users with... set Users[jsmith][first] = 'John'; set Users[jsmith][last] = 'Smith'; set Users[jsmith][age] = long(42); get Users[jsmith]; => (column=last, value=smith, timestamp= ) => (column=first, value=john, timestamp= ) => (column=age, value=42, timestamp= ) Returned 3 results. ~ schema / database ~ table (no explicit schema) ~ insert/update : set columns of row with id jsmith ~ select * from Users where id=jsmith del Users[jsmith][age]; del Users[jsmith]; ~ delete/update set null : remove columns of row or full row

41 Saját tapasztalatok Egy gépen, egy node-dal... Több gépen...

42 Itt a vége. Köszönöm a figyelmet!

Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00

Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00 Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn