Data-Intensive Information Processing Applications! Session #7. A database perspective on the cloud
|
|
|
- Loraine Roberts
- 10 years ago
- Views:
Transcription
1 Data-Intensive Information Processing Applications! Session #7 Web-Scale Databases A database perspective on the cloud Andreas Thor University of Maryland & University of Leipzig Thursday, March 31,
2 Agenda Cloud! Hadoop/MapReduce 1. Object Storage (Amazon S3) 2. Cloud Database (BigTable/Hbase) Add Database stuff to Hadoop/MapReduce 3. SQL to MapReduce 4. Data Warehouse on top of Hadoop/MapReduce (Hive) DeWitt & Stonebraker. MapReduce: A major step backwards. Stonebraker et. al: MapReduce and parallel DBMSs: friends or foes? CACM
3 Storage Services [Amazon] DeCandia et al. Dynamo: Amazon s Highly Available Key-value Store. SOSP 07 3
4 Object Storage Services Service for storing objects (binary data) in the cloud Upload, storage on multiple nodes, download Simple structure Buckets: simple (flat) containers Objects: arbitrary data (e.g., files), arbitrary size Authentication, access rights Simple API HTTP requests (REST-ful API): PUT, GET, DELETE Used by applications, e.g., DropBox (online backup & sync tool) Performance fast, scalable, high availability Costs pay as you go : #requests, data size, upload/download size Example: Microsoft Azure Storage, Amazon Simple Storage Service (S3) 4
5 Problems Concurrent user access YouTube videos, collaborative work on documents, Problem: concurrent writes Conditional Updates: IF current version =X THEN Update Node-based versioning Data copies on multiple nodes Reliability: Redundancy against node outage Read performance: Multiple clients can read different copies in parallel (locality) Problem: replica synchronization Strong Consistency: Any read access will return the updated version Eventual Consistency: All accesses will eventually return the updated version 5
6 Amazon S3/Dynamo: Overview Amazon S3 is based on Amazon Dynamo Distributed, scalable key-value store designed for small data objects (1MB / key) Characteristics high availability low latency Eventually consistent data store Write access always possible relaxed consistency in favor of availability Performance SLA (Service Level Agreement) response within 300ms for 99.9% of requests for peak client load of 500 requests per second P2P-like structure no master nodes, all nodes have the same functionality each node is aware of data at peers 6
7 Amazon Dynamo: Partitioning Each node is assigned a position in a ring Position= random value of a hash function Node assignment Compute hash value of key Choose next N nodes on ring (clock-wise) Example: Hash(key) between A and B " for N=3: nodes B, C, and D Performant node insert / delete / remove because neighboring nodes affected only Preference list List of N nodes that are assigned for a given key each node has a preference list for all keys Consistent hashing appropriate hash function needed for data locality and load balancing F G E A D Hash(key) B C 7
8 Amazon Dynamo: Data access Key value store interface Primary key access, no complex queries Request to any node of the ring Request will be forwarded to one (first) node of the key s preference list Put (Key, Context, Object) Coordinator creates vector clock (versioning) based on request s context Coordinator writes object + vector clock Replication Write requests to N-1 other nodes out of the preference list Success, if (at least) W-1 nodes succeed asynchronous replica updates for W<N " consistency problems Get (Key) Read request to N nodes of the preference list Return responses from R nodes " may contain multiple versions; list of (object, context) pairs 8
9 Amazon Dynamo: Replication Read/Write Quorum R/W = minimal number of replica nodes that must be synchronized for successful read/write operation Application can adjust (N,R,W) to meet needs for performance, availability, and durability Consistency if R + W > N User/application-controlled conflict resolution for different versions Variants Read-optimized: R=1, W=N Write-optimized: R=N, W=1 Default: (3,2,2) 9
10 Amazon Dynamo: Versioning Example of object versioning Vector Clocks represent dependencies between different versions of the same object " reconcile multiple versions version counter per replica node, e.g., D ([S x, 1]) for object D, node S x, version 1 Vector clock: list of (node, counter) pairs to indicate available object versions 10
11 Amazon Dynamo: Versioning (2) Vector Clocks to determine dependencies between 2 object versions Counters of 1 st vector clock! all counters of 2 nd vector clock! 1 st version is (direct) ancestor and can be deleted otherwise: conflict resolution Read returns all known versions incl. vector clocks subsequent update merges all version Application determines conflict resolution vector clocks part of get/put requests 11
12 Amazon Dynamo: Temporary failures Temporary node failure should be transparent to the user Sloppy Quorum (N, R, W) All operations performed on first N healthy nodes still writable if replica not available (e.g., W=N) Hinted Handoff If node is unavailable, replication request is sent to another node ( hinted replica ) Background job: When original node has recovered, send hinted replica to original node 12
13 Amazon Dynamo: Replica synchronization Hash-Tree (Merkle Tree) for key range Leafs = hash value of key value Parents = hash value of respective child node values Advantages Efficient check if two replicas are identical = roots have same value Efficient recursive identification of out-of-sync sub trees Disadvantages Computational costs during repartitioning (e.g., new nodes) K1 V1 K2 V2 K3 V3 H(...) H(H(H(k 1 ), H(k 2 )), H(H(k 3 ), H(k 4 ))) H(H(k 1 ), H(k 2 )) H(H(k 3 ), H(k 4 )) H(k 1 ) H(k 2 ) H(k 3 ) H(k 4 ) K4 V
14 Amazon Dynamo: Techniques (Summary) Problem Technique Advantage Partitioning Consistent Hashing Scalability High availability of writes Temporary node failure Recovering Vector Clocks + conflict resolution during reads Sloppy Quorum and Hinted Handoff Hash Tree (Merkle Tree) Versioning independent from update frequency High availability; reliable Efficient background synchronization of replicas Additional techniques Gossip protocol for P2P network (new nodes, failure identification, ) 14
15 Amazon S3/Dynamo vs. Azure Storage Amazon Dynamo Azure Storage Partitioning Hash function Object name Dynamically extensible + + Routing P2P hierarchical Replication asynchronous synchronous Consistency Eventual Consistency Strong Consistency Handling concurrent writes Performance during read; multiple versions with vector clock Adjustable by read/write quorum 15 during write; conditional updates Read optimized; CDN (eventual consistency)
16 Web (nonsql) Databases [BigTable] Chang et al. Bigtable: A Distributed Storage System for Structured Data. OSDI 06 [HBase] 16
17 Web Database: Usage scenario Web table Table contains crawled web pages incl. date, time,... Key: web page URL millions/billions of pages Random access Crawler adds / updates web pages Search engine delivers cached version of web pages Batch processing Build search engine index Dynamic web applications (e.g., Facebook) need fast random access to (semi-) structured data 17
18 Google s BigTable Distributed data storage system column-oriented key-value store multi-dimensional Versioning High availability High performance Goals Billions of rows, millions of columns, thousands of version Real-time read/write random access Large data (PB) linear scalability with the number of nodes Idea / techniques Architecture allows efficient but simple data access method no additional overhead (e.g., ACID) HBase is Hadoop implementation of BigTable 18
19 Data model Distributed, multi-dimensional, sorted map (row:string, column:string, time:int64)! string Keys for row and columns time stamp Arbitrary data (Strings / Byte strings) Rows Read and write operations are atomic per row only Data stored in (lexicographical) order of row keys 19
20 Data model (2) Columns can be added dynamically at run-time Column families Group together n similar columns column key = family: qualifier Disk/memory storage w.r.t. to column families (columns of the same family are stored close together ) Time stamp different versions of data per cell garbage collection of older versions ( keep t versions only ) 20
21 Data model (3) Conceptual (alternative) Row Key Time Stamp Column Contents Column Family Anchor com.cnn.www T9 Anchor:cnnsi.com CNN T8 Anchor:my.look.ca CNN.COM T6 <html>.. T5 <html>.. Physical storage Row Key Time Stamp Contents com.cnn.www T6 <html>.. T5 <html>.. Row Key Time Stamp Anchor com.cnn.www T9 Anchor:cnnsi.com CNN T5 Anchor:my.look.ca CNN.COM 21
22 Architecture Data partitioning Rows sorted by key Horizontal table partitioning into tablets Tablet distribution across multiple tablet servers Master Server Assignment: Tablet # Tablet Server Add/delete tablet servers Load balancing for tablet servers Tablet Server Manages 10-1,000 tablets Realizes read and write access Tablet split if tablet too large ( MB) Client Communication with tablet server for reading / writing 22 Tablet Server (GFC Chunk Server) Master Server (GFS Master Server)... Tablets (Chunks) Tablet Server (GFC Chunk Server)...
23 Tablet Location 2-level catalog management with Root and METADATA table Root table Links to all tablets of a METADATA table Stored in 1 Tablet (never split) METADATA table Links to all tablets (of user tables) Identifier: table name + key of last row Table are sorted by key Address space Entry size: 1KB Tablet size: 128MB Addressable tablets: METADATA: 128MB / 1KB = 2 17 tablets User Table: 2 17 $ 2 17 = 2 34 tablets Size of all user tablets: 2 34 $ 128 MB = 2 41 MB = 2 million TB 23
24 Tablet: Read and write access SSTable File (Sorted String Table ) Immutable sorted map Bloom Filter to check if SSTable contains data for row+column Write access Write to transaction Log (for redo) Write to MemTable (RAM) Asynchronous: Compaction Minor: Copy data from MemTable to SSTable (and delete from log) Merge: Merge MemTable and SSTable(s) to new SSTable Major: Remove deleted data (=merge to one SSTable) Read access Read from MemTable and SSTables to find data 24
25 Performance #Read/WriteOps per second for 1000Byte Good scalability for up to 250 tablet servers Write is faster than read Commit-Log is append only; Read requires access to MemTable + SSTable Random reads slowest Access (all) SSTables Scanning and sequential reads are more efficient Make use of sorted keys 25
26 Bigtable vs. RDBMS BigTable / HBase RDBMS Assumption (hardware) failures are prevalent (hardware) failures are rare Replication built-in external Normalization unnormalized data normalized data (3NF) (wide, sparse tables) (compact, redundant free tables) Query key-based access: point and range SQL Scalability linear, unlimited limited (due to ACID, foreign keys, views, trigger,...) Index primary key primary key + secondary indexes Transactions - + Atomicity row level transaction level Consistency Isolated execution No integrity constraints, no referential integrity - + Durability + + Integrity constraints and referential integrity 26
27 MapReduce and SQL [CouchDB] [Data] 27
28 Query transformation (manual) rewrite from SQL to MapReduce Example: CouchDB Document-oriented data store no schema JSON format simple versioning concept Query/view definition specify map and reduce function in Javascript (or other language) 28
29 Example data Conceptional: nested table id name time user camera info tags width height size 1 fish.jpg 17:46 bob nikon [tuna, shark] 2 trees.jpg 17:57 john canon [oak] 3 snow.png 17:56 john canon [tahoe, powder] 4 hawaii.png 17:59 john nikon [maui, tuna] 5 hawaii.gif 17:58 bob canon [maui] 6 island.gif 17:43 zztop nikon [maui] Internal representation as document set (JSON format) { _id : 1, "name":"fish.jpg", time": 17:46","user":"bob,"camera":"nikon", "info":{"width":100,"height":200,"size":12345},"tags":["tuna","shark"]} { _id : 2, "name":"trees.jpg", time": 17:57,"user":"john,"camera":"canon", "info":{"width":30,"height":250,"size":32091},"tags":["oak"]}... 29
30 Selection Selection = attribute value condition SQL:... WHERE attr = xy Map check condition using IF statement return selected document Reduce id function Example SQL: SELECT * FROM table WHERE user = bob id name time user camera info tags width height size 1 fish.jpg 17:46 bob nikon [tuna, shark] 5 hawaii.gif 17:58 bob canon [maui] 30
31 Selection: Example map function (doc) { if (doc.user == bob ) emit (doc.id, doc); } reduce function (key, values) { return values[0]; } {id:1,user:bob...} key value key values {id:1...} {id:2,user:john...} {id:3,user:john...} {id:4,user:john...} {id:5,user:bob...} map 1 {id:1...} 5 {id:5...} shuffle + sort 1 [{id:1...}] 5 [{id:5...}] reduce {id:5...} {id:6,user:zztop...} Alternative emit (null, doc); return values; map key value null {id:1...} null {id:5...} 31 shffl+srt key values null [{id:1...}, {id:5...}] reduce [{id:1...},{id:5...}]
32 Projection Projection = restrict set of attributes SQL: SELECT Attr1, Attr2 FROM... Map create new ( restricted ) document Reduce id function Duplicate removal map: key = projected attributes reduce: return first value Example SQL: SELECT (DISTINCT) user FROM table 32 user bob john john john bob zztop user bob john zztop
33 Projection: Example (w/o duplicate removal) map function (doc) { emit(doc.id,{ user :doc.user}); } reduce function (key, values) { return values[0]; } key value key value {id:1,user:bob...} 1 {user:bob } 1 [{user:bob }] {user:bob } {id:2,user:john...} {id:3,user:john...} {id:4,user:john...} {id:5,user:bob...} map 2 {user:john} 3 {user:john} 4 {user:john} 5 {user:bob} shuffle + sort 2 [{user:john}] 3 [{user:john}] 4 [{user:john}] 5 [{user:bob}] reduce {user:john} {user:john} {user:john} {user:bob} {id:6,user:zztop...} 6 {user:zztop} 6 [{user:zztop}] {user:zztop} 33
34 Projection: Example (w/ duplicate removal) map function (doc) { emit(doc.user,{ user :doc.user}); } reduce function (key, values) { return values[0]; } {id:1,user:bob...} {id:2,user:john...} {id:3,user:john...} {id:4,user:john...} {id:5,user:bob...} {id:6,user:zztop...} map key value bob {user:bob } john {user:john} john {user:john} john {user:john} bob {user:bob} zztop {user:zztop} shuffle + sort 34 key value bob [{user:bob }, {user:bob }] john [{user:john}, {user:john}, {user:john}] zztop [{user:zztop}] reduce {user:bob } {user:john} {user:zztop}
35 Grouping and aggregate functions Grouping Map Divides records into groups based on shared attribute values Produces one record (row) per group Aggregate functions to compute aggregated values (per group), e.g., SUM Key = group attribute values Reduce Return first key value Optional: Apply aggregate function(s) Example SELECT camera, AVG(info.size) as avgsize FROM Table GROUP BY camera camera avgsize canon nikon
36 Grouping and aggregate functions: Example map function (doc) { emit(doc.camera, doc.info.size); } {id:1,user:bob...} {id:2,user:john...} {id:3,user:john...} {id:4,user:john...} {id:5,user:bob...} {id:6,user:zztop...} map key reduce function (key, values) { sum = 0; for (i=0; i<values.length; i++) { sum = sum + values[i]; } return {"camera":keys, avgsize":sum/values.length}; } value nikon canon canon 1253 nikon canon nikon shuffle + sort 36 key value canon [32091, 1253, 49287] nikon [12345, 92834, 50398] reduce {camera:canon, avgsize: } {camera:nikon, avgsize: 51859}
37 Equi-join + multi-valued attribute Equi-join = combine records from two relations based on attribute equality SQL:... WHERE Tab1.Attr1 = Tab2.Attr2 Multi-valued attribute in 1NF 1-to-many, many-to-many relationships equi-joins needed Map Key = join attribute value Reduce Iteration over all value pairs Example (SQL) SELECT Tab1.name AS name1, Tab2.name AS name2 FROM table AS Tab1, table AS Tab2, tagtab AS Tag1, tagtab AS Tag2 WHERE Tag1.id=Tab1.id AND Tag2.id=Tab2.id AND Tag1.tag = Tag2.tag AND Tab1.name < Tab2.name 37 id tag 1 tuna 1 shark 4 maui 4 tuna 5 maui name1 hawaii.png hawaii.gif hawaii.gif fish.jpg name2 island.gif hawaii.png island.gif hawaii.png
38 Equi join + multi-valued attribute: Example (1) map function (doc) { for (i=0; i<doc.tags.length; i++) { emit (doc.tags[i], doc.name); } reduce function (key, values) { var result = new Array(); for (i=0; i<values.length; i++) { for (k=0; k<values.length; k++) { if (values[i]<values[k] { result.push ({name1:values[i], name2:values[k]}); } } } return result; } 38
39 Equi join + multi-valued attribute: Example (2) key value key value {id:1,...} {id:2,...} {id:3,...} {id:4,...} {id:5,...} {id:6,...} map tuna shark oak tahoe powder maui tuna maui maui fish.jpg fish.jpg tree.jpg snow.png snow.png hawaii.png hawaii.png hawaii.gif island.gif shuffle + sort maui oak [hawaii.png, hawaii.gif, island.gif] [tree.jpg] power [snow.png] shark [fish.jpg] tahoe [snow.png] tuna [fish.jpg, hawaii.png] reduce [{name1:hawaii.png, name2: island.gif}, {name1:hawaii.gif, name2:hawaii.png}, {name1:hawaii.gif, name2:island.gif}] [] [] [] [] [{name1:fish.jpg, name2:hawaii.png}] 39
40 MapReduce and Data Warehouses [Hive] Thusoo et.al.: Hive-a petabyte scale data warehouse using hadoop. ICDE 2010 [HiveUrl] [Hive1] [Hive2] [Hive3] 40
41 Hadoop/MR vs. Parallel DBS Hadoop/MR advantages Scalability, fault tolerance configuration effort, costs no initial data loading Parallel DBS advantages Declarative query language Queries run faster by order of magnitude Support for compressed data Random access Common use cases MapReduce ETL Data mining, data clustering Analysis of semi-structured data (e.g., web log files) Ad-hoc data analysis 41
42 Data analysis: Facebook Facebook 4TB compressed data per day 135TB compressed data are analyzed per day Aggregations #clicks/page views per day/month/... Ad-hoc analysis How many uploaded pictures per county / state on New Year s Eve? Data Mining User profiles based on attributes (#pageviews, #sessions, time,...) Spam detection (suspicious) frequent patterns in user generated content Analysis / optimization of online advertisement #AdClicks per user (type)... 42
43 Hive Data Warehouse based on Hadoop Hive = MapReduce + SQL SQL is simple and widely-used MapReduce scalability Automatic translation SQL to MapReduce necessary Programs hard to maintain, almost no reuse Difficult for non experts Limited expressiveness, e.g., long code (development time!) to realize simple count/ average queries 43
44 Hive: Overview Management and analysis of structured data using Hadoop no OLTP database, high latency File-based data storage (HDFS) metadata for mapping files to tables complex data types (e.g., list, map) direct file access, different data formats HiveQL queries are executed using MapReduce include scripts (e.g., written in Python) in queries metadata, e.g., for optimizing joins Scalability and fault tolerance HDFS + MapReduce Extensibility User-Defined Table-Generating Functions (UDTF) User-Defined Aggregate Functions (UDAF) 44
45 Hive: Architecture Metastore Tables, columns (type) Location, partitions Information on (de)serialization CLI / Web-GUI Browse metastore Send queries Thrift Cross-language Service " HiveQL Compiler + Optimizer Query optimization and translation of HiveQL query to DAG of MapReduce jobs Executor Execute MR-jobs of DAG 45
46 Hive: Data type & data access Data types simple and composite data types list, map Flexible (de)serialization of tables multiple (user-defined) format, e.g. XML, JSON, CSV multiple storage engines, e.g., file Advantages no initial data loading into data warehouse (no data replication!) no data transformation to relational model but direct file access Disadvantages no pre-processing, e.g., indexing always full (file) table scan necessary 46
47 Hive: Tables, partitions, and files Table links to existing file(s) in HDFS Table has corresponding HDFS directory: /wh/pvs Definition of columns for data partitioning /wh/pvs/ds= /ctry=us /wh/pvs/ds= /ctry=ca Bucketing: Split data of a directory based on hash value /wh/pvs/ds= /ctry=us/part /wh/pvs/ds= /ctry=us/part Clicks table ds= ds= partitions (multiple levels possible) HDFS files (Hash buckets possible) 47
48 Hive: Table Create CREATE EXTERNAL TABLE pvs (userid int, pageid int, ds string, stry string) PARTITIONED ON(ds string, ctry string) STORED AS textfile LOCATION /path/to/existing/file Load status_updates (user_id int, status string, ds string) LOAD DATA LOCAL INPATH /logs/status_updates INTO TABLE status_updates PARTITION (ds= ) 48
49 Hive-QL Similar to SQL Selection, projection, equi-join, union, sub-queries, group by, aggregate functions Sort by vs. order by Extend queries by MapReduce scripts UDF, may operate on complex data structures (lists, map) FROM ( FROM pv_users SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid) CLUSTER BY(dt) ) map INSERT INTO TABLE pv_users_reduced SELECT TRANSFORM(map.dt, map.uid) USING 'reduce_script' AS (date, count); 49
50 Hive-QL: Query transformation Hive-QL query is transformed into DAG (directed acyclic graph) Nodes: operators TableScan Select, Extract Filter Join, MapJoin, Sorted Merge Map Join GroupBy, Limit Union, Collect FileSink, HashTableSink, ReduceSink UDTF Graph represents data flow multiple (parallel) Map/Reduce phases possible 50
51 Hive-QL: Query transformation (Example) Example SELECT * FROM status_updates WHERE status LIKE michael jackson 51
52 Hive-QL: Query transformation (Example) (2) SELECT COUNT(*) FROM status_updates WHERE ds= All updates Materialize map output Updates per Map 52
53 Hive: Query transformation and optimization DAG can become very complex Optimization techniques Ignore unnecessary columns Apply selection as early as possible Ignore unnecessary partitions 53
54 Hive: Join page_view pageid userid... key value INSERT INTO TABLE pv_users SELECT pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid) key value pv_users pageid age <R,1> 111 <R,1> user userid age... map 111 <R,2> 222 <R,2> key value shuffle + sort 111 <R,2> 111 <S,25> key value reduce 2 25 pageid age <S,25> 222 <R,2> <S,32> 222 <S,32> Key = Join-Key, Value has flag (R or S) to distinguish between tables Multi-way join using the same join key " 1 MapReduce job Multi-way join using n join keys " n MapReduce jobs 54
55 MapJoin: Performance improvement MapJoin small table as additional map input can be transformed into hash table no reduce necessary n way join possible if n-1 tables can be made available as additional map input Dynamic optimization Determine small/large table at runtime Apply MapJoin if possible, e.g., if small table(s) fit into memory page_view pageid userid user userid age HashTable pageid userids 1 [111,222] 2 [111] pv_users pageid age map
56 Hive: Group By pv_users pageid age pageid age map key value <1,25> 2 key value <1,25> 1 <2,32> 1 INSERT INTO TABLE pageid_age_sum SELECT pageid, age, count(*) FROM pv_users GROUP BY pageid, age shuffle + sort key <1,25> 2 <1,25> 1 key <2,32> 1 value value reduce pageid_age_sum pageid age count pageid age count Key = group attributes Reduce = aggregation function pre-aggregation using a map combiner is possible (e.g., (<1,25>,2)) 56
57 User-defined scripts Include user-defined scripts in HiveQL queries using TRANSFORM operator Data (de)serialization Transfer via stdin/stdout user userid age userid authority_value computeauthorityvalue.py import sys for line in sys.stdin: id = line.strip()... compute authval... print '\t'.join([id, authval]) ADD FILE computeauthorityvalue.py; SELECT TRANSFORM (userid) USING computeauthorityvalue.py' AS id, authority_value FROM user 57
58 Hadoop/MR vs. Parallel DBS Hadoop / MapReduce Data size PB TB-PB 58 Shared Nothing-RDBMS Data structure semi-structured data static schema Partitioning Blocks in DFS (byte-wise) Horizontal Query MapReduce programs Declarative (SQL) Data access Batch via indexes (e.g., range) Updates Write once read many times Read and write many times Scheduling Runtime Compile-time Processing Parse tuples at runtime efficient access to attributes (Storage Manager) Data flow Pull materialize intermediate results Push tuple pipelining between operators Fault tolerance Restart map/reduce task query restart (operator restart) Scalability linear, unlimited linear, limited Hardware heterogeneous (cheap commodity homogeneous (expensive high end hardware) hardware) Software free, open source very expensive
59 Summary New database-like developments in the cloud Database techniques integrated in Hadoop/MR There is many many more Pig Latin a programming language for MapRedue-based data processing HadoopDB a hybrid of Hadoop/MR and RDBMS Megastore BigTable + ACID Dremel ad-hoc query system for analysis of read-only nested data RDBMS in the Cloud e.g., IBM DB2 running on Amazon EC2 Data management optimizations in the cloud e.g., load balancing... 59
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13
Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich
Big Data Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich MapReduce & Hadoop The new world of Big Data (programming model) Overview of this Lecture Module Background Google MapReduce The Hadoop
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Big Table A Distributed Storage System For Data
Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Hypertable Architecture Overview
WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop
Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 12+TB(compressed) raw data per day today Trends
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team
HIVE Data Warehousing & Analytics on Hadoop Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team Why Another Data Warehousing System? Problem: Data, data and more data 200GB per day in March 2008 back to
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
Hadoop/Hive General Introduction
Hadoop/Hive General Introduction Open-Source Solution for Huge Data Sets Zheng Shao Facebook Data Team 11/18/2008 Data Scalability Problems Search Engine 10KB / doc * 20B docs = 200TB Reindex every 30
Distributed Systems. Tutorial 12 Cassandra
Distributed Systems Tutorial 12 Cassandra written by Alex Libov Based on FOSDEM 2010 presentation winter semester, 2013-2014 Cassandra In Greek mythology, Cassandra had the power of prophecy and the curse
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Data Warehouse Overview. Namit Jain
Data Warehouse Overview Namit Jain Agenda Why data? Life of a tag for data infrastructure Warehouse architecture Challenges Summarizing Data Science peace.facebook.com Friendships on Facebook Data Science
Introduction to Apache Hive
Introduction to Apache Hive Pelle Jakovits 1. Oct, 2013, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language Examples Internals Hive vs
How To Improve Performance In A Database
Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00
Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Data Management in the Cloud -
Data Management in the Cloud - current issues and research directions Patrick Valduriez Esther Pacitti DNAC Congress, Paris, nov. 2010 http://www.med-hoc-net-2010.org SOPHIA ANTIPOLIS - MÉDITERRANÉE Is
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009
Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009 Hadoop @ Facebook Who generates this data? Lots of data
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra [email protected] Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Alternatives to HIVE SQL in Hadoop File Structure
Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
Bigtable is a proven design Underpins 100+ Google services:
Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
Distributed Data Stores
Distributed Data Stores 1 Distributed Persistent State MapReduce addresses distributed processing of aggregation-based queries Persistent state across a large number of machines? Distributed DBMS High
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011
A Review of Column-Oriented Datastores By: Zach Pratt Independent Study Dr. Maskarinec Spring 2011 Table of Contents 1 Introduction...1 2 Background...3 2.1 Basic Properties of an RDBMS...3 2.2 Example
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Introduction to Apache Hive
Introduction to Apache Hive Pelle Jakovits 14 Oct, 2015, Tartu Outline What is Hive Why Hive over MapReduce or Pig? Advantages and disadvantages Running Hive HiveQL language User Defined Functions Hive
Hadoop Architecture and its Usage at Facebook
Hadoop Architecture and its Usage at Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Microsoft Research, Seattle October 16, 2009 Outline Introduction
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Introduction to Apache Cassandra
Introduction to Apache Cassandra White Paper BY DATASTAX CORPORATION JULY 2013 1 Table of Contents Abstract 3 Introduction 3 Built by Necessity 3 The Architecture of Cassandra 4 Distributing and Replicating
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
SQL Server 2016 New Features!
SQL Server 2016 New Features! Improvements on Always On Availability Groups: Standard Edition will come with AGs support with one db per group synchronous or asynchronous, not readable (HA/DR only). Improved
Hive User Group Meeting August 2009
Hive User Group Meeting August 2009 Hive Overview Why Another Data Warehousing System? Data, data and more data 200GB per day in March 2008 5+TB(compressed) raw data per day today What is HIVE?» A system
COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015
COSC 6397 Big Data Analytics 2 nd homework assignment Pig and Hive Edgar Gabriel Spring 2015 2 nd Homework Rules Each student should deliver Source code (.java files) Documentation (.pdf,.doc,.tex or.txt
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
Hadoop Distributed File System. -Kishan Patel ID#2618621
Hadoop Distributed File System -Kishan Patel ID#2618621 Emirates Airlines Schedule Schedule of Emirates airlines was downloaded from official website of Emirates. Originally schedule was in pdf format.
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Map Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI
Big Data Development CASSANDRA NoSQL Training - Workshop March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI ISIDUS TECH TEAM FZE PO Box 121109 Dubai UAE, email training-coordinator@isidusnet M: +97150
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
A Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
Dynamo: Amazon s Highly Available Key-value Store
Dynamo: Amazon s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Cassandra A Decentralized, Structured Storage System
Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
Hadoop and its Usage at Facebook. Dhruba Borthakur [email protected], June 22 rd, 2009
Hadoop and its Usage at Facebook Dhruba Borthakur [email protected], June 22 rd, 2009 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed on Hadoop Distributed File System Facebook
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
How to Choose Between Hadoop, NoSQL and RDBMS
How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A
