NoSQL Motivation. RDBMS values. Admin [1] Title: CS61 Lecture 15 Author: Charles C. Palmer Date: May 26, 2015
|
|
|
- Ursula Neal
- 10 years ago
- Views:
Transcription
1 Title: CS61 Lecture 15 Author: Charles C. Palmer Date: May 26, 2015 CS61 Lecture Notes [1] NoSQL Admin Slides: 15 - NoSQL.ppt.pdf SQL: FriendsOfFriends.sql NoSQL Motivation SLIDE 15 1 In NoSQL, SQL refers to traditional RDBMS s. It arose due to the fact that for some systems we might want to use some other system. Thus, Not Only SQL. RDBMS values Efficient, reliable, convenient, and safe multi-user storage of and access to massive amounts of persistent data Widom ACID Persistent data access We want our data when we want it! Archives, warehousing, or even this morning. Analytics and BI main memory is fast, but finite Concurrency Never let the customer wait internal as well as external Integration with business processes large # of processes and applications need to share data. resulting in a shared database environment
2 Some of these capabilities may be more than we need, Some of these capabilities may prevent us from doing what we want! Impedance Mismatch Developers think in data structures that reside in memory. These in-memory data structures are typicaly more flexible and more representative of the real world. Example: The data making up a product order appears as an entity on the UI, but it really represents several entries in various RDBMS tables, such as customers, orders, order items, creditcardinfo, etc., with all the foreign keys and such to ties them together. In memory it s likely a single or a few data structures. The issue is the requisite simplicity of RDBMS entries. the values in an RDBMS tuple must be simple - no vectors or structure or nested data structures. So, if it makes sense to use a richer in-memory data structure, you have to struggle sometimes to fit it into an RDBMS. Thus, the term impedance mismatch : two different representations requiring translation back and forth. The rise of Web Services There are other mismatches * in the RDBMS world common schemas must match. - If the customers table columns change then some queries and applications may fail. - Indexes needed by one application might get in the way of another Place a single application in front of the RDBMS and have it channel, convert, map, etc., requests to it. Only this one application has to know about the DB structure. This moves interoperability requirements to the interfaces, and thus to the network. The deluge Next came the problem of the three V s, leading to scalability challenges. Volume of data large data sets become difficult to use when stored in RDBMS queries begin to take too long, joins get HUGE ( join pain ) due to the underlying data model which tends to build a set of all possible answers to a query before filtering to get the right one. Velocity of data
3 rate that data changes over time rarely static - bursty data store must be able to handle regular high-speed writes as well as bursts the values of specific properties can change, sometimes quite rapidly The structure of the data can vary over time as well, sometimes quite rapidly causes business needs can change quickly the data being collected can also change quickly Variety of data data may sometimes be well structured, and unstructured other times dense, sparse, connected or not Emergence This Led to the NoSQL (Not Only SQL) meaning a non-relational data store NoSQL consistency models are more lenient SO companies had to choose between bigger and bigger centralized computers, but at the time that only worked so far. Concerns about resilience arose, and the very real drive toward widely distributed, alwayson, applications encouraged clusters of commodity hardware instead. The RDBMS s of the day were NOT designed to run on clusters Some systems could handle writing to shared disk subsystems, but that wasn t always possible in widely distributed systems AND it didn t help resilience. The web has led to applications that must easily store and process data which is bigger in volume, changes more rapidly, is often sparse across the domains, and is more structurally varied than traditional RDBMS can handle. CAP Theorem (Brewer 2000) Proven by Lynch and Gilbert in 2003 It is impossible for a web service to provide the following three guarantees
4 Consistency The same definition we ve been using. (like the ATM example) Availability Brewer means if you can talk to a node in a cluster, the node can read and write data. Partition-tolerance the cluster can survive comm link breaks that separate the cluster into multiple parts Slide 15 2 Example Alice in London, Bob in Mumbai, both want to reserve a room at Hotel Zed in Paris Reservation system has two nodes, one in London and the other in Mumbai, with a comm-link There s only one room left Problems can arise If there was only one server It s consistent and available (if it s up) It can t be partitioned This is the typical DB today The two nodes (systems) have to agree on serializing their requests * This fails if the link goes down One node can be Primary, the other a Secondary Requests from the Secondary must go thru the Primary If the link goes down, we have a failure of Availability in Brewer s terms since Alice can talk to London but can t make a reservation Both nodes can be allowed to continue to take reservations, but overbooking can result Broken consistency This is might be acceptable since there are usually no-shows to remedy any overbooking (OPTIONAL) Other implications of relaxing ACID
5 Relaxing durability In-memory DB would have great performance, but all is lost after a shutdown Capturing telemetry at high speed might be more important than if you miss some data is the server crashes Replicated data might be lost if the Secondary sends it up to the Primary butthe Primary crashes before replicating it back out to the others. ACID is nice, but it comes with a price ACID vs. BASE (from Brewer) Basic Availability: works most of the time. Trade-off is really latency rather than availability. Soft-state: Stores don t have to be write-consistent, nor do different replicas have to be mutually consistent at all times. Eventually consistent: Stores exhibit consistency at some later point Amazon example of ALWAYS being able to write to your shopping cart, even if you end up with several. On checkout just collect them all and let the customer decide. e.g., lazily at read time). Of course, this more lenient data store simply transfers to the program the responsibility of figuring out any data problems. For example, what about durability? Surely this is inviolate? * Consider in-memory DB s that support extremely high-throughput apps. Perhaps a lazy replication of the data to disk store is sufficient to cover the transactions and if there s a crash only a few transactions may be lost. * strict durability systems depend on transaction logs. These can be slow without hardware assist, but using hardware allows for the possibility that the system will fail due to a power loss. What kinds of applications are like/tolerate this? * data acquisition systems - can afford to miss a data point * in high performance scanarios you may use hardware buffering to allow faster writes to NoSQL emergence led to another advancement: Polyglot Persistence Instead of always using an RDBMS, we choose a data storage mechanism based on the nature of the data being stored and how we want to use it.
6 As a result, an organization may employ a veriety of persistent data stores. whichever one is best for the data and the circumstances. Primary reasons to consider NoSQL 1. ability to handle data access with sizes and performance that necessitate a cluster of machines. 2. improvement of productivity of app development by using a more convenient or intuitive data interaction style. 3. 3V s data, and including sparse data 4. desire for highly distributed systems running on commodity hardware Shift to aggregates RDBMS tables cannot hold nested data. The table entries must be simple types and consistent (no variable types). Nice, consistent, and enables RDBMS efficient operation, but limiting. Aggregate orientation assumes that you sometimes want to operate on data with more complex structures than simple scalars of base types. Aggregates also more closely match the in-memory datastructures used by developers. on board example: think of the Amazon order system RDBMS would use several tables: * customer, * billing address, * shipping address (variable, so must select from a table with a ShippingAddressID), * Cart containing Items (each of which is a product with attributes), * Each item has shipping preferences (speed, giftwrapping), and payment info * payment info is ccard/giftcard/etc. Show with multiplicities An aggregate system might group them this way (JSON): // customer info collection { "id":42, "fname":"charles", "mi":"c", "lname":"palmer", "billingaddress":[ "streetaddr1":"2103 Xenon Way", "city":"santa Fe", "St":"NM" ] } // carts
7 { } "id": , "customer":42, "itemlist":[ { "UPC": , "price":79.95, "name":"apple 85w power adapter", "shippingspeed":"2nd day" }, { "UPC": , "price":59.95, "name":"apple touchpad mouse" } ], "shippingaddress":[ "streetaddr1":"42 Wrinkled Way", "city":"taos", "St":"NM" ] "paymentinfo":[ { "ccard":" abc-def0", "exp":"0115", "ccxact":"111111" } ] Survey of NoSQL Databases SLIDE 15 3 Key-Value and Document Store Databases Primarily thought of as a collection of aggregates Key-Value DB SLIDE 15 4 the aggregate is opaque to the database - a blob Store whatever you want - no structure. Only access the blob via a key. You can only search via a key - you cannot search within the blob. it s opaque.
8 Very different from RDBMS, where ad hoc queries are allowed. The application can store its data in a schema-less manner. It can be stored as a datatype of a programming language, such as an object. There is no fixed data model. Fields may be added at will and may be nonuniform or even nested. Fields may not be updated, however, only replaced. They act like large, distributed hash maps where these usually opaque values are stored and retrieved by a key. The Key space of the hash map is spread across multiple buckets on systems in the network. Keys like Usernames, addresses, Lat-Long, SSN s, zip codes, etc. are all typical for Fault tolerance you replicate the buckets on multiple systems. The various machines are not full replicas of each other, to aid in load-balancing Some systems do consistent hashing If the target system is down for a WRITE, the hash is temporarily remapped and the data stored. When the system comes back up, the data is moved to where it should be. If you don t have the key, most of these DB s won t provide you with a listing So you need to ensure you can get the key (or generate it), or it s based on a timestamp or other external info, or that users will always remember it ;-) Typical uses Storing session information User preferences Shopping Cart Data Document Store DB Example
9 First example is a row in a regular RDBMS Second won t fit into the same RDBMS since it has different attribute names This isn t a problem in a Document Store DB * the database can see the structure in the aggregate. A document is essentially a list of keys and values. Each field can have 0, 1, or many values Limits what you can put in it, but you gain flexibility in accessing it queries can be based on the fields within the document aggregate there are no strict schemas of the document content - you can if you need to You can search on any of the fields, and retrieval can be parts instead of the whole aggregate indices can be built on those fields Typical Uses Event logging lots of different formats, depending on what subsystem is making a log entry Blogging platforms Web analytics it s easy to update certain fields of a document as the information being monitored changes page hit counts network traffic analyses Column-Family Stores From Google s BigTable 2006 paper static.googleusercontent.com bigtable-osdi06.pdf Essentially a two-level aggregate store. A key is used to select a row which holds a map of links to the various aggregates
10 These second-level aggregates are referred to as Columns Queries can pick out specific columns, such as GET ( 1234, name ) The Columns are gathered into column families with other columns that are often referenced together. Typical uses Blogs Store blog entries with tags, categories, links, etc., in separate columns. Comments can be in the same row, or place in a separate keyspace. Usage/license expiration A column can be set to expire in some of these systems, and when that happens it is removed from the DB Shards Since these NoSQL data stores are focused on performance, it can be useful to distribute different parts of the data on different servers sharding Something in the key identifies which node will have the data Each of those servers handles the reads and writes for that particular set of shards Simple sharing can bring trouble if the server for a set of keys goes down The aggregate model offers a convenient way to chop up the data e.g., if certain aggregates are more likely to be accessed from Boston, place them in shards nearby e.g., to balance the load on aggregates equally likely to be accessed, distribute them evenly across the servers. Replication may also be employed for performance and resilience Example: Lotus Notes
11 The Secure Distributed Storage idea arises here Graph Databases SLIDE 15 9 While the others were motivated by the need to run on clusters, Graph DB s were motivated by a need for smaller records with complex and dynamic interconnections (relations). While you can do similar things with RDBMS s, as the relationships get increasingly complex, the joins required to find what you re looking for leads to poor performance. When we store a graph-like structure in a RDBMS, a single relationship isn t too hard: Manages As we need to add more relationships, it begins to get complicated In the RDBMS, we essentially have to think of the traversals we will want, and then to design the RDBMS accordingly If you want to add/remove relationships dynamically, well you can t really. Graph DB s shift the bulk of the work of navigating the graph to INSERTs, leaving QUERYs as fast as possible. QUERY is really just a very fast traversal of the graph from some starting point. The relationship between nodes isn t calculated during a QUERY, the relationship is always already there. There is no limit to the number and kind of relationships a node can have Relationships can be dynamically created / deleted Some graph DB systems like Neo4j allow you to specify Java objects as a relationship, while others have simpler relationship capabilities Example queries SLIDE FIND AN EMPLOYEE OF BigCO WITH A FRIEND WHO LIKES THE SAME BOOK IN THE DATABASE CATEGORY THAT WAS WRITTEN BY TWO FRIENDS. FIND THE BOOKS IN THE DATABASE CATEGORY THAT ARE WRITTEN BY SOMEONE WHOM A FRIEND OF MINE LIKES no match Example vs. RDBMS
12 See this simple SQL script (in MySQLWorkbench) Query for BOB s friends: SELECT PersonFriend.friend Query for the inverse: who are friends with Bob? (directional) SELECT Person.person NOW we ask Who are Alice s friends of friends? SELECT pf1.person as PERSON, pf3.person as FRIEND F RIEND O F Typical uses are apps requiring lots of relationships and very fast QUERYs Connected Data, such as Social Networks or (like Facebook or tracking down members of criminal groups) Routing, dispatch, and location-based services relationships between nodes can have a distance property Using distance and location properties in a graph of points of interest can enable an application to make recommendations of nearby services Recommendation engines which products are usually bought together which products when bought together should raise an alarm (fertilizer bomb components) searching for patterns in relationships that might indicate fraudulent transactions MongoDB Example Mongo Shell 1. mongo 2. use test 3. db.restaurant.find() // finds everything 4. db.restaurants.find().sort( { borough : 1, address.zipcode : 1 } ) 5. db.restaurants.find({ name : Juni }) // find one 6. db.restaurants.update( { name : Juni }, { $set: { cuisine : American (New) },
13 $currentdate: { lastmodified : true } } ) 7.db.restaurants.find({ name : Juni }) // see the change 7. //aggregation db.restaurants.aggregate( [ { $group: { _id : $borough, count : { $sum: 1 } } } ] ); 8. Filter and group aggregates db.restaurants.aggregate( [ { $match: { borough : Queens, cuisine : Brazilian } }, { $group: { "id : $address.zipcode, count": { $sum: 1 } } } ] ); The result has id field representing the zipcode group-by spec. pymongo from pymongo import MongoClient... client = MongoClient() //create a collection on localhost client = MongoClient("mongodb://mongodb0.example.net:27019") // or somewhere else... db = client.test // assign db to database named test //or db = client['test']... from datetime import datetime result = db.restaurants.insert_one( { "address": { "street": "2 Avenue", "zipcode": "10075", "building": "1480", "coord": [ , ] }, "borough": "Manhattan", "cuisine": "Italian", "grades": [ { "date": datetime.strptime(" ", "%Y-%m-%d"), "grade": "A", "score": 11 }, { "date": datetime.strptime(" ", "%Y-%m-%d"), "grade": "B", "score": 17 } ], "name": "Vella", "restaurant_id": " " } ) using Python
14 cursor = db.restaurants.find() for document in cursor: print(document) cursor = db.restaurants.find({"borough": "Manhattan"}) for document in cursor: print(document) cursor = db.restaurants.find({"grades.score": {"$gt": 30}})... cursor = db.restaurants.find({"cuisine": "Italian", "address.zipcode": "10075"})... // sorting cursor = db.restaurants.find().sort([ ("borough", pymongo.ascending), ("address.zipcode", pymongo.descending) ])... // update many fields result = db.restaurants.update_many( {"address.zipcode": "10016", "cuisine": "Other"}, { "$set": {"cuisine": "Category To Be Determined"}, "$currentdate": {"lastmodified": True} } )... // aggregates cursor = db.restaurants.aggregate( [ {"$group": {"_id": "$borough", "count": {"$sum": 1}}} ] ) MapReduce Framework While NoSQL has its advantages, it does lack some of the nice things that the more mature RDBMS systems have. This means more resonsibility falls to the programmer. Came from Google, and Hadoop is the open source implementation of it. See original paper: MapReduce: Simplified Data Processing on Large Clusters designed for data analysis applications no data model - data stored in files User provides five key functions and the MR system provides the glue to make them work, and it adds fault tolerance map, reduce, reader, writer, combine General Flow
15 Input records into map, map produces key, value pairs. each moves into a reduce that is handling those keys, and then reduce produces output records based upon the stream of values associated with the key. k 0.. k n Map SLIDE Divide and conquer! divide analysis problem into subproblems User creates ONE Map function that takes an input record and produces zero or more <key, value> pairs. Typically separates out subproblems based on the key (with limited processing) The key is not the same kind of key as in RDBMS s this key may have many values associated with it. In fact, that s the point of MapReduce. The framework will run the map function in multiple processes on one or more processors and systems. Example from Garcia-Molina Suppose we want to build an inverted index for a collection of documents identified as D i. D each having a document id. Thus, a document is i We build the Map function to take a D i and read it character by character to find each word w j. As the words are identified, the w, i Map function emits the <key, value> pair: < >. D i (w, [ i 1, i 2,..., i n ]) Thus, the output of Map for a single is a stream of word and document id pairs. This will likely include duplicate pairs, but that can be handled later. Reduce SLIDE User supplies one Reduce function that takes a set of values associated with a given key and produces zero or more outputs. The framework will run multiple Reduce functions on one or more processors and systems. Example from Garcia-Molina (continued) The intermediate results from the Map functions are the pairs (w, [ i 1, i 2,..., i n ]). The Reduce function should take a list of document ID s, removes duplicates, and sorts the new list of now unique document ID s.
16 Note that the Map function works on a single document, and the reduce works on a single word. Reader and Writer User creates READER and WRITER for reading records from disk and writing records to disk, respectively. Reader extracts records from files, feeds the maps, and the reduce feeds into the writer functions. Combine The combine() works with MAP and takes a set of records for a given key to send a combined version of those <key, value> pairs to feed the reducer more efficiently. Sort of a pre-processor for Reduce. What the framework provides SLIDE The user provides just those 4 5 functions and the framework provides the infrastructure to support all of this activity. The framework fistributes the pieces to multiple machines - everything is parallelizable by its very nature. It also supports scalability by enabling adaptation to the addition of new machines. Fault tolerance can recover reducer failures, etc. Here s graphical example of the simple word counter SLIDE Example 2 from Garcia-Molina Suppose we want to modify the previous example to produce a word count across all the documents. That is, for each word that occurs in at least one document in the collection, we want the final output to be, where is the word and is the total count of its occurrences across all the documents. The Map function works the same, only emitting the <key, value> pair: < > this time. The Reduce function will now receive the pair, with a for each occurrence of the word. The Reduce function simply adds up all the s. Other capabilities 1 w, 1 (w, c) w c (w, [1, 1,..., 1]) 1 w w You can even implement selection, projhectiob EXAMPLE: Web Log (Widom)
17 record: Userid, URL, timestamp, addlinfo task: count # of accesses for each domain map: takes a record, look inside, and extract the URL, and emit <domain, NULL> reduce: (domain, list of NULL values representing one access to that domain) - so it just has to count how many NULL values it gets for each domain, and then emits <domain,count> map(record) <domain(key), NULL> reduce(domain(key), list of NULLs) <domain, count> combiner takes the domain and list of nulls, and produces the <domain,count summary for EACH Map function. Insert combine() after Map combine takes the output from Map and preprocesses it to do the counting there, producing a list of <domain, count> pairs which are sent on to reduce. More efficient to count in the Map node. Example MapReduce in JavaScript and MongoDB From Isuru Suriarachchi s Blog SLIDE 15 16,20 Add on s HIVE (Schemas, SQL like language) and PIG (more imperative, but with relational operators) Both compile to a workflow of Hadoop (MapReduce) jobs. Both very popular Examples of the various kinds of NoSQL Data Stores For a list of the NoSQL databases of each type see nosql.mypopescu.com nosql
18 1. Key references used for this NoSQL lecture: Fowler, et.al., NoSQL Distilled ; Robinson, et.al., Graph Databases (Early Release) ; McCreary & Kelly, Making Sense of NoSQL. Other references include texts by Coronel, Widom, Ullman, Jukic, and Silberschatz.
SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford
SQL VS. NO-SQL Adapted Slides from Dr. Jennifer Widom from Stanford 55 Traditional Databases SQL = Traditional relational DBMS Hugely popular among data analysts Widely adopted for transaction systems
NoSQL Databases. Nikos Parlavantzas
!!!! NoSQL Databases Nikos Parlavantzas Lecture overview 2 Objective! Present the main concepts necessary for understanding NoSQL databases! Provide an overview of current NoSQL technologies Outline 3!
NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre
NoSQL systems: introduction and data models Riccardo Torlone Università Roma Tre Why NoSQL? In the last thirty years relational databases have been the default choice for serious data storage. An architect
Cloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015
NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR,
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
Can the Elephants Handle the NoSQL Onslaught?
Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented
X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
General announcements In-Memory is available next month http://www.oracle.com/us/corporate/events/dbim/index.html X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released
Structured Data Storage
Structured Data Storage Xgen Congress Short Course 2010 Adam Kraut BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT to conduct
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
Databases 2 (VU) (707.030)
Databases 2 (VU) (707.030) Introduction to NoSQL Denis Helic KMI, TU Graz Oct 14, 2013 Denis Helic (KMI, TU Graz) NoSQL Oct 14, 2013 1 / 37 Outline 1 NoSQL Motivation 2 NoSQL Systems 3 NoSQL Examples 4
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Understanding NoSQL Technologies on Windows Azure
David Chappell Understanding NoSQL Technologies on Windows Azure Sponsored by Microsoft Corporation Copyright 2013 Chappell & Associates Contents Data on Windows Azure: The Big Picture... 3 Windows Azure
Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.
Big Data Technology ดร.ช ชาต หฤไชยะศ กด Choochart Haruechaiyasak, Ph.D. Speech and Audio Technology Laboratory (SPT) National Electronics and Computer Technology Center (NECTEC) National Science and Technology
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
ADVANCED DATABASES PROJECT. Juan Manuel Benítez V. - 000425944. Gledys Sulbaran - 000423426
ADVANCED DATABASES PROJECT Juan Manuel Benítez V. - 000425944 Gledys Sulbaran - 000423426 TABLE OF CONTENTS Contents Introduction 1 What is NoSQL? 2 Why NoSQL? 3 NoSQL vs. SQL 4 Mongo DB - Introduction
Introduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15
MongoDB in the NoSQL and SQL world. Horst Rechner [email protected] Berlin, 2012-05-15 1 MongoDB in the NoSQL and SQL world. NoSQL What? Why? - How? Say goodbye to ACID, hello BASE You
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Why NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
NoSQL Databases. Polyglot Persistence
The future is: NoSQL Databases Polyglot Persistence a note on the future of data storage in the enterprise, written primarily for those involved in the management of application development. Martin Fowler
these three NoSQL databases because I wanted to see a the two different sides of the CAP
Michael Sharp Big Data CS401r Lab 3 For this paper I decided to do research on MongoDB, Cassandra, and Dynamo. I chose these three NoSQL databases because I wanted to see a the two different sides of the
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
A survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - [email protected] Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
Practical Cassandra. Vitalii Tymchyshyn [email protected] @tivv00
Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn
Transactions and ACID in MongoDB
Transactions and ACID in MongoDB Kevin Swingler Contents Recap of ACID transactions in RDBMSs Transactions and ACID in MongoDB 1 Concurrency Databases are almost always accessed by multiple users concurrently
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra [email protected] Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
Microsoft Azure Data Technologies: An Overview
David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...
Understanding NoSQL on Microsoft Azure
David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace
Introduction to Polyglot Persistence Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace FOSSCOMM 2016 Background - 14 years in databases and system engineering - NoSQL DBA @ ObjectRocket
Introduction to NoSQL and MongoDB. Kathleen Durant Lesson 20 CS 3200 Northeastern University
Introduction to NoSQL and MongoDB Kathleen Durant Lesson 20 CS 3200 Northeastern University 1 Outline for today Introduction to NoSQL Architecture Sharding Replica sets NoSQL Assumptions and the CAP Theorem
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010
System/ Scale to Primary Secondary Joins/ Integrity Language/ Data Year Paper 1000s Index Indexes Transactions Analytics Constraints Views Algebra model my label 1971 RDBMS O tables sql-like 2003 memcached
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 4. Basic Principles Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ NoSQL Overview Main objective:
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Sentimental Analysis using Hadoop Phase 2: Week 2
Sentimental Analysis using Hadoop Phase 2: Week 2 MARKET / INDUSTRY, FUTURE SCOPE BY ANKUR UPRIT The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular
Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment
Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment Subject NoSQL Databases - MongoDB Submission Date 20.11.2013 Due Date 26.12.2013 Programming Environment
Domain driven design, NoSQL and multi-model databases
Domain driven design, NoSQL and multi-model databases Java Meetup New York, 10 November 2014 Max Neunhöffer www.arangodb.com Max Neunhöffer I am a mathematician Earlier life : Research in Computer Algebra
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI
Big Data Development CASSANDRA NoSQL Training - Workshop March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI ISIDUS TECH TEAM FZE PO Box 121109 Dubai UAE, email training-coordinator@isidusnet M: +97150
So What s the Big Deal?
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
http://glennengstrand.info/analytics/fp
Functional Programming and Big Data by Glenn Engstrand (September 2014) http://glennengstrand.info/analytics/fp What is Functional Programming? It is a style of programming that emphasizes immutable state,
Preparing Your Data For Cloud
Preparing Your Data For Cloud Narinder Kumar Inphina Technologies 1 Agenda Relational DBMS's : Pros & Cons Non-Relational DBMS's : Pros & Cons Types of Non-Relational DBMS's Current Market State Applicability
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Open Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011
A Review of Column-Oriented Datastores By: Zach Pratt Independent Study Dr. Maskarinec Spring 2011 Table of Contents 1 Introduction...1 2 Background...3 2.1 Basic Properties of an RDBMS...3 2.2 Example
A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA
A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA Ompal Singh Assistant Professor, Computer Science & Engineering, Sharda University, (India) ABSTRACT In the new era of distributed system where
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
NoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone
Introduction to NOSQL
Introduction to NOSQL Université Paris-Est Marne la Vallée, LIGM UMR CNRS 8049, France January 31, 2014 Motivations NOSQL stands for Not Only SQL Motivations Exponential growth of data set size (161Eo
NoSQL: Going Beyond Structured Data and RDBMS
NoSQL: Going Beyond Structured Data and RDBMS Scenario Size of data >> disk or memory space on a single machine Store data across many machines Retrieve data from many machines Machine = Commodity machine
NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases
NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases Background Inspiration: postgresapp.com demo.beatstream.fi (modern desktop browsers without
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН
Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith?????? 122+ NoSQL Database
Advanced Data Management Technologies
ADMT 2014/15 Unit 15 J. Gamper 1/44 Advanced Data Management Technologies Unit 15 Introduction to NoSQL J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE ADMT 2014/15 Unit 15
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
nosql and Non Relational Databases
nosql and Non Relational Databases Image src: http://www.pentaho.com/big-data/nosql/ Matthias Lee Johns Hopkins University What NoSQL? Yes no SQL.. Atleast not only SQL Large class of Non Relaltional Databases
Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
An Open Source NoSQL solution for Internet Access Logs Analysis
An Open Source NoSQL solution for Internet Access Logs Analysis A practical case of why, what and how to use a NoSQL Database Management System instead of a relational one José Manuel Ciges Regueiro
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
NoSQL Database Options
NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has
FIS GT.M Multi-purpose Universal NoSQL Database. K.S. Bhaskar Development Director, FIS [email protected] +1 (610) 578-4265
FIS GT.M Multi-purpose Universal NoSQL Database K.S. Bhaskar Development Director, FIS [email protected] +1 (610) 578-4265 Outline M a Universal NoSQL database Using GT.M as a Multi-purpose NoSQL
NoSQL in der Cloud Why? Andreas Hartmann
NoSQL in der Cloud Why? Andreas Hartmann 17.04.2013 17.04.2013 2 NoSQL in der Cloud Why? Quelle: http://res.sys-con.com/story/mar12/2188748/cloudbigdata_0_0.jpg Why Cloud??? 17.04.2013 3 NoSQL in der Cloud
GigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO [email protected] DAMA SF December 15, 2011
NoSQL - What we ve learned with mongodb Paul Pedersen, Deputy CTO [email protected] DAMA SF December 15, 2011 DW2.0 and NoSQL management decision support intgrated access - local v. global - structured v.
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
MongoDB Developer and Administrator Certification Course Agenda
MongoDB Developer and Administrator Certification Course Agenda Lesson 1: NoSQL Database Introduction What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL Types of NoSQL
HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
Using RDBMS, NoSQL or Hadoop?
Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved. Data Ingest 2 Ingest
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island
Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY ANN KELLY II MANNING Shelter Island contents foreword preface xvii xix acknowledgments xxi about this book xxii Part 1 Introduction
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
Scalable ecommerce with NoSQL. Dipali Trivedi
Scalable ecommerce with NoSQL Dipali Trivedi ECommerce entities and schema Key aspect of NoSQL adoption Denomarlization: Key Aspect of NoSQL adoption Question oriented schema design: A. What are the products
Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)
Not Relational Models For The Management of Large Amount of Astronomical Data Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF) What is a DBMS A Data Base Management System is a software infrastructure
Getting Started with Hadoop. Raanan Dagan Paul Tibaldi
Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop
