MongoDB 1. Introduction MongoDB is a document-oriented database, not a relation one. It replaces the concept of a row with a document. This makes it possible to represent complex hierarchical relationships with a single record. There are no predefined schemas, this makes adding or removing fields easier. Data are easier to scale out. Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster. Missing features: joins and complex multi-row transactions. Whenever possible, the database server offloads processing and logic to the client side. 2. Getting Started Document: ordered set of keys with associated values. Ex: { "foo" : 3 }. MongoDB is type- and case-sensitive No duplicate keys Collection: a group of documents. Analog to a table. Database: a group of collections. A database has its own permissions, and is stored in separate files on disk. Namespace: a database name + a collection name = a qualified collection name. db : print the currently assigned db use test2 : switch to the db test2 db.blog.insert(post) : insertion of the JS variable post db.blog.find() : retrieve the content of a collection db.blog.findone() : retrieve the first element of a collection db.blog.update({field: new value }, post): update a document db.blog.remove({field: value }); show dbs show collections 3. Creating, Updating, and Deleting Documents db.foo.insert({ bar : bat }) db.foo.batchinsert( [ { _id :0},{ _id :1},{ _id :2} ] ) db.foo.remove() db.foo.remove({ opt-out :true}) db.foo.drop() updates: the last one will win $inc: increment the value of a key: { $inc : { pageviews : 1 }}
$set: set the value of a field. $unset: remove a key and its value $push: add elements to the end of an array $each: modifier available for user with $addtoset and $push $slice $sort $ne: not equal $addtoset: add only if already exist (prevent duplicate) $pop: remove like a queue or a stack $pull: remove elements of an array that match the given criteria $: positional operator Upsert: Update or Insert $setoninsert: only set the value of a field when the document is being inserted getlasterror: return info on the last operation findandmodify: return the item and update it in a single operation unacknowledged writes: do not return any status response (for low value data) 4. Querying The find method is used to perform queries. Which documents get returned is determined by the first argument. Ex: db.users.find({"age":27}). Second argument: the keys you want. Ex: db.users.find({}, {"mail":1, "_id": 0}). Conditionals: $lt, $lte, $gt, $gte, $in, $nin, $or, $and, $exist Ex: db.users.find({"age" : {"$gte" : 18, "$lte" : 30}}) Ex: db.raffle.find({"ticket_no" : {"$in" : [725, 542, 390]}}) Ex: db.raffle.find({"$or" : [{"ticket_no" : 725}, {"winner" : true}]}) Regular expressions: db.users.find({"name": /joe/i}) Array: each query clause can match a different array element Embedded documents: use the "dot" notation to access inner fields. Where clause: allows to execute arbitrary JS (dangerous and slow) db.foo.find( "$where": function( { return true / false }} ); Cursors: hasnext(), next(), foreach(f(x) { }) limit(), skip(), sort({}) snapshot(): make sure each document is returned only once (slower) DB commands db.runcommand( { "drop", "test" });
5. Indexing Table scan = query that does not use an index Creating an index: db.people.ensureindex( { "profession": 1} ); Compound index: db.users.ensureindex({"age" : 1, "username" : 1}); Inefficient operators: $where, $exist, $ne, $not explain() is a tool to diagnosticate slow queries. Unique indexes: db.users.ensureindex( { "username": 1}, { "unique" : true }); Sparse indexes: indexes that need not include every document as an entry. db.ensureindex({"email" : 1}, {"unique" : true, "sparse" : true}) To retrieve indexes: db.<table>.getindexes() 6. Special Index and Collection Types Capped collections: fixed-size collections, circular queues. Documents cannot be removed or deleted. Useful for logging. Tailable cursors: cursor that continue to fetch new results as documents are added to the collection. TTL indexes: remove document after a given timeout Full-text indexes: indexing a large amount of text Geospatial indexes: uses GeoJSON, a format for encoding a variety of geographic data structures. Grid FS: mechanism for storing large binary files in Mongo DB 7. Aggregation The aggregator framework lets you transform and combine documents in a collection. $project: extracts fields from subdocuments, rename fields, perform operations on them $match: filters documents $group: groups documents based on some fields $sort $limit $unwind: split each field of an array into a separate document Mathematical expressions: $sum, $add, $subtract, $multiply, $divide, $mod Date expressions: $year, $month, $week, $dayofmonth String expressions: $substr Logical expressions: $cmp, $eq, $lt, $and Control statements: $cond, $ifnull
Ex: > db.articles.aggregate( { "$project" : { "author": 1}}, ); { "$group" : {"_id": "$author", "count": {"$sum" : 1 }}}, { "$sort" : {"count" : -1}}, { "$limit" : 5 } Map Reduce Powerful and flexible tool for aggregating data. It can be easily parallelized across multiple servers. It splits up a problem, sends chuncks of it to different machines, and lets each machine solve its part of the problem. When all the machines are finished, they merge all the pieces of the solution back into a full solution. Two steps: map and reduce. 1. Map: maps an operation onto every document in a collection 2. Reduce: takes the list of values and reduces it to a single element db.runcommand( {"mapreduce" : "foo", "map" : mymap, "reduce": myreduce}); Other commands $count $distinct $group 8. Application Design Normalization: dividing up data into multiple collections with reference between collections. MongoDB has no joining facilities, so gathering documents from multiple collections requires multiple queries. Each piece of data lives in one collection, and multiple documents may reference it. Denormalization: embedding all of the data in a single document. Many documents may have copies of the data. Multiple documents need to be updated if the information changes, but all related data can be fetched with a single query. Normalizing makes writes faster, denormalizing makes read faster. Ex: user and address: best to embed the address in the user document (faster read) since the address does not change often. Cardinality: one-to-one, one-to-many, many-to-many. Split "many" in "many" and "few". "Few" relationships work better with embedding, "many" with references. If a document has only one field that grows, try to keep it as the last field in the document. When not to use MongoDB: when we need transactions or joining many different types of data.
9. Setting Up a Replica Set Replication is a way of keeping identical copies of the data on multiple servers. A replica set is a group of servers with one primary, the server taking client requests, and multiple secondary, servers that keep copies of the primary's data. If the primary crashes, the secondaries can elect a new primary from among themselves. 10. Components of a Replica Set MongoDB takes care of the replication by keeping a log of operations (oplog), containing every write that a primary performs. The secondary server query this collection for operations to replicate. 11. Connecting to a Replica Set from your application From an application' point of view, a replica set behaves much like a standalone server. To ensure that writes will be persisted no matter what happens to the set, you must ensure that the write propagates to a majority of the members of the set. We can use the getlasterror() command to check that write were successful: db.runcommand({"getlasterror" : 1, "w" : "majority"}) Applications that require strongly consistent reads should not read from secondaries. 12. Administration This chapter covers replica set administration. 13. Introduction to Sharding Sharding refers to the process of splitting data up across machines (also called portioning). By putting a subset of data on each machine, it becomes possible to store more data and handle more load (<> Replication, that creates exact copy of the data on multiple servers). MongoDB supports auto-sharding, which tries to abstract the architecture away from the application and simplify the administration of such a system. MongoDB automates balancing data across shards and makes it easier to add and remove capacity.
14. Configuring Sharding Do not shard to early, or too late. Use sharding to Increase available RAM Increase available disk space Reduce load on a server Read or write data with greater throughput that a single mongod can handle 15. Choosing a Shard Key The most important and difficult task when using sharding is choosing how your data will be distributed. A shard key is a field used to split up the data. Three types of keys: ascending key, random, locationbased. 16. Sharding Administration This chapter gives advice on performing administrative tasks on all parts of a cluster, including: inspecting the cluster's state, add - remove - change members of a cluster, administering data movement and manually moving data 17. Seeing what your application is doing Current operation: db.currentop() Killing operation: db.killop(id) Profiling: db.system.profile.find().pretty() Calculating size: Object.bsonsize(db.users.findOne()) Stats: db.users.stats()... 18. Data Administration Adding root user: use admin; db.adduser("root", "abcd"); To enable security: --auth command-line option To authenticate: db.auth("user", "password");...
19. Durability Durability is the guarantee that an operation that is committed will survive permanently. Use db.foo.validate() to check a collection for corruption.... 20. Starting and Stopping MongoDB --dbpath --port --config Closing: {"shutdown": 1 } 21. Monitoring MongoDB 22. Making Backups 23. Deploying MongoDB