How SourceForge is Using MongoDB Rick Copeland @rick446 rick@geek.net Geeknet, page 1
SF.net BlackOps : FossFor.us User Editable! Web 2.0! (ish) Not Ugly! Geeknet, page 2
Moving to NoSQL FossFor.us used CouchDB (NoSQL) Just adding new fields was trivial, and was happening all the time Mark Ramm Scaling up to the level of SF.net needs research CouchDB MongoDB Tokyo Cabinet/Tyrant Cassandra... and others Geeknet, page 3
Rewriting Consume Most traffic on SF.net hits 3 types of pages: Project Summary File Browser Download Pages are read-mostly, with infrequent updates from the Develop side of sf.net Original goal is 1 MongoDB document per project Later split release data because some projects have lots of releases Periodic updates via RSS and AMQP from Develop Geeknet, page 4
Deployment Architecture Load Balancer / Proxy Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Develop MongoDB Slave MongoDB Slave MongoDB Slave MongoDB Slave Master DB Server MongoDB Master Gobble Server Geeknet, page 5
Deployment Architecture (revised) Load Balancer / Proxy Apache mod_wsgi / TG 2.0 Apache Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 mod_wsgi / TG 2.0 Develop Scalability is good Single-node performance is good, too Master DB Server MongoDB Master Gobble Server Geeknet, page 6
SF.net Downloads Allow non-sf.net projects to use SourceForge mirror network Stats calculated in Hadoop and stored/served from MongoDB Same deployment architecture as Consume (4 web, 1 db) Geeknet, page 7
Allura (SF.net beta devtools) Rewrite developer tools with new architecture Wiki, Tracker, Discussions, Git, Hg, SVN, with more to come Single MongoDB replica set manually sharded by project Release early & often Geeknet, page 8
What We Liked Performance, performance, performance Easily handle 90% of SF.net traffic from 1 DB server, 4 web servers Schemaless server allows fast schema evolution in development, making many migrations unnecessary Replication is easy, making scalability and backups easy Keep a backup slave running Kill backup slave, copy off database, bring back up the slave Automatic re-sync with master Query Language You mean I can have performance without map-reduce? GridFS Geeknet, page 9
Pitfalls Too-large documents Store less per document Return only a few fields Ignoring indexing Watch your server log; bad queries show up there Ignoring your data s schema Using many databases when one will do Using too many queries Geeknet, page 10
Ming an Object-Document Mapper? Your data has a schema Your database can define and enforce it It can live in your application (as with MongoDB) Nice to have the schema defined in one place in the code Sometimes you need a migration Changing the structure/meaning of fields Adding indexes Sometimes lazy, sometimes eager Queuing up all your updates can be handy Python dicts are nice; objects are nicer Geeknet, page 11
Ming Concepts Inspired by SQLAlchemy Group of classes to which you map your collections Each class defines its schema, including indexes Convenience methods for loading/saving objects and ensuring indexes are created Migrations Unit of Work great for web applications MIM Mongo in Memory nice for unit tests Geeknet, page 12
Ming Example from ming import schema from ming.orm import MappedClass from ming.orm import (FieldProperty, ForeignIdProperty, RelationProperty) class WikiPage(MappedClass): class mongometa : session = session name = 'wiki_page' _id = FieldProperty(schema.ObjectId) title = FieldProperty(str) text = FieldProperty(str) comments=relationproperty('wikicomment') MappedClass.compile_all() # Lets ming know about the mapping Geeknet, page 13
Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License Geeknet, page 14
Future Work mongos New Allura Tools Migrating legacy SF.net projects to Allura Stats all in MongoDB rather than Hadoop? Better APIs to access your project data Geeknet, page 15
Questions? Confidential Geeknet, page 16
Contact Rick Copeland @rick446 rick@geek.net Geeknet, page 17