Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment Subject NoSQL Databases - MongoDB Submission Date 20.11.2013 Due Date 26.12.2013 Programming Environment Platform Indepented (Java/C#/C++) Advisors R.A. Ahmet Selman BOZKIR (Dr. Sevil Şen) Background For the past forty-some years, relational databases have ruled the data world. Relational models first appeared in the early 1970s thanks to the research of computer science pioneers such as E.F. Codd. Early versions of SQL-like languages were also developed in the early 70s, with modern SQL appearing in the late 1970s, and becoming popular by the mid-1980s [1]. Interactive applications have changed dramatically over the last 15 years, and so have the data management needs of those apps. Today, three interrelated megatrends Big Data, Big Users, and Cloud Computing are driving the adoption of NoSQL technology. And NoSQL is increasingly considered a viable alternative to relational databases, especially as more organizations recognize that operating at scale is better achieved on clusters of standard, commodity servers, and a schema-less data model is often better for the variety and type of data captured and processed today [2]. NoSQL databases offer various benefits over traditional relational databases such as MySQL, SQL Server or Oracle. These benefits can be listed briefly as follows [2]: Big Users Not that long ago, 1,000 daily users of an application was a lot and 10,000 was an extreme case. Today, with the growth in global Internet use, the increased number of hours users spend online, and the growing popularity of smartphones and tablets, it's not uncommon for apps to have millions of users a day. Big Data Data is becoming easier to capture and access through third parties such as Facebook, D&B, and others. Personal user information, geo location data, social graphs, user-generated content, machine logging data, and sensor-generated data are just a few examples of the everexpanding array of data being captured. 1
More Flexible Schema-less Data Model Relational and NoSQL data models are very different. The relational model takes data and separates it into many interrelated tables that contain rows and columns. Tables reference each other through foreign keys that are stored in columns as well. When looking up data, the desired information needs to be collected from many tables (often hundreds in today s enterprise applications) and combined before it can be provided to the application. Similarly, when writing data, the write needs to be coordinated and performed on many tables. NoSQL databases have a very different model. For example, a document-oriented NoSQL database takes the data you want to store and aggregates it into documents using the JSON format. Each JSON document can be thought of as an object to be used by your application. A JSON document might, for example, take all the data stored in a row that spans 20 tables of a relational database and aggregate it into a single document/object. Aggregating this information may lead to duplication of information, but since storage is no longer cost prohibitive, the resulting data model flexibility, ease of efficiently distributing the resulting documents and read and write performance improvements make it an easy trade-off for webbased applications. Another major difference is that relational technologies have rigid schemas while NoSQL models are schemaless. Relational technology requires strict definition of a schema prior to storing any data into a database. Changing the schema once data is inserted is a big deal, extremely disruptive and frequently avoided the exact opposite of the behavior desired in the Big Data era, where application developers need to constantly and rapidly incorporate new types of data to enrich their apps. Adopted from [2] Scalability Prior to NoSQL databases, the default scaling approach at the database tier was to scale up. This was dictated by the fundamentally centralized, shared-everything architecture of relational database technology. To support more concurrent users and/or store more data, you need a bigger and bigger server with more CPUs, more memory, and more disk storage to keep all the tables. Big servers tend to be highly complex, proprietary, and disproportionately expensive, unlike the low-cost, commodity hardware typically used so effectively at the web/application server tier. NoSQL databases were developed from the ground up to be distributed, scale out databases. They use a cluster of standard, physical or virtual servers to store data and support database operations. To scale, additional servers are joined to the cluster and the data and database operations are spread across the larger cluster. Since commodity servers are expected to fail from time-to-time, NoSQL databases are built to tolerate and recover from such failure making them highly resilient. 2
Because of the reality of new demands in applications, NoSQL database systems are being much preferred. In this experiment it is aimed to employ a very well-known and well-documented mature NoSQL database system named Mongo. MONGODB MongoDB (from "HUMONGOUS") is a cross-platform document-oriented database system. Classified as a "NoSQL" database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster [3]. MongoDB serves the schema-less database design and present those useful features [4]: Document-Oriented Storage JSON-style documents with dynamic schemas offer simplicity and power. Full Index Support Index on any attribute, just like you're used to. Replication & High Availability Mirror across LANs and WANs for scale and peace of mind. Auto-Sharding Scale horizontally without compromising functionality. Querying Rich, document-based queries. Map/Reduce Flexible aggregation and data processing. GridFS Store files of any size without complicating your stack. EXPERIMENT In this experiment you are given a sample person database depicted in Figure 1. In this database there exist some tables: Person, PersonPhone, EmailAddress, Password, Business EntityAddress, Address. Figure 1. Person database 3
These tables are adopted from SQL Server 2008 Adventure Works DB and exported as comma seperatad files format (CSV) for your usage. As can be easily seen at CSV file, some of the fields at tables are removed for simplicity. Currently the fields of tables are listed as below: Person { BusinessEntityID,PersonType,Title,FirstName,MiddleName,LastName,ModifiedDate} PersonPhone { BusinessEntityID,PhoneNumber,PhoneNumberTypeID} Password { BusinessEntityID, PasswordHash, PasswordSalt, ModifiedDate} EmailAddress { BusinessEntityID, EmailAddress} BusinessEntityAddress {BusinessEntityID,AddressID} Address { AddressID, AddressLine1, AddressLine2, City, StateProvinceID, PostalCode} All the tables are connected via primary key BusinessEntityID. Some fields at various fields can contain null values. In CSV files all fields are separated via a column separator character. In this experiment, you are expected to develop an application accessing to a relational database system (You are free to choose a RDBMS in this list: MySQL, SQL Server, PostgreSQL) and MongoDB database system working on this sample database with the following requirements. Please read them carefully before design and implementation. Person table holds 19972 people. First of all you must increase the row number to more than 5.000.000 records by randomly picking the field values and appending to CSV file again by considering unique BusinessEntityID key. You should also do this step for the other tables. (Hint: Develop an application to achieve this) Do not duplicate the same rows. Instead, prefer an original random selection method for generating new records. Then these modified CSV files must be exported to the relational database system and define the primary/foreign key relations at RDBMS. Third, according to NoSQL database paradigm you must create your MongoDB database. This stage requires your research about NoSQL and MongoDB. Note that NoSQL databases are differing than relational ones in terms of db design and usage. You must develop a visual application for querying people by first name, last name, email, city and phone number. Users can select arbitrary numbers of filter. In other words, user can filter records by adding first name = Richard and city = New Jersey or only one filter. However field name will be used only once. Your application must enable users to connect relational DB and Mongo DB with a radio button and user should have the ability to define query terms via text boxes and drop down lists (Dropdown lists will hold field names). If relational DB is selected, your application should connect to RDBMS and run SQL query to retrieve the results in tabular format. If MongoDB is selected then your application must connect to MongoDB server and do the similar operations. Required time (as milliseconds) to complete the querying operations must be written on screen (a label can be used) upon querying. By this way, we can understand the difference of amount of time in querying between relational and NoSQL DB systems for same result set. After querying title, firstname, middlename, lastname, persontype, email,phone number, passwordsalt, addressline1, city and postal code fields must shown to user in 4
a tabular format. At UI side, you can use GridView or DataGrid controls for tabular viewing. Even a large multiline text box for result set viewing is allowed. At implementation stage of your program you are free to select either Java or C#. Microsoft Visual C++ environment is also accepted. Due to selection of your environment you must download and use required MongoDB driver and API. Without an appropriate driver you cannot access MongoDB. This information is also valid for your relational database server access. For instance, if you are using MySQL and C# then you must obtain appropriate drivers. GENERAL RULES First of all, you must learn the theoretical background of NoSQL database paradigm and MongoDB You are free to complete your homework with these defined relational databases (MySQL, SQL Server, PostgreSQL, Oracle) and those defined programming environments (Java, C#, Visual C++). Due to your selection you must use appropriate driver that can be freely downloaded from MongoDB web site. Do not duplicate your original CSV file contents for increasing the number of records. Use random selection and assignment. NOTES: SAVE all your work until the experiment is graded. The assignment must be original, INDIVIDUAL work. You will work as groups. Each group will contain 2 person. SUBMISSONS: The experiment code will be tested under custody of advisor (R.A. Selman Bozkır). Therefore you will not submit the application and your development environment. Advisor will announce the time and place for evaluating your experiment. REFERENCES: 1. http://slashdot.org/topic/bi/sql-vs-nosql-which-is-better/, 17.11.2013 2. http://www.couchbase.com/why-nosql/nosql-database 17.11.2013 3. http://en.wikipedia.org/wiki/mongodb 18.11.2013 4. http://www.mongodb.org/ 18.11.2013 5