Educational Collaborative Develops Big Data Solution with MongoDB

CASE STUDY OVERVIEW Educational Collaborative Develops Big Data Solution with MongoDB INDUSTRIES Education, Nonprofit LOCATION Durham, NC PROJECT LENGTH 1 year, 5 months APPLICATION SUPPORTED Data driven software providing tools to enhance the quality of education RELEVANT WEBPAGE OR SCREENSHOT IMAGE HERE Education is the most powerful weapon which you can use to change the world. - Nelson Mandela The Project In early 2012, Mammoth Data began working with a non-profit educational collaborative. The organization was eager to develop a robust application that would facilitate a more fluid collection and analysis of data for educators. The project consisted of a set of services designed to provide access to a unified database for student data across the United States. Initially Mammoth Data provided project management services on several of the smaller projects within the overall development umbrella. Following these successful contracts, Mammoth Data was eventually asked to join the customer s new in-house development team in taking over development of the product. The product consisted of an extensive RESTful API developed in Java/Spring and backed by a MongoDB data store, with management, data access, data manipulation, and other applications implemented in Java, Ruby, Freemarker, and other languages. Mammoth Data was not involved in the beginning of this project, and due to our delayed presence some time was spent bringing the existing project up to Mammoth Data s standards in terms of architecture, code quality, build infrastructure, and documentation. This project ran on industry standard 2-week iterations. Mammoth Data managed one scrum team and worked closely with the client as they brought some of the work in-house and created an internal scrum team. Mammoth Data diligently followed the established development practices and maintained clear communication during the course of that transition and throughout the duration of the project.

technologies & TECHNIQUES Test-driven development Business-driven development with cucumber Continuous integration with jenkins Peer reviewed code with reviewboard and crucible Issue tracking with rally and jira Agile/scrum/sprints revision control with git Accomplishments Instructions and Installation Mammoth Data refined and fixed installation scripts, while writing new instructions for the setup of these components. This included multiple Java application servers, Rails servers, a JMS broker, MongoDB database, and Liferay installation. New developers that joined the project were able to reduce their onboarding and development environment installation time from two days to three hours by using Mammoth Data s new scripts and instructions. These scripts included bash scripts that would download and install portions of the environment. The scripts were also used to perform tasks that could be automated, such as emptying out the test database and re-importing a clean set of data when switching between tasks. When Mammoth Data began working on this project there was no documentation on how to build and/or test the system. The Mammoth Data team was able to assess the appropriate actions required to have a whole system up and running while documenting those steps, as well as others that would be beneficial to maintaining a working local system. As part of this, a markdown file was created that was used as the documentation for how to build the system and was later publicly released along with the code when the system was open sourced. Language Updates We updated the project from Ruby 1.9.2 to Ruby 1.9.3 and then to Ruby 2.0.0, and from Java 1.6 to Java 1.7, as we identified and corrected compatibility issues in project code, tests, and the included libraries. Security Through code analysis and extending regression tests, we discovered multiple defects of various severity, which allowed application clients to execute functions which should not have been allowed for privileged and security-related entities. We designed, implemented, and tested appropriate solutions to close these issues. Additionally, we found that improperly sanitized API parameters allowed clients to corrupt the database by wiping out entity metadata with blank posts. We disallowed such posts in all of our modified services. To prevent these issues from recurring, we implemented new regression tests that attempted all of these forbidden posts and accesses, verified that the system did not permit them and logged events appropriately. To expose security events through the Rest API, Mammoth Data created a new resource (controller) and a new service to handle capturing the data from MongoDB. The data was in a similar format as the rest of the entities, and as a result it was easy to extend and use existing code to handle the connections to the database. Furthermore, it simplified retrieving representative objects. A new right was also added using existing security infrastructure to verify only approved users could view the security data.

REST API We performed various upgrades to the REST API in order to expose new entities and MongoDB collections for client consumption including data import job status reports and security event logs. These new implementations augmented the project s use of Spring Data for MongoDB. This project was a unique opportunity to secure the importance of big data in our nation s future, while also providing a seamless solution to increase the quality of our education. DREW NELSON CONSULTANT, MAMMOTH DATA We added API endpoints for ingestion data (ingestion is the process used for imported data from another source into the system). In the system, API and Ingestion were separate modules, and each module had implemented its own communication with MongoDB. We created new Spring Data MongoDB DAOs, services, and controllers to provide better access to the ingestion data, which unified the data access layer. The only thing able to be reused was the model. This change required a new right to be implemented. Databrowser We adapted a Ruby on Rails application that provided a raw view of the data to summarize and provide an administrator-focused view with numerous UI refinements. The administrator-focused view required extensive visual and functional changes. We designed our extension of the tool to serve this new audience while maintaining its original functionality of representing the Rest API and Mongo data in visual form. A fair number of the changes were to clean up the UI. These were mostly text changes that would explain what was going on behind the scenes, as well as provide more appropriate data as required. For example, the columns in tables of displayed data were inconsistent across different data types, and across the same data types as displayed in different locations. For this, Mammoth Data changed the column headers around to be more uniform across the entities that were being viewed in the system. Another item that was added was pagination. As it was, only a max of 50 entities would ever be displayed for a particular entity type, and there was no way to see any entities beyond this point. A new pagination system was built into the Databrowser to sync the display and retrieval calls to the API. This allowed the display to page through all of the entities made available by the API and to only return the specified number of entities for performance reasons. This was all made configurable through yml files that were purposed throughout the Databrowser product. Various tables and counts were also added to the UI for the Databrowser. These tables would give statistical information about the entities that were associated with the specific entity being viewed. There were only certain entities where this data made sense, and therefore controls were inserted to validate that these tables were only displayed at appropriate locations. Other counts were also placed next to various links throughout the system to show the number of entities that reside below those links. The organization was eager to develop a robust application that would facilitate a more fluid collection and analysis of data for educators. Data Schema Analysis We identified inefficiencies in the data model that were an artifact of the project s prior transition from an RDBMS to MongoDB. We found that the existing implementation used associative tables between collections instead of sub-documents, which are more

performant for NoSQL. The client s API exposed these association tables, introducing additional dependency on them. Mammoth Data recommended changes to the data model to bring it in line with standard best practices for a document database, and implemented an interim solution at the client level to fulfill our feature requirements. Test Coverage All new code was covered by unit and/or integration tests. We uncovered and corrected broken and/or avoided tests. For each step along the way, new tests were created and other tests were modified to evaluate the new functionality being added. This was done using JUnit and Cucumber/Gherkin. Each of the requirements that were given in the work being done were tested to verify their viability within the system. We developed a process where each relevant piece of code had appropriate tests designed and then tested locally. Once each test would pass locally, a specific module test was performed to verify there was no impact to the module and other relevant sections of code. After everything passed locally, a full suite of integration tests were implemented in a Jenkins environment. While the Release Candidate testing took place, the modified code would be placed into a review We found that the existing implementation used associative tables between collections instead of subdocuments,which are more performant for NoSQL. system for peer review. Changes at this stage were generally related to making sure appropriate comments were inserted and there were no glaring issues with the code that would impact performance and/or create bugs. Upon successful completion of the Jenkins test and code review, the code was then pushed to a Release Candidate environment which housed the system on multiple servers matching production as closely as possible. At this point, another full suite of tests was administered. While the code was in the Release Candidate environment a demo would be presented to the client. This would show what had been accomplished and to verify that the functionality and presentation met their expectations. After the aforementioned steps were completed, and the code had passed the necessary tests while meeting the requirements of the client, then and only then would the code merge into the master branch. After it was merged, a full set of Jenkins tests would run on the master branch to verify no issues were introduced by the merge. Additionally, at the end of every sprint cycle the master was pushed to the Release Candidate and the full suite of tests were run there to ensure that the master was ready for release. The release process would then be followed by the operations and release management team.

ABOUT US Mammoth Data is a Big Data consulting firm specializing in Hadoop, NoSQL databases, and designing modern data architectures that enable companies to become data-driven. By combining cutting-edge technologies with a highlevel strategy, we are able to craft systems that capture, organize and turn unstructured information into real business intelligence. Mammoth Data was founded as Open Software Integrators in 2008 by open source software developer, evangelist and now president, Andrew C. Oliver. Mammoth Data is headquartered in downtown Durham, North Carolina. MAIN OFFICE 345 W. MAIN ST. SUITE 201 DURHAM, NC 27701 (919) 321-0119 info@mammothdata.com @mammothdataco mammothdata.com