MongoDB and Python. Key Ingredients for a Perfect Big Data Recipe WHITEPAPER. Firoz Mohamed Kasim, PMP

Similar documents
What your SIEM vendor will not tell you

Building Business Continuity and Enabling Smart Disaster Recovery with Azure Site Recovery (ASR) WHITEPAPER. By Pawan Kumar Dontula

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Contents WHITE PAPER. Introduction

How To Handle Big Data With A Data Scientist

CLOUD COMPUTING - OPPORTUNITIES

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Developing Microsoft Azure Solutions 20532B; 5 Days, Instructor-led

A Monitored Student Testing Application Using Cloud Computing

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

MONGODB - THE NOSQL DATABASE

CitusDB Architecture for Real-Time Big Data

REAL-TIME BIG DATA ANALYTICS

IBM WebSphere ILOG Rules for.net

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

REFERENCE ARCHITECTURE FOR SMAC SOLUTIONS

Data Integration Checklist

Performance and Scalability Overview

INTRODUCTION TO CASSANDRA

THE QUEST FOR A CLOUD INTEGRATION STRATEGY

Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast,

Streamlining the Process of Business Intelligence with JReport

Big data: Unlocking strategic dimensions

Blueprints and feasibility studies for Enterprise IoT (Part Two of Three)

Empowering the Masses with Analytics

Essential Elements of an IoT Core Platform

Buying vs. Building Business Analytics. A decision resource for technology and product teams

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Cloud Computing and Advanced Relationship Analytics

Delivering secure, real-time business insights for the Industrial world

JOURNAL OF OBJECT TECHNOLOGY

UNLEASHING THE VALUE OF THE TERADATA UNIFIED DATA ARCHITECTURE WITH ALTERYX

TURN YOUR DATA INTO KNOWLEDGE

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

OWB Users, Enter The New ODI World

HGST Virident Solutions 2.0

Data Virtualization. Paul Moxon Denodo Technologies. Alberta Data Architecture Community January 22 nd, Denodo Technologies

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Cloud3DView: Gamifying Data Center Management

Implementing and Maintaining Microsoft SQL Server 2008 Integration Services

Contents. Introduction... 1

Lofan Abrams Data Services for Big Data Session # 2987

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

In Memory Accelerator for MongoDB

Course 10978A Introduction to Azure for Developers

How to select the right Marketing Cloud Edition

Making big data simple with Databricks

Databricks. A Primer

Big Data Integration: A Buyer's Guide

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Copyright 2013 Splunk Inc. Introducing Splunk 6

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

InfiniteGraph: The Distributed Graph Database

Data Modeling for Big Data

The 4 Pillars of Technosoft s Big Data Practice

Address IT costs and streamline operations with IBM service desk and asset management.

Agile Business Intelligence Collapsing BI from Months to Minutes

Cloudera Enterprise Data Hub in Telecom:

Collaborative Open Market to Place Objects at your Service

Sentimental Analysis using Hadoop Phase 2: Week 2

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

10978A: Introduction to Azure for Developers

Adding scalability to legacy PHP web applications. Overview. Mario Valdez-Ramirez

NoSQL and Hadoop Technologies On Oracle Cloud

Extend your analytic capabilities with SAP Predictive Analysis

Integrating SAP and non-sap data for comprehensive Business Intelligence

EMC CAPTIVA SOLUTIONS FOR HEALTHCARE

Big Data at Cloud Scale

Can Cloud Database PaaS Solutions Replace In-House Systems?

Three Open Blueprints For Big Data Success

Architected Blended Big Data with Pentaho

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

ITIL V3: Making Business Services Serve the Business

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Introduction to Azure for Developers

MANAGEMENT AND ORCHESTRATION WORKFLOW AUTOMATION FOR VBLOCK INFRASTRUCTURE PLATFORMS

How to Enhance Traditional BI Architecture to Leverage Big Data

Databricks. A Primer

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Oracle Business Intelligence Implementation at National Guard Health Affairs. Case Study

Building a SaaS Application. ReddyRaja Annareddy CTO and Founder

Accenture and Oracle: Leading the IoT Revolution

How To Make Sense Of Data With Altilia

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Reference Architecture, Requirements, Gaps, Roles

Yes Bank open source CRM. Yes Bank Collaborative CRM (YCCRM) Case Study. Empower business with Professional Open Source. Solutions.

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Learning Management Redefined. Acadox Infrastructure & Architecture

Transcription:

WHITEPAPER MongoDB and Python Key Ingredients for a Perfect Big Data Recipe By Firoz Mohamed Kasim, PMP To discover how GAVS can help you innovate and bring greater value to your business, write to inquiry@gavstech.com or visit www.gavstech.com.

Contents MongoDB and Python: Key Ingredients for a Perfect Big Data Recipe 1 Open-source is the way to go for developing Big Data solutions 4 The elements of a Big Data solution 4 Leveraging MongoDB to enhance performance and scalability 4 Implementing Analytics Framework using Python to accelerate time and value efficiencies Developing a Customized Dashboard Solution with Python Efficient Data Sourcing with Bubbles Log Management in Python Heralding a new direction in Big Data with Open-source software 7 2

Abstract In today s highly connected world, enterprises are faced with exponential growth in the volume of data in both structured and unstructured formats. Broadly referred to as Big Data, these huge volumes and high complexity of data makes it difficult to process with the help of traditional data processing methods. Big Data is useful for companies as it leads to deeper insights through more accurate analyses. As a result, an increasing number of organizations are eager to harness the powers of Big Data. However, to derive accurate and actionable insights from the data, best-fit solutions that use cost-effective and agile technologies are required. Innovative open-source products accelerate accessibility and productivity with their superior functionalities to support comprehensive data management and drive more informed decisions. The paper highlights how implementing open-source technologies such as MongoDB and Python can help achieve a viable and long-term big data solution. Employing MongoDB provides high performance storage solutions and Python enables efficient big data analytics with the assistance of its powerful libraries. 3

Open-source is the way to go for developing Big Data solutions Big data analytics has emerged as the key component in the analytics and information management domain. It enables integrated analysis of both structured and unstructured data, and offers powerful insights to make informed decisions and enhance productivity. However, to derive real business value from big data, the right tools are needed for capturing and organizing data for analysis and acquiring business insights. Several challenges had to be addressed before deploying an analytics platform using big data which include selecting right set of technologies suited to the diverse needs of the business to build the platform, integrating myriad data into the platform by synchronizing various data sources, and ensuring easy data accessibility and syndication. Cost-effective open-source products offer strong capabilities such as faster time-to-market and advanced technology features to develop compelling solutions for big data challenges. By leveraging open-source products such as Mongo-DB and Python, it is easier to perform big data analysis and accelerate strategic decisions and derive business value. An idea can be prototyped using free open-source software and technologies within a short span of time and made available for demonstration to target business audience. The next section discusses a generalized recipe for an effective big data solution using open-source software. Core elements of a Big Data solution A typical big data solution requires a front-end dashboard, an analytics framework that acts as the backbone infrastructure, a data store, and a data sourcing solution. The front-end dashboard displays the results of data crunching; the analytics framework performs in-depth analysis, while a reliable, agile scalable storage site stores actual data and processes information. Another important element of the solution is a reliable channel for data sourcing that can be easily replicated to source data using Extract, Transform and Load (ETL) processes from transactional applications, social networking sites, mobile platforms, etc. Leveraging MongoDB to enhance performance and scalability Various traditional methods and tools can be used for building dashboards, performing analytics, sourcing data from various platforms, and storing variety of data. However, while building viable big data solutions, it is important to consider the escalating volume of data that is expanding beyond terabytes into exabytes and zettabytes. The unstructured nature of data which may include graphical content adds another layer of complexity in building such solutions. Data is the main actor for any big data solution and no enterprise can afford to have it lost permanently or even have it temporarily unavailable for processing. This reiterates the need for a reliable, highly-available and high-performance storage solution. An easy proposition for a NoSQL store, capable of processing high-volumes of semi-structured and unstructured data, could be MongoDB, which has become an increasingly popular cross-platform document-oriented database solution that is being adopted across industries. It is free and open-source, allowing for prototyping without any expenditure, while providing easy scalability, high performance and availability. It is classified as a NoSQL database and uses a document-oriented structure for effective storage and retrieval of data. As mentioned before, MongoDB is NoSQL and hence can therefore store data as-is. Moreover, due to this nature, a defined structure is not really required to store data, which makes it non-relational. The data is stored in the form of key-value pairs. However, it is advisable to have a primary structure in place, at least in the case of long-term integrated solutions, that enables organized storage of data for effective data retrieval. Python s Ming framework is quite popular in enterprise circles for use with MongoDB which assists in organized storage of data. There are some trade-offs to be considered while using MongoDB for enterprise solutions. Though, MongoDB offers extremely simple programming interface for handling large volumes of data and has extreme horizontal scalability, it does not support transactional behavior and integrity constraints. Hence, no ACID behavior is possible with MongoDB. Also, without an appropriate plan for storing data like the Ming framework, queries can take forever to retrieve the right results from enterprise-size databases. MongoDB envisages use of a replication factor of three, which means data will be replicated thrice for storage. This makes the storage highly reliable and available at all times (high availability) for processing. Sharding is another feature of MongoDB where data can be spread across various machines to support the ever growing demands of data volume (performance and scalability). However, sharding requires careful selection of candidate keys to evenly spread the data across multiple machines. 4

Implementing Analytics Framework using Python to accelerate time and value efficiencies Python, as a programming language, simplifies the development life cycle. Besides being easy to learn with simple implementable libraries and community support to make adaption of code easy, it possess the capability to process large amounts of data by using simple data structures. R, MATLAB and Octave are some of the other advanced analytical tools that fit this category with high processing capabilities. Though R, MATLAB and Octave are powerful in their statistical libraries, they do not offer support for general purpose programming capabilities like web and server-side programming, graphical interface support, etc. Python, being a general-purpose language does not disappoint in these aspects. Python s easy to understand syntax emphasizes readability, minimizing the cost of program maintenance. Python supports both structured as well as object-oriented programming for application development. Python libraries like NumPy and SciPy provides enhanced utilities for number crunching and scientific applications; Django and Flask provides micro containers suited for web development and deployment. Python also provides a varied list of libraries for myriad computing functions like Cryptography, Game Development, Geographic Information System (GIS), GUI programming, Multimedia processing, Image manipulation, Indexing and Searching, Networking, Plotting, Multi-language Processing, etc. Python provides a library called PyMongo which contains tools for connecting and working with MongoDB. PyMongo provides native drivers to interact with MongoDB. Ming framework can be used to channel data from MongoDB data store for analytical processing. Ming framework helps enforce a schema-based behavior for documents obtained from MongoDB data store within Python applications. Developing a Customized Dashboard Solution with Python Python frameworks such as Flask, Django or Pyramid can be utilized to create front-end dashboards. Django Dash is a customizable, modular dashboard application framework that allows users to create bespoke dashboards. Python Flask can be employed to develop dashboards from scratch, whereas Flask-based dashboard solution can power interactive visualization and reporting. Efficient Data Sourcing with Bubbles Now, since we have the front-end dashboard, analytical processing framework and data storage solution options available, let us divert our attention on how to source data to a MongoDB data store from external applications or databases. In traditional computing sphere, we have various tools to perform this Extract, Transform and Load (ETL) process from multiple sources. Going the traditional route, open-source tools like Pentaho Data Integration and Talend Big Data Studio will fit the bill. While these tools have its own advantages and disadvantages, Python also provides ETL frameworks which rely on metadata for data sourcing, such as Bubbles. Bubbles provide data objects which are abstract in nature such as objects from CSV files, SQL table representations, MongoDB collections, Twitter API objects, etc. Log Management in Python A necessary feature, even for a prototyping project, is effective log management. Logs are essential for tracking events that occur in an application. Error, Warning and Informational messages enable debugging in the event of potential failures. Runtime exceptions which prevent code from executing can be investigated only if logs are maintained in persistent storage. Python enables logging at various levels like Information, Debug, Warning and Error using its Logging module. Numerous open-source monitoring tools typically referred to as logging aggregators like Sentry, Graylog2 and Scribe can also be used for log management. Raven is an open-source Python client for Sentry. Graylog2 has a graphical interface to search through log events and has libraries for major languages including Python.

Dashboard Analytics Framework & General-purpose Application Features Log Management Data Store Big Data Application Internet ETL Python Flask Django Pyramid Pyxley Python Programs Logging MongoDB ETL DBS Bubbles EXTERNAL DATA SOURCES Fig.1 shows the integration of various components of the Big Data solution discussed above Heralding a new direction in Big Data with Open-source software Open-source software such as MongoDB and Python aims to enable agility, speed and flexibility to software development process, thus revolutionizing the way ideas are transformed into marketable solutions. They herald a new direction in Big Data arena by accelerating the ecosystem maturity. In the near future, we can expect these complex custom solutions to be developed using graphical plug and play architectures with ready-to-use, off-the-shelf, open-source components requiring zero or minimal configuration tweaks. 6

About the Author Firoz Mohamed Kasim works as a Project/Program Manager at GAVS Technologies Pvt. Ltd., Chennai. He is a certified Project Management Professional (PMP) with around 1 years of experience in the software sector. He also has ITIL Foundation and FLMI LOMA certifications to his credit. His interests include exploring new technologies and software products, show-casing architecture feasibility using new technologies, mobile app development, etc. 7

About GAVS GAVS Technologies (GAVS) is a global IT services & solutions provider for customers across multiple industry verticals. GAVS offers services and solutions aligned with strategic technology trends to enable enterprises take advantage of futuristic technologies such as Cloud, IoT, Managed Infrastructure, and Security services. GAVS has been recognized as an emerging player in the Healthcare Provider IT outsourcing sector by Everest Group, and as a prominent India-based Remote Infrastructure Management player by Gartner. USA UK Middle East GAVS Technologies N.A., Inc 10901 W 120th Avenue, Suite 110, Broomfield CO 80021, USA. Tel: +1 303 782 0402 Fax: +1 303 782 0403 GAVS Technologies (Europe) Ltd. 3000 Hillswood Drive, Hillswood Business Park, Chertsey KT16 ORS, United Kingdom Tel: + 44 (0) 1932 79664 GAVS Technologies LLC Office No. 11, Bldg No : 4, Knowledge Oasis Muscat, Rusayl, Sultanate of Oman Tel: +968 24449301 INDIA GAVS Technologies N.A., Inc 116 Village Blvd, Suite 200, Princeton, New Jersey 0840, USA. Tel: +1 609 91 226/7 Fax: +1 609 20 1702 GAVS Technologies Pvt. Ltd. No.11, Old Mahabalipuram Road, Sholinganallur, Chennai, India - 600 119 Tel: +91 44 6669 4287 GAVS Technologies P.O.Box : 12419, Office no 202, Al Thuraiya Tower 1 Dubai Internet City Dubai, UAE Tel: +971-4-441234