CASE STUDIES OF SUCCESSFUL APPLICATIONS OF BIG DATA IN THE INDUSTRY SAI HARINYA TURAGA 1207065245 RAMYA MERUVA 1206988844 TEJASWINI KANTHETI 1207053558
CASE STUDIES OF SUCCESSFUL APPLICATIONS OF BIGDATA IN INDUSTRY Introduction Big data is the name given to very large and complex databases which cannot be processed by traditional database management systems. This kind of data is being generated due to the analysis of single large set of related data rather than multiple smaller data sets with same amount of data. This analysis helps in identifying business trends, quality of research, and also allows business organizations to take better decisions. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization [1]. Definition[2] In 2012, Gartner defined big data as "It is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization." Big Data in Industry As the amount of Big data generated increased, the need for information management also increased. So the companies like Microsoft, IBM, HP, Dell, Oracle Corporation, etc. have spent $15 billion, only on software firms specialized in data management and analysis. In 2010, this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole[3]. Big data software: Apache Hadoop MongoDB Splunk APACHE HADOOP Definition [4] The Apache Hadoop is open-source software that is used for reliable, scalable, distributed computing. It is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Apache Hadoop has the following modules [5]: Hadoop Common - contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications. Hadoop MapReduce - a programming model for large scale data processing. As Hadoop evolves, many companies started using Hadoop for both research and production purposes. More than half of the fortune 50 companies use Hadoop including Yahoo!, Facebook. Case Study: Yahoo! Yahoo is not only the first large scale user of Hadoop but also, a tester and a contributor. Fig 1 [6] represents the volumes of data generated each year since 2006 and also the internal Hadoop usage. Figure 1 Initially, Yahoo started using Hadoop in order to speed up the process of indexing the web crawl results for its search engine, but now it completely depends on Microsoft s Bing search results. Yahoo's relationship with Hadoop today, according to Scott Burke, senior VP of advertising and data platforms is that the company is still betting on it. "Yahoo remains committed to Hadoop, more than ever before. It's the only platform we use globally. We have the largest Hadoop footprint in the world," said Burke in his keynote address at the Hadoop Summit 2012[4]. Fig 2 [6] represents statistics before and after Yahoo started using Hadoop.
Architecture Currently, Yahoo uses 42,000 servers which is 1,200 racks divided into four data centers to run Hadoop. The largest cluster being 4,000 nodes, but after the release of Apache Hadoop 2.0, yahoo will increase it to 10,000 nodes. Fig 3[6] represents the architecture of Hadoop. Figure 3 Usage of Hadoop in Yahoo Today, we can say that Hadoop has become an integral part of how yahoo does business. It uses Hadoop to block spam messages that try to get into the email servers. This has elevated their spam detection capabilities. On an average, yahoo blocks around 20.5 billion spam messages per day. It is located at the bottom of the powerful software stack that is inside yahoo which empowers the personalization facilities that yahoo aims to give its end-users. Hadoop already has information regarding the web history of the end-users based upon the previous visits, so if you search for sports or finance news, you ll get a high-value package that reflect your interests. In order to obtain this kind of personalization, yahoo uses a combination of automated analysis and human editors to define packages. If there isn t any manual involvement, then everyone would get packages about celebrities, which draw the highest traffic on the site. [7] Instead of using Hadoop as a stand-alone application, Yahoo uses it as an information foundation for an Oracle database system that pulls presorted, indexed data out and feeds it into a Microsoft SQL Server cube for highly detailed analysis. The resulting data is displayed in
either Tableau or Microstrategy visualization systems to Yahoo business analysts. They in turn use it to advise advertisers how their campaigns are faring soon after launch. [8] Yahoo can suggest advertisers on the kind and percentage of people who respond best to their campaigns so that advertisers have the facility to adjust their ads to appeal to wide range of audience. This will lead to improvement in search results, response time and will also uplift the business. Yahoo provides the advertisers with significant data like who is viewing their advertisement, how much time are they spending on it, what they are doing on the site after reading it, etc. This kind of data is very valuable and thus, helps the advertisers to plan and improve their future advertising strategies. This helps them take better decisions which ultimately results in better business. Fig 4[6] depicts the yahoo webpage and shows where Hadoop is actually being utilized.. Figure 5 In addition to all of this, yahoo developed a set of open source code written in JavaScript called the cocktails. These cocktails increase the data availability to the advertisers by providing them with with tools that extract data directly from Yahoo Hadoop. A server-side cocktail is Manhattan whereas client side is Mojito. In order to get the information the advertiser is seeking, a cocktail could run the same reporting code on either server or an end-user s laptop, respectively. The reporting system works on a Hadoop-Oracle-Pentaho Mondrian (open source code for building multi-view cubes) stack [9]. The advertisers could see all the valuable data that yahoo gathered about their customers. If this is the case, then there is huge possibility that cocktails could replace PHP in the business. Apart from all this, Yahoo also uses Hadoop for the internal analysis of the data that is collected from user interactions. This data accumulates to 140 Peta Bytes in Hadoop. Since Hadoop maintains triplicates of all data sets, over 400 Peta Bytes of storage is required to maintain the system. Yahoo adopted Hadoop at its crude level, nurtured it, made it an open source, got benefited from it and is still betting their business on it by continuously developing it. The company is confident enough that its connection with Hadoop will bring Yahoo to the top in the business initiatives.
In an interview [9] with the Information Week, Scott Burke said, Yahoo is becoming an expert at capitalizing on the big shift from the off-line analysis model to something much closer to a predictive model: Get a response in front of the customer with the right offer at the right time," To achieve this goal, Yahoo has had to implement "science at scale," or invest in an open source system that seemed to have broad potential but wasn't yet proven in prime time. "We've bet the business on this platform at on a global scale," Burke says, "and there's no turning back." MongoDB :[10] MongoDB is a cross-platform document-oriented database system. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas, making the integration of data in certain types of applications easier and faster. MongoDB is free and open source software. MongoDB comes under online form of bigdata. Main Features of MangoDB are : 1. Adhoc queries : Regular expression searches, range of queries etc are supported by the mangodb. Specified documents and java script functions are returned by the set of queries. 2. Replication : MongoDB provides the advantage of Replication which increases the throughput and availabitiy.the Replication procedure involves the availability of two replica copies. Primary replica performs the read and write operations on the data and the secondary replica maintains the copy of the data at the backend. If there is no availability of primary replica then replica automatically set elections which secondary replica should become primary replica and then read and write operations are performed by that elected replica. 3. Indexing : In MongoDB all the documents are indexed in which the indices are similar to the indices in Relational Data Management System. 4. File storage : MongoDB also provides the file storage mechanism achieved by using replicas system to store files in different machines and load balancing process.gridfs is the function which develops the languages and help in the file manipulations to the developers. MongoDB also allows the mechanism like distribution of files over multiple machines which in turn creates fault tolerance. 5. Aggregation : MongoDB provides the facility of MapReduce technique which is used for Batchprocessing of data and aggregation operations. Aggregation is used where groupby clause is used in SQL to obtain results.
How MongoDB and Hadoop will work together for typical BigData?[11] There are so many techniques in which Hadoop and MongoDB work together to fit for typical BigData.Some of them are described briefly. They are : 1. Batch Aggregation Technique. 2. Data warehouse. Batch Processing Technique : In many cases the data analyzing can be done by providing the built-in-aggregation functionality of MongoDB, if there is a complex data the hadoop substitutes the MongoDB which provides a powerful framework for complex data analysing. In this scenario, the data to be processed is taken from the MongoDB and it is sent to Hadoop and after processing in Hadoop, it is again sent to the MongoDB for ad-hoc analysis. The information found on the top of MongoDB can be used by the applications and will be presented to the end-users. Datawarehouse: The large amounts of data is stored in the hadoop which acts as a data warehouse which is taken from different data sources.
In this scenario, the data stored in the MongoDB can be taken via Mapreduce jobs and stored in the Hadoop and the data from other different sources are also available. The mostly used techniques by the data analysts are Pig and MapReduce jobs to create jobs that query these larger datasets which incorporate data from MongoDB. Case study :[12] MongoDB is an open-source document and it uses nosql database and it is written in C++. There are many number of clients for MngoDB. They are : 1.MTV 2.Forbes 3.Craigslist 4.Intuit 5.Rangespan 6.Foursquare 7.CustomInk etc If we have a deep look into one of the above i.e., MTV networks : MTV Networks : MTV networks operates different web-properties like thedailyshow.com, gametrailers.com, nick.com etc. MongoDB was choosen by MTV for ots document data model and flexible schema as a data repository. Problem faced by MTV networks? MTV web-properties were developed on JAVA-based content management system, which suits their data model. They planned to migrate to new relational model for content management system, as there are different web-properties, they made it a roadblock problem. So after struggling MTV began looking for database solution which possess a data flexibilty. Solution : This problem can be overcome by using the MongoDB which allows MTV to store hierarchical data without using queries to build pages. MongoDB schema can be used for the designing the structures and data elements for each brand. The advantage Possessed by MongoDB is Queries and Indexing which is provided by no other databases. In this the documents are provided with an indices which are helpful for fast access of
data and also MTV found query nested data feature in MongoDB to overcome the above problem. Results : MTV uses the MongoDB to find a solution for their database problem and also met the needs of their web-properties.in this way MongoDB helps to enable the online BigData which to cease the delay of the processing of applications to serve all the end-users where the new data is created at every time and provides modern application performance. SPLUNK Splunk is an American multinational corporation with its headquarters in San Francisco, California. Splunk began as a tool for troubleshooting IT issues by poring through log files, but now it produces software for monitoring, searching and analyzing machine-generated big data, through a web-style interface[13]. Splunk gives IT professionals ways to log into any piece of information and quickly index it, find it and run a number of analytics functions on it.[14] Splunk Co-Founder and CTO Erik Swan says Splunk is even used as a replacement for, or at least a complement to, Hadoop in certain instances. Hadoop might be great for social-networking data and creating a social graph. Splunk is ideal for time-related tasks like monitoring profile changes. Even web sites such as Facebook, Myspace and Zynga use Splunk to analyze operational data. [15] Customers: [16] Some of the customers of Splunk are Domino s pizza: uses Splunk to strengthen online business through customer interaction and understanding T-mobile: uses splunk to search across all their log sources, run ad hoc reports and visualize all their IT data, secure the infrastructure and more. Hortonworks: uses both hadoop and splunk. But mainly uses splunk to get data easily and fast, ease of use Apollo: uses splunk to get security information, faster search engine Cricket Communications: uses Splunk to monitor the performance of an in-house middleware environment, to gain end-to-end visibility on various critical infrastructures
Case Study: Domino s pizza transforms E-Commerce with Splunk Domino's is a world leader in pizza delivery. Domino's was founded in 1960 and by 2012 its sales were $7.5 billion. It consists of more than 10,000 corporate and franchised stores in US and international markets. It has an online ordering system, apps, kindle and iphone. Domino s used Splunk for the first time in 2009. Splunk was initially used by Domino's because it needed a solution to analyze and aggregate logging data from OS (Linux and Solaris) and middleware in a timely manner. Information Security team used HP ArcSight for log aggregation, but Splunk offered the following advantages:[17] Faster and easier searches in Splunk Real-time insights Better reporting with Apache access logs Much faster alerting in Splunk Cost and scalability Ease of deployment Splunk Usage at Domino's: [17] Range of Uses from monitoring to business insights include: What we are selling, orders per minute, coupon usage, etc. Online ordering trends, efficiency of marketing promotions Splunk provides answers 24-48h prior to analysis from data warehousing tools. Significant reduction in troubleshooting time Streamlined developer insight into debugging development code Architecture: Splunk provides high performance and has a scalable software server which is written in C/C++ and Python. It indexes and searches logs and other IT data in real time. Splunk deals with data generated by any application, server or device. The Splunk Developer API is accessible through REST, SOAP or the command line. After downloading, installing and starting Splunk, you'll find two Splunk Server processes running on your host, splunkd and splunkweb. splunkd is a distributed C/C++ server that accesses, processes and indexes streaming IT data. It also handles search requests. splunkd processes and indexes your data by streaming it through a series of pipelines. Pipelines are single threads inside the splunkd process, each configured with a single snippet of XML. Processors are individual, reusable C/C++ or Python functions that act on the stream of IT data passing through a
pipeline. Pipelines can pass data to one another via queues. splunkd supports a command line interface for searching and viewing results. splunkweb is a Python-based application server providing the Splunk Web user interface. It allows users to search and navigate IT data stored by Splunk servers and also to manage your Splunk deployment through the browser interface. splunkweb communicates with your web browser via REST and communicates with splunkd via SOAP.[18] Splunk at Domino's today: Splunk is deployed across two data centers(live and failover). It has four different production environments. 25-40GB of data is indexed per day. There are dozen unique users per month.
Teams using Splunk are: Site Reliability team, InfoSec and web developers. Splunk Apps are: Deployment Monitor, Google Maps [19] Results with Splunk: Domino's splunk environment: Before Splunk gathering logs manually Shifting through aggregated Java messages from middleware(grep) Reactive After Splunk Million times easier with Splunk Proactive alarm alerts to dip in sales Base lining and trending Splunk for operational analysis of payment processing: Measuring response time for various order channels
Instant analysis of cash vs. credit card ordering performance. Troubleshooting card processor issues. Splunk for GEO sales tracking: Splunk API s integrate with Domino s GEO sales tracking applications which are Java based. It monitors sales by regions. Domino s has been able to identify ISP outages in certain regions. Splunk for Domino s marketing: Before Splunk: Someone stuggling with data and numbers daily After Splunk: Automated information, report submitted to leadership team including the CIO and CEO, monitoring promotion success in real-time. Splunk at Domino s Future: Create real time dashboards for any departments to view Use Splunk for more key performance analysis Expand Splunk Apps deployment: Linux and Unix monitoring, VMware App, F5 integration Optimize middleware application logs for Splunk consumption Start to leverage Splunk to monitor Corporate applications built on stack. [19]
References [1] http://en.wikipedia.org/wiki/big_data [2] http://hadoop.apache.org/ [3] http://en.wikipedia.org/wiki/apache_hadoop#papers [6] http://www.slideshare.net/ydn/hadoop-yahoo-internet-scale-data-processing [4http:/www.informationweek.com/database/yahoo-and-hadoop-in-it-for-the-longterm/d/d-id/1104866 [5]http:/www.informationweek.com/database/yahoo-and-hadoop-in-it-for-the-longterm/d/d-id/1104866 [7]http:/www.informationweek.com/database/yahoo-and-hadoop-in-it-for-the-longterm/d/d-id/1104866 [8]http:/www.informationweek.com/database/yahoo-and-hadoop-in-it-for-the-longterm/d/d-id/1104866 [9]http:/www.informationweek.com/database/yahoo-and-hadoop-in-it-for-the-longterm/d/d-id/1104866 [10] http://en.wikipedia.org/wiki/mongodb [11] http://docs.mongodb.org/ecosystem/use-cases/hadoop/ [12] http://www.mongodb.com/customers/mtv-networks [13] http://en.wikipedia.org/wiki/splunk [14] http://venturebeat.com/2011/02/08/splunk-seattle-office-opens/ [15] http://gigaom.com/2010/12/17/how-splunk-is-riding-it-search-toward-an-ipo/ [16] http://www.splunk.com/view/sp-caaah92 [17] http://pt.slideshare.net/splunk/splunklive-detroit-april-2013-dominos-pizza#btnnext [18]http://docs.splunk.com/Documentation/Splunk/6.0.1/Installation/Splunksarchitectureandwhat getsinstalled#splunk_and_windows_in_safe_mode [19] http://pt.slideshare.net/splunk/splunklive-detroit-april-2013-dominos-pizza#btnnext