1 Big Data: Big Challenge, Big Opportunity A Globant White Paper By Sabina A. Schneider, Technical Director, High Performance Solutions Studio
2 1. Big Data: The Challenge While it is popular to say that advances in technology are making the world grow smaller, anyone involved in the collection, storage and processing of data knows this statement is far from true. As technology advances, so too does the size, scope and complexity of data collected. Facebook, a social platform which only became available to the masses about five years ago, sees 30 billion pieces of content shared each month. Beyond the explosion of social media, exponential increases in the scope and use of remote monitoring and sensing technologies, as well as constant growth in the collection and indexing of data by online search engines and other Internet tools, have all combined to create a giant flow of data the likes of which the world has never experienced. Further exacerbating the situation is the proliferation of portable consumer internet devices, such as smartphones and laptops. Consider that the US Library of Congress, one of the world s great stores of information, collected 235 terabytes (or 235 trillion bytes) of data in April 2011 alone. Also consider that each second of highdefinition video generates more than 2,000 times as many bytes required to store a single text page. As reported in a 2011 study from McKinsey & Company (1), structured data is growing at an annual rate of 22%, while unstructured data is growing at an astounding 72% every year. One notable Big Data expert, Victor Sanchez, Big Data Architect at Globant, describes Big Data as extra large, unstructured datasets that are hard to be gathered, analyzed and stored using conventional tools. Those datasets can reach terabytes of petrabytes of size and contain valuable information related to the core business of a company. Sanchez goes on to say that Big Data challenges modern organizations by forcing them to develop and use tools that can extract valuable business information from any kind of dataset as fast as possible (near real-time) with a low cost in order to make the business more agile and dynamic. This will create the chance to explore new opportunities of adding new customers and hence, increase revenues. Simply put, no computing device currently on the market can store and process this volume of data alone. Even sophisticated relational databases by themselves cannot manage Big Data with the efficiency and effectiveness modern businesses require.
3 2. Big Data: The Opportunity 3. Big Data: The Solution As with most difficult circumstances, the continuing expansion of Big Data holds thepotential for great opportunity as well as great challenge. Hidden within streams of what at first glance is meaningless data lay gems of rich information, which may have been unavailable for analysis even a few short years ago. And as overall volumes of data rise, so too do volumes of actionable 2 information, although they usually are not immediately apparent. This type of valuable information can be buried in any number of other seemingly insurmountable streams of data: tweets containing critical keywords, financial transactions executed by key consumer demographics in key places at key times, or any significant event, opinion or action which must be filtered and retrieved from billions or even trillions of similar but meaningless events, opinions or actions. By processing the new data which comes from the internet and mobile devices (and even what used to be considered miscellaneous data and was discarded) companies can now create a new wealth of information to make better decisions towards making their business grow, comments Globant Big Data Architect Juan Manuel Palacios. Leading edge tools providing new processing techniques specifically tailored to deal with large volumes of data, in affordable time frames and hardware. Saying there is any one solution to the challenge of Big Data is oversimplistic, like saying there is one solution to the challenges of staying healthy or rectifying a damaged economy. However, companies can identify their own unique Big Data challenges and then solve them to find the opportunities hidden within, provided they are armed with the proper tools. Think of the valuable information hidden inside Big Data as like a vein of gold hidden inside a large concentration of solid rock. Humans have been mining through rock to obtain gold for thousands of years. But mining for gold with the most sophisticated drilling equipment available today will give your gold mining efforts a much higher chance of succeeding much more quickly than will mining for gold with a pickaxe. 4. Big Data: The Market If experts agree on one thing about the monetary size of the Big Data market, it s that Big Data is a market which is extremely hard to quantify. In a December 2010 blog post (2), The 451 Group pointed out that beyond the fact there is no single, strict definition of Big Data, impediments to estimating the revenues that can be gained by solving Big Data problems include that Big Data can conceivably be considered a total
4 amalgamation of all available data, or as that specific class of data which cannot be managed through total means. However, Gartner estimates the size of the data warehousing/analytics market, which is essentially the market for tools to solve Big Data headaches, to be $7 billion. 5. Apache Hadoop Helps Tame Big Data Globant High Performance Solutions Studio creates high availability and performance critical software to manage large volumes of information. One key weapon in Globant s Big Data arsenal is a software framework known as Apache Hadoop (3), which is the Yahoo Open Source version of Google distributed file system (4) products designed for large scale data processing. Hadoop is specifically designed to support distributed processing of large data sets across clusters of computers, based on a simple programming model. Hadoop is fully scalable, and can be used on a single server or across thousands of machines with individual computation and storage. Hadoop works by processing data inside each individual node of a network. Using a 3 map reduce (5) program, users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The map reduce program, which is automatically parallelized and executed on a large cluster of commodity machines, is transferred to the data nodes, thus avoiding data transfer or distribution between nodes beyond one shuffling before the reduce phase. Files are replicated on multiple nodes, providing failsafe data backup. Hadoop therefore does not work like a traditional data clustering system and does not require expensive hardware to support processing. It is also easily scalable, as capacity can be linearly increased by adding nodes. 5.1 Hadoop: When To Use It, When Not To Use It While Hadoop can be invaluable to Big Data processing efforts, it is not a guaranteed cure-all for every Big Data-related problem. Following are some general tips to help determine whether Hadoop should or should not be used to solve a specific Big Data problem. Hadoop should be used for: Business Statistic Reports Generating aggregated valuable information out of big volumes of data Unstructured data Long-term analysis Social network analysis Ticketing systems analysis Financial behavior predictions based on pattern recognition
5 Hadoop should NOT be used for: Less considerable volumes of data Structured data Data which frequently changes Transactional data To further explain exactly how Hadoop can be applied to help turn Big Data into Valuable Information, let s take a look at two real-world Globant client success stories. 5.2 Success Story: Scoring with Big Data For a recent major sporting event, Globant helped a client use the Apache Pig platform, a high-level analysis language which implements default map reducers behind the scenes, for analysis of data through default mappers and reducers implementations. The client extracted four gigabytes of tweets and perform detailed analysis on top subtopics, popular tweeting sources and geolocation aggregation. Specifically, the client was able to identify the 10 users who received the most re-tweetswhen mentioning the event. In addition, theclient was able to use the Twitter API totrack relevant tweets by location and time, as well as count tweets relating to individual subtopics and quantify the percentage of tweets with positive, neutral and negative sentiment regarding each subtopic. This is just one example. Globant has also used Hadoop to help clients such as a financial company that wanted to analyze 40 years of historical data for 500 stock symbols, and an entertainment company which was able to analyze gigabytes of 4 tweets to determine the popularity of upcoming movie releases. 6. Other Software and Infrastructure Options While Hadoop can often be a valuable tool in unlocking the hidden potential of Big Data, needless to say it is not the only one. There are a variety of software and infrastructure options available for companies looking to sort, sift, analyze and store massive volumes of data. Sanchez details several IT tools that can assist companies in managing Big Data. Hardware can be leveraged at low cost using existing services-on-the-cloud, like Amazon AWS and tools like vertical databases, that are conceived to perform real time analysis tasks, says Sanchez. One of the most important requirements that companies will expect of Big Data is and will be real-time tools to provide a balanced mix between performance and large datasets manipulation. Some frameworks, like Hadoop MapReduce, are not designed to execute real-time processes, but rather to handle big amounts of data, and this is where solutions like Flume, Scribe, Storm, Chukwa, Splunk and Rainbird arise in order to provide realtime requirements. Some of the mentioned possibilities are being used right now in top companies like Twitter and Facebook, and can be viewed as great
6 alternatives in a medium term to achieve real-time processes over extra-large datasets. However, Sanchez cautions that most leading-edge Big Data tools are still in an immature state. Developing reliable, scalable, stable and extendable frameworks to handle Big Data with the appropriate support and documentation will convince companies to take one step forward about the importance of Big Data and its analysis, he states. Also, it's important to share the experience of companies while solving real-world problems with Big Data to improve those tools and make them more robust. Following is a further brief review of several options. 6.1 RDBMS/SQL vs. NoSQL RDBMS stores data in tables, with the databases running on Structured Query Language (SQL) and following the Atomic, Consistent, Isolated, Durable (ACID) transactional model. This model ensures that data transactions are atomic (the entiretransaction must work properly or it is not executed), consistent (only data validaccording to all rules is written), isolated (transactions cannot interfere with one another or simultaneously change the same data) and durable (once executed, the system cannot lose a transaction). Following this concept means that when a user queries data, the latest version of that data is immediately sent. RDBMS databases are best used for activities such as financial transaction processing, where data consistency and accuracy are crucial for the user, and they have difficulty handling teraand petabyte volumes of data. NoSQL data management systems do not use SQL to query data and operate according to the BASE (Basically Available Soft State Eventual Consistency) model, which does not offer immediate consistency of data that is queried, so a user may not always get the most current version of data in response to a query. NoSQL data management systems are mainly used for marketing and business analysis purposes, rather than to execute core business functions. According to the CAP Theorem there is no such solution that fits three Capacity, Availability and Partitioning at the same time. It is all about finding the best solution for your problem. 6.2 Open Source Solution: Apache Hadoop Open Source computing by frameworks such as Hadoop, offers several advantages to companies trying to manage Big Data volumes. These include escalation with low difficulty and cost, as source code is publicly available, as well as the capability to provide the processing power of possibly thousands of machines working in tandem. HDFS (Hadoop Distributed File System)
7 is open source software written in Java for the Hadoop framework designed for reliable, scalable, distributed computing. Computation is moved to the data nodes, which makes it ideal for processing large volumes of data in a distributed, parallel way. In addition, data is replicated in a configurable way along the data nodes to ensure fault tolerance. The main node splits raw data into chunks and distributes it. By spreading data between different machines, HDFS architecture provides distributed access to large volumes of data, and also avoids network latency by preventing data transfer between nodes and only transferring data once, from the program to the nodes. Once raw data is uploaded to the HDFS cluster, the map reduce program helps aggregate it. There are other modules living in the Apache Hadoop environment: Apache Pig and Hive which constitute very good tools for users not familiarized with Java or any bash lenguage. They include a default Map Reduce implementation, which is not always the best for the problem at hand. If you need fine Tuning, we strongly recommend implementing your own Map Reduce in Java or your favorite scripting language supported by Hadoop: Perl, Python or bash, between others. 7.Summary/ Conclusion advantage for companies that properly manage it. Companies are collecting exponentially growing volumes of data produced by an explosion of social media, internet, sensing and mobile technology which provides little to no value if simply stored in a database. However, there are now a number of highperformance back-end data analysis tools capable of sorting through even the largest volumes of data, providing intelligent analysis, filtering, sorting and retrieval that would otherwise be unavailable. Companies can thus turn what is otherwise an unwieldy assortment of data into actionable information that provides unparalleled insight into their operations. This information can provide a significant competitive edge and allow extreme personalization in providing services and products to customers and clients, essentially breaking the traditional data analysis paradigm and creating a new, innovative method of deep dive data analysis. While no two companies will have the same Big Data problem, solution or benefit, every company dealing with Big Data has a few things in common. Big Data is here to stay, will only get bigger over time, and can sink a company or elevate it to new competitive heights, depending on the response, or lack thereof. Big Data is becoming a major problem for modern business, which holds the potential to become a strategic
8 8. References 1. James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers, McKinsey Global Institute, Big data: The next frontier for innovation, competition, and productivity, May The 451 Group, Sizing the big data problem: big data is the problem, December Chuck Lam, Hadoop in Action, Manning Publications Co., Jeff Dean, Sanjay Ghemawat, Google, Inc., Map Reduce: Simplified Data Processing on Large Clusters. 5. Sanjay Ghemawat, Howard Gobitott, Shun-Tak Leung, Google, Inc., The Google File System.
9 ABOUT GLOBANT Globant is the Latin American leader in the creation of innovative software products that appeal to global audiences, specializing in the use of agile methodologies and a rational blend of open source and proprietary software. For us, that means a process where the best engineers team up with the art design studios and innovation labs to deliver a superb user experience. We believe that our differentiation isn t based only in a product vision. Our experience working with innovative companies, with state-of-the-art technology, led us to provide Innovation as a Service to customers that, by keeping the focus in their business, require external help to know the possibilities that these technologies are offering. Today, Globant s clients consist of innovative companies such as Google, EMC, LinkedIn, Electronic Arts, PRNewswire, Dreamworks, JWT and Coca Cola.