Chapter 1 Understanding Big Data Analytics In This Chapter Defining Big Data Understanding Big Data Analytics Contrasting traditional and visual analytics approaches The era of Big Data is upon us. The race is on to extract insight and value from this abundant resource. The opportunities are enormous and so are the challenges. Organizations that master the emerging discipline of Big Data Analytics can reap significant rewards and separate themselves from their competitors; those that fail to do so will be left in the dust. Big Data is here to stay. And you re part of it! In this chapter, I define Big Data and Big Data Analytics and explore the challenges of harvesting value from an evergrowing sea of digital information. What Is Big Data? COPYRIGHTED MATERIAL Big Data is a term applied to data sets so large that common software tools aren t capable of capturing, managing, and processing their data within a tolerable period. Big Data is colossal, unstructured (or loosely structured), distributed, fluid, and often unconnected. The amount of Big Data varies by organization, but its volume (and variety) tends to increase astonishingly quickly and exponentially.
4 Big Data Analytics For Dummies, Centrifuge Special Edition In the following sections, I give you some basic background on Big Data: exactly how big it is, where it comes from, how to evaluate it, and how to use it. Putting the Big in Big Data Analysts estimate that approximately 300 million terabytes (TB) of data exist in the world today. But what s staggering is that 90 percent of this data was created in the last two years! In a recent study titled The Digital Decade Are You Ready? market research company IDC projected that by 2020 the digital universe will encompass a staggering 35 zettabytes (ZB). Step aside, petabytes and exabytes! The word is now zettabytes! And with 1ZB being equivalent to 1 billion terabytes, that s a whole lot of data. To put this into perspective, from its founding in April 1800 to April 2011, the U.S. Library of Congress had amassed about 235TB of data. Currently, it s adding about 5TB of new data each month. So if IDC is correct, by 2020, computers will collectively store 400 million times more data than is archived in the entire Library of Congress today! Seeing where the data comes from You may wonder where all this data comes from. It comes from almost everywhere. Enterprises and government agencies aggregate data from myriad private and/or public data sources. Private data is information that your organization specifically collects that is available only to your organization, such as employee data, customer data, and machine data (such as user transactions, customer behavior, computer system health, and cybersecurity threats). Commercial-specific examples include credit-card, pharmacy, and mortgage transactions. Government-specific examples include Social Security data, Medicare transactions, and passport paperwork. Public data is information that s generally available to the public for a fee or at no charge. Examples include stock prices, company and individual credit ratings, social media content (such as Facebook and Twitter), and computer IP
Chapter 1: Understanding Big Data Analytics 5 blacklists (such as known hacking sites) along with all other content found on the public Internet. When you stop and think about it, it s no wonder the world is drowning in data. If an organization can record something, it usually does environmental data, financial data, medical data, surveillance data, and on and on. Figuring out what to do with the data The most significant challenges of Big Data no longer involve aggregation and storage but rather what to do with all the accumulated data. Today, common concerns for commercial enterprises and government agencies include the following: Deriving actionable value from Big Data, due to information overload. Analyzing the connections between structured, semistructured, and unstructured data sets. Structured data is stored in relational databases in columns and rows. Semistructured data doesn t conform to the formal structure of tables and rows but contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields. Examples include web pages and XML (extensible markup language). Unstructured data refers to information that doesn t have a predefined data model and thus can t be stored in a relational database. Unstructured data can be textual or nontextual. Examples of textual unstructured data include e-mail messages, PowerPoint presentations, and Word documents. Examples of nontextual unstructured data include JPEG images, MP3 audio files, and Flash video files. According to the market-research firm IDC, semistructured and unstructured data accounts for more than 90 percent of the data in today s organizations. Uncovering patterns of useful data when you don t know what questions to ask in the first place. As you discover throughout this book, a Big Data Analytics solution can help your organization meet all these challenges.
6 Big Data Analytics For Dummies, Centrifuge Special Edition Popular Big Data infrastructure solutions The following is a list of popular Big Data infrastructure solutions that you re likely to encounter when using Big Data Analytics applications. Apache Hadoop is an opensource software framework that supports data-intensive distributed applications working with thousands of computers and petabytes of data. Cloudera offers Apache Hadoopbased software and services that make it easier to run Hadoop in a production environment. EMC Greenplum is a commercial data warehouse based on the open-source database PostgreSQL and intended for large-scale enterprise and cloud deployments. HP Vertica is a commercial database software platform that uses a column-oriented analytic database to process large amounts of data for quick analysis. IBM Netezza is a commercial data-warehouse solution based on proprietary technology, scaling to more than 10 petabytes (PB) of data. NetApp specializes in enterpriseclass data-warehouse solutions and is a thought leader in Big Data. NetApp s highest-end platform can accommodate up to 4PB of raw data. Oracle is one of the most successful enterprise database warehouse providers today, with all of the Fortune 100 as customers. Oracle offers a line of Big Data Appliances that can accommodate up to 648TB of raw storage in a single rack and up to 5PB within an eight-rack cluster. Splunk is a software application that enables users to search, monitor, and analyze machinegenerated data by applications, systems, and IT infrastructure via a web-based interface. Sybase, an SAP company, is an enterprise software and services company offering software to manage, analyze, and mobilize information using relational databases, analytics, and data warehousing solutions and mobile applications development platforms. Teradata offers commercial relational database management system (RDBMS) hardware and software. The company launched the Petabyte Power Players club to include customers with petabyte-plus data warehouses, including Dell (1PB), Bank of America (1.5PB), Wal-Mart Stores (2.5PB), and ebay (5PB).
Chapter 1: Understanding Big Data Analytics 7 Evaluating Big Data for analysis: The Four Vs Diamonds are evaluated on what are commonly known as the Four Cs: color, cut, clarity, and carat weight. Similarly, Big Data is commonly evaluated on the Four Vs: Volume describes the relative size of data typically, in terabytes or petabytes. Velocity describes the frequency at which data is generated, captured, and shared. Variety describes the types of data in a data set, such as transactional, social, content, geospatial, location-based, log, and radio-frequency identification (RFID). Value describes the business benefits reaped by the organization, such as fraud detection, loan risk analysis, and customer-behavioral analytics. All four of these Big Data characteristics are important to consider when you re evaluating solutions for Big Data Analytics which I introduce next. See Chapter 2 for a lot more information about the Four Vs. What Is Big Data Analytics? Big Data Analytics is the process whether manual or automated of analyzing Big Data to extract meaning and actionable intelligence. Put another way, it makes Big Data useful. Only a short time ago, companies used to spend considerable time and resources to identify and procure useful data. Today, most companies have the opposite problem. Aggregating useful data is relatively easy; analyzing that data is the challenge. In the following sections, I explore two approaches to Big Data Analytics: the traditional approach and the visual analytics approach.
8 Big Data Analytics For Dummies, Centrifuge Special Edition Traditional analytics approach You may be surprised that many organizations still employ data analysts who use manual techniques to extract useful information from large data warehouses. Such techniques typically include ad hoc database queries followed by a series of univariate (analysis of single-variable distributions), bivariate, and, more often, multivariate analyses. These analysts often have advanced degrees in mathematics and/or statistics and pride themselves on their ability to perform advanced regression analyses. They often view data in columns and rows and then periodically create charts and graphs manually, using spreadsheets or basic business intelligence reporting tools (see Figure 1-1). Figure 1-1: Manual data analysis. Even with automation, this type of analytical approach is limited in its ability to detect unknown or undiscovered patterns (link analysis). Assumptions are often hard-coded, leading to false outcomes. It s like trying to find a needle in a haystack! Ultimately, the results are too little and too late.
Chapter 1: Understanding Big Data Analytics 9 Visual analytics approach Today s data analysts take an entirely different approach. They prefer to work smarter not harder to uncover hidden meanings in Big Data, leveraging visual analytics tools to integrate, visualize, and collaborate with data in ways that old-school data analysts have never seen. Visual analytics applications extract value from Big Data through advanced analytics and interactive visualization. Advanced analytics, such as link analysis, enable the integration of complex information in simple visualizations for pattern discovery (for example, seeing the forest through the trees). Interactive visualization refers to the ability to do it yourself through prebuilt charts, graphs, and timelines that tell the complete story. You ve often heard that a picture is worth a thousand words. Would you rather try to extract useful information from the table of data shown in Figure 1-1 or through interactive visualizations displayed in Figure 1-2? Figure 1-2: Data analysis with visual analytics software. For centuries, visualization has been used to support the understanding of complex information. Better understanding of relationships and context is key to visual analytics.
10 Big Data Analytics For Dummies, Centrifuge Special Edition Visual analytics software can improve time-to-discovery by more than 50 percent and make data analysts 10 to 20 times more productive than analysts who use traditional manual methods. Organizations typically recoup their investments in visual analytics tools in a matter of months. They also find it easier to fill data-analyst positions because advanced degrees in mathematics and statistics are no longer required. Analysts who leverage visual analytics applications instantly become data scientists because they now have the ability to test new hypotheses and experiment with data in ways never before possible. Visual representation of the data sharpens focus on what s important (so you can see clearly). If you re excited by the prospects of visual analytics, read on. Chapter 2 describes how to get started.