BIG DATA FUNDAMENTALS

BIG DATA FUNDAMENTALS Timeframe Minimum of 30 hours Use the concepts of volume, velocity, variety, veracity and value to define big data Learning outcomes Critically evaluate the need for big data management across the spheres of government, business, the environment, and society Evaluate the changing and expanding role of information technology in the organisation Discuss the concepts of correlation and prediction as they apply to big data Recommended articles Cloud Security Alliance, 2014, 'Big Data Taxonomy', Big Data Working Group, https://cloudsecurityalliance.org/research/big-data/ (Accessed 06 October, 2014). Evans, P.C. and Annunziata, M. 2012, 'Industrial Internet: Pushing the Boundaries of Minds and Machines', GE Imagination at Work, http://www.ge.com/docs/chapters/industrial_internet.pdf (Accessed 28 August 2014). This opening section deals with some of the fundamentals of big data, including how it is defined. It also looks at the business need and where big data belongs in the organisation. Section overview This section points to the importance of correlation in big data, whereas Section 7.3 (Big Data Analytics) will provide a more in-depth discussion on statistical analysis, and the role of quants vs the role of management. Due to the nature of the subject, some terms may be unfamiliar to you. Refer to the glossary of terms at the end of this study guide for explanations that do not appear in the text. What is Big Data the Five Vs? Companies and governments around the world are collecting vast quantities of digital information about us and our environments using information exchanges (eg e-mails, mobile phones, etc) and sensory devices (eg cameras, heat sensors, etc). The potential for every electronic device to be connected to the internet to produce data (the internet of things) is imminent.

This proliferation of data has exceeded many organisations and governments capacities to store, compute, and analyse the information, quite apart from the implications for security and privacy. Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures [constraints of the structure] of your database architectures. (Dumbill, 2014) Every day we create 2.5 quintillion bytes of data the issues of storing, computing, security and privacy, and analytics are all magnified by the velocity, volume, and variety of big data, such as largescale cloud infrastructures, diversity of data sources and formats, streaming nature of data acquisition and high volume inter-cloud migration. (Cloud Security Alliance, 2014) The following are types of data that contribute to the big data environment: Traditional enterprise data (eg from CRM systems, transactional ERP data, web store transactions, general ledger data, etc); Machine-generated or sensor data (eg call detail records, weblogs, smart meters, manufacturing sensors, equipment logs, and trading systems data); and Social data (eg customer feedback streams, micro blogging sites such as Twitter, and social media platforms like Facebook). (Oracle, 2013a) The authors cited above, and many other writers, emphasise the significance of big data, why big data is important for business, and where big data fits into organisational structures and decisionmaking processes. We begin by describing the concept of big data by using the characteristics of volume, velocity, variety, veracity and value (the 5Vs). In late 2001, the term 3Vs first appeared in a research note titled 3-D Data Management: Controlling Data Volume, Velocity and Variety (Laney, 2001). Many authors have since added other distinguishing factors (ie veracity and value). While these two additional factors do not necessarily define big data, they do emphasise the importance of data trustworthiness and business value.

FIGURE 1: 5V S OF BIG DATA Volume Value Velocity Veracity Variety (Marr, 2014) High volume IBM (2014) states that, every day we create 2.5 quintillion bytes of data, with 90% of the data in the world today created in the last two years alone (since 2011). Data comes from everywhere: Sensors, eg used to gather climate information; Posts to social media sites and social media advertising, eg Facebook, Qzone (China only), WhatsApp, Google+, etc (refer to Appendix 1 for largest social networks in the world); Digital pictures and videos, eg YouTube; Websites, eg purchase transaction records; and GPS signals, to name a few.

Consider the following example of the New York Stock Exchange s management of data volumes (Melnyk, 2014). New York Stock Exchange NYSE Euronext operates multiple securities exchanges, most notably the New York Stock Exchange (NYSE). NYSE ingests approximately 8 billion transactions per day, which can go up to as much as 15 billion during a crash or surge in the market. Analysts track this data, eg the value of listed companies, performance trends, and fraudulent activity. This market surveillance and analysis includes every transaction from each trading day. Similar to most other large organisations, NYSE was moving data back and forth between the storage systems and their analytic engines which could take over 26 hours to complete. NYSE also has global customers who require 24/7 accesses without any downtime. Now NYSE has reduced the time needed to access business-critical data from 26 hours to 2 minutes. They can also carry out ad-hoc searches over a petabyte of data (one thousand, million, million, bytes or 10 15 ) and they have opened up new analytical capabilities. (Melnyk, 2014) It is possible to hold very large data sets due to the decreasing cost of different types of storage and the availability of cloud-based services. To appreciate the size of a petabyte, if you counted all the bits in one petabyte at one bit per second, it would take 285 million years and if you counted one byte per second, it would take 35.7 million years (McKenna, 2014). Refer to Appendix 2 for more on bytes. Digital mapping Google first entered digital mapping in 2004 and launched Google Maps and Google Earth in 2005. Today, Google offers its users over 20 petabytes (21.5 billion megabytes) of imagery from satellite images to aerial photos to 360 degree street view images. (McKenna, 2014) High velocity A second characteristic that implies big is high velocity (or frequency) of the data. Velocity, defined by McKenna (2014), is the rate at which data arrives at the enterprise and is processed or well understood.

But there is a big difference between the speed of the information being received and the information being processed, as the example below shows. The doubling of computing power every 18 months is nothing compared to a big algorithm An algorithm is a set of rules that can be used to solve a problem a thousand times faster than conventional computational methods could. One colleague, faced with a mountain of data, figured out that he would need a $2-million computer to analyse it. Instead they came up with an algorithm within two hours that would do the same thing in 20 minutes on a laptop: a simple example, but illustrative. (Shaw, 2014) McKenna (2014) argues that from a governance perspective, powerful analytics engines can apply analytics to the data as it flows across the wire, and you can glean insight from that data without having to store it, you might not have to subject this data to retention policies, and that can result in huge savings for your IT department. Consider that social media messages go viral in seconds and technology allows organisations to analyse that data while it is being generated without ever putting it into their databases. Through cloud-based information exchange there is an opportunity for organisations to pull varying data sets into a single view. Social media Twitter users are estimated to generate nearly 100,000 tweets a minute. This is in addition to 700,000 Facebook posts and more than 100,000 million e-mails a minute. (State Tech, 2013) Driverless cars Big data projects include using surrounding (big) data to get a car from A to B this requires high velocity data in real-time. High variety Structured data is what we typically find on our traditional databases, eg customer relationship management records and statistics relating to financial transactions. Big data brings together structured data and the unstructured data available from the multiple sources we described under data volume above social media conversations, photos, sensor data, video or voice recordings etc (Marr, 2014; State Tech, 2014). Retailers Retailers combine data from social media such as Twitter with their own in-house data collected from point-of-sale terminals and loyalty cards to produce rich and detailed information for marketing.