Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013
Housekeeping 1. Any questions coming out of today s presentation can be discussed in the bar this evening 2. OCF is sponsoring the networking session with drinks and nibbles
Big Data Confused? In 2012 the Experton Group carried out a study, on behalf of BT Germany GmbH & Co, on the question of how big data is changing business and IT. The study, Data Explosion in Business IT, was carried with 100 decision makers working at companies with more than 500 employees An accredited researcher working for a global company Using an informed, if small, dataset A total of 67% agree or strongly agree that Big Data is Marketing Hype yet in the same group 79% agree or strongly agree that Big Data is a New Generation of Database and Analytics Technology
What is Big Data?
The 3 Vs represented graphically In 2001 Doug Laney defined data growth challenges and opportunities as being three dimensional, i.e. increasing Volume (amount of data), Velocity (speed of data in and out), and Variety (range of data types and sources). Since then many vendors and pundits have attempted to enhance his definition with clever(?) Vs of their own. However, the 3Vs were intended to define the proportional dimensions and challenges specific to big data, other Vs like veracity, validity, value, viability, etc. are aspirational qualities of all data, not defining qualities of big data
Characteristics of Big Data Characteristics Volume Velocity Variety 1 Description The sheer amount of data generated or data intensity that must be ingested, analysed and managed to make decisions based on complete data analysis Atrribute Driver Increase in data sources, According to IDC's Digital Universe Study, the world's higher resolution sensors "digital universe" is in the process of gathering 1.8 1 zettabytes of information with continuing exponential growth projecting to 35 zettabytes in 2020 How fast data is being produced and changed and the speed with which data must be received, understood and processed Accessibility information Increase in data sources Improved throughput when, where and how the user wants it, at the point of connectivity Enhanced computing impact power of data generating Applicable relevant, valuable information for an devices enterprise at a torrential pace becomes a rela time phenomenon Time Value real time analysis yields improved data driven decisions The rise of information coming Structured 15% of data Mobile from new sources both inside today is structured with Social Media and outside the confines of the rows and columns Videos enterprise or organisation Chat Unstructured 85% is creates management, Genomics unstructured or human governance and architectural generated information Sensors pressures on IT Semi structured the combination of structured and unstructured data is becoming paramount Complexity where data sources are moving and residing 21 Zettabyte = 1 000 000 000 000 000 000 000 bytes or 10
Big Data landscape Mind Map courtesy of Gary Crawford www.garycrawford.co.uk
Some use cases E tailing Recommendation engines Cross channel sales attribution, average order value, lifetime value Event analytics what series of steps (golden path) led to a desired outcome (e.g., purchase, registration) Financial Services Compliance and regulatory reporting. Risk analysis and management. Fraud detection and security analytics. CRM and customer loyalty programs. Credit scoring and analysis. Trade surveillance. Government Fraud detection and cyber security. Compliance and regulatory analysis. Energy consumption and carbon footprint management Manufacturing Predictive maintenance scheduling. Product design and modification Asset monitoring Fault logging and cost predictions (motor manufacturing) Academia Student Retention Research data publication (via Hadoop cluster) Collaborative research with external organisations in: Ecological projects Design projects Patient care & diagnosis Utilities Smart metres Energy use prediction Health & Life Sciences Campaign and sales program optimization. Brand management. Patient care quality and program analysis. Supply chain management. Drug discovery and development analysis. Retail/CPG Merchandizing and market basket analysis. Campaign management and customer loyalty programs. Supply chain management and analytics. Event and behaviour based targeting Mood mapping Telecommunications Revenue assurance and price optimization Customer churn prevention Campaign management and customer loyalty Call Detail Record (CDR) analysis Network performance and optimization. Web & Digital Media Services Large scale clickstream analytics Ad targeting, analysis, forecasting and optimization Abuse and click fraud prevention Social graph analysis and profile segmentation Campaign management and loyalty programs Policing Suspect tracking combining CCTV images, facial recognition software, travel trends and identifiers on travel cards Event Prediction
Big Data in action Google Flu Trends (www.google.org/flutrends/) In 2009 Google was able apply big data to search terms run through its search engine to track the spread of the H1N1 pandemic and here s how they did it: First, researchers at Google obtained data from the Center for Disease Control and Prevention (CDC) regarding the spread of the seasonal flu between 2003 and 2008. Next, the researchers took the 50 million most common search terms punched into their search engine, and traced where and when they were punched in during the flu seasons between 2003 and 2008. The researchers then plugged both data sets into their computers to look for correlations. As the authors explain, all their system did was look for correlations between the frequency of certain search queries and the spread of the flu over time and space. In total, they processed a staggering 450 million different mathematical models in order to test the search terms... And they struck gold: their software found a combination of 45 search terms that, when used together in a mathematical model, had a strong correlation between their prediction and the official figures nationwide. Then, in 2009, when the H1N1 flu pandemic struck, Google used its predictive model, together with the search terms that were being registered in its search engine in real time, to track the spread of the virus. It worked like a charm. As was reported at the time, when the H1N1 crisis struck in 2009, Google s system proved to be a more useful and timely indicator than government statistics with their natural reporting lags. Now, the H1N1 virus did not turn out to be as deadly as many predicted. Nevertheless, there is no guarantee that we will be so fortunate with future flu pandemics. Should we ever find ourselves up against a very deadly flu pandemic, Google s predictive model could end up saving many, many lives
MSc Big Data Analytics @ Sheffield Hallam University Developed in response to a need from Business. Highly practical and skills oriented. Utilises industry leading software packages. Partnered with Industry to give real experience & real challenges to students. Semester 1 Semester 2 Semester 3 Data Quality, Data Mining, Handling Data in the Cloud, Statistical Modelling, Social & Economical aspects Advanced Statistical Modelling, Big Data & Distributed of Cloud Dissertation, Vendor Certification Vendor Certification Systems
Chris Brown Consultant Data Analytics OCF Plc +44 7943 594084 cbrown@ocf.co.uk http://blog.ocf.co.uk/
Definitions of Big Data from various influential organisations 1. Gartner. In 2001, a Meta (now Gartner) report noted the increasing size of data, the increasing rate at which it is produced and the increasing range of formats and representations employed. This report predated the term dig data but proposed a three fold definition encompassing the three Vs : Volume, Velocity and Variety.This idea has since become popular and sometimes includes a fourth V: veracity, to cover questions of trust and uncertainty. 2. Oracle. Big data is the derivation of value from traditional relational database driven business decision making, augmented with new sources of unstructured data. 3. Intel. Big data opportunities emerge in organizations generating a median of 300 terabytes of data a week. The most common forms of data analyzed in this way are business transactions stored in relational databases, followed by documents, e mail, sensor data, blogs, and social media. 4. Microsoft. Big data is the term increasingly used to describe the process of applying serious computing power the latest in machine learning and artificial intelligence to seriously massive and often highly complex sets of information. 5. The Method for an Integrated Knowledge Environment open source project. The MIKE project argues that big data is not a function of the size of a data set but its complexity. Consequently, it is the high degree of permutations and interactions within a data set that defines big data. 6. The National Institute of Standards and Technology. NIST argues that big data is data which exceed(s) the capacity or capability of current or conventional methods and systems. In other words, the notion of big is relative to the current standard of computation.
What is Big Data?