The Changing Nature and Uses of Data



Similar documents
Big Data a threat or a chance?

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Sanjeev Kumar. contribute

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

How Big Data is Different

Sunnie Chung. Cleveland State University

Big Data Hope or Hype?

Information Visualization WS 2013/14 11 Visual Analytics

The Future of Data Management

Using Tableau Software with Hortonworks Data Platform

Personalized Medicine and IT

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Computational Science and Informatics (Data Science) Programs at GMU

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Solve Your Toughest Challenges with Data Mining

Foundations of Business Intelligence: Databases and Information Management

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

A Review of Data Mining Techniques

The 4 Pillars of Technosoft s Big Data Practice

Solve your toughest challenges with data mining

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Introduction to Data Mining

I N T E L L I G E N T S O L U T I O N S, I N C. DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD

Information Management course

T a c k l i ng Big Data w i th High-Performance

Big Data Executive Survey

RC & CREATING DATA PRIVACY OPPORTUNITIES USING BIG IN EUROPE DATA AND ANALYTICS. risk compliance RISK & COMPLIANCE MAGAZINE.

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Microsoft Big Data. Solution Brief

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

Data Isn't Everything

Integrating a Big Data Platform into Government:

Big Data Challenges in Bioinformatics

Hurwitz ValuePoint: Predixion

How To Build A Cloud Computer

Unlocking the Intelligence in. Big Data. Ron Kasabian General Manager Big Data Solutions Intel Corporation

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Extend your analytic capabilities with SAP Predictive Analysis

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Doctor of Philosophy in Computer Science

PDF PREVIEW EMERGING TECHNOLOGIES. Applying Technologies for Social Media Data Analysis

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Government Technology Trends to Watch in 2014: Big Data

Wendy Weber President.

Big Data Trends A Basis for Personalized Medicine

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Data Semantics Aware Cloud for High Performance Analytics

WINDOWS AZURE AND WINDOWS HPC SERVER

Doing Multidisciplinary Research in Data Science

A New Era Of Analytic

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data better business benefits

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Analyzing Big Data: The Path to Competitive Advantage

Applications of Deep Learning to the GEOINT mission. June 2015

Creating a Business Intelligence Competency Center to Accelerate Healthcare Performance Improvement

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Big Data overview. Livio Ventura. SICS Software week, Sept Cloud and Big Data Day

Big Data jako součást našeho života. Zdenek Panec: June, 2015

Big Data & the Cloud: The Sum Is Greater Than the Parts

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

HSD. W Business Analytics (M.Sc.) IT in Business Analytics. IT Applications in Business Analytics SS2016 / 01 Introduction Thomas Zeutschler

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

Ten Mistakes to Avoid

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

The Data Mining Process

5 Big Data Use Cases to Understand Your Customer Journey CUSTOMER ANALYTICS EBOOK

Big Data R&D Initiative

Utilizing and Visualizing Geolocation Data for Powerful Analysis

Data Mining Solutions for the Business Environment

Transcription:

The Changing Nature and Uses of Data Jim Pinkelman Senior Director jimpi@microsoft.com Microsoft Research Connections

Changes in Data Effect and Behavior Challenges and Needs

Data Size and Speed are Growing Large Hadron Collider Entire sequence of DNA for the human body, consists of around 3 billion of these base pairs. The human genome requires ~750 megabytes of storage 150 million sensors delivering data 40 million times per second. Data flow: ~700 MB/sec ~15 PB/year 1000 s of scientists around the world; Institutions in 34 different countries: Thousands of small antennas spread over a distance of more than 3000km. Data flow: ~60 GB/sec 1 Million PB/day The SKA supercomputer will perform 1018 operations per second ~ 100M PCs 1990 2000 2010 2020

Data Complexity is Increasing data means recorded information, regardless of the form or medium on which it may be recorded, and includes writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data. http://grants.nih.gov/grants/policy/nihgps_2012/nihgps_ch8.htm The National Institutes of Health (NIH) Grants Policy Statement

Data Sensors have become Digital vision of processing power so distributed throughout the environment that computers per se effectively disappear Adam Greenfield Head of design and UI design, Nokia Smaller, Inexpensive, Wireless, Sophisticated, Abundant, Digital

Data can now be transmitted Wirelessly Harbor Research not only "hard sensors" that track physical attributes such as light, heat, pressure and motion, but also "soft sensors" such as a user's calendar, social network activity and Web browsing habits. "What context awareness does is collect all of that, some of which is up-to-theminute on the physical sensors and some of which is accumulated incrementally over a long expanse of time through these soft senses, to create devices that really anticipate your needs Intel CTO Justin Rattner Distributed, Connected, Flexible, Power Efficient, Internet of Things

The Cloud is Available Omnipresent Services Uploading data Download commands Streaming signals Network between Devices Compute & Storage Elasticity Lower barriers to adoption Lower barriers to scaling Lower overheads Accelerates Collaboration Sharing data Sharing algorithms Co-authoring Reproducible Research

Changes in Data Effect and Behavior Challenges and Needs

Uses of Analytics continues to Grow Business Intelligence in Industry 1. More data; Incorporation of External Data Sources 2. Emergence Near Real Time Analytics 3. Less structured data and non-rdbms Data Warehouses Chaudhuri, S., Dayal, U., Narasayya, V. An Overview of Business Intelligence Technology. Communications of the ACM Vol. 54 No. 8

Unstructured Data is Valuable MapReduce -> Hadoop Targeting advertisements Fraud detection Financial modeling Business analytic Predicting markets Social network analysis Audience sentiment http://www.datameer.com/blog/uncategorized/the-hadoop-ecosystem-visualized-in-datameer.html

The Nature Of Research is Changing. a a 2 4 G 3 c a 2 2 Thousand years ago Description of natural phenomena Last few hundred years Newton s laws, Maxwell s equations Last few decades Simulation of complex phenomena Today and the Future Unify theory, experiment and simulation with large multidisciplinary data Using data exploration and data mining (from instruments, sensors, humans ) Jim Gray

DNA and The Tree of Life Inferring ancient divergences requires genes with strong phylogenetic signals Nature, Vol. 497, Issue 7449 Leonidas Salichos & Antonis Rokas1 St. George Jackson Mivart Illustration: Ernst Haeckel nearly 150 years later, scientists have vast amounts of data with which to build so-called phylogenetic trees, the modern version of Mivart s structure. Advances in DNA sequencing technology and bioinformatics enable them to compare the sequence of hundreds of genes, sometimes entire genomes, among many different species, creating a tree of life more detailed than ever before. Emily Singer, Simons Foundation https://www.simonsfoundation.org/features/science-news/a-new-approach-to-building-the-tree-of-life/ http://www.nature.com/nature/journal/v497/n7449/full/nature12130.html

Science is now Data-Intensive Data Complexity Extremely large data sets Expensive to move Domain standards High computational needs Supercomputers, HPC, Grids e.g. High Energy Physics, Astronomy Large data sets Some Standards within Domains Shared Datacenters & Clusters Research Collaborations e.g. Genomics, Financial A paper from Microsoft Research, aptly titled Nobody ever got fired for buying a cluster, which points out that a lot of the problems solved by engineers at even the most datahungry firms don t need to be run on clusters. there are vast classes of problems for which clusters are a relatively inefficient or even totally inappropriate solution. Christopher Mims Science and Technology Editor, Quartz Medium & Small data sets Flat Files, Excel Widely diverse data; Few standards Local Servers & PCs e.g. Social Sciences, Humanities The Long Tail of Science Number of Researchers

Industry is building out massive Infrastructure

Microsoft Cloud Research Engagement Project Worked with international funding agencies to grant access to cloud resources to researchers. 90 projects world wide. Penn Louisiana Washington New York New Mexico California Colorado Michigan South Carolina Texas Florida Georgia Mass. Virginia North Carolina Indiana Delaware http://research.microsoft.com/en-us/projects/azure/ Brussels Venus-C England - University of Nottingham Inria in France Plus Italy, Spain, Greece, Denmark, Switzerland, German China - Now! Partners NICTA ANU CSIRO

Bioinformatics Research on Windows Azure Protein Folding The University of Washington is studying the ways proteins from salmonella inject DNA into cells. Used 2000 concurrent cores. Joint Genetic and Neuroimaging Analysis France s premier research institute INRIA is using 1000 cores of Azure to study large cohorts of subjects to understand links between genetic patterns and brain anomalies. Comparative Genomics Researchers at the University of North Carolina Charlotte are doing large scale operon prediction using Windows HPC Scheduler on Azure using 300 cores to do BLAST analysis. Used 1,000,000 hours. Drug Discovery Researchers at Newcastle University in the U.K. are using Azure to model the properties (toxicity, solubility, biological activity) of molecules for potential use as drugs. Systems Biology The University of Trento Centre for Computational and Systems Biology have developed an Azure based tool, BetaSIM for modeling and simulating biological systems.

Machine Learning is becoming more Applicable Data and massive parallelism change the game. Supervised Machine Learning - inferring knowledge from labeled training data Unsupervised finding the hidden structure in data without labels Used an array of 16,000 processors to create a neural network with more than one billion connections. They then fed it random thumbnails of images, one each extracted from 10 million outube videos. Stanford University Andrew Ng Google fellow Jeff Dean turning my English into Chinese in two steps. The first takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part. The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages. Microsoft Research Rick Rashid

Collaboration and Sharing of Data is Expected and Growing expects investigators to share with other researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work. NIH reaffirms its support for the concept of data sharing. We believe that data sharing is essential for expedited translation of research results into knowledge, products, and procedures to improve human health. The NIH expects and supports the timely release and sharing of final research data from NIH-supported studies for use by other researchers. A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government.

Data Visualization is leading to Digital Storytelling Napoleon s March Charles Joseph Minard Edward Tufte: Probably the best statistical graphic ever drawn Hans Rosling on CNN: US in a converging world http://www.gapminder.org/videos/hans-rosling-on-cnn-us-in-a-converging-world/ Hans Rosling Karolinska Institute Gapminder Foundation

Changes in Data Effect and Behavior Challenges and Needs

The Data Lifecycle Technical Challenges Privacy & Security Integrity Data Portability Communication Protocols Metadata Standards Data Curation and Archiving Archiving & Preserving Dissemination & Sharing Acquisition & Modelling Data Analysis & Data Mining Collaboration & Visualisation Social & Cultural Economic Sustainability Trust Institutional Agility Talent Pipeline Scholarly Communications Semantic Diversity

Data Science Columbia will offer new master s and certificate programs heavy on data. The University of San Francisco will soon graduate its charter class of students with a master s in analytics. Other institutions teaching data science include New York University, Stanford, Northwestern, George Mason, Syracuse, University of California at Irvine and Indiana University. Claire Cain Miller NY Times http://wikipedia.datascience101.wordpress.com/

http://columbiadatascience.files.wordpress.com/

Talent - A Data Scientist? Data Engineer Data Analyst Data Steward People who are expert at Operating at low levels close to the data, write code that manipulates They may have some machine learning background. Large companies may have teams of them in-house or they may look to third party specialists to do the work. People who explore data through statistical and analytical methods They may know programming; May be an spreadsheet wizard. Either way, they can build models based on low-level data. They eat and drink numbers; They know which questions to ask of the data. Every company will have lots of these. People who think to managing, curating, and preserving data. They are information specialists, archivists, librarians and compliance officers. This is an important role: if data has value, you want someone to manage it, make it discoverable, look after it and make sure it remains usable. What is a data scientist? Microsoft UK Enterprise Insights Blog, Kenji Takeda http://blogs.msdn.com/b/microsoftenterpriseinsight/archive/2013/01/31/what-is-a-data-scientist.aspx

Real Evolution Computing Efficiencies Internet of Things Interdisciplinary Research Digital Storytelling Cloud Storage Cloud Analytics Reproducible Research Sensing & Sensors Digitization Advanced Analytics and Viz Growth of ML For Prediction Data Scientists Sharing and Collaboration

Thank you Jim Pinkelman jimpi@microsoft.com Microsoft Research Connections