Applying Semantics to Unstructured Data (Big and Getting Bigger) Wednesday, November 30, 2012 4:00 5:00 Bryan Bell Vice President, Enterprise Solutions, Expert System Lynda Moulton, Analyst & Consultant, LWM Technology Services Peter O'Kelly Principal Analyst, O'Kelly Associates
Overall Session Agenda Introduction and context-setting "Big Data" 101 for Business Semantics and the Big Data Opportunity 2
Big Data 101 Agenda Big data in context Recap Risks Recommendations 3
Big Data in Context What is big data? Unhelpfully, both big data and NoSQL, generally considered a key part of the big data wave, are defined more in terms of what they aren t than what they are A typical big data definition (Wikipedia): [ ] data sets that grow so large that they become awkward to work with using on-hand database management tools Often associated with Gartner s volume, variety (and complexity), and velocity model Also value and veracity considerations 4
Big Data in Context Why is big data a big deal now? Commoditized hardware, software, and networking Capability and price/performance curves that continue to defy all economic laws Cloud services with radical new capability/cost equations Maturation and uptake of related open source software, especially Hadoop Powerful and often no- or low-cost 5
Big Data in Context Why is big data a big deal now (continued)? Market enthusiasm for NoSQL systems Useful and often open source /public domain data sources and services Mainstreaming of semantic tools and techniques 6
A Prime Minicomputer, c1982 7
Fast-Forward to 2012 8
Fast-Forward to 2012 9
Fast-Forward to 2012 10
Fast-Forward to 2012 11
Fast-Forward to 2012 12
Google BigQuery 13
Hadoop Hadoop is often considered central to big data Originating with Google s MapReduce architecture, Apache Hadoop is an open source architecture for distributed processing on networks of commodity hardware From Wikipedia: Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output the answer to the problem it was originally trying to solve 14
Hadoop Commercial application domains include (from Wikipedia) Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance 15
Hadoop Hadoop is popular and rapidly evolving Most leading information management vendors have embraced Hadoop There is now a Hadoop ecosystem 16
Meanwhile, Back in the Googleplex Dremel, BigQuery, Spanner, and other really big data projects 17
Meanwhile, Back in the Googleplex 18
Google Now 19
A NoSQL Taxonomy From the NoSQL Wikipedia article: 20
A View of the NoSQL Landscape 21
Another NoSQL Landscape View
NoSQL Perspectives The NoSQL meme confusingly conflates Document database requirements Best served by XML DBMS (XDBMS) Physical database model decisions on which only DBAs and systems architects should focus And which are more complementary than competitive with DBMS Object databases, which have floundered for decades But with which some application developers are nonetheless enamored, for minimized impedance mismatch, despite significant information management compromises Semantic (e.g., RDF) models Also more complementary than competitive with RDBMS/XDBMS Also consider: the traditional DBMS players can leverage the same underlying technology power curves 23
Data as a Service The (single source of) truth is out there?... High-quality data sources are being commoditized Value is shifting to the ability to discern and leverage conceptual connections, not just to manage big databases Some resources and developments to explore Social networking graphs and activities Data.com (Salesforce.com) Data.gov Google Knowledge Graph Linked Data Microsoft Windows Azure Data Marketplace Wikidata.org Wolfram Alpha 24
Mainstreaming Semantics Tools and techniques applied in search of more meaning, e.g., Vocabulary management Disambiguation and auto-categorization Text mining and analysis Context and relationship analysis It s still ideal to help people capture and apply data and metadata in context Semantic tools/techniques are complementary 25
Mainstreaming Semantics The Semantic Web is still more vision than reality But Google, Microsoft, and Yahoo, and Yandex, for example, are improving Web searches by capturing and applying more metadata and relationships via schema.org schemas in Web pages And Google s Knowledge Graph is about things, not strings, with, as of mid-2012, 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects 26
Recap Commoditization and cloud Very significant new opportunities Hadoop and related frameworks Complementary to RDBMS and XDBMS NoSQL Likely headed for meme-bust Data services Game-changing potential Semantic tools and techniques Rapidly gaining momentum 27
Risks The potential for an ever-expanding set of information silos Focus on minimized redundancy and optimized integration GIGO (garbage in, garbage out) at super-scale New opportunities for unprecedented self-inflicted damage, for organizations that don t model or query effectively Cognitive overreach The potential for information workers to create and act on nonsensical queries based on poorly-designed and/or misunderstood information models Skills gaps can create competitive disadvantages Modeling, query formulation, and data analysis Critical thinking and information literacy 28
Recommendations Aim high: big data is in many respects just getting started A lot of technology recycling but also significant and disruptive innovation Work to build consensus among stakeholders on the opportunities and risks Focus on human skills e.g., critical thinking and information literacy For now, an instance of the most creative and powerful type of semantic big data processor we know of is between your ears 29