Data First Framework How to Build Your Enterprise Data Hub Luis Campos Big Data Solutions Director Oracle Europe, Middle East and Africa @luigicampos June 2014 Copyright 2015 Oracle and/or its affiliates. All rights reserved. No DBAs were hurt in the making of this presentation
Why we measure things? Ultimately to convert Known Unknowns! Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Oracle Big Data Discovery The Visual Face of Hadoop The Hidden Face of Spark, and Dgraph, and Weblogic, Oozie, HUE Luis Campos Director, Big Data Solutions EMEA #BudapestData @luigicampos @oraclebigdata Budapest Data Forum June 2015
Who is the Data Scientist? Give me the data Give me the computing power I will show you the FUTURE! Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Data Science Limits and Promisses Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Data Scientist Skillset (Unicorn profile) Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Data Engineer + Business Analyst + Data Scientist You need them all Data Engineer DBA, ETL, etc. Representation Reporting and summarization Data Scientist Extrapolation Movement Business Analyst Interpretation Prescription Copyright 2015 Oracle and/or its affiliates. All rights reserved. Oracle Confidential Internal 7
Storyboard Terrorboard The Arrival of a New Data Source Load data from new data source into Hadoop Extracting new insight from newly added data set proves to be almost impossible due to data quality issues. Data is very inconsistent, inaccurate, and not complete. Data needs to be cleansed and as such data cleansing rules need to be set Go through many transformevaluate iterations with product marketing until data is in the desirable format. Too much time is spent on manual data wrangling tasks. Focus is directed away from generating valuable insights for the business. By the time insights can be extracted, product marketing have already moved on to the next problem. Copyright 2015 Oracle and/or its affiliates. All rights reserved. 10
Introducing Oracle Big Data Discovery 1.0 Explore Analyze Discover Transformation Augment Oracle Confidential Business Analytics Product Group
Oracle Big Data Discovery. The Visual Face of Hadoop find explore transform discover share Copyright 2015 Oracle and/or its affiliates. All rights reserved. 12
Oracle Big Data Discovery. The Visual Face of Hadoop Projects Data Sets Atributes Scratchpad find explore transform discover share Copyright 2015 Oracle and/or its affiliates. All rights reserved. 13
Catalog Projects are comprised of Data Sets Search and guided navigation for ease of use See data set summaries, user annotation and recommendations Load data to Hadoop via selfservice Copyright 2015 Oracle and/or its affiliates. All rights reserved. 14
Explore Visualize all attributes by type Sort attributes by name, information potential*, relation Assess attribute statistics, data quality and outliers Use scratch pad to uncover correlations between attributes * Shannon Entropy based algorithm Copyright 2015 Oracle and/or its affiliates. All rights reserved. 15
Scratchpad Cool and Addictive! Explore Graphic type changes as additional attributes are added Autoselects best visualization Offers next best graphics option(s) Copyright 2015 Oracle and/or its affiliates. All rights reserved. Oracle Confidential Internal 16
Transform Intuitive, user driven data wrangling Extensive library of powerful data transformations and enrichments Preview results, undo, commit and replay transforms Test on sample data then apply to full data set in Hadoop Copyright 2015 Oracle and/or its affiliates. All rights reserved. 17
Oracle Big Data Discovery. The Visual Face of Hadoop Unlock big data for everyone find explore transform discover share Copyright 2015 Oracle and/or its affiliates. All rights reserved. 18
Discovery Dashboard creation Control over Layout Filtering behavior Metrics Formatting controls At Project level At Component level D3 Charts Copyright 2015 Oracle and/or its affiliates. All rights reserved. Oracle Confidential Internal
Discover Join and blend data for deeper perspectives Compose project pages via drag and drop Use powerful search and guided navigation to ask questions See new patterns in rich, interactive data visualizations Copyright 2015 Oracle and/or its affiliates. All rights reserved. 20
Share Share projects, bookmarks and snapshots with others Build galleries and tell big data stories Collaborate and iterate as a team Publish blended data to HDFS for leverage in other tools Copyright 2015 Oracle and/or its affiliates. All rights reserved. 21
Now let s dive in
The Hadoop Ecosystem Standard Hadoop Node Hadoop Analytic & Data Processing Tools Spark Map Reduce Sqoop MLlib R-on-Hadoop Hive Hadoop Management Tools HCatalog Oozie (Workflow) YARN Zookeeper HDFS Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Big Data Discovery In Hadoop Hadoop Node Hadoop Analytic & Data Processing BDD Data Processing BDD Node BDD Server Components Hadoop Management Tools HDFS Indexing & Transformation of Data Management Visual Tool Indexing Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Requests Big Data Discovery In Detail Hadoop Node Hadoop Analytic & Data Processing Hadoop Management Tools HDFS Data Processing (Spark) Dgraph HDFS Agent Sync & Transformations Self Service Load Transformations BDD Node Data Processing CLI DP Workflows (Oozie) Hive Table Detector Studio Visual Interface (J2EE) DP Workflows (Oozie) Dgraph Gateway (J2EE) Caching + Business Logic Dgraph Instance(s) Indexing EM Plug-in EM Agent Note that although a BDD data set can be deleted by a Studio user, the Data Processing software can never delete a Hive table. Therefore, it is up to the Hive administrator to delete obsolete Hive tables. Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Data Ingest 2 Methods Self Service Upload via BDD Studio The preferred method for the Business Analyst Command Line Interface (CLI) The preferred method for IT / Data Engineer / Data Scientist / Anyone who loves CLI s Remember: BDD does not hold data, only index and metadata! Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Data Ingest - Personal Data Upload Big Data Discovery supports personal data upload, in a variety of formats Flat Files A user can upload a personal file in the following formats: Delimited (CSV, tab, pipe, etc) Excel (XLS, XLSX) Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Deep Dive... Nahh, we don t have time Just a command line example for the amusement of hardline professionals
Hive Table METADATA Creation via SQL Copyright 2015 Oracle and/or its affiliates. All rights reserved.
BDD Command Line Data Set Creation (manual) (only needed if the BDD Listener* is not working) Run manually or via cron job BDD installation orchestration script defaults to cron job Needs configuration to run correctly (define paths to cluster, Dgraphs, etc.) Can be run on individual, group (whitelist/blacklist), Hive database, or all Hive tables Invoking the BDD Hive Table Detector* ( invoked within the DP CLI script) Keeps Hive database/tables in sync with BDD data sets Manual Data Set Creation Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Job Monitoring & Debugging Copyright 2015 Oracle and/or its affiliates. All rights reserved.
Language Support Studio Search English, Chinese Simplified, German, Japanese, Korean, Portuguese Brazilian, Spanish Arabic, Basque, Belarusian, Bosnian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian Lithuanian, Macedonian, Malay, Norwegian Bokmål, Norwegian Nynorsk, Persian, Polish, Portuguese, Portuguese (Brazilian), Romanian, Russian, Serbian (Cyrillic), Serbian (Latin), Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Valencian, Vietnamese Text Enrichment Entity Extraction Whitelist Tagger Sentiment Language Detection English, French, German, Spanish, Italian, Portuguese (Brazilian) English English, Danish, German, Spanish, French, Italian, Japanese, Korean, Chinese (Simplified), Chinese (Traditional), Portuguese English, French, German, Spanish, Italian, Portuguese (Brazilian) All OLT 2.1 languages (50) Copyright 2014 Oracle and/or its affiliates. All rights reserved. Oracle Confidential Internal 41
Authentication By default BDD owns authentication, based on users and assigned roles Users and roles may be imported via LDAP/Active Directory Standard User Roles define a users rights to both data sets and projects. Additionally, global roles are defined that control a user s read/write access to Hadoop. Single Sign-On is supported Preferred method is via Oracle Access Manager, though other options, such as OpenSSO and SiteMinder are also supported Will bypass the login portlet SSO User SSO Portal Support also exists for auto login hooks Copyright 2014 Oracle and/or its affiliates. All rights reserved.
Oracle BDD 1.0 What is it for? Intuitive visual interfaces for the entire Hadoop analytics process Data transformation and enrichment at scale Shares data with Oracle and Hadoop ecosystems (via Big Data SQL) What is it not for? Modeling or advanced analytics Reporting Other data engines rather than Hadoop/Spark (eg. NoSQL) Copyright 2014 Oracle and/or its affiliates. All rights reserved. Oracle Confidential Internal/Restricted/Highly Restricted 43
Big Data Discovery Cloud Service (coming soon) Copyright 2014 Oracle and/or its affiliates. All rights reserved. Oracle Confidential Highly Restricted 44
BDD Easy Deployment Partner http://bigdatadisco.branchbird.com/bdd/web/home/index Copyright 2014 Oracle and/or its affiliates. All rights reserved.
Enjoy the rest of the conference! Thank You @luigicampos www.oracle.com/bigdatadiscovery