The Open Source Knowledge Discovery and Document Analysis Platform

Similar documents

Search and Real-Time Analytics on Big Data

Information Retrieval Elasticsearch

Ad Hoc Analysis of Big Data Visualization

Big Data Visualization and Dashboards

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Geo Analysis, Visualization and Performance with JReport 13

Big Data and Analytics (Fall 2015)

Search and Information Retrieval

Embedded Analytics & Big Data Visualization in Any App

Data Integration Checklist

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

An Approach to Implement Map Reduce with NoSQL Databases

NoSQL Roadshow Berlin Kai Spichale

Embedding Customized Data Visualization and Analysis

Full-text Search in Intermediate Data Storage of FCART

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Flattening Enterprise Knowledge

The Rembrandt Group Strategies for BIG DATA

Beyond The Web Drupal Meets The Desktop (And Mobile) Justin Miller Code Sorcery Workshop, LLC

Large Scale Text Analysis Using the Map/Reduce

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Big Data Analytics Nokia

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Client Overview. Engagement Situation. Key Requirements

Elasticsearch for Lua Developers. Pablo Musa

Introducing Apache Pivot. Greg Brown, Todd Volkert 6/10/2010

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

General principles and architecture of Adlib and Adlib API. Petra Otten Manager Customer Support

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Preface. Motivation for this Book

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

MongoDB Developer and Administrator Certification Course Agenda

A Performance Analysis of Distributed Indexing using Terrier

Data Discovery and Systems Diagnostics with the ELK stack. Rittman Mead - BI Forum 2015, Brighton. Robin Moffatt, Principal Consultant Rittman Mead

Distributed Computing and Big Data: Hadoop and MapReduce

Unleash your intuition

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

How To Make Sense Of Data With Altilia

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Reducing Client Incidents through

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Integrating VoltDB with Hadoop

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Using Tableau Software with Hortonworks Data Platform

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

BIRT in the World of Big Data

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Big Data Visualization with JReport

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Mining Text Data: An Introduction

Making Sense of Big Data in Insurance

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Developing Microsoft SharePoint Server 2013 Advanced Solutions MOC 20489

Introducing the Reimagined Power BI Platform. Jen Underwood, Microsoft

XpoLog Center Suite Data Sheet

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Big Data Paradigm Shift. Insight Through Automation

INSPIRE Dashboard. Technical scenario

Cleveland State University

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Best Practices for Hadoop Data Analysis with Tableau

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Generic Log Analyzer Using Hadoop Mapreduce Framework

Analysis of Web Archives. Vinay Goel Senior Data Engineer

MONGODB - THE NOSQL DATABASE

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

The emergence of big data technology and analytics

Adobe ColdFusion 11 Enterprise Edition

Log Mining Based on Hadoop s Map and Reduce Technique

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Course 20489B: Developing Microsoft SharePoint Server 2013 Advanced Solutions OVERVIEW

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

the missing log collector Treasure Data, Inc. Muga Nishizawa

Communiqué 4. Standardized Global Content Management. Designed for World s Leading Enterprises. Industry Leading Products & Platform

Sisense. Product Highlights.

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

DE-20489B Developing Microsoft SharePoint Server 2013 Advanced Solutions

Sentiment Analysis on Big Data

Mashing Up with Google Mashup Editor and Yahoo! Pipes

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Discovering Business Insights in Big Data Using SQL-MapReduce

Using Apache Solr for Ecommerce Search Applications

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Structured Content: the Key to Agile. Web Experience Management. Introduction

Enterprise Content Management with Microsoft SharePoint

Transcription:

Enabling Agile Intelligence through Open Analytics The Open Source Knowledge Discovery and Document Analysis Platform 17/10/2012 1

Agenda Introduction and Agenda Problem Definition Knowledge Discovery Document Analysis The Infinit.e Solution Architecture Use Cases Questions

The Problem http://techbuddha.wordpress/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/

Knowledge Discovery Knowledge Discovery is the process of indexing and categorizing the contents of a corpus of data sources in order to identify what is contained in those sources and how to retrieve it. What information do we have? Where is the information located?

Document Analysis Document Analysis is the process of analyzing the contents of a large numbers of documents in order to answer questions related to the content of those documents. What kind of questions can we answer with our data? What kind of enrichment can we apply to our data to improve our ability to answer organizational questions?

The Infinit.e Solution Infinit.e is an Open Source Knowledge Discovery and Document Analysis platform that Harvests Enriches Stores Retrieves Analyzes Visualizes structured and unstructured documents

The Architecture External Applications & GUIs RSS XML HTML TXT PDF JDBC Etc. Rest Based API Core Server elasticsearch JSON RSS KML GraphML Etc. Enrichment MongoDB Hadoop Linux

Storage Infinit.e uses MongoDB for the following reasons: Document-oriented storage Horizontal and Vertical Scalability The infinit.e.data_model library: Manages connections to MongoDB Converts JSON (BSON) to POJOs using Google s GSON library

Harvesting Server infinit.e.core.server library manages the process of harvesting and cleansing documents: service infinite-px-engine start Configurable for timing and number of documents to harvest per cycle Note: Migrating to the Apache UIMA framework is on our to-do list Harvesting

Harvesting Document Types The Infinit.e platform can harvest documents from: URLs RSS, HTML, etc. File Shares Samba, Windows Shares, and local files Databases via JDBC

Harvesting Sources Infinit.e harvests documents based on configuration information contained in Source documents like the following example: { } "_id": "4cbdb9f05ed98e7bed499270", "title": "Wired: Top News", "url": "http://feeds.wired/wired/index", "created": "Oct 19, 2010 11:32:00 AM", "description": "Top News", "extracttype": "Feed", "mediatype": "News", "modified": "Oct 19, 2010 11:32:00 AM", "tags": ["technology", "news"]

Harvesting Metadata Extraction Infinit.e does not store the original document Infinit.e extracts the metadata associated with the original document and creates a Document POJO Full text can be stored in gzip format within a MongoDB collection Note: The Infinit.e harvester uses the Apache Tika toolkit to extract metadata and text from a wide variety of file formats.

Harvesting doc_metadata { } "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia...", "url" : "http://www.pressreleasebureau/mediterraneanconference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com- 13613.html" "description" : "Report by egyptlastminute CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most...", "created" : ISODate("2012-04-22T01:49:02Z"), metadata : { }, "associations" : [ ], "entities" : [ ],...

Harvesting metadata { }... "metadata" : { "location" : [ { "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front ],... },...

Enrichment What is it? Data enrichment is: The extraction of entities (people, places, things) and associations (relationships, events, facts) from unstructured data using Natural Language Processing (NLP) libraries Extracting entities and associations from structured data sources Applying geo-tags to entities and associations

Enrichment Libraries The Infinit.e platform ships with several enrichment libraries including: Structured Analysis Handler Extracts entities, creates associations, and geo-tags data from databases and other structured source documents like XML Unstructured Analysis Handler Uses RegExs, JavaScript, or Xpath to extract entities and associations TextRank based keyphrase extractor Extracts entities (keywords or phrases) from text using the TextRank algorithm and OpenNLP

Enrichment Structured Sample Structured Analysis Source { } "_id": "50366595e4b0bb23272794b7", "communityids": ["503663b1e4b0bb23272794b4"], "created": "Aug 23, 2012 1:17:09 PM", "description": "NCTC Wits Data",... "structuredanalysis": { "entities": [ { "dimension": "Who", "disambiguated_name": "$characteristic from $nationality", "iterateover": "perpetrator", "type": "PersonPerpetrator", "usedocgeo": false }... ] },...

Enrichment 3 rd Party Libraries Infinit.e comes with built in support for several 3 rd party enrichment tools including:

Enrichment Entities Feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityid" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }

Enrichment Entities

Enrichment Associations Feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityid" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president,"president/position ], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career,"current,"past ], "verb_category" : "career" }

Enrichment Associations

Enrichment Geolocation Feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 } } Note: MongoDB 2d Index

Enrichment Geolocation

Retrieval - Indexing Infinit.e uses elasticsearch to index the document, entity, and association data stored in MongoDB Document, entity and association data is searchable via Lucene queries The fields indexed by elasticsearch can be configured

Retrieval RESTful Interface Infinit.e exposes its API via a RESTful interface Infinit.e.api.server uses the Restlet API framework Example HTTP Get API Calls http://localhost/api/auth/login/user@ikanow/2f7nrslrbgcqozepmjclexmk5vrv http://localhost/api/community/get/4c927585d591d31d7b37097a http://localhost/api/person/get/user@ikanow http://localhost/api/knowledge/document/get/4cc0ebff97622e5914a70e83 http://localhost/api/auth/logout

Analysis What s Built In The Infinit.e platform ships with built in algorithms that calculate the following for entities: Significance Entity (term frequency inverse document frequency, a.k.a. TF-IDF) Document (sum of entity significance) Coverage Percentage of documents an entity appears in the dataset returned by a query Frequency Number of occurrences in the dataset returned by a query

Analysis Hadoop MapReduce The Infinit.e platform has a built-in integration with Apache s Hadoop MapReduce framework

Analysis Hadoop MapReduce Configuration Options Job schedule Custom MongoDB query Mapper/combiner/reducer classes Output key and value types Whether or not to append results to existing data sets Data age out in number of days Job dependencies User arguments Reuse existing MapReduce jar

Visualization Infinit.e includes an Adobe Flex based application with a set of default visualization widgets

Use Case The HTS Problem: HTS had a massive amount of unstructured data locked up in 1000s of documents with no way to get at it economically Highly skilled analysts had to read each document and manually extract the information into an Excel spreadsheet that was used to catalog the contents by Topics

Use Case The Infinit.e Solution: Harvest the documents using Infinit.e Extract entities from the harvested documents (who, what, where) Assign one or more Topics to each document based on the entities extracted (i.e. clustering)

Questions? Thank you! Craig Vitter Professional Services Engineer cvitter@ikanow http://meetup/infinit-e-user-group-dc/ http://github/ikanow/infinit.e www.ikanow @ikanowdata 703.454.9029