Enabling Agile Intelligence through Open Analytics The Open Source Knowledge Discovery and Document Analysis Platform 17/10/2012 1
Agenda Introduction and Agenda Problem Definition Knowledge Discovery Document Analysis The Infinit.e Solution Architecture Use Cases Questions
The Problem http://techbuddha.wordpress/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
Knowledge Discovery Knowledge Discovery is the process of indexing and categorizing the contents of a corpus of data sources in order to identify what is contained in those sources and how to retrieve it. What information do we have? Where is the information located?
Document Analysis Document Analysis is the process of analyzing the contents of a large numbers of documents in order to answer questions related to the content of those documents. What kind of questions can we answer with our data? What kind of enrichment can we apply to our data to improve our ability to answer organizational questions?
The Infinit.e Solution Infinit.e is an Open Source Knowledge Discovery and Document Analysis platform that Harvests Enriches Stores Retrieves Analyzes Visualizes structured and unstructured documents
The Architecture External Applications & GUIs RSS XML HTML TXT PDF JDBC Etc. Rest Based API Core Server elasticsearch JSON RSS KML GraphML Etc. Enrichment MongoDB Hadoop Linux
Storage Infinit.e uses MongoDB for the following reasons: Document-oriented storage Horizontal and Vertical Scalability The infinit.e.data_model library: Manages connections to MongoDB Converts JSON (BSON) to POJOs using Google s GSON library
Harvesting Server infinit.e.core.server library manages the process of harvesting and cleansing documents: service infinite-px-engine start Configurable for timing and number of documents to harvest per cycle Note: Migrating to the Apache UIMA framework is on our to-do list Harvesting
Harvesting Document Types The Infinit.e platform can harvest documents from: URLs RSS, HTML, etc. File Shares Samba, Windows Shares, and local files Databases via JDBC
Harvesting Sources Infinit.e harvests documents based on configuration information contained in Source documents like the following example: { } "_id": "4cbdb9f05ed98e7bed499270", "title": "Wired: Top News", "url": "http://feeds.wired/wired/index", "created": "Oct 19, 2010 11:32:00 AM", "description": "Top News", "extracttype": "Feed", "mediatype": "News", "modified": "Oct 19, 2010 11:32:00 AM", "tags": ["technology", "news"]
Harvesting Metadata Extraction Infinit.e does not store the original document Infinit.e extracts the metadata associated with the original document and creates a Document POJO Full text can be stored in gzip format within a MongoDB collection Note: The Infinit.e harvester uses the Apache Tika toolkit to extract metadata and text from a wide variety of file formats.
Harvesting doc_metadata { } "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia...", "url" : "http://www.pressreleasebureau/mediterraneanconference-seeks-to-flourish-tourism-in-egypt-tunisia-report-by-egyptlastminute-com- 13613.html" "description" : "Report by egyptlastminute CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most...", "created" : ISODate("2012-04-22T01:49:02Z"), metadata : { }, "associations" : [ ], "entities" : [ ],...
Harvesting metadata { }... "metadata" : { "location" : [ { "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front ],... },...
Enrichment What is it? Data enrichment is: The extraction of entities (people, places, things) and associations (relationships, events, facts) from unstructured data using Natural Language Processing (NLP) libraries Extracting entities and associations from structured data sources Applying geo-tags to entities and associations
Enrichment Libraries The Infinit.e platform ships with several enrichment libraries including: Structured Analysis Handler Extracts entities, creates associations, and geo-tags data from databases and other structured source documents like XML Unstructured Analysis Handler Uses RegExs, JavaScript, or Xpath to extract entities and associations TextRank based keyphrase extractor Extracts entities (keywords or phrases) from text using the TextRank algorithm and OpenNLP
Enrichment Structured Sample Structured Analysis Source { } "_id": "50366595e4b0bb23272794b7", "communityids": ["503663b1e4b0bb23272794b4"], "created": "Aug 23, 2012 1:17:09 PM", "description": "NCTC Wits Data",... "structuredanalysis": { "entities": [ { "dimension": "Who", "disambiguated_name": "$characteristic from $nationality", "iterateover": "perpetrator", "type": "PersonPerpetrator", "usedocgeo": false }... ] },...
Enrichment 3 rd Party Libraries Infinit.e comes with built in support for several 3 rd party enrichment tools including:
Enrichment Entities Feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityid" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }
Enrichment Entities
Enrichment Associations Feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityid" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president,"president/position ], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career,"current,"past ], "verb_category" : "career" }
Enrichment Associations
Enrichment Geolocation Feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 } } Note: MongoDB 2d Index
Enrichment Geolocation
Retrieval - Indexing Infinit.e uses elasticsearch to index the document, entity, and association data stored in MongoDB Document, entity and association data is searchable via Lucene queries The fields indexed by elasticsearch can be configured
Retrieval RESTful Interface Infinit.e exposes its API via a RESTful interface Infinit.e.api.server uses the Restlet API framework Example HTTP Get API Calls http://localhost/api/auth/login/user@ikanow/2f7nrslrbgcqozepmjclexmk5vrv http://localhost/api/community/get/4c927585d591d31d7b37097a http://localhost/api/person/get/user@ikanow http://localhost/api/knowledge/document/get/4cc0ebff97622e5914a70e83 http://localhost/api/auth/logout
Analysis What s Built In The Infinit.e platform ships with built in algorithms that calculate the following for entities: Significance Entity (term frequency inverse document frequency, a.k.a. TF-IDF) Document (sum of entity significance) Coverage Percentage of documents an entity appears in the dataset returned by a query Frequency Number of occurrences in the dataset returned by a query
Analysis Hadoop MapReduce The Infinit.e platform has a built-in integration with Apache s Hadoop MapReduce framework
Analysis Hadoop MapReduce Configuration Options Job schedule Custom MongoDB query Mapper/combiner/reducer classes Output key and value types Whether or not to append results to existing data sets Data age out in number of days Job dependencies User arguments Reuse existing MapReduce jar
Visualization Infinit.e includes an Adobe Flex based application with a set of default visualization widgets
Use Case The HTS Problem: HTS had a massive amount of unstructured data locked up in 1000s of documents with no way to get at it economically Highly skilled analysts had to read each document and manually extract the information into an Excel spreadsheet that was used to catalog the contents by Topics
Use Case The Infinit.e Solution: Harvest the documents using Infinit.e Extract entities from the harvested documents (who, what, where) Assign one or more Topics to each document based on the entities extracted (i.e. clustering)
Questions? Thank you! Craig Vitter Professional Services Engineer cvitter@ikanow http://meetup/infinit-e-user-group-dc/ http://github/ikanow/infinit.e www.ikanow @ikanowdata 703.454.9029