Information Retrieval Elasticsearch
IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing. Wikipedia
Concepts term t : a noun or compound word used in a specific context tf (t in d) : term frequency in a document; measure the number of times term t appears in the currently scored document d idf (t) : inverse document frequency measures how often the term appears across the index: common or rare obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.
Concepts tf idf, (term frequency inverse document frequency) is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It is used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, adjusted for the fact that some words appear more frequently in general.
Concepts Then tf idf is calculated as and
Apache Lucene Fast, high performance, scalable search/ir library Open source Indexing and Searching Inverted Index of documents Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. http://lucene.apache.org/
Lucene Internal Representation Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/searchkitconcepts/searchkit_basics/searchkit_basics.html
Lucene Based on documents model Index contains documents Each document consist of fields Each Field has attributes field data type (FieldType) way to handle the content (Analyzers, Filters) Stored or indexed field (stored="true") or (indexed="true")
Indexing Pipeline Analyzer : creates tokens using a Tokenizer and/or applying Filters (Token Filter) Each field can define an Analyzer at index time/query time or both at same time http://www.slideshare.net/otisg/lucene-introduction
Elasticsearch - Introduction Enterprise Search platform for Apache Lucene Open source Highly reliable, scalable, fault tolerant Support distributed Indexing, Replication, and load balanced querying http://www.elasticsearch.org/
Elasticsearch - Features Distributed (multiple nodes) RESTful search server (GET, PUT, POST, DELETE) Document oriented, full text (JSON format) Schema free Easy to scale horizontally Real time analytics Multi tenancy (multiple indexes)
APIs HTTP RESTful Api Java Api Clients perl, python, php, ruby,.net and more All APIs perform automatic node operation rerouting
Cluster Architecture Source: http://www.slideshare.net/dmitribabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Index Request Source: http://www.slideshare.net/dmitribabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Search Request Source: http://www.slideshare.net/dmitribabaev1/elastic-search-moscow-bigdata-cassandra-sept-2013-meetup
Logstash - Architecture Source: http://www.infoq.com/articles/review-the-logstash-book
Logstash life of an event Input Filters Output Filters are processed in order of config file Outputs are processed in order of config file Input: Input stream File input (tail) Log4j Redis Syslog and many more http://logstash.net/docs/1.3.3/
Kibana
Source: http://www.slideshare.net/amazeeag/2014-0422-loggingwithlogstashbastianwidmercampusbern
Source: http://www.slideshare.net/amazeeag/2014-0422-loggingwithlogstashbastianwidmercampusbern
Analytics Analytics source : Kibana.org based on ElasticSearch and Logstash Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8