EPL660: Information Retrieval and Search Engines Lab 7

Similar documents
Maven2. Configuration and Build Management. Robert Reiz

Hands on exercise for

Software project management. and. Maven

Build management & Continuous integration. with Maven & Hudson

Maven or how to automate java builds, tests and version management with open source tools

by Charles Souillard CTO and co-founder, BonitaSoft

Continuous integration in OSGi projects using Maven (v:0.1) Sergio Blanco Diez

Presentation of Enterprise Service Bus(ESB) and. Apache ServiceMix. Håkon Sagehaug

Continuous Integration Multi-Stage Builds for Quality Assurance

Software project management. and. Maven

Developer s Guide. How to Develop a Communiqué Digital Asset Management Solution

COMPUTACIÓN ORIENTADA A SERVICIOS (PRÁCTICA) Dr. Mauricio Arroqui EXA-UNICEN

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. AVRO Tutorial

Content. Development Tools 2(63)

Sonatype CLM for Maven. Sonatype CLM for Maven

Apache Tika for Enabling Metadata Interoperability

Kohsuke Kawaguchi Sun Microsystems, Inc. hk2.dev.java.net, glassfish.dev.java.net. Session ID

Maven2 Reference. Invoking Maven General Syntax: Prints help debugging output, very useful to diagnose. Creating a new Project (jar) Example:

Meister Going Beyond Maven

How To Run Apa Tika On A Microsoft Macbook Or Ipa.Net (For Linux) Or Ipad (For Windows) (For Macbook) (Or Ipa) (On Linux) (Minor) (Large

Sonatype CLM Enforcement Points - Continuous Integration (CI) Sonatype CLM Enforcement Points - Continuous Integration (CI)

LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP

Drupal CMS for marketing sites

CI/CD Cheatsheet. Lars Fabian Tuchel Date: 18.March 2014 DOC:

IBM Tivoli Workload Scheduler Integration Workbench V8.6.: How to customize your automation environment by creating a custom Job Type plug-in

Tutorial- Counting Words in File(s) using MapReduce

Tutorial Reference Manual. Java WireFusion 4.1

Mind The Gap! Setting Up A Code Structure Building Bridges

Continuous Integration Part 2

IBM WebSphere Adapter for Quick Start Tutorials

WIRIS quizzes web services Getting started with PHP and Java

IKAN ALM Architecture. Closing the Gap Enterprise-wide Application Lifecycle Management

Builder User Guide. Version 5.4. Visual Rules Suite - Builder. Bosch Software Innovations

Hello World RESTful web service tutorial

Java Forum Nord Dirk Mahler

Developing Web Services with Apache CXF and Axis2

Service Integration course. Cassandra

Overview of Web Services API

Introduction to Programming Tools. Anjana & Shankar September,2010

Creating a Simple, Multithreaded Chat System with Java

EMC Documentum Composer

Software Quality Exercise 2

Integration with Other Tools

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Creating Custom Web Pages for cagrid Services

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

SETTING UP YOUR JAVA DEVELOPER ENVIRONMENT

Java Language Tools COPYRIGHTED MATERIAL. Part 1. In this part...

XML nyelvek és alkalmazások

SOLoist Automation of Class IDs Assignment

Creating an application with the Virgo Web Server

Teaming Up for Software Development

Integrating your Maven Build and Tomcat Deployment

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September National Institute of Standards and Technology (NIST)

European Access Point for Truck Parking Data

Repository Management with Nexus

NetBeans e lo sviluppo di applicazioni Java/JavaFX per Facebook. Corrado De Bari corrado.debari@sun.com Sun Microsystems Italia

GOOGLE DOCS. 1. Creating an account

Word Count Code using MR2 Classes and API

Vaidya Guide. Table of contents

Builder User Guide. Version Visual Rules Suite - Builder. Bosch Software Innovations

Oracle Universal Content Management

FUSE-ESB4 An open-source OSGi based platform for EAI and SOA

1 Building, Deploying and Testing DPES application

LICENSE4J FLOATING LICENSE SERVER USER GUIDE

Implementing SQI via SOAP Web-Services

Tutorial 5: Developing Java applications

Setting up Hadoop with MongoDB on Windows 7 64-bit

Contents. Apache Log4j. What is logging. Disadvantages 15/01/2013. What are the advantages of logging? Enterprise Systems Log4j and Maven

Automated performance testing using Maven & JMeter. George Barnett, Atlassian Software

CSE 70: Software Development Pipeline Build Process, XML, Repositories

SDK Code Examples Version 2.4.2

FOCUS ON YOUR FEATURES

AVRO - SERIALIZATION

B.Sc (Honours) - Software Development

Amazon Glacier. Developer Guide API Version

Enterprise Content Management with Microsoft SharePoint

Overview of DatadiagramML

Repository Management with Nexus

SparkLab May 2015 An Introduction to

Multiple vulnerabilities in Apache Foundation Struts 2 framework. Csaba Barta and László Tóth

Practice Fusion API Client Installation Guide for Windows

ORACLE GOLDENGATE BIG DATA ADAPTER FOR HIVE

Developing Eclipse Plug-ins* Learning Objectives. Any Eclipse product is composed of plug-ins

An Overview of Java. overview-1

Hadoop Streaming. Table of contents

Using Impatica for Power Point

MarkLogic Server. Java Application Developer s Guide. MarkLogic 8 February, Copyright 2015 MarkLogic Corporation. All rights reserved.

N CYCLES software solutions. XML White Paper. Where XML Fits in Enterprise Applications. May 2001

A Sample OFBiz application implementing remote access via RMI and SOAP Table of contents

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Introduction to XML Applications

D5.4.4 Integrated SemaGrow Stack API components

Enterprise Service Bus

Display Zipped Files within D2L Content Window

Installing Java. Table of contents

Developer Guide: Smartphone Mobiliser Applications. Sybase Mobiliser Platform 5.1 SP03

Setting up an online Java Jmonitor. server using the. EXPERIMENTAL code from. John Melton GØORX/N6LYT

Talend Component: tjasperreportexec

Transcription:

EPL660: Information Retrieval and Search Engines Lab 7 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science

Apache Tika What is Apache Tika? Content Analysis Toolkit The Apache Tika toolkit detects and extracts metadata and text content from over a thousand different file types Useful for search engine indexing, content analysis, translation, and much more

Supported Document Formats Microsoft Excel, Word, PowerPoint, Visio, Outlook GZIP, bzip2 compression MP3, MIDI, Wave audio XML HTML Java class files Images Java Archive Files Plain text OpenDocument PDF RTF TAR/ZIP You can also extend Tika with your own parsers!!

Getting Started with Apache Tika Download a source release from: https://tika.apache.org/download.html Build Tika from sources Use Maven build system: $ sudo apt-get install maven Extract tika sources to a folder Use install command: $ mvn install Note: We need Java 7 or higher to build Tika.

Build Artifacts The Tika build consists of a number of components and produces the following main binaries: tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6. tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries. tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface. tika-server/target/tika-server-*.jar Tika JAX-RS REST application. This is a Jetty web server running Tika REST services as described in this page. tika-bundle/target/tika-bundle-*.jar Tika bundle. An OSGi bundle that combines tika-parsers with non-osgified parser libraries to make them easy to deploy in an OSGi environment.

Command Line Utility

Tika GUI (--gui)

The Parser Interface void parse(inputstream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException; Input: Document to be parsed Related metadata Output Results as XHTML SAX events Extra metadata

InputStream First argument of parse method For reading the document to be parsed Parser implementation will consume this stream but will not close it Closing the stream is the responsibility of the client application that opened it in the first place.

XHTML SAX events Parsed content of the document stream is about to be returned to the client application as a sequence of XHTML SAX events. XHTML used to express structured content of the document (not to render documents for browsing) and SAX events enable streamed processing.

ContentHandler Second argument of parse method Receives XHTML SAX events produced by parser Parser implementations typically use the XHTMLContentHandler utility class to generate the XHTML output SAX events may be complex to understand Tika provides utility classes to process and convert event stream to other representations e.g. BodyContentHandler class can be used to extract the body of XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string

Document Metadata Third argument of parse method Used to pass document metadata both in and out of the parser; expressed as Metadata object some of the more interesting metadata properties: Metadata.RESOURCE_NAME_KEY Metadata.CONTENT_TYPE Metadata.TITLE Metadata.AUTHOR

Parse Context Final argument of parse method Inject context-specific information to the parsing process Example of use: when dealing with locale-specific date and number formats in Microsoft Excel spreadsheets

Write your Tika application! Download the Java file implementing PDF parsing related tika classes must be imported Problem: number of jar files (and their dependencies) must be downloaded and added to classpath difficult to manually specify and discover all dependency libraries Solution: Apache tool for building and managing any Java-based project excellent dependency management mechanism easy build process

Tika application using Maven! Installation: sudo apt-get install maven Create Maven project mvn archetype:generate -DgroupId=com.csdeptucy.app -DartifactId=tikaParser -DarchetypeArtifactId=maven-archetype-quickstart - DinteractiveMode=false Get into project folder cd tikaparser see project structure here POM.xml file core of project s configuration

POM file example <project xmlns="http://maven.apache.org/pom/4.0.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://maven.apache.org/pom/4.0.0 http://maven.apache.org/xsd/maven- 4.0.0.xsd"> <modelversion>4.0.0</modelversion> <groupid>com.mycompany.app</groupid> <artifactid>my-app</artifactid> <version>1.0-snapshot</version> <packaging>jar</packaging> <name>maven Quick Start Archetype</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupid>junit</groupid> <artifactid>junit</artifactid> <version>4.8.2</version> <scope>test</scope> </dependency> </dependencies> </project>

Maven phases Most common lifecycle phases: validate: validate the project is correct and all necessary information is available compile: compile the source code of the project test: test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed package: take the compiled code and package it in its distributable format, such as a JAR integration-test: process and deploy the package if necessary into an environment where integration tests can be run verify: run any checks to verify the package is valid and meets quality criteria install: install the package into the local repository, for use as a dependency in other projects locally deploy: done in an integration or release environment, copies the final package to the remote repository for sharing with other developers and projects clean: cleans up artifacts created by prior builds site: generates site documentation for this project Phases may be executed in sequence mvn clean package

Test initial application Test the newly compiled and packaged JAR with the following command: java -cp target/tikaparser-1.0- SNAPSHOT.jar com.csdeptucy.app.app Which will print: Hello World!

Unzip LAB07.zip Place java file into tikaparser/src/main/java/com/csdeptucy/app folder Replace old pom.xml file with the given one Clean artifacts form the previous build and regenerate a jar file mvn clean package In case of java.lang.outofmemoryerror: Java heap space error run in terminal: export MAVEN_OPTS=-Xmx1024m mvn clean package Run the application java -cp target/tikaparser-1.0-snapshot-jar-withdependencies.jar com.csdeptucy.app.epl660parser

PDF Parsing import java.io.file; import java.io.fileinputstream; import java.io.ioexception; import java.io.inputstream; import org.apache.tika.exception.tikaexception; import org.apache.tika.metadata.metadata; import org.apache.tika.parser.parsecontext; import org.apache.tika.parser.pdf.pdfparser; import org.apache.tika.sax.bodycontenthandler; import org.xml.sax.contenthandler; import org.xml.sax.saxexception; public class EPL660Parser { public static void main(string[] args) { try { } parsepdf(); } catch (IOException SAXException TikaException e) { e.printstacktrace(); }

PDF Parsing private static void parsepdf() throws IOException, SAXException, TikaException { InputStream input = new FileInputStream(new File("simple.pdf")); ContentHandler texthandler = new BodyContentHandler(System.out); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); // parsing the document using PDF parser PDFParser parser = new PDFParser(); parser.parse(input, texthandler, metadata, context); // getting the content of the document System.out.println("Contents of the PDF :" + texthandler.tostring()); // getting metadata of the document System.out.println("Metadata of the PDF:"); String[] metadatanames = metadata.names(); for (String name : metadatanames) { System.out.println(name + " : " + metadata.get(name)); } } input.close(); }

PDF screenshots Content Metadata

PDF using Tika Content Metadata

Parse all Types of Files Change your PDF parser Parser should: Use AutoDetectParser: AutoDetectParser parser = new AutoDetectParser(); Read all files from a folder Print all the metadata for each file

Useful Links http://tika.apache.org/ http://tika.apache.org/1.12/index.html http://tika.apache.org/1.12/api/ http://www.ibm.com/developerworks/opensource/t utorials/os-apache-tika/index.html