EPL660: Information Retrieval and Search Engines Lab 7

EPL660: Information Retrieval and Search Engines Lab 7 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science

Apache Tika What is Apache Tika? Content Analysis Toolkit The Apache Tika toolkit detects and extracts metadata and text content from over a thousand different file types Useful for search engine indexing, content analysis, translation, and much more

Supported Document Formats Microsoft Excel, Word, PowerPoint, Visio, Outlook GZIP, bzip2 compression MP3, MIDI, Wave audio XML HTML Java class files Images Java Archive Files Plain text OpenDocument PDF RTF TAR/ZIP You can also extend Tika with your own parsers!!

Getting Started with Apache Tika Download a source release from: https://tika.apache.org/download.html Build Tika from sources Use Maven build system: $ sudo apt-get install maven Extract tika sources to a folder Use install command: $ mvn install Note: We need Java 7 or higher to build Tika.

Build Artifacts The Tika build consists of a number of components and produces the following main binaries: tika-core/target/tika-core-*.jar Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 6. tika-parsers/target/tika-parsers-*.jar Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries. tika-app/target/tika-app-*.jar Tika application. Combines the above components and all the external parser libraries into a single runnable jar with a GUI and a command line interface. tika-server/target/tika-server-*.jar Tika JAX-RS REST application. This is a Jetty web server running Tika REST services as described in this page. tika-bundle/target/tika-bundle-*.jar Tika bundle. An OSGi bundle that combines tika-parsers with non-osgified parser libraries to make them easy to deploy in an OSGi environment.

Command Line Utility

Tika GUI (--gui)

The Parser Interface void parse(inputstream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException; Input: Document to be parsed Related metadata Output Results as XHTML SAX events Extra metadata

InputStream First argument of parse method For reading the document to be parsed Parser implementation will consume this stream but will not close it Closing the stream is the responsibility of the client application that opened it in the first place.

XHTML SAX events Parsed content of the document stream is about to be returned to the client application as a sequence of XHTML SAX events. XHTML used to express structured content of the document (not to render documents for browsing) and SAX events enable streamed processing.

ContentHandler Second argument of parse method Receives XHTML SAX events produced by parser Parser implementations typically use the XHTMLContentHandler utility class to generate the XHTML output SAX events may be complex to understand Tika provides utility classes to process and convert event stream to other representations e.g. BodyContentHandler class can be used to extract the body of XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string

Document Metadata Third argument of parse method Used to pass document metadata both in and out of the parser; expressed as Metadata object some of the more interesting metadata properties: Metadata.RESOURCE_NAME_KEY Metadata.CONTENT_TYPE Metadata.TITLE Metadata.AUTHOR

Parse Context Final argument of parse method Inject context-specific information to the parsing process Example of use: when dealing with locale-specific date and number formats in Microsoft Excel spreadsheets

Write your Tika application! Download the Java file implementing PDF parsing related tika classes must be imported Problem: number of jar files (and their dependencies) must be downloaded and added to classpath difficult to manually specify and discover all dependency libraries Solution: Apache tool for building and managing any Java-based project excellent dependency management mechanism easy build process

Tika application using Maven! Installation: sudo apt-get install maven Create Maven project mvn archetype:generate -DgroupId=com.csdeptucy.app -DartifactId=tikaParser -DarchetypeArtifactId=maven-archetype-quickstart - DinteractiveMode=false Get into project folder cd tikaparser see project structure here POM.xml file core of project s configuration

POM file example <project xmlns="http://maven.apache.org/pom/4.0.0" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://maven.apache.org/pom/4.0.0 http://maven.apache.org/xsd/maven- 4.0.0.xsd"> <modelversion>4.0.0</modelversion> <groupid>com.mycompany.app</groupid> <artifactid>my-app</artifactid> <version>1.0-snapshot</version> <packaging>jar</packaging> <name>maven Quick Start Archetype</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupid>junit</groupid> <artifactid>junit</artifactid> <version>4.8.2</version> <scope>test</scope> </dependency> </dependencies> </project>

Maven phases Most common lifecycle phases: validate: validate the project is correct and all necessary information is available compile: compile the source code of the project test: test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed package: take the compiled code and package it in its distributable format, such as a JAR integration-test: process and deploy the package if necessary into an environment where integration tests can be run verify: run any checks to verify the package is valid and meets quality criteria install: install the package into the local repository, for use as a dependency in other projects locally deploy: done in an integration or release environment, copies the final package to the remote repository for sharing with other developers and projects clean: cleans up artifacts created by prior builds site: generates site documentation for this project Phases may be executed in sequence mvn clean package

Test initial application Test the newly compiled and packaged JAR with the following command: java -cp target/tikaparser-1.0- SNAPSHOT.jar com.csdeptucy.app.app Which will print: Hello World!

Unzip LAB07.zip Place java file into tikaparser/src/main/java/com/csdeptucy/app folder Replace old pom.xml file with the given one Clean artifacts form the previous build and regenerate a jar file mvn clean package In case of java.lang.outofmemoryerror: Java heap space error run in terminal: export MAVEN_OPTS=-Xmx1024m mvn clean package Run the application java -cp target/tikaparser-1.0-snapshot-jar-withdependencies.jar com.csdeptucy.app.epl660parser

PDF Parsing import java.io.file; import java.io.fileinputstream; import java.io.ioexception; import java.io.inputstream; import org.apache.tika.exception.tikaexception; import org.apache.tika.metadata.metadata; import org.apache.tika.parser.parsecontext; import org.apache.tika.parser.pdf.pdfparser; import org.apache.tika.sax.bodycontenthandler; import org.xml.sax.contenthandler; import org.xml.sax.saxexception; public class EPL660Parser { public static void main(string[] args) { try { } parsepdf(); } catch (IOException SAXException TikaException e) { e.printstacktrace(); }

PDF Parsing private static void parsepdf() throws IOException, SAXException, TikaException { InputStream input = new FileInputStream(new File("simple.pdf")); ContentHandler texthandler = new BodyContentHandler(System.out); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); // parsing the document using PDF parser PDFParser parser = new PDFParser(); parser.parse(input, texthandler, metadata, context); // getting the content of the document System.out.println("Contents of the PDF :" + texthandler.tostring()); // getting metadata of the document System.out.println("Metadata of the PDF:"); String[] metadatanames = metadata.names(); for (String name : metadatanames) { System.out.println(name + " : " + metadata.get(name)); } } input.close(); }

PDF screenshots Content Metadata

PDF using Tika Content Metadata

Parse all Types of Files Change your PDF parser Parser should: Use AutoDetectParser: AutoDetectParser parser = new AutoDetectParser(); Read all files from a folder Print all the metadata for each file

Useful Links http://tika.apache.org/ http://tika.apache.org/1.12/index.html http://tika.apache.org/1.12/api/ http://www.ibm.com/developerworks/opensource/t utorials/os-apache-tika/index.html