High Performance Indexing of Large Heterogeneous Data Sets using GPU

Size: px

Start display at page:

Download "High Performance Indexing of Large Heterogeneous Data Sets using GPU"

Matilda Nelson
7 years ago
Views:

1 High Performance Indexing of Large Heterogeneous Data Sets using GPU Massimo Bernaschi IAC National Research Council of Italy funded by the ISEC programme under GA n

2 Why a new indexer? Law Enforcement Agencies need an easy and fast tool to index and search seized disk images GTC

3 How it works Extract raw files and metadata from (seized) disk images Distribute them over multiple systems Extract plain text and metadata from every file including deleted files Create distributed indexes Provide a friendly user interface to query results Organize query results in an intuitive visual representation GTC

4 Architecture Overview CONNECTIONS LEGEND DB input/ouput HPC cluster Admin Web GUI Search DATABASE SEARCHER MEDIATOR DBMS GTC 2015 INDEX REPO COORDINATOR Job Scheduler Status Manager WORKER AGENT Worker Nodes HPC Cluster 4

5 Architecture Overview (cont.) Coordinator Manage, coordinate and monitor the whole system DBMS Provides the interface to the Database Mediator Mediates among all components to ease message communication Admin Web UI Used to manage the infrastruture, create investigation cases and add disk images for indexing Worker Agent Runs all worker nodes and provides services for monitoring, starting, stopping, configuring local components Index Repository Repository used to store results of all indexing jobs GTC

6 Architecture Overview (cont.) Each worker node can run one or more Image-Extractor to extract files from seized disk images Docu-Parser to trasform extracted documents into plain text and metadata Docu-Indexer to create searchable indexes from transformed text and metadata Managed by worker agents They are connected to form an Extraction > Parse > Indexing Pipeline GTC

7 Extract Parse Indexing Pipeline 1: EXTRACT 2: PARSE 3: INDEXING Image - Extractor Docu - Parser Docu - Parser Docu - Parser Docu - Parser Docu - Indexer Docu - Indexer Docu - Indexer Docu - Indexer GTC

8 Extract Parse Indexing Pipeline 1: EXTRACT 2: PARSE 3: INDEXING Image - Extractor Docu - Parser Docu - Parser Docu - Parser Docu - Parser Docu - Indexer Docu - Indexer Docu - Indexer Docu - Indexer GTC

Disk Image Extraction Performed by the Image Extractor component Based on The Sleuth Kit Library Supports Unix, Linux, OSx and Windows volumes and file systems

9 Disk Image Extraction Performed by the Image Extractor component Based on The Sleuth Kit Library Supports Unix, Linux, OSx and Windows volumes and file systems Extracts raw files and file system metadata The Sleuth Kit Library SYSTEM METADATA CREATION_DATE FILENAME SIZE PATH LAST_MODIFICATION_DATE GTC

Document Parsing Performed by Docu-Parser

10 Document Parsing Performed by Docu-Parser component Based on Apache Tika Library Detects and extracts document metadata and structured text Supports about 1400 file types Tika Library DOCUMENT METADATA AUTHOR TITLE KEYWORDS SUMMARY LANGUAGE TOOL RIGHTS FORMAT GTC

11 Document Indexing Perfomed by Docu-Indexer component Based on Apache Lucene Libraries Provides indexing and search capabilities Index size roughly 20-30% the size of text indexed Indexes are collected into Index Repository Apache Lucene Libraries GTC

12 Document Searching Based on Apache Lucene Libraries Provides searching capabilities: ranked searching multiple-index searching with merged results many powerful query types fielded searching (e.g. title, author, contents) Working on presenting results through an efficient and interactive interface GTC

13 HPC Document Indexing Text analysis requires tokenization, filtering and stop words removal GPU cards offer huge computing power Combine CLucene indexing with GPU power to accelerate these steps Clucene Libraries GTC

14 GPU CUDA Text Analysis M y n a m e i s B o b. \0 One CUDA Thread per character. Each thread applies LowerCase Filter m y n a m e i s b o b. \ my name is bob Each CUDA Thread performs Tokenization by locating delimiter positions Vector processing in order to create two vectors representing start and end token indexes respectively. Start Indexes (related to input text) End Indexes (related to input text) One CUDA Thread per token. Each thread applies StopWords Filter. my name bob GTC

15 Time (Seconds) (2070 Fermi) GPU CUDA Results CLucene GPU+CLucene Speed-Up 7x 9x x MB 32MB 128MB Plain-Text Size GTC

16 CUDA and (Java)Lucene 1/2 How do they cooperate?

17 CUDA and (Java)Lucene 2/2 How do they cooperate efficiently? o smart and efficient memory transfer using Java Unsafe API

18 Test Environment 4 Worker Nodes 4 CPUs / 24 Cores 2.67GHz 48 GB RAM GPU per node Running Worker Agents and Extract Parse Indexing Pipelines 1 Management Node Running all other components 1G Ethernet GTC

19 Disk Images for Test Disk images built using the Govdocs1 document set Govdocs1 digital corpora includes nearly 1 milion freelyredistributable files gz ps ppt Govdocs1 File Types doc image pdf 0% 5% 10% 15% 20% 25% Govdocs1 GTC

20 Results Time 1:12:00 1:04:48 0:57:36 0:50:24 0:43:12 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 "Extract-Parse-Index Time" 32 80Seized disk image size (in GB) Disk Image Size (GB) Extract-Parse-Indexing Time DD Time # Files Index Size (GB) 32 00:09:15 00:07: :17:43 00:14: :37:16 00:20: :02:22 00:33: /09/14 21

21 64 GB Disk Image Indexing text pdf xls others html doc csv xml ps ppt gz image ISODAC Index Extracted Text Disk Image SIZE (GB) GTC

22 Highlights Streamed In-Memory Extraction+Parse+Indexing Only indexes written on disks Much faster than a Map-Reduce based solution File indexing failure recovery Files are processed again in case of failure Selectable files extraction and indexing Exportable indexes Generated indexes can be exported and handled to back to investigators GTC

23 Future Works Distribute workload based on file type Enhance scheduling algorithm Support file extraction filtering Alternative ad-hoc parser based on file type CUDA version of Tesseract (for fast OCR) Enhanced and interactive results visualization GTC

24 Tesseract OCR Profiling with valgrind s tool callgrind reveals how 3 functions collect approximately 50% self time execution go parallel try to parallelize these in CUDA In multi-paged documents, ProcessPages function takes near 98% of total execution time go parallel openmp: 1 page per thread or get total number of pages and launch a process per page

25 Please complete the Presenter Evaluation sent to you by or through the GTC Mobile App. Your feedback is important! GTC

26 Why Not? Hadoop performance MapReduce performance [Jiang et al. (2010)] [Lin et al. (2012)] HDFS performance [Dong et al. (2014)] Seized disk images are neither stored on cluster nor available on a distributed infrastructure As fast as possible In-Memory Streaming Pipeline Only indexes are written to disk Ad-hoc Recovery process GTC

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

+ Hadoop-based Open Source ediscovery: FreeEed (Easy as popcorn) + Hello! 2 Sujee Maniyam & Mark Kerzner Founders @ Elephant Scale consulting and training around Hadoop, Big Data technologies Enterprise