High Performance Indexing of Large Heterogeneous Data Sets using GPU
|
|
- Matilda Nelson
- 7 years ago
- Views:
Transcription
1 High Performance Indexing of Large Heterogeneous Data Sets using GPU Massimo Bernaschi IAC National Research Council of Italy funded by the ISEC programme under GA n
2 Why a new indexer? Law Enforcement Agencies need an easy and fast tool to index and search seized disk images GTC
3 How it works Extract raw files and metadata from (seized) disk images Distribute them over multiple systems Extract plain text and metadata from every file including deleted files Create distributed indexes Provide a friendly user interface to query results Organize query results in an intuitive visual representation GTC
4 Architecture Overview CONNECTIONS LEGEND DB input/ouput HPC cluster Admin Web GUI Search DATABASE SEARCHER MEDIATOR DBMS GTC 2015 INDEX REPO COORDINATOR Job Scheduler Status Manager WORKER AGENT Worker Nodes HPC Cluster 4
5 Architecture Overview (cont.) Coordinator Manage, coordinate and monitor the whole system DBMS Provides the interface to the Database Mediator Mediates among all components to ease message communication Admin Web UI Used to manage the infrastruture, create investigation cases and add disk images for indexing Worker Agent Runs all worker nodes and provides services for monitoring, starting, stopping, configuring local components Index Repository Repository used to store results of all indexing jobs GTC
6 Architecture Overview (cont.) Each worker node can run one or more Image-Extractor to extract files from seized disk images Docu-Parser to trasform extracted documents into plain text and metadata Docu-Indexer to create searchable indexes from transformed text and metadata Managed by worker agents They are connected to form an Extraction > Parse > Indexing Pipeline GTC
7 Extract Parse Indexing Pipeline 1: EXTRACT 2: PARSE 3: INDEXING Image - Extractor Docu - Parser Docu - Parser Docu - Parser Docu - Parser Docu - Indexer Docu - Indexer Docu - Indexer Docu - Indexer GTC
8 Extract Parse Indexing Pipeline 1: EXTRACT 2: PARSE 3: INDEXING Image - Extractor Docu - Parser Docu - Parser Docu - Parser Docu - Parser Docu - Indexer Docu - Indexer Docu - Indexer Docu - Indexer GTC
9 Disk Image Extraction Performed by the Image Extractor component Based on The Sleuth Kit Library Supports Unix, Linux, OSx and Windows volumes and file systems Extracts raw files and file system metadata The Sleuth Kit Library SYSTEM METADATA CREATION_DATE FILENAME SIZE PATH LAST_MODIFICATION_DATE GTC
10 Document Parsing Performed by Docu-Parser component Based on Apache Tika Library Detects and extracts document metadata and structured text Supports about 1400 file types Tika Library DOCUMENT METADATA AUTHOR TITLE KEYWORDS SUMMARY LANGUAGE TOOL RIGHTS FORMAT GTC
11 Document Indexing Perfomed by Docu-Indexer component Based on Apache Lucene Libraries Provides indexing and search capabilities Index size roughly 20-30% the size of text indexed Indexes are collected into Index Repository Apache Lucene Libraries GTC
12 Document Searching Based on Apache Lucene Libraries Provides searching capabilities: ranked searching multiple-index searching with merged results many powerful query types fielded searching (e.g. title, author, contents) Working on presenting results through an efficient and interactive interface GTC
13 HPC Document Indexing Text analysis requires tokenization, filtering and stop words removal GPU cards offer huge computing power Combine CLucene indexing with GPU power to accelerate these steps Clucene Libraries GTC
14 GPU CUDA Text Analysis M y n a m e i s B o b. \0 One CUDA Thread per character. Each thread applies LowerCase Filter m y n a m e i s b o b. \ my name is bob Each CUDA Thread performs Tokenization by locating delimiter positions Vector processing in order to create two vectors representing start and end token indexes respectively. Start Indexes (related to input text) End Indexes (related to input text) One CUDA Thread per token. Each thread applies StopWords Filter. my name bob GTC
15 Time (Seconds) (2070 Fermi) GPU CUDA Results CLucene GPU+CLucene Speed-Up 7x 9x x MB 32MB 128MB Plain-Text Size GTC
16 CUDA and (Java)Lucene 1/2 How do they cooperate?
17 CUDA and (Java)Lucene 2/2 How do they cooperate efficiently? o smart and efficient memory transfer using Java Unsafe API
18 Test Environment 4 Worker Nodes 4 CPUs / 24 Cores 2.67GHz 48 GB RAM GPU per node Running Worker Agents and Extract Parse Indexing Pipelines 1 Management Node Running all other components 1G Ethernet GTC
19 Disk Images for Test Disk images built using the Govdocs1 document set Govdocs1 digital corpora includes nearly 1 milion freelyredistributable files gz ps ppt Govdocs1 File Types doc image pdf 0% 5% 10% 15% 20% 25% Govdocs1 GTC
20 Results Time 1:12:00 1:04:48 0:57:36 0:50:24 0:43:12 0:36:00 0:28:48 0:21:36 0:14:24 0:07:12 0:00:00 "Extract-Parse-Index Time" 32 80Seized disk image size (in GB) Disk Image Size (GB) Extract-Parse-Indexing Time DD Time # Files Index Size (GB) 32 00:09:15 00:07: :17:43 00:14: :37:16 00:20: :02:22 00:33: /09/14 21
21 64 GB Disk Image Indexing text pdf xls others html doc csv xml ps ppt gz image ISODAC Index Extracted Text Disk Image SIZE (GB) GTC
22 Highlights Streamed In-Memory Extraction+Parse+Indexing Only indexes written on disks Much faster than a Map-Reduce based solution File indexing failure recovery Files are processed again in case of failure Selectable files extraction and indexing Exportable indexes Generated indexes can be exported and handled to back to investigators GTC
23 Future Works Distribute workload based on file type Enhance scheduling algorithm Support file extraction filtering Alternative ad-hoc parser based on file type CUDA version of Tesseract (for fast OCR) Enhanced and interactive results visualization GTC
24 Tesseract OCR Profiling with valgrind s tool callgrind reveals how 3 functions collect approximately 50% self time execution go parallel try to parallelize these in CUDA In multi-paged documents, ProcessPages function takes near 98% of total execution time go parallel openmp: 1 page per thread or get total number of pages and launch a process per page
25 Please complete the Presenter Evaluation sent to you by or through the GTC Mobile App. Your feedback is important! GTC
26 Why Not? Hadoop performance MapReduce performance [Jiang et al. (2010)] [Lin et al. (2012)] HDFS performance [Dong et al. (2014)] Seized disk images are neither stored on cluster nor available on a distributed infrastructure As fast as possible In-Memory Streaming Pipeline Only indexes are written to disk Ad-hoc Recovery process GTC
Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)
+ Hadoop-based Open Source ediscovery: FreeEed (Easy as popcorn) + Hello! 2 Sujee Maniyam & Mark Kerzner Founders @ Elephant Scale consulting and training around Hadoop, Big Data technologies Enterprise
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationInformation Retrieval Elasticsearch
Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches
More informationBig Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationXpoLog Competitive Comparison Sheet
XpoLog Competitive Comparison Sheet New frontier in big log data analysis and application intelligence Technical white paper May 2015 XpoLog, a data analysis and management platform for applications' IT
More informationCS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationA Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader
A Performance Evaluation of Open Source Graph Databases Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader Overview Motivation Options Evaluation Results Lessons Learned Moving Forward
More informationUse case: Merging heterogeneous network measurement data
Use case: Merging heterogeneous network measurement data Jorge E. López de Vergara and Javier Aracil Jorge.lopez_vergara@uam.es Credits to Rubén García-Valcárcel, Iván González, Rafael Leira, Víctor Moreno,
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Oracle Big Data Appliance Releases 2.5 and 3.0 Ralf Lange Global ISV & OEM Sales Agenda Quick Overview on BDA and its Positioning Product Details and Updates Security and Encryption New Hadoop Versions
More informationTopics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
More informationOutline: Operating Systems
Outline: Operating Systems What is an OS OS Functions Multitasking Virtual Memory File Systems Window systems PC Operating System Wars: Windows vs. Linux 1 Operating System provides a way to boot (start)
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationActian SQL in Hadoop Buyer s Guide
Actian SQL in Hadoop Buyer s Guide Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop
More informationBRINGING INFORMATION RETRIEVAL BACK TO DATABASE MANAGEMENT SYSTEMS
BRINGING INFORMATION RETRIEVAL BACK TO DATABASE MANAGEMENT SYSTEMS Khaled Nagi Dept. of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt. khaled.nagi@eng.alex.edu.eg
More informationSystem Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks
System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks OnurSoft Onur Tolga Şehitoğlu November 10, 2012 v1.0 Contents 1 Introduction 3 1.1 Purpose..............................
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationSearch and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
More informationTowards Smart and Intelligent SDN Controller
Towards Smart and Intelligent SDN Controller - Through the Generic, Extensible, and Elastic Time Series Data Repository (TSDR) YuLing Chen, Dell Inc. Rajesh Narayanan, Dell Inc. Sharon Aicler, Cisco Systems
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationPetascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing
Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
More informationEXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
More informationIndexing big data with Tika, Solr, and map-reduce
Indexing big data with Tika, Solr, and map-reduce Scott Fisher, Erik Hetzner California Digital Library 8 February 2012 Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February 2012 1 / 19 Outline
More informationThe Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist
The Top Six Advantages of CUDA-Ready Clusters Ian Lumb Bright Evangelist GTC Express Webinar January 21, 2015 We scientists are time-constrained, said Dr. Yamanaka. Our priority is our research, not managing
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationDocument Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and
More informationCLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
More informationHow To Make Sense Of Data With Altilia
HOW TO MAKE SENSE OF BIG DATA TO BETTER DRIVE BUSINESS PROCESSES, IMPROVE DECISION-MAKING, AND SUCCESSFULLY COMPETE IN TODAY S MARKETS. ALTILIA turns Big Data into Smart Data and enables businesses to
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationData Masking Secure Sensitive Data Improve Application Quality. Becky Albin Chief IT Architect Becky.Albin@softwareag.com
Data Masking Secure Sensitive Data Improve Application Quality Becky Albin Chief IT Architect Becky.Albin@softwareag.com Data Masking for Adabas The information provided in this PPT is entirely subject
More informationBig Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment firms turn big data into actionable research
More informationXpoLog Center Suite Data Sheet
XpoLog Center Suite Data Sheet General XpoLog is a data analysis and management platform for Applications IT data. Business applications rely on a dynamic heterogeneous applications infrastructure, such
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationUsing MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam sastry.vedantam@oracle.com Agenda The rise of Big Data & Hadoop MySQL in the Big Data Lifecycle MySQL Solutions for Big Data Q&A
More informationPart I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
More informationFinding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
More informationEfficiency of Web Based SAX XML Distributed Processing
Efficiency of Web Based SAX XML Distributed Processing R. Eggen Computer and Information Sciences Department University of North Florida Jacksonville, FL, USA A. Basic Computer and Information Sciences
More informationPerformance Management Platform
Open EMS Suite by Nokia Performance Management Platform Functional Overview Version 1.4 Nokia Siemens Networks 1 (16) Performance Management Platform The information in this document is subject to change
More informationApache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationHadoop on a Low-Budget General Purpose HPC Cluster in Academia
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationCiteSeer x in the Cloud
Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationBuilding Multilingual Search Index using open source framework
Building Multilingual Search Index using open source framework ABSTRACT Arjun Atreya V 1 Swapnil Chaudhari 1 Pushpak Bhattacharyya 1 Ganesh Ramakrishnan 1 (1) Deptartment of CSE, IIT Bombay {arjun, swapnil,
More informationOn a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationManifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
More informationUnstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
More informationWHAT S NEW IN SAS 9.4
WHAT S NEW IN SAS 9.4 PLATFORM, HPA & SAS GRID COMPUTING MICHAEL GODDARD CHIEF ARCHITECT SAS INSTITUTE, NEW ZEALAND SAS 9.4 WHAT S NEW IN THE PLATFORM Platform update SAS Grid Computing update Hadoop support
More informationLarge Scale Text Analysis Using the Map/Reduce
Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract
More informationAgenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
More informationQuantcast Petabyte Storage at Half Price with QFS!
9-131 Quantcast Petabyte Storage at Half Price with QFS Presented by Silvius Rus, Director, Big Data Platforms September 2013 Quantcast File System (QFS) A high performance alternative to the Hadoop Distributed
More informationClient Overview. Engagement Situation. Key Requirements
Client Overview Our client is one of the leading providers of business intelligence systems for customers especially in BFSI space that needs intensive data analysis of huge amounts of data for their decision
More informationEvoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca
Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca Carlo Cavazzoni CINECA Supercomputing Application & Innovation www.cineca.it 21 Aprile 2015 FERMI Name: Fermi Architecture: BlueGene/Q
More informationFinding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014
Finding the Needle in a Big Data Haystack Wolfgang Hoschek (@whoschek) JAX 2014 1 About Wolfgang Software Engineer @ Cloudera Search Platform Team Previously CERN, Lawrence Berkeley National Laboratory,
More informationUnified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
More informationEnabling the Big Data Commons through indexing of data and their interactions
biomedical and healthcare Data Discovery Index Ecosystem Enabling the Big Data Commons through indexing of and their interactions 2 nd BD2K all-hands meeting Bethesda 11/12/15 Aims 1. Help users find accessible
More informationContent Based Search Add-on API Implemented for Hadoop Ecosystem
International Journal of Research in Engineering and Science (IJRES) ISSN (Online): 2320-9364, ISSN (Print): 2320-9356 Volume 4 Issue 5 ǁ May. 2016 ǁ PP. 23-28 Content Based Search Add-on API Implemented
More informationlocuz.com HPC App Portal V2.0 DATASHEET
locuz.com HPC App Portal V2.0 DATASHEET Ganana HPC App Portal makes it easier for users to run HPC applications without programming and for administrators to better manage their clusters. The web-based
More informationApache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationApache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380 AGENDA 2 > What's a search engine > Lucene Java Features Code example > Solr Features Integration > Nutch Features
More informationOverview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National
More informationTechnical Data Sheet SCADE R17 Solutions for ARINC 661 Compliant Systems Design Environment for Aircraft Manufacturers, CDS and UA Suppliers
661 Solutions for ARINC 661 Compliant Systems SCADE R17 Solutions for ARINC 661 Compliant Systems Design Environment for Aircraft Manufacturers, CDS and UA Suppliers SCADE Solutions for ARINC 661 Compliant
More informationApache Hadoop FileSystem Internals
Apache Hadoop FileSystem Internals Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Storage Developer Conference, San Jose September 22, 2010 http://www.facebook.com/hadoopfs
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationBig Data Management and Security
Big Data Management and Security Audit Concerns and Business Risks Tami Frankenfield Sr. Director, Analytics and Enterprise Data Mercury Insurance What is Big Data? Velocity + Volume + Variety = Value
More informationCloud Infrastructure Management - IBM VMControl
Cloud Infrastructure Management - IBM VMControl IBM Systems Director 6.3 VMControl 2.4 Thierry Huche IBM France - Montpellier thierry.huche@fr.ibm.com 2010 IBM Corporation Topics IBM Systems Director /
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationBenchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside
Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment Sanjay Kulhari, Jian Wen UC Riverside Team Sanjay Kulhari M.S. student, CS U C Riverside Jian Wen Ph.D. student, CS U
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationHow To Install Linux Titan
Linux Titan Distribution Presented By: Adham Helal Amgad Madkour Ayman El Sayed Emad Zakaria What Is a Linux Distribution? What is a Linux Distribution? The distribution contains groups of packages and
More informationLeveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
More informationSisense. Product Highlights. www.sisense.com
Sisense Product Highlights Introduction Sisense is a business intelligence solution that simplifies analytics for complex data by offering an end-to-end platform that lets users easily prepare and analyze
More informationDataDirect XQuery Technical Overview
DataDirect XQuery Technical Overview Table of Contents 1. Feature Overview... 2 2. Relational Database Support... 3 3. Performance and Scalability for Relational Data... 3 4. XML Input and Output... 4
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationNoSQL Roadshow Berlin Kai Spichale
Full-text Search with NoSQL Technologies NoSQL Roadshow Berlin Kai Spichale 25.04.2013 About me Kai Spichale Software Engineer at adesso AG Author in professional journals, conference speaker adesso is
More informationIssues in Big Data: Analytics
Session 11413 Issues in Big Data: Analytics Tom Deutsch, tdeutsch@us.ibm.com Program Director, Big Data Bob Foyle, bfoyle@us.ibm.com Sr. Product Manager IBM Content Analytics with Enterprise Search August
More informationThe Open Source Knowledge Discovery and Document Analysis Platform
Enabling Agile Intelligence through Open Analytics The Open Source Knowledge Discovery and Document Analysis Platform 17/10/2012 1 Agenda Introduction and Agenda Problem Definition Knowledge Discovery
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationTechnology Partners. Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in 2009.
Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in 2009. We create innovative software solutions for SharePoint, Office 365, MS Windows Remote Desktop Services, and
More informationWhat s New in MATLAB and Simulink
What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationEnabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
More informationUse of Hadoop File System for Nuclear Physics Analyses in STAR
1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More information