Indexing big data with Tika, Solr, and map-reduce
|
|
- Clyde Lucas Leonard
- 8 years ago
- Views:
Transcription
1 Indexing big data with Tika, Solr, and map-reduce Scott Fisher, Erik Hetzner California Digital Library 8 February 2012 Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
2 Outline Introduction Introduction Tika Pig Solr Done! Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
3 Introduction Web Archiving Service Service provided by the California Digital Library Fee-based Archiving web sites, as selected by curators Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
4 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
5 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
6 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
7 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
8 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
9 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
10 Tools Introduction Open source and rails UI for crawl management and display of many focused web crawls. Heritrix - NutchWAX - Wayback Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
11 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
12 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
13 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
14 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
15 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
16 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
17 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
18 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
19 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
20 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
21 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
22 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
23 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
24 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
25 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
26 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
27 Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
28 Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
29 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
30 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
31 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
32 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
33 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
34 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
35 Pig example Pig Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
36 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
37 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
38 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
39 Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
40 Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
41 Faceting Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
42 Faceting 2 Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
43 Mime types Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
44 Solr XML Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
45 Finale Introduction Be careful when you try to parse at a bunch of files you downloaded from the web. Parse and store. Distribute up front. Build a test index first. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities:
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationPASIG San Diego March 13, 2015
Archive-It Archiving & Preserving Web Content PASIG San Diego March 13, 2015 We are a Digital Library Founded in 1996 by Brewster Kahle In 2007 of@icially designated a library by the state of California
More informationSector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationArchive-IT Services Andrea Mills Booksgroup Collections Specialist
Getting Started with Archive-IT Services Andrea Mills Booksgroup Collections Specialist Internet Archive Micro History Text Archive Update Archive-IT Services 1996 The Internet Archive is created, with
More informationFull Text Search of Web Archive Collections
Full Text Search of Web Archive Collections IWAW2005 Michael Stack Internet Archive stack@ archive.org Internet Archive IA is a 501(c)(3)non-profit Mission is to build a public Internet digital library.
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationCreating a billion-scale searchable web archive. Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes
Creating a billion-scale searchable web archive Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes Web archiving initiatives are spreading around the world At least 6.6 PB were archived
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationWEB ARCHIVING AT SCALE
WEB ARCHIVING AT SCALE (updated 12/19/14) by Rosalie Lack, Stephen Abrams, Trisha Cruse THE VALUE OF WEB CONTENT Web content holds a critical place in modern library collections. Web archiving is an essential
More informationApache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380 AGENDA 2 > What's a search engine > Lucene Java Features Code example > Solr Features Integration > Nutch Features
More informationApache Tika for Enabling Metadata Interoperability
Apache Tika for Enabling Metadata Interoperability Apache: Big Data Europe September 28 30, 2015 Budapest, Hungary Presented by Michael Starch (NASA JPL) and Nick Burch (Quan,cate) Proposed by Giuseppe
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationBuilding Multilingual Search Index using open source framework
Building Multilingual Search Index using open source framework ABSTRACT Arjun Atreya V 1 Swapnil Chaudhari 1 Pushpak Bhattacharyya 1 Ganesh Ramakrishnan 1 (1) Deptartment of CSE, IIT Bombay {arjun, swapnil,
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationHow To Understand Web Archiving Metadata
Web Archiving Metadata Prepared for RLG Working Group The following document attempts to clarify what metadata is involved in / required for web archiving. It examines: The source of metadata The object
More informationA Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King rainer.schmidt@arcs.ac.at Planets Project Permanent
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationLegal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION
Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,
More informationCloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr
CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr Lambros Charissis, Wahaj Ali University of Crete {lcharisis, ali}@csd.uoc.gr Abstract. Implementing a performant
More informationNoSQL Roadshow Berlin Kai Spichale
Full-text Search with NoSQL Technologies NoSQL Roadshow Berlin Kai Spichale 25.04.2013 About me Kai Spichale Software Engineer at adesso AG Author in professional journals, conference speaker adesso is
More informationApplying Apache Hadoop to NASA s Big Climate Data!
National Aeronautics and Space Administration Applying Apache Hadoop to NASA s Big Climate Data! Use Cases and Lessons Learned! Glenn Tamkin (NASA/CSC)!! Team: John Schnase (NASA/PI), Dan Duffy (NASA/CO),!
More informationIIPC Metadata Workshop. Brad Tofel Vinay Goel Aaron Binns Internet Archive
IIPC Metadata Workshop Brad Tofel Vinay Goel Aaron Binns Internet Archive IIPC Data Analysis Workshop, The Hague, May 13, 2011 How can I implement analysis? IIPC Data Analysis Workshop, The Hague, May
More informationCrisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT
Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT Edward A. Fox, Seungwon Yang, & CTRnet Team Department of Computer Science, Virginia Tech Workshop at WADL
More informationLeveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
More informationSearch and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
More informationMatchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony
Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Speaker logo centered below image Steve Kuo, Software Architect Joshua Tuberville, Software Architect Goal > Leverage EC2 and Hadoop to
More informationVaidya Guide. Table of contents
Table of contents 1 Purpose... 2 2 Prerequisites...2 3 Overview... 2 4 Terminology... 2 5 How to Execute the Hadoop Vaidya Tool...4 6 How to Write and Execute your own Tests... 4 1 Purpose This document
More informationBuilding a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)
Building a master s degree on digital archiving and web archiving Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF) Objectives of the presentation > Present and discuss an experiment
More informationCase Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo moon@nflabs.com HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
More informationSPM: Large Scale Performance Monitoring for ElasticSearch HBase Solr & Friends. Otis Gospodnetić Sematext International @otisg @sematext sematext.
SPM: Large Scale Performance Monitoring for ElasticSearch HBase Solr & Friends #spmbuzz #bbuzz Otis Gospodnetić Sematext International @otisg @sematext sematext.com Agenda Introductions SPM Architecture
More informationUnified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
More informationInvestigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein dstrohschein@cga.harvard.edu Stephen Mcdonald stephenmcdonald@cga.harvard.edu Benjamin Lewis blewis@cga.harvard.edu Weihe
More informationHadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)
+ Hadoop-based Open Source ediscovery: FreeEed (Easy as popcorn) + Hello! 2 Sujee Maniyam & Mark Kerzner Founders @ Elephant Scale consulting and training around Hadoop, Big Data technologies Enterprise
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationFinal Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
More informationWeb and Twitter Archiving at the Library of Congress
Web and Twitter Archiving at the Library of Congress Web Archive Globalization Workshop June 16, 2011 Nicholas Taylor (@nullhandle) Web Archiving Team Library of Congress why archive the web? preserve
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationKEYSTONE - Short scientific report for STSM visit in TU Delft
KEYSTONE - Short scientific report for STSM visit in TU Delft Visitor: Georgia Kapitsaki Host: TU Delft 1. Introduction This report covers the activities performed during my STSM at Delft University of
More informationOverview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationCollaborative Open Source Software Production & APIs. Tom Cramer Stanford University @tcramer
Collaborative Open Source Software Production & APIs Tom Cramer Stanford University @tcramer Agenda Flavors of Open Source Software 3 Case Studies in Collaborative OSS The Power of APIs What about Web
More informationScalable Cloud Computing
Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/44 Business Drivers of Cloud Computing Large data centers allow for economics
More informationThe Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project
The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project Alastair Duncan STFC Pre Coffee talk STFC July 2014 SCAPE Scalable Preservation Environments The
More informationHomework: Visual Search and Interaction with NSF and NASA Polar Datasets Due: May 2nd, 2015, 12pm PT
Homework: Visual Search and Interaction with NSF and NASA Polar Datasets Due: May 2nd, 2015, 12pm PT 1. Overview In this assignment you will take your Apache Solr index constructed from Polar data that
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationBig Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
More informationFile S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
More informationReal Time Analytics for Big Data. NtiSh Nati Shalom @natishalom
Real Time Analytics for Big Data A Twitter Inspired Case Study NtiSh Nati Shalom @natishalom Big Data Predictions Overthe next few years we'll see the adoption of scalable frameworks and platforms for
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationPractical Options for Archiving Social Media
Practical Options for Archiving Social Media Content Summary for ALGIM Web-Symposium Presentation 03/05/11 Euan Cochrane, Senior Advisor, Digital Continuity, Archives New Zealand, The Department of Internal
More informationOptimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationHow To Run Apa Tika On A Microsoft Macbook Or Ipa.Net (For Linux) Or Ipad (For Windows) (For Macbook) (Or Ipa) (On Linux) (Minor) (Large
What's with all the 1s and 0s? Making sense of binary data at scale with Apache Tika Nick Burch CTO, Quanticate Those 1s and 0s Apache Tika the basics Detection Binary formats Text formats Extending Tika
More informationOpen Source Server Product Description
Open Source Server Product Description Disclaimer: The information in this document shall not be disclosed outside System Associates Limited and shall not be duplicated, used, or disclosed in whole or
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationFinding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014
Finding the Needle in a Big Data Haystack Wolfgang Hoschek (@whoschek) JAX 2014 1 About Wolfgang Software Engineer @ Cloudera Search Platform Team Previously CERN, Lawrence Berkeley National Laboratory,
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013 2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationEXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
More informationFrom Wikipedia, the free encyclopedia
Page 1 sur 5 Hadoop From Wikipedia, the free encyclopedia Apache Hadoop is a free Java software framework that supports data intensive distributed applications. [1] It enables applications to work with
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationScalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea January 8, 2013 FloCon 2013 Contents Introduction
More informationOracle Big Data Fundamentals Ed 1 NEW
Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationBig Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment firms turn big data into actionable research
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch Sebastian Nagel snagel@apache.org ApacheCon EU 2014 2014-11-18 About Me computational linguist software developer at Exorbyte (Konstanz, Germany) search and data matching
More informationCloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationJournal of Environmental Science, Computer Science and Engineering & Technology
JECET; March 2015-May 2015; Sec. B; Vol.4.No.2, 202-209. E-ISSN: 2278 179X Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences
More informationHPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
More informationReal-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH
Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch
More informationthe missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
More informationExtreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1
Extreme Computing Hadoop Stratis Viglas School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk Stratis Viglas Extreme Computing 1 Hadoop Overview Examples Environment Stratis Viglas Extreme
More informationImproving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationScaling Big Data Mining Infrastructure: The Smart Protection Network Experience
Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience 黃 振 修 (Chris Huang) SPN 主 動 式 雲 端 截 毒 技 術 架 構 師 About Me SPN 主 動 式 雲 端 截 毒 技 術 架 構 師 SPN Hadoop 基 礎 運 算 架 構 師 Hadoop in Taiwan
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationBigg-Data LLC, Data Scientists Hadoop Developers/Administrators
Bigg-Data LLC, is a Software Solutions Technical training and resources firm. We are the first Professional Solutions Company in the country that specializes in providing big data training and resources.
More informationScalable Computing with Hadoop
Scalable Computing with Hadoop Doug Cutting cutting@apache.org dcutting@yahoo-inc.com 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationHadoop Architecture and its Usage at Facebook
Hadoop Architecture and its Usage at Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Microsoft Research, Seattle October 16, 2009 Outline Introduction
More informationCS242 PROJECT. Presented by Moloud Shahbazi Spring 2015
CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationMoving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
More informationDeveloping a MapReduce Application
TIE 12206 - Apache Hadoop Tampere University of Technology, Finland November, 2014 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 MapReduce
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationShared service components infrastructure for enriching electronic publications with online reading and full-text search
Shared service components infrastructure for enriching electronic publications with online reading and full-text search Nikos HOUSSOS, Panagiotis STATHOPOULOS, Ioanna-Ourania STATHOPOULOU, Andreas KALAITZIS,
More information