Indexing big data with Tika, Solr, and map-reduce
|
|
|
- Clyde Lucas Leonard
- 10 years ago
- Views:
Transcription
1 Indexing big data with Tika, Solr, and map-reduce Scott Fisher, Erik Hetzner California Digital Library 8 February 2012 Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
2 Outline Introduction Introduction Tika Pig Solr Done! Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
3 Introduction Web Archiving Service Service provided by the California Digital Library Fee-based Archiving web sites, as selected by curators Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
4 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
5 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
6 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
7 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
8 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
9 Vital statistics Introduction 43 public archives 18 partners 58k crawls, 35k viewable by public 7535 sites 600 million URLs 40+ TB Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
10 Tools Introduction Open source and rails UI for crawl management and display of many focused web crawls. Heritrix - NutchWAX - Wayback Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
11 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
12 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
13 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
14 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
15 Nutch search Introduction Using Nutch for full text indexing Nutch is slowing down... Nutchwax (nutch + web archiving) is no longer supported Nutch search is no longer default with Nutch itself Deduplicating content requires a more sophisticated index. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
16 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
17 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
18 Parsing Tika The web can contain anything. Mostly HTML, but PDFs are very important. Not to mention Office Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
19 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
20 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
21 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
22 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
23 Tika Tika Apache software project Java Wraps parsers for different file types in a uniform interface. Parses most common file types. Use the same code to parse different types. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
24 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
25 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
26 Tika difficulties Tika Some files are slow to parse. Some files blow up your memory. Some file parses never return. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
27 Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
28 Tika solutions Tika Don t parse files that are too big (e.g. > 2 MB) Fork and monitor process from the outside (Hadoop comes in handy) Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
29 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
30 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
31 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
32 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
33 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
34 What is Pig? Pig Platform for data analysis from Apache. Based on Hadoop. fault tolerant distributed processing Can be used for ad-hoc analysis, without writing Java code. Embraced by the Internet Archive. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
35 Pig example Pig Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
36 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
37 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
38 Parse once Lessons learned Parse once! Parsing takes forever. Do it once, store the results. Storing raw text is cheap, compared to all those PDFs, HTML, etc. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
39 Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
40 Lessons learned Distribute from the start Use hadoop, pig, or another system to distribute your computing. Don t use an ad-hoc solution. Take the time up front to distribute things. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
41 Faceting Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
42 Faceting 2 Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
43 Mime types Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
44 Solr XML Solr Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
45 Finale Introduction Be careful when you try to parse at a bunch of files you downloaded from the web. Parse and store. Distribute up front. Build a test index first. Scott Fisher, Erik Hetzner (CDL) Indexing big data 8 February / 19
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web. NLA Gordon Mohr March 28, 2012
Tools for Web Archiving: The Java/Open Source Tools to Crawl, Access & Search the Web NLA Gordon Mohr March 28, 2012 Overview The tools: Heritrix crawler Wayback browse access Lucene/Hadoop utilities:
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Sector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
Analysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
Archive-IT Services Andrea Mills Booksgroup Collections Specialist
Getting Started with Archive-IT Services Andrea Mills Booksgroup Collections Specialist Internet Archive Micro History Text Archive Update Archive-IT Services 1996 The Internet Archive is created, with
Full Text Search of Web Archive Collections
Full Text Search of Web Archive Collections IWAW2005 Michael Stack Internet Archive stack@ archive.org Internet Archive IA is a 501(c)(3)non-profit Mission is to build a public Internet digital library.
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
WEB ARCHIVING AT SCALE
WEB ARCHIVING AT SCALE (updated 12/19/14) by Rosalie Lack, Stephen Abrams, Trisha Cruse THE VALUE OF WEB CONTENT Web content holds a critical place in modern library collections. Web archiving is an essential
Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380 AGENDA 2 > What's a search engine > Lucene Java Features Code example > Solr Features Integration > Nutch Features
Apache Tika for Enabling Metadata Interoperability
Apache Tika for Enabling Metadata Interoperability Apache: Big Data Europe September 28 30, 2015 Budapest, Hungary Presented by Michael Starch (NASA JPL) and Nick Burch (Quan,cate) Proposed by Giuseppe
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Building Multilingual Search Index using open source framework
Building Multilingual Search Index using open source framework ABSTRACT Arjun Atreya V 1 Swapnil Chaudhari 1 Pushpak Bhattacharyya 1 Ganesh Ramakrishnan 1 (1) Deptartment of CSE, IIT Bombay {arjun, swapnil,
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
How To Understand Web Archiving Metadata
Web Archiving Metadata Prepared for RLG Working Group The following document attempts to clarify what metadata is involved in / required for web archiving. It examines: The source of metadata The object
A Service for Data-Intensive Computations on Virtual Clusters
A Service for Data-Intensive Computations on Virtual Clusters Executing Preservation Strategies at Scale Rainer Schmidt, Christian Sadilek, and Ross King [email protected] Planets Project Permanent
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION
Brian Lao - bjlao Karthik Jagadeesh - kjag Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND There is a large need for improved access to legal help. For example,
CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr
CloudSearch: A Custom Search Engine based on Apache Hadoop, Apache Nutch and Apache Solr Lambros Charissis, Wahaj Ali University of Crete {lcharisis, ali}@csd.uoc.gr Abstract. Implementing a performant
NoSQL Roadshow Berlin Kai Spichale
Full-text Search with NoSQL Technologies NoSQL Roadshow Berlin Kai Spichale 25.04.2013 About me Kai Spichale Software Engineer at adesso AG Author in professional journals, conference speaker adesso is
Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015 Welcome Johannes Weigend - CTO QAware GmbH - Software architect / developer - 25 years
Search and Real-Time Analytics on Big Data
Search and Real-Time Analytics on Big Data Sewook Wee, Ryan Tabora, Jason Rutherglen Accenture & Think Big Analytics Strata New York October, 2012 Big Data: data becomes your core asset. It realizes its
Vaidya Guide. Table of contents
Table of contents 1 Purpose... 2 2 Prerequisites...2 3 Overview... 2 4 Terminology... 2 5 How to Execute the Hadoop Vaidya Tool...4 6 How to Write and Execute your own Tests... 4 1 Purpose This document
Building a master s degree on digital archiving and web archiving. Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF)
Building a master s degree on digital archiving and web archiving Sara Aubry (IT department, BnF) Clément Oury (Legal Deposit department, BnF) Objectives of the presentation > Present and discuss an experiment
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
SPM: Large Scale Performance Monitoring for ElasticSearch HBase Solr & Friends. Otis Gospodnetić Sematext International @otisg @sematext sematext.
SPM: Large Scale Performance Monitoring for ElasticSearch HBase Solr & Friends #spmbuzz #bbuzz Otis Gospodnetić Sematext International @otisg @sematext sematext.com Agenda Introductions SPM Architecture
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
Investigating Hadoop for Large Spatiotemporal Processing Tasks
Investigating Hadoop for Large Spatiotemporal Processing Tasks David Strohschein [email protected] Stephen Mcdonald [email protected] Benjamin Lewis [email protected] Weihe
Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)
+ Hadoop-based Open Source ediscovery: FreeEed (Easy as popcorn) + Hello! 2 Sujee Maniyam & Mark Kerzner Founders @ Elephant Scale consulting and training around Hadoop, Big Data technologies Enterprise
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Final Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library
Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library Web Archiving Austrian Books Online SCAPE at the Austrian National
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Scalable Cloud Computing
Keijo Heljanko Department of Information and Computer Science School of Science Aalto University [email protected] 1/44 Business Drivers of Cloud Computing Large data centers allow for economics
Homework: Visual Search and Interaction with NSF and NASA Polar Datasets Due: May 2nd, 2015, 12pm PT
Homework: Visual Search and Interaction with NSF and NASA Polar Datasets Due: May 2nd, 2015, 12pm PT 1. Overview In this assignment you will take your Apache Solr index constructed from Polar data that
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Big Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment Management firms turn big data into actionable
File S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
Real Time Analytics for Big Data. NtiSh Nati Shalom @natishalom
Real Time Analytics for Big Data A Twitter Inspired Case Study NtiSh Nati Shalom @natishalom Big Data Predictions Overthe next few years we'll see the adoption of scalable frameworks and platforms for
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Practical Options for Archiving Social Media
Practical Options for Archiving Social Media Content Summary for ALGIM Web-Symposium Presentation 03/05/11 Euan Cochrane, Senior Advisor, Digital Continuity, Archives New Zealand, The Department of Internal
Optimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
How To Run Apa Tika On A Microsoft Macbook Or Ipa.Net (For Linux) Or Ipad (For Windows) (For Macbook) (Or Ipa) (On Linux) (Minor) (Large
What's with all the 1s and 0s? Making sense of binary data at scale with Apache Tika Nick Burch CTO, Quanticate Those 1s and 0s Apache Tika the basics Detection Binary formats Text formats Extending Tika
Open Source Server Product Description
Open Source Server Product Description Disclaimer: The information in this document shall not be disclosed outside System Associates Limited and shall not be duplicated, used, or disclosed in whole or
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014
Finding the Needle in a Big Data Haystack Wolfgang Hoschek (@whoschek) JAX 2014 1 About Wolfgang Software Engineer @ Cloudera Search Platform Team Previously CERN, Lawrence Berkeley National Laboratory,
Big Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013 2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
Click Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea January 8, 2013 FloCon 2013 Contents Introduction
Oracle Big Data Fundamentals Ed 1 NEW
Oracle University Contact Us: +90 212 329 6779 Oracle Big Data Fundamentals Ed 1 NEW Duration: 5 Days What you will learn In the Oracle Big Data Fundamentals course, learn to use Oracle's Integrated Big
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Big Data for Investment Research Management
IDT Partners www.idtpartners.com Big Data for Investment Research Management Discover how IDT Partners helps Financial Services, Market Research, and Investment firms turn big data into actionable research
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Web Crawling with Apache Nutch
Web Crawling with Apache Nutch Sebastian Nagel [email protected] ApacheCon EU 2014 2014-11-18 About Me computational linguist software developer at Exorbyte (Konstanz, Germany) search and data matching
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Journal of Environmental Science, Computer Science and Engineering & Technology
JECET; March 2015-May 2015; Sec. B; Vol.4.No.2, 202-209. E-ISSN: 2278 179X Journal of Environmental Science, Computer Science and Engineering & Technology An International Peer Review E-3 Journal of Sciences
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH
Real-time Data Analytics mit Elasticsearch Bernhard Pflugfelder inovex GmbH Bernhard Pflugfelder Big Data Engineer @ inovex Fields of interest: search analytics big data bi Working with: Lucene Solr Elasticsearch
the missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh [email protected]. Stratis Viglas Extreme Computing 1
Extreme Computing Hadoop Stratis Viglas School of Informatics University of Edinburgh [email protected] Stratis Viglas Extreme Computing 1 Hadoop Overview Examples Environment Stratis Viglas Extreme
Improving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience
Scaling Big Data Mining Infrastructure: The Smart Protection Network Experience 黃 振 修 (Chris Huang) SPN 主 動 式 雲 端 截 毒 技 術 架 構 師 About Me SPN 主 動 式 雲 端 截 毒 技 術 架 構 師 SPN Hadoop 基 礎 運 算 架 構 師 Hadoop in Taiwan
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
Bigg-Data LLC, Data Scientists Hadoop Developers/Administrators
Bigg-Data LLC, is a Software Solutions Technical training and resources firm. We are the first Professional Solutions Company in the country that specializes in providing big data training and resources.
Scalable Computing with Hadoop
Scalable Computing with Hadoop Doug Cutting [email protected] [email protected] 5/4/06 Seek versus Transfer B-Tree requires seek per access unless to recent, cached page so can buffer & pre-sort
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Hadoop Architecture and its Usage at Facebook
Hadoop Architecture and its Usage at Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Microsoft Research, Seattle October 16, 2009 Outline Introduction
CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015
CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
Developing a MapReduce Application
TIE 12206 - Apache Hadoop Tampere University of Technology, Finland November, 2014 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 Outline 1 MapReduce Paradigm 2 Hadoop Default Ports 3 MapReduce
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Shared service components infrastructure for enriching electronic publications with online reading and full-text search
Shared service components infrastructure for enriching electronic publications with online reading and full-text search Nikos HOUSSOS, Panagiotis STATHOPOULOS, Ioanna-Ourania STATHOPOULOU, Andreas KALAITZIS,
