The Case for a Portuguese Web Search Engine

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "The Case for a Portuguese Web Search Engine"

Transcription

1 The Case for a Portuguese Web Search Engine Mário J. Gaspar da Silva FCUL/DI and LASIGE/XLDB Community Web Community webs have distint social patterns (Gibson, 1998) Community webs can be identified through the Web Linkage (Kumar, 1999)

2 Portuguese Web There is an identifiable community Web, that we call the Portuguese Web The web of the people directly related to Portugal This is NOT a small community web 10M population PT 3+ M users No identifiable topic Community Identification Flake, Lawrence & Giles: A Community Web is the sub-set of all web pages which are referenced mostly from that community web. Seeds + Expectation-Maximization iterative procedure Max-flow/min-cut algorithm performs the E step Focused crawler performs the M step. National web: some tweaking required Some sites may be too close to the web graph center Many not have incoming links at all More identification features (language,...)

3 Portuguese Web Seeds: All sites registered under the.pt TLD Sites hinted by users and verified M Step: Web pages hosted linked from a.pt site E Step: written in Portuguese Under.COM.NET.ORG.TV and.tk. Tumba! (Temos um Motor de Busca Alternativo!) Public service Community Web Search Engine Web Archive Research infrastructure See it in action at

4 Motivations for a Portuguese Web Search Engine Sociologic What kind of stuff do the Portuguese people look for on the web? Cultural Information Preservation Linguistic What language does this people use to communicate? What vocabulary? Easy way to build corpora Security & Protection Tumba! Modest effort: 1 Prof., 4-5 graduate students, 4-5 servers for 2 years Still beta! Fault-tolerance will require substantially more hardware (replication) Periodic update willl demand more storage Full-time operators? Encouraging feedback

5 Statistics Up to 20,000 queries/day 3,5 million documents under.pt the deepest crawl! 95% responses under 0.5 sec Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine

6 Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine.PT DNS Authority crawling+archiving Seed URLs User Input Versus (Meta-data Repository) Web ViúvaNegra (Crawling Engine) WebStore (Contents Repository)

7 Versus Basic Idea Combine idea of Versions & Workspaces model for engineering data management with parallel processing techniques Designed for web data warehousing applications in general Web data repository with a time dimension Ability to see a web as it was somewhere in the past Versus - Class Model <<abstract>>source PartitionKey 1 * externid name value A source is a reference to a Web document; A version is a snapshot of a source at a given instant; Layer externaltime 1 * Version content * * VersionProperty name value Facet A layer represents a time unit in the repository; An partitionkey is a property associated to a source and therefore to every version of it, used for partitioning; A versionproperty is a property associated to a certain version; A Facet holds a reference to a content associated with a version.

8 Layers & Resolvers Layer agregates a set of versions created within a crawling period; Resolver a function that defines the layers upon which a client operates; Access to a specific version of a contents is resolved based on the layers (and search order) to be considered. Incremental crawls merged with stored data through this mechanism Facets Alternative views of crawled and archived contents Text view Links view Converted views Serves both search engine and preservation requirements

9 WebStore - Architecture Client VCR API Java Library NFS Volume READ_ONLY Volume READ_ONLY Volume WRITABLE Volume WRITABLE host1 host2 host3 hostn RAID Disks Contents spread by multiple volumes. When storage capacity of a volume is exhausted, it becomes read-only. New volumes added as needed to augment capacity or improve performance WebStore Software Architecture Clients WebStore API WebStore Volume Management (Content Keys, Duplicates Managent, Compression) Network File System, Lustre, IBM SAN File System,...

10 Content Addressing Webstore receives content and returns key Duplicates receive already assigned key About 30% of URLs are duplicates Key codes: volume + location of contents in volume + checksum Preservation Issues No backups! Volumes implemented with low cost hard disks as network appliances. Self-describing volumes accessible to standard protocols Assumes human infrastructure Historical archives also do!

11 Tumba! Web Crawlers Repository Indexing Engine Ranking Engine SIDRA Presentation Engine Query Processing Architecture (indexing phase) Versus (Meta-data Repository) Index DataStructs Generator WebStore (Contents Repository) Page Attributes (Authority) Word Index

12 SIDRA - Word Index Data Structure 2 files Term {docid} <Term,docID> {hit} Hit = position + attrib DocID assigned in Static Rank order SIDRA Index Range Partitioning docids index hits index

13 SIDRA - Ranking Engine Word Word Index Word Index Index Query Server Query Broker Page Attributes Addressing Multi-dimensionality Generalization: page-rank (page importance measure) isn t but one of possible ranking contexts. Query Servers may index data according to other dimensions time Location... Query Brokers perform the results fusion

14 Tumba! Web Crawlers Repository Indexing Engine Ranking Engine Presentation Engine Query processing architecture (run-time phase) Page Attributes Word Index Query Processing & Ranking Engine Presentation Engine Page Attributes WebStore (Contents Repository)

15 Tumba! User Interface Time Navigation See previous versions of any page Navigate on previous date Search at a past time Internet Archive Yes experimental Not supported Tumba! (www.tumba.pt) Yes, query to versus gives previous versions stored Requires definition of a resolve function for each period. Enabled by appropariate choice of indexes

16 Conclusion Tumba! demonstrates a running search engine for a relatively large (national, 10M ppl) community. Integrates tools specific to this community: Term tools (spell checking, pronunciation, dictionary,...) Integration with deep webs Supports archiving for preservation Scalable software architecture that operates on low-cost hardware Testbed for research activities in Compuational Processing of the Portuguese. Q?

17 XMLBASE João Campos, Mário J. Silva, Versus: A Model for a Web Repository, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Daniel Gomes, Mário J. Silva, Tarântula - Sistema de Recolha de Documentos da Web, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Daniel Gomes, João Campos, Mário J. Silva, Versus: A Web Repository, WDAS-2002: Workshop on Distributed Data & Structures, Paris, References Tumba! Bruno Martins, Mário J. Silva, Is it Portuguese? Language detection in large document collections, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Miguel Costa, Mário J. Silva, Ranking no Motor de Busca TUMBA, CRC'01-4ª Conferência de Redes de Computadores, Covilhã, Novembro de Mário J. Silva, Tumba!: Relatório de Actividades de 2003 e Plano para 2003, Relatório Técnico TUMBA TR-02-1, Grupo XLDB da Faculdade de Ciências da Universidade de Lisboa, Dezembro de Mário J. Silva, The Case for a Portuguese Web Search Engine, Relatório Técnico DI/FCUL TR-03-3, Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa, Março de Rachel Aires, Sandra Aluísio, Paulo Quaresma, Diana Santos, Mário J. Silva. An initial proposal for cooperative evaluation on information retrieval in Portuguese. Propor' VI Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada, Faro, Junho de (accepted) José Borbinha, Nuno Freire, Mário J. Silva, Bruno Martins. Internet Search Engines and OPACs: Getting the best of two worlds. ElPub ICCC/IFIP 7th International Conference on Electronic Publishing (accepted). Mário J. Silva, Tumba! A Web Search and Archive Combination. Mário J. Silva, The Case for a Portuguese Web Search Engine, Language Identification Problem: identify pages in domains other than.pt that are written in Portuguese (during crawl) Approach: use categorization technique based in n-gram analysis (Dunning 1994, Adams 1997). Open problem: how to identify BRASILIAN Portuguese pages?

18 Crawling Policy Search engine requirements Home page and all important pages Up to X pages, Anti-Spam protection Frequent updating Archival system requirements Anything that is worth preserving Images We can support both within our community, with incremental crawling and indexing Multiple indexes and index selection. Matching & Ranking Algorithm Phase 1: Query Matching QueryServers fetch matching docids (pre-sorted in static ranking order) QueryBrokers merge results using distributed merge-sort algorithm (preserves ranking order) Phase 2: Ranking Pick N (1000) first results from phase 1 Compute final rank using hits data Are terms also in title? What is the distance among query terms in the page? Terms in Bold, Italic?

19 Scalability Analysis Word Word Index Word Index Index Query Server Query Broker Presentation Engine User requests may be balanced among multiple Presentation Engines Contents may be replicated Requests may be balanced among multiple Query Brokers Page Attributes may be replicated Page Attributes VCR (Contents Repository) Query Brokers may balance requests to multiple Query Servers Multiple Query servers for a Word Index Word indexes may be replicated

The XLDB Group at CLEF 2004

The XLDB Group at CLEF 2004 The XLDB Group at CLEF 2004 Nuno Cardoso, Mário J. Silva, and Miguel Costa Grupo XLDB - Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa {ncardoso, mjs, mcosta} at xldb.di.fc.ul.pt

More information

Internet Search Engines and OPACs: Getting the best of two worlds

Internet Search Engines and OPACs: Getting the best of two worlds Internet Search Engines and OPACs: Getting the best of two worlds JOSÉ BORBINHA 1 ; NUNO FREIRE 2 ; MÁRIO SILVA 3 ; BRUNO MARTINS 4 1 National Library of Portugal Campo Grande, 83, 1749-081 Lisbon, Portugal

More information

Design and selection criteria for a national web archive

Design and selection criteria for a national web archive Design and selection criteria for a national web archive Daniel Gomes Sérgio Freitas Mário J. Silva University of Lisbon Daniel Gomes http://xldb.fc.ul.pt/daniel/ 1 The digital era has begun The web is

More information

Creating a billion-scale searchable web archive. Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes

Creating a billion-scale searchable web archive. Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes Creating a billion-scale searchable web archive Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes Web archiving initiatives are spreading around the world At least 6.6 PB were archived

More information

DataStorm: Large-Scale Data Management in Cloud Environments

DataStorm: Large-Scale Data Management in Cloud Environments DataStorm: Large-Scale Data Management in Cloud Environments INESC-ID Data Management & Information Retrieval Group 1st DataStorm Workshop DataStorm W01: Outline Task H1 1 Task H1: Data Acquisition and

More information

A survey of web archive search architectures

A survey of web archive search architectures A survey of web archive search architectures Miguel Costa, Daniel Gomes (Portuguese Web Archive@FCCN) Francisco Couto, Mário J. Silva (University of Lisbon) The Internet Archive was founded in 1996 Web-archived

More information

CiteSeer x in the Cloud

CiteSeer x in the Cloud Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

More information

The University of Lisbon at CLEF 2006 Ad-Hoc Task

The University of Lisbon at CLEF 2006 Ad-Hoc Task The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports

More information

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet

More information

Managing duplicates in a web archive

Managing duplicates in a web archive Managing duplicates in a web archive Daniel Gomes Universidade de Lisboa 1749-016 Lisboa, Portugal dcg@di.fc.ul.pt André L. Santos Universidade de Lisboa 1749-016 Lisboa, Portugal als@di.fc.ul.pt Mário

More information

Data Discovery on the Information Highway

Data Discovery on the Information Highway Data Discovery on the Information Highway Susan Gauch Introduction Information overload on the Web Many possible search engines Need intelligent help to select best information sources customize results

More information

Design and Selection Criteria for a National Web Archive

Design and Selection Criteria for a National Web Archive Design and Selection Criteria for a National Web Archive Daniel Gomes, Sérgio Freitas, and Mário J. Silva University of Lisbon, Faculty of Sciences 1749-016 Lisboa, Portugal dcg@di.fc.ul.pt, sfreitas@lasige.di.fc.ul.pt,

More information

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE

IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE White Paper IBM TSM DISASTER RECOVERY BEST PRACTICES WITH EMC DATA DOMAIN DEDUPLICATION STORAGE Abstract This white paper focuses on recovery of an IBM Tivoli Storage Manager (TSM) server and explores

More information

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data David Minor 1, Reagan Moore 2, Bing Zhu, Charles Cowart 4 1. (88)4-104 minor@sdsc.edu San Diego Supercomputer Center

More information

Application Note 116: Gauntlet System High Availability Using Replication

Application Note 116: Gauntlet System High Availability Using Replication Customer Service: 425-487-1515 Technical Support: 425-951-3390 Fax: 425-487-2288 Email: info@teltone.com support@teltone.com Website: www.teltone.com Application Note 116: Gauntlet System High Availability

More information

Backup and Recovery 1

Backup and Recovery 1 Backup and Recovery What is a Backup? Backup is an additional copy of data that can be used for restore and recovery purposes. The Backup copy is used when the primary copy is lost or corrupted. This Backup

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Object Storage: Out of the Shadows and into the Spotlight

Object Storage: Out of the Shadows and into the Spotlight Technology Insight Paper Object Storage: Out of the Shadows and into the Spotlight By John Webster December 12, 2012 Enabling you to make the best technology decisions Object Storage: Out of the Shadows

More information

Scalable Internet Services and Load Balancing

Scalable Internet Services and Load Balancing Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and

More information

The Viuva Negra crawler

The Viuva Negra crawler The Viuva Negra crawler Daniel Gomes Mário J. Silva DI FCUL TR 2006 06-21 November 2006 Departamento de Informática Faculdade de Ciências da Universidade de Lisboa Campo Grande, 1749 016 Lisboa Portugal

More information

Information Systems Technological Infrastructure in the Polytechnic Institute of Setúbal Steps to Evolution

Information Systems Technological Infrastructure in the Polytechnic Institute of Setúbal Steps to Evolution Information Systems Technological Infrastructure in the Polytechnic Institute of Setúbal Steps to Evolution Nuno Pina Gonçalves ¹, João Filipe dos Santos Daniel ² ¹Systems and Informatics Department, Superior

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.

More information

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA OLAP and OLTP AMIT KUMAR BINDAL Associate Professor Databases Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information, which is created by data,

More information

REACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

REACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION Workshop 2013.07.31 Overview Porto, FEUP Mário J. Silva IST/INESC-ID, Portugal Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each

More information

www.coveo.com Unifying Search for the Desktop, the Enterprise and the Web

www.coveo.com Unifying Search for the Desktop, the Enterprise and the Web wwwcoveocom Unifying Search for the Desktop, the Enterprise and the Web wwwcoveocom Why you need Coveo Enterprise Search Quickly find documents scattered across your enterprise network Coveo is actually

More information

EMC BACKUP MEETS BIG DATA

EMC BACKUP MEETS BIG DATA EMC BACKUP MEETS BIG DATA Strategies To Protect Greenplum, Isilon And Teradata Systems 1 Agenda Big Data: Overview, Backup and Recovery EMC Big Data Backup Strategy EMC Backup and Recovery Solutions for

More information

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits

SharePoint Server 2010 Capacity Management: Software Boundaries and Limits SharePoint Server 2010 Capacity Management: Software Boundaries and s This document is provided as-is. Information and views expressed in this document, including URL and other Internet Web site references,

More information

Introduction to Gluster. Versions 3.0.x

Introduction to Gluster. Versions 3.0.x Introduction to Gluster Versions 3.0.x Table of Contents Table of Contents... 2 Overview... 3 Gluster File System... 3 Gluster Storage Platform... 3 No metadata with the Elastic Hash Algorithm... 4 A Gluster

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features

Solaris For The Modern Data Center. Taking Advantage of Solaris 11 Features Solaris For The Modern Data Center Taking Advantage of Solaris 11 Features JANUARY 2013 Contents Introduction... 2 Patching and Maintenance... 2 IPS Packages... 2 Boot Environments... 2 Fast Reboot...

More information

CERN Search Engine Status CERN IT-OIS

CERN Search Engine Status CERN IT-OIS CERN Search Engine Status CERN IT-OIS Tim Bell, Eduardo Alvarez Fernandez, Andreas Wagner HEPiX Fall 2010 Workshop 3rd November 2010, Cornell University Outline Enterprise Search What is Enterprise Search?

More information

Wikimedia architecture. Mark Bergsma <mark@wikimedia.org> Wikimedia Foundation Inc.

Wikimedia architecture. Mark Bergsma <mark@wikimedia.org> Wikimedia Foundation Inc. Mark Bergsma Wikimedia Foundation Inc. Overview Intro Global architecture Content Delivery Network (CDN) Application servers Persistent storage Focus on architecture, not so much on

More information

Global Data Integration with Autonomous Mobile Agents. White Paper

Global Data Integration with Autonomous Mobile Agents. White Paper Global Data Integration with Autonomous Mobile Agents White Paper June 2002 Contents Executive Summary... 1 The Business Problem... 2 The Global IDs Solution... 5 Global IDs Technology... 8 Company Overview...

More information

Europass Curriculum Vitae

Europass Curriculum Vitae Europass Curriculum Vitae Personal information Surname(s) / First name(s) Address(es) Custódio, Jorge Filipe Telephone(s) +351 919687707 Email(s) Personal website(s) Nationality(-ies) Rua Francisco Pereira

More information

Long term retention and archiving the challenges and the solution

Long term retention and archiving the challenges and the solution Long term retention and archiving the challenges and the solution NAME: Yoel Ben-Ari TITLE: VP Business Development, GH Israel 1 Archive Before Backup EMC recommended practice 2 1 Backup/recovery process

More information

A comprehensive guide to XML Sitemaps:

A comprehensive guide to XML Sitemaps: s emperpl ugi ns. com A comprehensive guide to XML Sitemaps: What are they? Why do I need one? And how do I create one? A little background and history A sitemap is a way of collecting and displaying the

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

Web Archiving and Scholarly Use of Web Archives

Web Archiving and Scholarly Use of Web Archives Web Archiving and Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 April 2013 Overview 1. Introduction 2. Access and usage: UK Web Archive 3. Scholarly feedback on

More information

New Features... 1 Installation... 3 Upgrade Changes... 3 Fixed Limitations... 4 Known Limitations... 5 Informatica Global Customer Support...

New Features... 1 Installation... 3 Upgrade Changes... 3 Fixed Limitations... 4 Known Limitations... 5 Informatica Global Customer Support... Informatica Corporation B2B Data Exchange Version 9.5.0 Release Notes June 2012 Copyright (c) 2006-2012 Informatica Corporation. All rights reserved. Contents New Features... 1 Installation... 3 Upgrade

More information

Tools and Services for the Long Term Preservation and Access of Digital Archives

Tools and Services for the Long Term Preservation and Access of Digital Archives Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer Studies Department of Electrical and Computer

More information

REDUCE COSTS AND COMPLEXITY WITH BACKUP-FREE STORAGE NICK JARVIS, DIRECTOR, FILE, CONTENT AND CLOUD SOLUTIONS VERTICALS AMERICAS

REDUCE COSTS AND COMPLEXITY WITH BACKUP-FREE STORAGE NICK JARVIS, DIRECTOR, FILE, CONTENT AND CLOUD SOLUTIONS VERTICALS AMERICAS REDUCE COSTS AND COMPLEXITY WITH BACKUP-FREE STORAGE NICK JARVIS, DIRECTOR, FILE, CONTENT AND CLOUD SOLUTIONS VERTICALS AMERICAS WEBTECH EDUCATIONAL SERIES REDUCE COSTS AND COMPLEXITY WITH BACKUP-FREE

More information

Oracle Database 11g: New Features for Administrators DBA Release 2

Oracle Database 11g: New Features for Administrators DBA Release 2 Oracle Database 11g: New Features for Administrators DBA Release 2 Duration: 5 Days What you will learn This Oracle Database 11g: New Features for Administrators DBA Release 2 training explores new change

More information

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

More information

Mining Web Access Logs of an On-line Newspaper

Mining Web Access Logs of an On-line Newspaper Mining Web Access Logs of an On-line Newspaper Paulo Batista and Mário J. Silva Departamento de Informática, aculdade de Ciências Universidade de Lisboa Campo Grande 749-06 Lisboa Portugal {pb,mjs}@di.fc.ul.pt

More information

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems

A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems A Novel Cloud Computing Data Fragmentation Service Design for Distributed Systems Ismail Hababeh School of Computer Engineering and Information Technology, German-Jordanian University Amman, Jordan Abstract-

More information

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc. Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

More information

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY" Tokyo. Koln Sebastopol. Cambridge Farnham.

FIFTH EDITION. Oracle Essentials. Rick Greenwald, Robert Stackowiak, and. Jonathan Stern O'REILLY Tokyo. Koln Sebastopol. Cambridge Farnham. FIFTH EDITION Oracle Essentials Rick Greenwald, Robert Stackowiak, and Jonathan Stern O'REILLY" Beijing Cambridge Farnham Koln Sebastopol Tokyo _ Table of Contents Preface xiii 1. Introducing Oracle 1

More information

2. Metadata Modeling Best Practices with Cognos Framework Manager

2. Metadata Modeling Best Practices with Cognos Framework Manager IBM Cognos 10.1 DWH Basics 1 Cognos System Administration 2 Metadata Modeling Best Practices With Cognos Framework Manager 3 OLAP Modeling With Cognos Transformer (Power Play Tranformer) 4 Multidimensional

More information

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007

Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms. Cray User Group Meeting June 2007 Performance, Reliability, and Operational Issues for High Performance NAS Storage on Cray Platforms Cray User Group Meeting June 2007 Cray s Storage Strategy Background Broad range of HPC requirements

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com

Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com Removing Failure Points and Increasing Scalability for the Engine that Drives webmd.com Matt Wilson Director, Consumer Web Operations, WebMD @mattwilsoninc 9/12/2013 About this talk Go over original site

More information

2011 FileTek, Inc. All rights reserved. 1 QUESTION

2011 FileTek, Inc. All rights reserved. 1 QUESTION 2011 FileTek, Inc. All rights reserved. 1 QUESTION 2011 FileTek, Inc. All rights reserved. 2 HSM - ILM - >>> 2011 FileTek, Inc. All rights reserved. 3 W.O.R.S.E. HOW MANY YEARS 2011 FileTek, Inc. All rights

More information

Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley

Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley Web Search Engines Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Search Engine Characteristics

More information

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1 Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC 10.1.3.4.1 Mark Rittman, Director, Rittman Mead Consulting for Collaborate 09, Florida, USA,

More information

Bringing Big Data into the Enterprise

Bringing Big Data into the Enterprise Bringing Big Data into the Enterprise Overview When evaluating Big Data applications in enterprise computing, one often-asked question is how does Big Data compare to the Enterprise Data Warehouse (EDW)?

More information

NexentaConnect for VMware Virtual SAN

NexentaConnect for VMware Virtual SAN NexentaConnect for VMware Virtual SAN User Guide 1.0.2 FP3 Date: April, 2016 Subject: NexentaConnect for VMware Virtual SAN User Guide Software: NexentaConnect for VMware Virtual SAN Software Version:

More information

Survey of Filesystems for Embedded Linux. Presented by Gene Sally CELF

Survey of Filesystems for Embedded Linux. Presented by Gene Sally CELF Survey of Filesystems for Embedded Linux Presented by Gene Sally CELF Presentation Filesystems In Summary What is a filesystem Kernel and User space filesystems Picking a root filesystem Filesystem Round-up

More information

From SDN to SDC. Requirements for the Next Generation Cloud. Lisboa, Junho 2014

From SDN to SDC. Requirements for the Next Generation Cloud. Lisboa, Junho 2014 From SDN to SDC Requirements for the Next Generation Cloud Lisboa, Junho 2014 Este documento é propriedade intelectual da PT e fica proibida a sua utilização ou propagação sem expressa autorização escrita.

More information

EVILSEED: A Guided Approach to Finding Malicious Web Pages

EVILSEED: A Guided Approach to Finding Malicious Web Pages + EVILSEED: A Guided Approach to Finding Malicious Web Pages Presented by: Alaa Hassan Supervised by: Dr. Tom Chothia + Outline Introduction Introducing EVILSEED. EVILSEED Architecture. Effectiveness of

More information

Collecting and Providing Access to Large Scale Archived Web Data. Helen Hockx-Yu Head of Web Archiving, British Library

Collecting and Providing Access to Large Scale Archived Web Data. Helen Hockx-Yu Head of Web Archiving, British Library Collecting and Providing Access to Large Scale Archived Web Data Helen Hockx-Yu Head of Web Archiving, British Library Web Archives key characteristics Snapshots of web resources, taken at given point

More information

The software platform for storing, preserving and sharing very large data sets. www.active-circle.com

The software platform for storing, preserving and sharing very large data sets. www.active-circle.com The software platform for storing, preserving and sharing very large data sets www.active-circle.com The easiest solution for storing and archiving very large data sets! ACTIVE CIRCLE HIGHLIGHTS Software-based

More information

Ranked Keyword Search in Cloud Computing: An Innovative Approach

Ranked Keyword Search in Cloud Computing: An Innovative Approach International Journal of Computational Engineering Research Vol, 03 Issue, 6 Ranked Keyword Search in Cloud Computing: An Innovative Approach 1, Vimmi Makkar 2, Sandeep Dalal 1, (M.Tech) 2,(Assistant professor)

More information

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks

Building Views and Charts in Requests Introduction to Answers views and charts Creating and editing charts Performing common view tasks Oracle Business Intelligence Enterprise Edition (OBIEE) Training: Working with Oracle Business Intelligence Answers Introduction to Oracle BI Answers Working with requests in Oracle BI Answers Using advanced

More information

San Jose State University

San Jose State University San Jose State University Fall 2011 CMPE 272: Enterprise Software Overview Project: Date: 5/9/2011 Under guidance of Professor, Rakesh Ranjan Submitted by, Team Titans Jaydeep Patel (007521007) Zankhana

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems 1 Some Numbers (2010) Over 260 Billion images (20 PB) 65 Billion X 4 different sizes for each image. 1 Billion

More information

A New Method of SAN Storage Virtualization

A New Method of SAN Storage Virtualization A New Method of SAN Storage Virtualization Table of Contents 1 - ABSTRACT 2 - THE NEED FOR STORAGE VIRTUALIZATION 3 - EXISTING STORAGE VIRTUALIZATION METHODS 4 - A NEW METHOD OF VIRTUALIZATION: Storage

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information

Scholarly Use of Web Archives

Scholarly Use of Web Archives Scholarly Use of Web Archives Helen Hockx-Yu Head of Web Archiving British Library 15 February 2013 Web Archiving initiatives worldwide http://en.wikipedia.org/wiki/file:map_of_web_archiving_initiatives_worldwide.png

More information

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications

IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications IBM Software Information Management Scaling strategies for mission-critical discovery and navigation applications Scaling strategies for mission-critical discovery and navigation applications Contents

More information

IBM Tivoli Storage Manager Version 7.1.4. Introduction to Data Protection Solutions IBM

IBM Tivoli Storage Manager Version 7.1.4. Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.4 Introduction to Data Protection Solutions IBM IBM Tivoli Storage Manager Version 7.1.4 Introduction to Data Protection Solutions IBM Note: Before you use this

More information

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002 IAF Business Intelligence Solutions Make the Most of Your Business Intelligence White Paper INTRODUCTION In recent years, the amount of data in companies has increased dramatically as enterprise resource

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

LOAD BALANCING IN WEB SERVER

LOAD BALANCING IN WEB SERVER LOAD BALANCING IN WEB SERVER Renu Tyagi 1, Shaily Chaudhary 2, Sweta Payala 3 UG, 1,2,3 Department of Information & Technology, Raj Kumar Goel Institute of Technology for Women, Gautam Buddh Technical

More information

VMware vsphere Data Protection 6.0

VMware vsphere Data Protection 6.0 VMware vsphere Data Protection 6.0 TECHNICAL OVERVIEW REVISED FEBRUARY 2015 Table of Contents Introduction.... 3 Architectural Overview... 4 Deployment and Configuration.... 5 Backup.... 6 Application

More information

Oracle Warehouse Builder 10g

Oracle Warehouse Builder 10g Oracle Warehouse Builder 10g Architectural White paper February 2004 Table of contents INTRODUCTION... 3 OVERVIEW... 4 THE DESIGN COMPONENT... 4 THE RUNTIME COMPONENT... 5 THE DESIGN ARCHITECTURE... 6

More information

TSM for Advanced Copy Services: Today and Tomorrow

TSM for Advanced Copy Services: Today and Tomorrow TSM for Copy Services and TSM for Advanced Copy Services: Today and Tomorrow Del Hoobler Oxford University TSM Symposium 2007 September 2007 Disclaimer This presentation describes potential future enhancements

More information

CIGRE 2014: Udaljena zaštita podataka

CIGRE 2014: Udaljena zaštita podataka CIGRE 2014: Udaljena zaštita podataka Žarko Stupar Product Manager zstupar@mds.rs "" 1 Agenda Udaljena zaštita podataka - pristup Replikacija podataka između data centara Napredna backup rešenja Replikacija

More information

Moving Data and Distributing Data

Moving Data and Distributing Data Moving Data and Distributing Data Raymond A. Clarke Sr. Enterprise Storage Solutions Specialist, Sun Microsystems - Archive & Backup Solutions SNIA Data Management Forum, Board of Directors 1 Sun s Enterprise

More information

In: Proceedings of RECPAD 2002-12th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal

In: Proceedings of RECPAD 2002-12th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal Paper Title: Generic Framework for Video Analysis Authors: Luís Filipe Tavares INESC Porto lft@inescporto.pt Luís Teixeira INESC Porto, Universidade Católica Portuguesa lmt@inescporto.pt Luís Corte-Real

More information

Virtual server management: Top tips on managing storage in virtual server environments

Virtual server management: Top tips on managing storage in virtual server environments Tutorial Virtual server management: Top tips on managing storage in virtual server environments Sponsored By: Top five tips for managing storage in a virtual server environment By Eric Siebert, Contributor

More information

Drobo How-To Guide. Use a Drobo iscsi Array as a Target for Veeam Backups

Drobo How-To Guide. Use a Drobo iscsi Array as a Target for Veeam Backups This document shows you how to use a Drobo iscsi SAN Storage array with Veeam Backup & Replication version 5 in a VMware environment. Veeam provides fast disk-based backup and recovery of virtual machines

More information

Ultimate Guide to Oracle Storage

Ultimate Guide to Oracle Storage Ultimate Guide to Oracle Storage Presented by George Trujillo George.Trujillo@trubix.com George Trujillo Twenty two years IT experience with 19 years Oracle experience. Advanced database solutions such

More information

Scalable Internet Services and Load Balancing

Scalable Internet Services and Load Balancing Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and

More information

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380 Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380 AGENDA 2 > What's a search engine > Lucene Java Features Code example > Solr Features Integration > Nutch Features

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc. Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE

More information

Qlik Sense scalability

Qlik Sense scalability Qlik Sense scalability Visual analytics platform Qlik Sense is a visual analytics platform powered by an associative, in-memory data indexing engine. Based on users selections, calculations are computed

More information

Data Warehouses in the Path from Databases to Archives

Data Warehouses in the Path from Databases to Archives Data Warehouses in the Path from Databases to Archives Gabriel David FEUP / INESC-Porto This position paper describes a research idea submitted for funding at the Portuguese Research Agency. Introduction

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

Philadelphia Area SharePoint User Group June 27th, 2012. Ken Choyce RJB Technical Consulting, Inc. www.rjbtech.com

Philadelphia Area SharePoint User Group June 27th, 2012. Ken Choyce RJB Technical Consulting, Inc. www.rjbtech.com Philadelphia Area SharePoint User Group June 27th, 2012 Ken Choyce RJB Technical Consulting, Inc. www.rjbtech.com Agenda Overview of SharePoint Backups Purpose of backing up SharePoint Levels of SharePoint

More information

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline References Introduction to Database Systems CSE 444 Lecture 24: Databases as a Service YongChul Kwon Amazon SimpleDB Website Part of the Amazon Web services Google App Engine Datastore Website Part of

More information

Introduction to Database Systems CSE 444

Introduction to Database Systems CSE 444 Introduction to Database Systems CSE 444 Lecture 24: Databases as a Service YongChul Kwon References Amazon SimpleDB Website Part of the Amazon Web services Google App Engine Datastore Website Part of

More information

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE

IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE IDENTIFYING AND OPTIMIZING DATA DUPLICATION BY EFFICIENT MEMORY ALLOCATION IN REPOSITORY BY SINGLE INSTANCE STORAGE 1 M.PRADEEP RAJA, 2 R.C SANTHOSH KUMAR, 3 P.KIRUTHIGA, 4 V. LOGESHWARI 1,2,3 Student,

More information

PUBMED: an efficient biomedical based hierarchical search engine ABSTRACT:

PUBMED: an efficient biomedical based hierarchical search engine ABSTRACT: PUBMED: an efficient biomedical based hierarchical search engine ABSTRACT: Search queries on biomedical databases, such as PubMed, often return a large number of results, only a small subset of which is

More information

Open Source IR Tools and Libraries

Open Source IR Tools and Libraries Open Source IR Tools and Libraries Giorgos Vasiliadis, gvasil@csd.uoc.gr CS-463 Information Retrieval Models Computer Science Department University of Crete 1 Outline Google Search API Lucene Terrier Lemur

More information

Actifio Big Data Director. Virtual Data Pipeline for Unstructured Data

Actifio Big Data Director. Virtual Data Pipeline for Unstructured Data Actifio Big Data Director Virtual Data Pipeline for Unstructured Data Contact Actifio Support As an Actifio customer, you can get support for all Actifio products through the Support Portal at http://support.actifio.com/.

More information

Knowledge-Based Systems IS430. Mostafa Z. Ali

Knowledge-Based Systems IS430. Mostafa Z. Ali Winter 2009 Knowledge-Based Systems IS430 Data Warehousing Lesson 6 Mostafa Z. Ali mzali@just.edu.jo Lecture 2: Slide 1 Learning Objectives Understand the basic definitions and concepts of data warehouses

More information

Nutanix Tech Note. Data Protection and Disaster Recovery

Nutanix Tech Note. Data Protection and Disaster Recovery Nutanix Tech Note Data Protection and Disaster Recovery Nutanix Virtual Computing Platform is engineered from the ground-up to provide enterprise-grade availability for critical virtual machines and data.

More information