Mark Bennett. Search and the Virtual Machine

Similar documents
Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

VMware Virtual Infrastucture From the Virtualized to the Automated Data Center

Intro to Virtualization

IOS110. Virtualization 5/27/2014 1

Rackspace Cloud Databases and Container-based Virtualization

BI4.x Architecture SAP CEG & GTM BI

Data Centers and Cloud Computing

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

Virtualization. Dr. Yingwu Zhu

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Technology Insight Series

A Middleware Strategy to Survive Compute Peak Loads in Cloud

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

2009 AAMGA Automation Conference

HP OO 10.X - SiteScope Monitoring Templates

Best Practices for Virtualised SharePoint

Lecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

In Memory Accelerator for MongoDB

Cloud Computing. Chapter 8 Virtualization

Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using

Assignment # 1 (Cloud Computing Security)

Scaling Analysis Services in the Cloud

High Availability Solutions for the MariaDB and MySQL Database

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

Data Centers and Cloud Computing. Data Centers

How To Choose Between A Relational Database Service From Aws.Com


Database Scalability and Oracle 12c

Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP

Virtualization for Cloud Computing

White Paper. Recording Server Virtualization

Cloud Computing. Adam Barker

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Data Centers and Cloud Computing. Data Centers. MGHPCC Data Center. Inside a Data Center

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Automating Big Data Benchmarking for Different Architectures with ALOJA

Cloud Computing: Making the right choices

Virtualization and the U2 Databases

Basics of Virtualisation

Maintaining Non-Stop Services with Multi Layer Monitoring

Capacity planning with Microsoft System Center

HP Client Automation Standard Fast Track guide

TechTarget Windows Media

Basics in Energy Information (& Communication) Systems Virtualization / Virtual Machines

Building a Scalable News Feed Web Service in Clojure

IN DETAIL. Smart & Dedicated Servers

C a r l G o e t h a l s T e r r e m a r k E u r o p e. C a r l. g o e t h a l t e r r e m a r k. c o m

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Cloud Based Application Architectures using Smart Computing

Ø Teaching Evaluations. q Open March 3 through 16. Ø Final Exam. q Thursday, March 19, 4-7PM. Ø 2 flavors: q Public Cloud, available to public

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

HRG Assessment: Stratus everrun Enterprise

ZEN LOAD BALANCER EE v3.04 DATASHEET The Load Balancing made easy

NoSQL and Hadoop Technologies On Oracle Cloud

The Virtualization Practice

SAP Private Cloud Case Studies Real World Customer Case Study

Mobile Cloud Computing T Open Source IaaS

Client Hardware and Infrastructure Suggested Best Practices

COS 318: Operating Systems. Virtual Machine Monitors

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Microsoft Exchange Solutions on VMware

2) Xen Hypervisor 3) UEC

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Virtualization: Know your options on Ubuntu. Nick Barcet. Ubuntu Server Product Manager

WebLogic on Oracle Database Appliance: Combining High Availability and Simplicity

Energy Efficient MapReduce

Citrix Training. Course: Citrix Training. Duration: 40 hours. Mode of Training: Classroom (Instructor-Led)

Cloud Hosting. Quick Guide 7/30/ EarthLink. Trademarks are property of their respective owners. All rights reserved.

Prepared for: How to Become Cloud Backup Provider

CSE 501 Monday, September 09, 2013 Kevin Cleary

Availability for your modern datacenter

Geoff Raines Cloud Engineer

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Managing Traditional Workloads Together with Cloud Computing Workloads

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

VegaStream Tutorial - The Advantages & Disadvantages of Using Virtual Machines

What s New with VMware Virtual Infrastructure

Introduction What is the cloud

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

RingStor User Manual. Version 2.1 Last Update on September 17th, RingStor, Inc. 197 Route 18 South, Ste 3000 East Brunswick, NJ

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Big data blue print for cloud architecture

VMware vcenter Server 5.5 Deployment Guide TECHNICAL MARKETING DOCUMENTATION V 1.0/NOVEMBER 2013/JUSTIN KING

Integrated Application and Data Protection. NEC ExpressCluster White Paper

The future is in the management tools. Profoss 22/01/2008

Introduction to Cloud Computing

Configuring and Managing a Private Cloud with Enterprise Manager 12c

Who is SharePoint Joel?

Virtualization. Jukka K. Nurminen

Networking in the Hadoop Cluster

Virtual Computing and VMWare. Module 4

Cloud Optimize Your IT

Networking for Caribbean Development

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Server Virtualization Cloud Partner Training Series

Cost effective methods of test environment management. Prabhu Meruga Director - Solution Engineering 16 th July SCQAA Irvine, CA

MySQL Cluster New Features. Johan Andersson MySQL Cluster Consulting johan.andersson@sun.com

Transcription:

Mark Bennett Search and the Virtual Machine

Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A

Business Drivers? Virtual = Flexible Virtual machines give you flexibility in testing and deployment. But perhaps not short term cost savings

Draws on Many Areas General Virtualization Advantages Search Engine Architecture Search Engine Performance Hardware, OS and Networking Distributed and Parallel Computing Vendor Licensing and Pricing Policies

Key Issues for Virtual + Search: Vendor support Search engine architecture Automation of common tasks Predicting performance Training and buy in from IT and Dev Licensing ( tech & legal )

Search Tasks that Virtualization Can Help With

Search on Virtual Tasks/Benefits General Benefits Standardized machine setups/images Snapshot of complex Restore Points In Sync on OS, Java and Vendor patches Search Engine specific benefits Multi node Dev environment & Frame of Mind SAME topology in Dev, Stage & Prod farms Failover and rollout flexibility

Validation Tasks System bootstrapping Have we completely documented bare machine to a customized and running search farm? Multi Node Installs ssh & keys, search engine config & licensing Testing Failover and Recovery scenarios Network firewall and security issues Distributed search uses LOTS of ports! Security issues SSO, LDAP, search engine passwords Spider Über Login

Simulating Hardware Failures Kill a machine, validate failover, recovery Kill a disk, validate recovery, failover VMWare hardware abstraction layer

What About Search Engine Performance?

Fundamental Performance Questions When does it really matter Prod vs. testing procedures, topology Virtual *Might* Be Fast Enough for Prod Can you reliably scale virtual benchmarks to physical benchmarks?? Can you ever just say: "multiply by 125% for virtual to physical"?

Search Engine Subsystems Data Processing Spider (FETCHER) and connectors Indexer / search indicies Distribution, Replication, optimization Handling Search Requests Inbound application stack Query preprocessing, expansion, early binding and federation CORE search engine > search indices Results list formatting, facets, late binding Admin, IT, Monitoring, Scheduling & Security

Search Engine Diagram

Sizing and Scaling of Search Scalability Monolithic vs. Distributed Software Data / Documents vs. Queries Documents vs. Terabytes, Text vs. Graphics Indexing Overall average, Latency and Failover # of Documents vs. # of Transactions Queries Peak vs. Average Average Throughput vs. Individual Query Time Compound Queries, Federated Search

How Search Engines Scale Monolithic vs. Farmed vs. Distributed Rows and Columns (meaning machines) It s mostly about where the Index lives

How Search Engines Scale: Spider Spider breakups Fetching data Parsing data Writing indexing data to disk

How Search Engines Scale: Search Search breakups App server Query preprocessing Query Parsing, query Cooking Stemming, Thesaurus, Search Filters Formatting Results Low level search the core engine Relevance and Ranking Post Filter, Sorting and Dedupe Monolithic vs. Federated

Bottlenecks for Search Engine Performance Usual Suspects: High Latency Ops Network Fetches (fixed with threading) Disk I/O Round Trip A few others in some cases: Huge binary files Extreme document indexing Atypical usage patterns Starved machines (memory, cpu)

Other Scaling Techniques: Google Map/Reduce and Hadoop General: Multi core driving new parallel ideas Google popularized Map / Reduce Open Source responded with Hadoop and Nutch 0.8 Maybe Clouds

Virtual Platforms for Search

Types of Virtual Machines Older Emulators (ex: Parellelz, etc.) VMWare, Xen, Microsoft, Baynote, others Full HW software layer Special Client OS layer Special Host OS layer Virtual Hardware components Virtual Networks Cloud Computing Some overlap with Blades

Operating System Issues: OS Selection Host OS Client OS 32 vs. 64 bit OS 32 vs. 64 bit Search Engine Version General Networking on clients Disk storage strategy File Handles local vs. network OS Licensing and EULA issues with Microsoft

The Cloud Not quite primetime yet for search, but some good choices are evolving: Amazon Google Microsoft in 2009 For now: use RackSpace or Joyent

Clouds can Add Temp Capacity If you only need to Spider 100,000,000 pages once a month You ve just won a contract to index the.gov top level domain You need to do a one time global reindex

Amazon Cloud: Pros Extreme Flexibility Low entry cost, pay as you grow Based on familiar Linux environment Prebuilt template virtual machine images Web based Administration

Amazon Cloud: Cons Micromanagement of machines Can be mitigated by third party software Assignment and visibility of IP addresses Dynamic DNS not good for busy sites Can pay extra to fix No Windows based servers Lacks traditional Tech Support channel Complaints about other channels during the Beta

Google Cloud Pros No micromanagement of machines Free to start, pays as you grow Uses Python (so far) Cons Unlike any of your existing platforms No way to port existing apps No relational database, Google: joins = bad! New Data Store programming model flawed? Prone to reference leaks and cycle explosion Google architects seem to be in denial mode

Test Results

Expected Results 3% to 20% overhead Leaning towards 20% because of non core CPU operations

Actual Results In range and pretty good: Average: approx 10% Range: max 0% * to 17% Had predicted 3% to 20% Overall results: Slightly better than expected

Test Platform: HW & OS HP mid tier workstation CPU: Dual Core Memory: 8 Gigs MS Windows Server 2008 64 bit For both Host and Guest With Hypervisor Sun Java SE SDK v6 u10 64 bit JVM set to 1 Gig of RAM

Test Platform: Spider and Data Spider: Nutch 0.9 Lucene based Includes Hadoop engine Set to 50 threads Data: Enron public emails 1.5 Gigs, approx ½ M messages Served by IIS on Separate local machine

Spider vs. Index CPU

Host, Client & Web Server

Wrapping Up

What was Learned JVM and Memory Not strictly limited by Xmx switch Nutch and CPU usage Spidering is distributed, indexing is not Virtualization and Search Hypervisor or bare metal is viable option!

Is Your Search Vendor Virtualize able? Google, no way (at this time ) FAST, yes mostly. and try ESPDeploy Others TBD

The Near Future Search and Virtualization Providing Flexibility and Validation vs. huge Cost savings Scalability across many commodity machines IT and Developers need to think distributed Ask Your Vendor Official Support Benchmark metrics Clouds: Search is not there Yet

Resources Newsletter & Whitepapers: www.ideaeng.com/current Search Dev Newsgroup: www.searchdev.org Blogs: EnterpriseSearchBlog.com SearchComponentsOnline.com

Finish Line Review & Questions General Info info@ideaeng.com Mark Bennett mbennett@ideaeng.com main 408 446 3460 cell 408 829 6513