Search Engine Architecture



Similar documents
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

NoSQL Data Base Basics

6.S897 Large-Scale Systems

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture Data Warehouse Systems

Big Data and Data Science: Behind the Buzz Words

Big Data and Apache Hadoop s MapReduce

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Open source large scale distributed data management with Google s MapReduce and Bigtable

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Apache Flink Next-gen data analysis. Kostas

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Hadoop. Sunday, November 25, 12

Scaling Out With Apache Spark. DTL Meeting Slides based on

Cloud Scale Distributed Data Storage. Jürmo Mehine

Big Data Processing. Patrick Wendell Databricks

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Large scale processing using Hadoop. Ján Vaňo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Big Data Systems CS 5965/6965 FALL 2015

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop IST 734 SS CHUNG

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Kafka & Redis for Big Data Solutions

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

RevoScaleR Speed and Scalability

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

So What s the Big Deal?

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Open source Google-style large scale data analysis with Hadoop

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

MapReduce with Apache Hadoop Analysing Big Data

From GWS to MapReduce: Google s Cloud Technology in the Early Days

How To Handle Big Data With A Data Scientist

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Introduction to NOSQL

How To Scale Out Of A Nosql Database

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Unified Big Data Analytics Pipeline. 连 城

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Hadoop implementation of MapReduce computational model. Ján Vaňo

A Brief Introduction to Apache Tez

MapReduce and Hadoop Distributed File System V I J A Y R A O

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Systems CS 5965/6965 FALL 2014

Apache Hadoop. Alexandru Costan

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Energy Efficient MapReduce

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Ali Ghodsi Head of PM and Engineering Databricks

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Accelerating and Simplifying Apache

How Companies are! Using Spark

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Why Threads Are A Bad Idea (for most purposes)

Google File System. Web and scalability

Large-Scale Data Processing

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Unified Big Data Processing with Apache Spark. Matei

Paolo Costa

MapReduce and Hadoop Distributed File System

Hadoop: Embracing future hardware

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Challenges for Data Driven Systems

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data with Component Based Software

CS 564: DATABASE MANAGEMENT SYSTEMS

The Quest for Extreme Scalability

Sentimental Analysis using Hadoop Phase 2: Week 2

Fast Data in the Era of Big Data: Twitter s Real-

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

CSE-E5430 Scalable Cloud Computing Lecture 2

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Apache Hama Design Document v0.6

White Paper: What You Need To Know About Hadoop

MAPREDUCE Programming Model

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Transcription:

Search Engine Architecture 1. Introduction This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Noted slides adapted from Lin et al. s Data-Intensive Computing with MapReduce, UMD Spring 2013 with cosmetic changes.

What is this course about? Information retrieval on big data Broad survey of system architectures that enable big data applications In-depth focus on key insights behind each Without getting too bogged down

IR is Everywhere Web search (Google, Bing, ) Finance (Bloomberg) Advertising (AdSense, ) Fraud detection Medical diagnostics What are the major architectural problems?

Big Data Storage KV Dynamo, Cassandra, Riak, Redis Document MongoDB, CouchDB, Elasticsearch Column VoltDB, Vertica Graph Neo4j, OrientDB BigTable HBase, HyperTable

Big Data Processing MapReduce Hadoop, Hive, Pig DAG Storm, Dryad, Spark Vertex-centric Pregel, GraphLab

Course Administrivia

Prerequisites CSCI-GA 1170 Fundamental Algorithms CSCI-GA 2110 Programming Languages CSCI-GA 2250 Operating Systems Working knowledge of Python Ability to Google solutions to problems as they come up

Details Lectures Wed 5:10-7pm Office hours after class 7-9pm Mailing list See assignment 1 for the link if you re not sure Grading 10% Class Participation 50% Assignments 40% Final project

Details Readings All available online see course website Introduction to Information Retrieval by Manning et al. Data-Intensive Text-Processing with MapReduce by Lin et al. Relevant academic articles

Assignments Designed to introduce you to the material in a structured way Feel free to work together, but everything submitted must be prepared (typed) individually Due by start of class

Assignments Late policy Up to 24 hours late: 0.75 multiplier 24 to 48 hours late: 0.5 multiplier And so on Resubmit policy Graded assignments may be resubmitted up to a week later for (up to) full credit Must write detailed notes on what went wrong with your first solution

Assignments In the history of CS, these topics are extremely new New things tend to break Bugs, missing/wrong documentation, incompatible versions Be patient We will inevitably encounter problems Be flexible We will find workarounds Be constructive Tell me how I can improve everyone s experience

Big Ideas

Scale out, not up Prefer solutions that use many low-end machines instead of few high-end machines Typically, more than twice as expensive to double number of cores Commodity machines see better economies of scale What are reasons to scale up instead? Source: Lin et al. Data- Intensive Compu8ng with MapReduce, UMD Spring 2013.

Assume failures will happen Example: MTBF of 1,000 years (3 years) 1,000 node cluster will experience 1 failure per day System architectures must be designed with faulttolerance in mind Automatically take failed nodes offline Resubmit failed jobs Watch for stragglers Source: Lin et al. Data- Intensive Compu8ng with MapReduce, UMD Spring 2013.

Good APIs hide system details Keeping track of details makes programming hard Counterexample: threading Traditional means of parallelization Many hazards: race conditions, deadlock, livelock Difficult to reason about Require higher-level design patterns (e.g. producer-consumer queues) to avoid pitfalls Source: Lin et al. Data- Intensive Compu8ng with MapReduce, UMD Spring 2013.

Aim for ideal scalability N machines should handle (nearly) N times the load Avoid serial computation (dependencies) Avoid shared memory (side effects) Example: MapReduce Source: Lin et al. Data- Intensive Compu8ng with MapReduce, UMD Spring 2013.

Move code to the data In recent years, CPUs have become much faster Storage has become much faster and more dense But network speeds haven t changed much, so network is often the bottleneck for large-scale batch computation Solution: rather than fetching data from remote storage and processing it, move the processing code to the nodes that are storing your data Source: Lin et al. Data- Intensive Compu8ng with MapReduce, UMD Spring 2013.

Avoid random disk access Random disk access is slow Example: 1 TB database containing 10 10 100-byte records Updating 1% of records will take one month on one machine But rewriting entire database will take less than one day Solutions: Process data sequentially Don t go to disk at all Source: Lin et al. Data- Intensive Compu8ng with MapReduce, UMD Spring 2013.

Single-Threaded Asynchronous Execution

The Problem User makes a request to a server We need to spend a little time coming up with a response In the meantime, we still need to be able to accept new connections! What are some solutions to this problem?

Traditional Solutions Prefork processes Spawn a worker thread Disadvantages: Thread and process overhead Reasoning about multithreaded code

Blocking vs. Non-blocking Sockets Blocking sockets: API calls will block until action (send, recv, connect, accept) has finished Non-blocking sockets: these calls will return immediately without doing anything In Python, use socket.setblocking(0) to make a socket non-blocking

Event Loop Single thread, single process Uses non-blocking I/O to wait for data to come back Allows us to service new connections while we wait All non-i/o activity within the process will still block

Tornado Event loop-based web framework Started at FriendFeed Bought by Facebook Lots in common with Twisted Includes C10k web server

select vs. poll vs. epoll select system call that allows a program to monitor multiple file descriptors (sockets) Builds a bitmap of all fds, turns on bits for fds of interest Each call is O(highest file descriptor)

select vs. poll vs. epoll poll requires registering file descriptors of interest Each call is O(number of registered file descriptors)

select vs. poll vs. epoll epoll better event notification Same API as poll Each call is O(number of active file descriptors)

Web Server Example import tornado.ioloop import tornado.web class MainHandler(tornado.web.RequestHandler): def get(self): self.write("hello, world") if name == " main ": application = tornado.web.application([ (r"/", MainHandler), ]) application.listen(8888) tornado.ioloop.ioloop.instance().start() Source: Structure of a Tornado web applica8on. Tornado User s Guide.

Tornado Coroutines class AsyncHandler(RequestHandler): @asynchronous def get(self): http_client = AsyncHTTPClient() http_client.fetch("http://example.com", callback=self.on_fetch) def on_fetch(self, response): do_something_with_response(response) self.render("template.html") Source: tornado.gen Simplify asynchronous code. Tornado User s Guide.

Tornado Coroutines class GenAsyncHandler(RequestHandler): @gen.coroutine def get(self): http_client = AsyncHTTPClient() response = yield http_client.fetch("http://example.com") do_something_with_response(response) self.render("template.html") Source: tornado.gen Simplify asynchronous code. Tornado User s Guide.

Until next time Assignment 1 has been posted Due before class next week Questions?