Postings Lists - Reminder

Size: px
Start display at page:

Download "Postings Lists - Reminder"

Transcription

1 Introduction to Search Engine Technology Index Compression Ronny Lempel Yahoo! Labs, Haifa (Some of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab) Postings Lists - Reminder The lexicon entry corresponding to term t points to t s postings list and also holds t s DF Logically, t s postings list is a list of posting elements corresponding to t s occurrences Each posting element contains a document identifier along with the offsets of t s occurrences within the document Sorted by increasing document identifiers Formally, for a term appearing in n t documents, x 1,x 2,,x nt : [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] where xi < xi+1 and o j <o j+1 Efficient skipping mechanisms exist that enable reaching a position in a postings list without streaming through its prefix 23 November Search Engine Technology 2 1

2 Compression of Postings Lists Smaller, more compact postings lists mean less I/O! Or that larger indices can fit in RAM Key idea: since the doc-ids associated with each term t are in ascending order, encode each doc-id by its gap from the previous identifier This encoding is called d-gap encoding Example: transform the list [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] into [(x1,f1,<o 1,,o f1 >),(x2-x 1,f2,<o 1,,o f2 >),, (xnt-x nt-1,fnt,<o 1,,o fnt >)] Note: the sequence of occurrence offsets within documents can also be encoded in a similar fashion 23 November Search Engine Technology 3 Why Use d-gaps? No information loss, but where is the saving? The largest d-gap in the 2 nd representation is potentially of the same order of magnitude as the largest document id in the 1 st representation If the index holds N documents, and a fixed binary encoding is used, both methods require log(n) bits per doc-id/d-gap However, frequent terms have d-gaps that are significantly smaller than the document identifiers in which they occur Consequently: use variable-length encoding schemes, in which small and/or frequent d-gap values will be encoded in less than log(n) bits Optimal choice of encoding scheme will depend on the probability distribution of the d-gaps and on decoding speeds 23 November Search Engine Technology 4 2

3 Integer Representations: Fixed Length vs. Variable Length Fixed length: log(n) bits per integer, where N is maximal possible integer Imposes a limit on the number of bits used per integer Does not compress: does not exploit the differences in the relative frequencies of the integers Variable length, prefix free encoding allows unbounded number representation with significant space savings Single gap variable length representations: Vint, Huffman, unary, γ, δ, Golomb encodings Multiple gap variable length representations: Group Varint, Simple9, PforDelta Simple9 doesn t support unbounded numbers, but is practical enough How much space savings is possible? 23 November Search Engine Technology 5 First Example - Vint A byte-aligned family of schemes, with each integer encoded by a variable number of bytes Simplest form - chaining: the leading bit of each byte indicates whether the number continues in an additional byte E.g. the 10-bit number : Alternatively, if the maximal integer is bounded, can encode in the leading bits of the leading byte the number of additional bytes needed to encode the given number Example for integers bounded by : Integers up to 2 6-1: Integers up to : Integers up to : Other variants exist, all easily decodable 0 0 x x x x x x 0 1 x x x x x x x x x x x x x x 1 0 x x x x x x x x x x x x x x x x x x x x x x 23 November Search Engine Technology 6 3

4 Group Varint Used by Google [Jeff Dean, Keynote at WSDM 2009] Encodes 4 integers in blocks of size 5-17 bytes First byte: four 2-bit binary length fields L 1 L 2 L 3 L 4, L j {1,2,3,4} Then, L1+L2+L3+L4 bytes (between 4-16) holding 4 numbers Each number can use 8/16/24/32 bits Reported to be about twice as fast to decode than (single) Vint schemes 23 November Search Engine Technology 7 Prefix-Free Coding of Integers Let N be the set of natural numbers and Σ the alphabet of the code In our case, Σ = {0,1} C: N Σ + is a prefix free code if for any two distinct natural numbers i,j, C(i) is not a prefix of C(j) Significance of prefix-free coding: codewords can be concatenated to each other without the need for any delimiters, and the resulting sequence remains uniquely decipherable Sometimes called comma-free codes 23 November Search Engine Technology 8 4

5 Shannon s Lossless Source Coding Theorem (simplified version) Let C: N {0,1} + be a binary encoding of the set of natural numbers, and let P: N [0,1] be a probability distribution on the set of natural numbers Denote by b i the number of bits in C(i), i.e. the length of i s encoding The expected length (in bits) of a codeword of C is thus R(C)=Σ i>0 p(i)b i (R(C) is also called the rate of the code C) Shannon: 1. For any code C, R(C) -Σ i>0 p(i) log 2 [p(i)] 2. An optimal code C* will achieve R(C*) < 1-Σ i>0 p(i) log 2 [p(i)] The quantity -Σ i>0 p(i) log 2 [p(i)] is called the Entropy of P and is denoted by H(P) 23 November Search Engine Technology 9 Unary Representation Perhaps the simplest prefix-free variable length representation of positive integers. X X-1 1 s followed by a single 0 E.g. 1 0, 3 110, By Shannon, optimal for the distribution Pr(x)=2 -x Since R(Unary Representation) = H( Pr(x)=2 -x ) The total length of gap representations in a postings list equals the ordinal number of the last document that includes the term Beats fixed-length encodings for terms that appear in more than N/log(N) documents 23 November Search Engine Technology 10 5

6 γ (Gamma) Coding Factor any x>0 into 2 e +d, where: e= log 2 x and 0 d < 2 e Represent e+1 in unary Represent d in binary, using e bits. E.g. 9= :001 Representation length: 2* log 2 x + 1 Optimal for 1/{2x 2 } < Pr(x) 1/{x 2 } OK for probability distribution? Gamma Code Integers x xx xxx xxxx Bits xxxxx November Search Engine Technology 11 Generalization: the δ (Delta) Code Factor x>0 into 2 e +d, where e= log 2 x and 0 d < 2 e Represent e+1 in γ Represent d in binary, using e bits Detailed example δ encoding of 9 9 = , i.e. e=3 and d=1 3+1 (e+1) in gamma is 110:00 1 (d) in 3-bit representation is 001 Altogether: 110:00:001 Length = 1 + log x + 2 log log 2x Optimal for P(x) 1/2x(log 2x) 2 δ Code Integers Bits x xx xxx xxxx xxxxx xxxxxx November Search Engine Technology 12 6

7 Encoding Lengths Comparison Number Unary Gamma Delta ,000,000 1,000, A fixed-length representation would require at least 20 bits per integer to encode a range of 1M 23 November Search Engine Technology 13 Golomb-Rice Codes Golomb codes are a parametric family of prefix codes that are very easy to implement They are distinguished from each other by a single parameter m The optimal choice of m depends on the probability distribution of the input sequence Rice coding is a special case of Golomb coding with m being a power of 2 Operations can then be done by masking and shifting bits 23 November Search Engine Technology 14 7

8 Golomb-Rice Coding (cont.) To encode an integer n using the Golomb code with parameter m=2 k : Write n as r*m+d, where r= n/m (the quotient) and 0 d < m is the remainder Represent r+1 in unary (since 0 doesn t have a unary encoding) Represent the remainder (n mod m) in binary using k bits Integer m=4 # bits m=8 # bits 0-3 0xx 3 0xxx xx 4 0xxx xx 5 10xxx xx 6 10xxx xx 7 110xxx 6 23 November Search Engine Technology 15 Matching Code to Distribution Unary coding is optimal when Pr(x)=2 -x Gamma is optimal when Pr(x) 1/(2x 2 ) Delta is optimal when Pr(x) 1/2x(log 2x) 2 Golomb-Rice is optimal when Pr(x)=(1-p) x-1 p, i.e. for Geometric distributions Provided that m is chosen such that (1-p) m + (1-p) m+1 1 < (1-p) m + (1-p) m-1 23 November Search Engine Technology 16 8

9 Golomb-Rice vs. the Rest For a word that appears in fraction p of the documents, let s consider that each document received the word independently with probability p (i.e. a Bernoulli process) This is an approximation, but a reasonable one in most cases Consequently, the d-gaps are distributed geometrically with parameter p, and Golomb-Rice encoding is optimal Can use different parameters for different postings lists, based on the DF of each term that is stored in the lexicon 23 November Search Engine Technology 17 Simple9 Encoding Scheme [Anh & Moffat, 2004] A word-aligned, multiple number encoding scheme Encoding block: 4 bytes (32 bits) Most significant nibble (4 bits) describe the layout of the 28 other bits as follows: 0: a single 28-bit number 1: two 14-bit numbers Layout (4 bits) n numbers of b bits each n * b 28 2: three 9-bit numbers (and one spare bit) 3: four 7-bit numbers 4: five 5-bit numbers (and three spare bits) 5: seven 4-bit numbers 6: nine 3-bit numbers (and one spare bit) 7: fourteen two-bit numbers 8: twenty-eight one-bit numbers Simple16 is a variant that defines 5 additional (uneven) configurations Can be efficiently decodable using bit masks 23 November Search Engine Technology 18 9

10 PForDelta [S. Heman, 2005] Encode a block of B integers together (e.g. B=128) Determine a percentage threshold x, such that x% of the B integers fit in k bits (e.g. x=90) Allocate an array of kb bits, and write any integer that fits in k bits in its corresponding slot; the minority of integers that don t fit in k bits are called exceptions. Encode the locations of the exceptions by chaining, using their unused k-bit slots in the array The index of the first exception is encoded before the array in log B bits Gap to next exception is encoded in k bits; if it doesn t fit in k bits, force an additional exception Encode the exceptions somehow after the log B + kb bits. 23 November Search Engine Technology 19 Practical Considerations Most search engines are believed to be using byte-aligned compression schemes While this favors Vint/Group-Varint and Simple9, one can also byte-align any of the other methods, by adding padding zeros When using d-gap compression on postings lists that support efficient skipping, each possible landing point of a skip (e.g. each block in a B + Tree) must start with an absolute docid rather than a d-gap from the previous postings element Offsets (locations) within documents are also encoded 23 November Search Engine Technology 20 10

11 DocID Assignment Problem The previous methods all compressed d-gaps; in all cases, small d- gaps are encoded by less bits than large ones Can documents be ordered (i.e. can document identifiers be assigned) such that the implied d-gaps are smaller and will thus compress better? c1 c2 c3 c4 c5 c6 c7 c c6 c3 c8 c1 c2 c5 c7 c November Search Engine Technology 21 DocID Assignment Problem (cont.) When framed as an optimization (minimization) problem of finding the best permutation of a set of documents, docid assignment is NP-Hard For small d-gaps - smaller than the expected N/df(t) - documents with similar terms should be assigned close docids Techniques applied: clustering, TSP approximations Observation: if a d-gap cannot be made smaller than N/df(t), try making it as large as possible (why?) N number of document; df(t) document frequency of term t 23 November Search Engine Technology 22 11

12 DocID Assignment by URL Sorting State of the art on Web collections is surprisingly simple ordering URLs by lexicographic order results in good compression (small d- gaps), as same-host pages that use similar vocabulary are grouped [Silvestri, ECIR 2007] Same topic, same page template, navigation bars, etc. Lexicographic URL sorting further preserves finer-grain site structure However, most of the benefit of this method is gained from simply grouping same-host documents together Lexicographic URL sorting is also key to compressing the Web graph [Boldi & Vigna, WWW 2004] 23 November Search Engine Technology 23 Further Research The following areas have been researched: Exploiting redundancy when indexing multiple documents with highly overlapping content: Near-duplicate Web pages Versioned documents (code, Wikipedia) threads (back-and-forth messages) Document assignment problems on partitioned indexes Achieving compact representations of the Web graph In particular, the adjacency lists can also be d-gap encoded 23 November Search Engine Technology 24 12

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Compression techniques

Compression techniques Compression techniques David Bařina February 22, 2013 David Bařina Compression techniques February 22, 2013 1 / 37 Contents 1 Terminology 2 Simple techniques 3 Entropy coding 4 Dictionary methods 5 Conclusion

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

The WebGraph Framework:Compression Techniques

The WebGraph Framework:Compression Techniques The WebGraph Framework: Compression Techniques Paolo Boldi Sebastiano Vigna DSI, Università di Milano, Italy The Web graph Given a set U of URLs, the graph induced by U is the directed graph having U as

More information

Information, Entropy, and Coding

Information, Entropy, and Coding Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 7, July 23 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Greedy Algorithm:

More information

On the Use of Compression Algorithms for Network Traffic Classification

On the Use of Compression Algorithms for Network Traffic Classification On the Use of for Network Traffic Classification Christian CALLEGARI Department of Information Ingeneering University of Pisa 23 September 2008 COST-TMA Meeting Samos, Greece Outline Outline 1 Introduction

More information

Image Compression through DCT and Huffman Coding Technique

Image Compression through DCT and Huffman Coding Technique International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul

More information

On Inverted Index Compression for Search Engine Efficiency

On Inverted Index Compression for Search Engine Efficiency On Inverted Index Compression for Search Engine Efficiency Matteo Catena 1, Craig Macdonald 2, and Iadh Ounis 2 1 GSSI - Gran Sasso Science Institute, INFN Viale F. Crispi 7, 67100 L Aquila, Italy matteo.catena@gssi.infn.it

More information

Section 1.4 Place Value Systems of Numeration in Other Bases

Section 1.4 Place Value Systems of Numeration in Other Bases Section.4 Place Value Systems of Numeration in Other Bases Other Bases The Hindu-Arabic system that is used in most of the world today is a positional value system with a base of ten. The simplest reason

More information

Storage Optimization in Cloud Environment using Compression Algorithm

Storage Optimization in Cloud Environment using Compression Algorithm Storage Optimization in Cloud Environment using Compression Algorithm K.Govinda 1, Yuvaraj Kumar 2 1 School of Computing Science and Engineering, VIT University, Vellore, India kgovinda@vit.ac.in 2 School

More information

Indexing and Compression of Text

Indexing and Compression of Text Compressing the Digital Library Timothy C. Bell 1, Alistair Moffat 2, and Ian H. Witten 3 1 Department of Computer Science, University of Canterbury, New Zealand, tim@cosc.canterbury.ac.nz 2 Department

More information

Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

More information

Modified Golomb-Rice Codes for Lossless Compression of Medical Images

Modified Golomb-Rice Codes for Lossless Compression of Medical Images Modified Golomb-Rice Codes for Lossless Compression of Medical Images Roman Starosolski (1), Władysław Skarbek (2) (1) Silesian University of Technology (2) Warsaw University of Technology Abstract Lossless

More information

Diffusion and Data compression for data security. A.J. Han Vinck University of Duisburg/Essen April 2013 Vinck@iem.uni-due.de

Diffusion and Data compression for data security. A.J. Han Vinck University of Duisburg/Essen April 2013 Vinck@iem.uni-due.de Diffusion and Data compression for data security A.J. Han Vinck University of Duisburg/Essen April 203 Vinck@iem.uni-due.de content Why diffusion is important? Why data compression is important? Unicity

More information

A Catalogue of the Steiner Triple Systems of Order 19

A Catalogue of the Steiner Triple Systems of Order 19 A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of

More information

APP INVENTOR. Test Review

APP INVENTOR. Test Review APP INVENTOR Test Review Main Concepts App Inventor Lists Creating Random Numbers Variables Searching and Sorting Data Linear Search Binary Search Selection Sort Quick Sort Abstraction Modulus Division

More information

Gambling and Data Compression

Gambling and Data Compression Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction

More information

3-17 15-25 5 15-10 25 3-2 5 0. 1b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true

3-17 15-25 5 15-10 25 3-2 5 0. 1b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true Section 5.2 solutions #1-10: a) Perform the division using synthetic division. b) if the remainder is 0 use the result to completely factor the dividend (this is the numerator or the polynomial to the

More information

CHAPTER 2 LITERATURE REVIEW

CHAPTER 2 LITERATURE REVIEW 11 CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION Image compression is mainly used to reduce storage space, transmission time and bandwidth requirements. In the subsequent sections of this chapter, general

More information

An Introduction to Information Theory

An Introduction to Information Theory An Introduction to Information Theory Carlton Downey November 12, 2013 INTRODUCTION Today s recitation will be an introduction to Information Theory Information theory studies the quantification of Information

More information

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =

More information

Arithmetic Coding: Introduction

Arithmetic Coding: Introduction Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation

More information

Streaming Lossless Data Compression Algorithm (SLDC)

Streaming Lossless Data Compression Algorithm (SLDC) Standard ECMA-321 June 2001 Standardizing Information and Communication Systems Streaming Lossless Data Compression Algorithm (SLDC) Phone: +41 22 849.60.00 - Fax: +41 22 849.60.01 - URL: http://www.ecma.ch

More information

Storage Management for Files of Dynamic Records

Storage Management for Files of Dynamic Records Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science

More information

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of

More information

Class Notes CS 3137. 1 Creating and Using a Huffman Code. Ref: Weiss, page 433

Class Notes CS 3137. 1 Creating and Using a Huffman Code. Ref: Weiss, page 433 Class Notes CS 3137 1 Creating and Using a Huffman Code. Ref: Weiss, page 433 1. FIXED LENGTH CODES: Codes are used to transmit characters over data links. You are probably aware of the ASCII code, a fixed-length

More information

Probability Interval Partitioning Entropy Codes

Probability Interval Partitioning Entropy Codes SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Probability Interval Partitioning Entropy Codes Detlev Marpe, Senior Member, IEEE, Heiko Schwarz, and Thomas Wiegand, Senior Member, IEEE Abstract

More information

Record Storage and Primary File Organization

Record Storage and Primary File Organization Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records

More information

DNA Sequencing Data Compression. Michael Chung

DNA Sequencing Data Compression. Michael Chung DNA Sequencing Data Compression Michael Chung Problem DNA sequencing per dollar is increasing faster than storage capacity per dollar. Stein (2010) Data 3 billion base pairs in human genome Genomes are

More information

Chapter 12 File Management

Chapter 12 File Management Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

More information

Chapter 12 File Management. Roadmap

Chapter 12 File Management. Roadmap Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

More information

winhex Disk Editor, RAM Editor PRESENTED BY: OMAR ZYADAT and LOAI HATTAR

winhex Disk Editor, RAM Editor PRESENTED BY: OMAR ZYADAT and LOAI HATTAR winhex Disk Editor, RAM Editor PRESENTED BY: OMAR ZYADAT and LOAI HATTAR Supervised by : Dr. Lo'ai Tawalbeh New York Institute of Technology (NYIT)-Jordan X-Ways Software Technology AG is a stock corporation

More information

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Sergio De Agostino Computer Science Department Sapienza University of Rome Internet as a Distributed System Modern

More information

Entropy and Mutual Information

Entropy and Mutual Information ENCYCLOPEDIA OF COGNITIVE SCIENCE 2000 Macmillan Reference Ltd Information Theory information, entropy, communication, coding, bit, learning Ghahramani, Zoubin Zoubin Ghahramani University College London

More information

Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska

Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska PROBLEM STATEMENT A ROBUST COMPRESSION SYSTEM FOR LOW BIT RATE TELEMETRY - TEST RESULTS WITH LUNAR DATA Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska The

More information

Classification - Examples

Classification - Examples Lecture 2 Scheduling 1 Classification - Examples 1 r j C max given: n jobs with processing times p 1,...,p n and release dates r 1,...,r n jobs have to be scheduled without preemption on one machine taking

More information

On the Unique Games Conjecture

On the Unique Games Conjecture On the Unique Games Conjecture Antonios Angelakis National Technical University of Athens June 16, 2015 Antonios Angelakis (NTUA) Theory of Computation June 16, 2015 1 / 20 Overview 1 Introduction 2 Preliminary

More information

Lossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks

Lossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks Lossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks Christian Renner and Phu Anh Tuan Nguyen ENSsys 14, Memphis, TN, USA November 6 th, 2014

More information

Calculator Notes for the TI-Nspire and TI-Nspire CAS

Calculator Notes for the TI-Nspire and TI-Nspire CAS INTRODUCTION Calculator Notes for the Getting Started: Navigating Screens and Menus Your handheld is like a small computer. You will always work in a document with one or more problems and one or more

More information

SMALL INDEX LARGE INDEX (SILT)

SMALL INDEX LARGE INDEX (SILT) Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song

More information

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: 9081 + 900b + 90c = 9001 + 100c + 10b

We can express this in decimal notation (in contrast to the underline notation we have been using) as follows: 9081 + 900b + 90c = 9001 + 100c + 10b In this session, we ll learn how to solve problems related to place value. This is one of the fundamental concepts in arithmetic, something every elementary and middle school mathematics teacher should

More information

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8. Reading.. IMAGE COMPRESSION- I Week VIII Feb 25 Chapter 8 Sections 8.1, 8.2 8.3 (selected topics) 8.4 (Huffman, run-length, loss-less predictive) 8.5 (lossy predictive, transform coding basics) 8.6 Image

More information

1/18/2013. 5.5-year Ph.D. student internships in. Done job hunting recently. Will join Yahoo! Labs soon. Interviewed with

1/18/2013. 5.5-year Ph.D. student internships in. Done job hunting recently. Will join Yahoo! Labs soon. Interviewed with Liangjie Hong Ph.D. Candidate Dept. of Computer Science and Engineering 5.5-year Ph.D. student internships in a local software company (2008, 2 months) Yahoo! Labs (2010, 3 months) Yahoo! Labs (2011, 3

More information

Algorithms for Advanced Packet Classification with Ternary CAMs

Algorithms for Advanced Packet Classification with Ternary CAMs Algorithms for Advanced Packet Classification with Ternary CAMs Karthik Lakshminarayanan UC Berkeley Joint work with Anand Rangarajan and Srinivasan Venkatachary (Cypress Semiconductor) Packet Processing

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Encoding Text with a Small Alphabet

Encoding Text with a Small Alphabet Chapter 2 Encoding Text with a Small Alphabet Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out

More information

Intro to the Art of Computer Science

Intro to the Art of Computer Science 1 LESSON NAME: Intro to the Art of Computer Science Lesson time: 45 60 Minutes : Prep time: 15 Minutes Main Goal: Give the class a clear understanding of what computer science is and how it could be helpful

More information

Lecture 11: Number Systems

Lecture 11: Number Systems Lecture 11: Number Systems Numeric Data Fixed point Integers (12, 345, 20567 etc) Real fractions (23.45, 23., 0.145 etc.) Floating point such as 23. 45 e 12 Basically an exponent representation Any number

More information

Storage and File Structure

Storage and File Structure Storage and File Structure Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks RAID Tertiary Storage Storage Access File Organization Organization of Records in Files

More information

Scheduling Shop Scheduling. Tim Nieberg

Scheduling Shop Scheduling. Tim Nieberg Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations

More information

On Directed Information and Gambling

On Directed Information and Gambling On Directed Information and Gambling Haim H. Permuter Stanford University Stanford, CA, USA haim@stanford.edu Young-Han Kim University of California, San Diego La Jolla, CA, USA yhk@ucsd.edu Tsachy Weissman

More information

Parquet. Columnar storage for the people

Parquet. Columnar storage for the people Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various

More information

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Hash Tables Tables so far set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Worst O(lg n) O(lg n) O(lg n) Table naïve array implementation

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

CHAPTER 17: File Management

CHAPTER 17: File Management CHAPTER 17: File Management The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

Web Graph Visualizer. AUTOMATYKA 2011 Tom 15 Zeszyt 3. 1. Introduction. Micha³ Sima*, Wojciech Bieniecki*, Szymon Grabowski*

Web Graph Visualizer. AUTOMATYKA 2011 Tom 15 Zeszyt 3. 1. Introduction. Micha³ Sima*, Wojciech Bieniecki*, Szymon Grabowski* AUTOMATYKA 2011 Tom 15 Zeszyt 3 Micha³ Sima*, Wojciech Bieniecki*, Szymon Grabowski* Web Graph Visualizer 1. Introduction Web Graph is a directed, unlabeled graph G = (V, E), which represents connections

More information

Counters and Decoders

Counters and Decoders Physics 3330 Experiment #10 Fall 1999 Purpose Counters and Decoders In this experiment, you will design and construct a 4-bit ripple-through decade counter with a decimal read-out display. Such a counter

More information

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY 11794 4400

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY 11794 4400 Lecture 18: Applications of Dynamic Programming Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Problem of the Day

More information

6 3 4 9 = 6 10 + 3 10 + 4 10 + 9 10

6 3 4 9 = 6 10 + 3 10 + 4 10 + 9 10 Lesson The Binary Number System. Why Binary? The number system that you are familiar with, that you use every day, is the decimal number system, also commonly referred to as the base- system. When you

More information

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran Wan Accelerators: Optimizing Network Traffic with Compression Bartosz Agas, Marvin Germar & Christopher Tran Introduction A WAN accelerator is an appliance that can maximize the services of a point-to-point(ptp)

More information

The Goldberg Rao Algorithm for the Maximum Flow Problem

The Goldberg Rao Algorithm for the Maximum Flow Problem The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }

More information

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004 Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February

More information

encoding compression encryption

encoding compression encryption encoding compression encryption ASCII utf-8 utf-16 zip mpeg jpeg AES RSA diffie-hellman Expressing characters... ASCII and Unicode, conventions of how characters are expressed in bits. ASCII (7 bits) -

More information

5.1 Bipartite Matching

5.1 Bipartite Matching CS787: Advanced Algorithms Lecture 5: Applications of Network Flow In the last lecture, we looked at the problem of finding the maximum flow in a graph, and how it can be efficiently solved using the Ford-Fulkerson

More information

Storing Measurement Data

Storing Measurement Data Storing Measurement Data File I/O records or reads data in a file. A typical file I/O operation involves the following process. 1. Create or open a file. Indicate where an existing file resides or where

More information

How To Code With Cbcc (Cbcc) In Video Coding

How To Code With Cbcc (Cbcc) In Video Coding 620 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003 Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard Detlev Marpe, Member,

More information

CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM

CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM Instructor: Alex C. Snoeren Name SOLUTIONS Student ID Question Score Points 1 15 15 2 35 35 3 25 25 4 15 15 5 10 10 Total 100 100 This exam is

More information

Base Conversion written by Cathy Saxton

Base Conversion written by Cathy Saxton Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,

More information

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals Modified from the lecture slides of Lami Kaya (LKaya@ieee.org) for use CECS 474, Fall 2008. 2009 Pearson Education Inc., Upper

More information

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic: Binary Numbers In computer science we deal almost exclusively with binary numbers. it will be very helpful to memorize some binary constants and their decimal and English equivalents. By English equivalents

More information

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13. Disk Storage, Basic File Structures, and Hashing Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing

More information

Transformation of LOG file using LIPT technique

Transformation of LOG file using LIPT technique Research Article International Journal of Advanced Computer Research, Vol 6(23) ISSN (Print): 2249-7277 ISSN (Online): 2277-7970 http://dx.doi.org/ 10.19101/IJACR.2016.623015 Transformation of LOG file

More information

An overview of FAT12

An overview of FAT12 An overview of FAT12 The File Allocation Table (FAT) is a table stored on a hard disk or floppy disk that indicates the status and location of all data clusters that are on the disk. The File Allocation

More information

Multimedia Systems WS 2010/2011

Multimedia Systems WS 2010/2011 Multimedia Systems WS 2010/2011 31.01.2011 M. Rahamatullah Khondoker (Room # 36/410 ) University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de

More information

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Today s topics. Digital Computers. More on binary. Binary Digits (Bits) Today s topics! Binary Numbers! Brookshear.-.! Slides from Prof. Marti Hearst of UC Berkeley SIMS! Upcoming! Networks Interactive Introduction to Graph Theory http://www.utm.edu/cgi-bin/caldwell/tutor/departments/math/graph/intro

More information

Chapter 7 Memory Management

Chapter 7 Memory Management Operating Systems: Internals and Design Principles Chapter 7 Memory Management Eighth Edition William Stallings Frame Page Segment A fixed-length block of main memory. A fixed-length block of data that

More information

Introduction to image coding

Introduction to image coding Introduction to image coding Image coding aims at reducing amount of data required for image representation, storage or transmission. This is achieved by removing redundant data from an image, i.e. by

More information

Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks

Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks Reza Tourani, Satyajayant (Jay) Misra, Joerg Kliewer, Scott Ortegel, Travis Mick Computer Science Department

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1 Slide 13-1 Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

CMPE 150 Winter 2009

CMPE 150 Winter 2009 CMPE 150 Winter 2009 Lecture 6 January 22, 2009 P.E. Mantey CMPE 150 -- Introduction to Computer Networks Instructor: Patrick Mantey mantey@soe.ucsc.edu http://www.soe.ucsc.edu/~mantey/ / t / Office: Engr.

More information

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM Iuon Chang Lin Department of Management Information Systems, National Chung Hsing University, Taiwan, Department of Photonics and Communication Engineering,

More information

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis

More information

Zeros of Polynomial Functions

Zeros of Polynomial Functions Zeros of Polynomial Functions Objectives: 1.Use the Fundamental Theorem of Algebra to determine the number of zeros of polynomial functions 2.Find rational zeros of polynomial functions 3.Find conjugate

More information

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential

More information

BM307 File Organization

BM307 File Organization BM307 File Organization Gazi University Computer Engineering Department 9/24/2014 1 Index Sequential File Organization Binary Search Interpolation Search Self-Organizing Sequential Search Direct File Organization

More information

Self-Indexing Inverted Files for Fast Text Retrieval

Self-Indexing Inverted Files for Fast Text Retrieval Self-Indexing Inverted Files for Fast Text Retrieval Alistair Moffat Justin Zobel February 1994 Abstract Query processing costs on large text databases are dominated by the need to retrieve and scan the

More information

RN-coding of Numbers: New Insights and Some Applications

RN-coding of Numbers: New Insights and Some Applications RN-coding of Numbers: New Insights and Some Applications Peter Kornerup Dept. of Mathematics and Computer Science SDU, Odense, Denmark & Jean-Michel Muller LIP/Arénaire (CRNS-ENS Lyon-INRIA-UCBL) Lyon,

More information

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 3 Random Variables: Distribution and Expectation Random Variables Question: The homeworks of 20 students are collected

More information

Useful Number Systems

Useful Number Systems Useful Number Systems Decimal Base = 10 Digit Set = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Binary Base = 2 Digit Set = {0, 1} Octal Base = 8 = 2 3 Digit Set = {0, 1, 2, 3, 4, 5, 6, 7} Hexadecimal Base = 16 = 2

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Fast Arithmetic Coding (FastAC) Implementations

Fast Arithmetic Coding (FastAC) Implementations Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface

Two Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface File Management Two Parts Filesystem Interface Interface the user sees Organization of the files as seen by the user Operations defined on files Properties that can be read/modified Filesystem design Implementing

More information

Solutions to Problem Set 1

Solutions to Problem Set 1 YALE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE CPSC 467b: Cryptography and Computer Security Handout #8 Zheng Ma February 21, 2005 Solutions to Problem Set 1 Problem 1: Cracking the Hill cipher Suppose

More information

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C Tutorial#1 Q 1:- Explain the terms data, elementary item, entity, primary key, domain, attribute and information? Also give examples in support of your answer? Q 2:- What is a Data Type? Differentiate

More information

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems 1 Any modern computer system will incorporate (at least) two levels of storage: primary storage: typical capacity cost per MB $3. typical access time burst transfer rate?? secondary storage: typical capacity

More information