Postings Lists - Reminder
|
|
- Judith O’Neal’
- 7 years ago
- Views:
Transcription
1 Introduction to Search Engine Technology Index Compression Ronny Lempel Yahoo! Labs, Haifa (Some of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab) Postings Lists - Reminder The lexicon entry corresponding to term t points to t s postings list and also holds t s DF Logically, t s postings list is a list of posting elements corresponding to t s occurrences Each posting element contains a document identifier along with the offsets of t s occurrences within the document Sorted by increasing document identifiers Formally, for a term appearing in n t documents, x 1,x 2,,x nt : [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] where xi < xi+1 and o j <o j+1 Efficient skipping mechanisms exist that enable reaching a position in a postings list without streaming through its prefix 23 November Search Engine Technology 2 1
2 Compression of Postings Lists Smaller, more compact postings lists mean less I/O! Or that larger indices can fit in RAM Key idea: since the doc-ids associated with each term t are in ascending order, encode each doc-id by its gap from the previous identifier This encoding is called d-gap encoding Example: transform the list [(x1,f1,<o 1,,o f1 >),(x2,f2,<o 1,,o f2 >),, (xnt,fnt,<o 1,,o fnt >)] into [(x1,f1,<o 1,,o f1 >),(x2-x 1,f2,<o 1,,o f2 >),, (xnt-x nt-1,fnt,<o 1,,o fnt >)] Note: the sequence of occurrence offsets within documents can also be encoded in a similar fashion 23 November Search Engine Technology 3 Why Use d-gaps? No information loss, but where is the saving? The largest d-gap in the 2 nd representation is potentially of the same order of magnitude as the largest document id in the 1 st representation If the index holds N documents, and a fixed binary encoding is used, both methods require log(n) bits per doc-id/d-gap However, frequent terms have d-gaps that are significantly smaller than the document identifiers in which they occur Consequently: use variable-length encoding schemes, in which small and/or frequent d-gap values will be encoded in less than log(n) bits Optimal choice of encoding scheme will depend on the probability distribution of the d-gaps and on decoding speeds 23 November Search Engine Technology 4 2
3 Integer Representations: Fixed Length vs. Variable Length Fixed length: log(n) bits per integer, where N is maximal possible integer Imposes a limit on the number of bits used per integer Does not compress: does not exploit the differences in the relative frequencies of the integers Variable length, prefix free encoding allows unbounded number representation with significant space savings Single gap variable length representations: Vint, Huffman, unary, γ, δ, Golomb encodings Multiple gap variable length representations: Group Varint, Simple9, PforDelta Simple9 doesn t support unbounded numbers, but is practical enough How much space savings is possible? 23 November Search Engine Technology 5 First Example - Vint A byte-aligned family of schemes, with each integer encoded by a variable number of bytes Simplest form - chaining: the leading bit of each byte indicates whether the number continues in an additional byte E.g. the 10-bit number : Alternatively, if the maximal integer is bounded, can encode in the leading bits of the leading byte the number of additional bytes needed to encode the given number Example for integers bounded by : Integers up to 2 6-1: Integers up to : Integers up to : Other variants exist, all easily decodable 0 0 x x x x x x 0 1 x x x x x x x x x x x x x x 1 0 x x x x x x x x x x x x x x x x x x x x x x 23 November Search Engine Technology 6 3
4 Group Varint Used by Google [Jeff Dean, Keynote at WSDM 2009] Encodes 4 integers in blocks of size 5-17 bytes First byte: four 2-bit binary length fields L 1 L 2 L 3 L 4, L j {1,2,3,4} Then, L1+L2+L3+L4 bytes (between 4-16) holding 4 numbers Each number can use 8/16/24/32 bits Reported to be about twice as fast to decode than (single) Vint schemes 23 November Search Engine Technology 7 Prefix-Free Coding of Integers Let N be the set of natural numbers and Σ the alphabet of the code In our case, Σ = {0,1} C: N Σ + is a prefix free code if for any two distinct natural numbers i,j, C(i) is not a prefix of C(j) Significance of prefix-free coding: codewords can be concatenated to each other without the need for any delimiters, and the resulting sequence remains uniquely decipherable Sometimes called comma-free codes 23 November Search Engine Technology 8 4
5 Shannon s Lossless Source Coding Theorem (simplified version) Let C: N {0,1} + be a binary encoding of the set of natural numbers, and let P: N [0,1] be a probability distribution on the set of natural numbers Denote by b i the number of bits in C(i), i.e. the length of i s encoding The expected length (in bits) of a codeword of C is thus R(C)=Σ i>0 p(i)b i (R(C) is also called the rate of the code C) Shannon: 1. For any code C, R(C) -Σ i>0 p(i) log 2 [p(i)] 2. An optimal code C* will achieve R(C*) < 1-Σ i>0 p(i) log 2 [p(i)] The quantity -Σ i>0 p(i) log 2 [p(i)] is called the Entropy of P and is denoted by H(P) 23 November Search Engine Technology 9 Unary Representation Perhaps the simplest prefix-free variable length representation of positive integers. X X-1 1 s followed by a single 0 E.g. 1 0, 3 110, By Shannon, optimal for the distribution Pr(x)=2 -x Since R(Unary Representation) = H( Pr(x)=2 -x ) The total length of gap representations in a postings list equals the ordinal number of the last document that includes the term Beats fixed-length encodings for terms that appear in more than N/log(N) documents 23 November Search Engine Technology 10 5
6 γ (Gamma) Coding Factor any x>0 into 2 e +d, where: e= log 2 x and 0 d < 2 e Represent e+1 in unary Represent d in binary, using e bits. E.g. 9= :001 Representation length: 2* log 2 x + 1 Optimal for 1/{2x 2 } < Pr(x) 1/{x 2 } OK for probability distribution? Gamma Code Integers x xx xxx xxxx Bits xxxxx November Search Engine Technology 11 Generalization: the δ (Delta) Code Factor x>0 into 2 e +d, where e= log 2 x and 0 d < 2 e Represent e+1 in γ Represent d in binary, using e bits Detailed example δ encoding of 9 9 = , i.e. e=3 and d=1 3+1 (e+1) in gamma is 110:00 1 (d) in 3-bit representation is 001 Altogether: 110:00:001 Length = 1 + log x + 2 log log 2x Optimal for P(x) 1/2x(log 2x) 2 δ Code Integers Bits x xx xxx xxxx xxxxx xxxxxx November Search Engine Technology 12 6
7 Encoding Lengths Comparison Number Unary Gamma Delta ,000,000 1,000, A fixed-length representation would require at least 20 bits per integer to encode a range of 1M 23 November Search Engine Technology 13 Golomb-Rice Codes Golomb codes are a parametric family of prefix codes that are very easy to implement They are distinguished from each other by a single parameter m The optimal choice of m depends on the probability distribution of the input sequence Rice coding is a special case of Golomb coding with m being a power of 2 Operations can then be done by masking and shifting bits 23 November Search Engine Technology 14 7
8 Golomb-Rice Coding (cont.) To encode an integer n using the Golomb code with parameter m=2 k : Write n as r*m+d, where r= n/m (the quotient) and 0 d < m is the remainder Represent r+1 in unary (since 0 doesn t have a unary encoding) Represent the remainder (n mod m) in binary using k bits Integer m=4 # bits m=8 # bits 0-3 0xx 3 0xxx xx 4 0xxx xx 5 10xxx xx 6 10xxx xx 7 110xxx 6 23 November Search Engine Technology 15 Matching Code to Distribution Unary coding is optimal when Pr(x)=2 -x Gamma is optimal when Pr(x) 1/(2x 2 ) Delta is optimal when Pr(x) 1/2x(log 2x) 2 Golomb-Rice is optimal when Pr(x)=(1-p) x-1 p, i.e. for Geometric distributions Provided that m is chosen such that (1-p) m + (1-p) m+1 1 < (1-p) m + (1-p) m-1 23 November Search Engine Technology 16 8
9 Golomb-Rice vs. the Rest For a word that appears in fraction p of the documents, let s consider that each document received the word independently with probability p (i.e. a Bernoulli process) This is an approximation, but a reasonable one in most cases Consequently, the d-gaps are distributed geometrically with parameter p, and Golomb-Rice encoding is optimal Can use different parameters for different postings lists, based on the DF of each term that is stored in the lexicon 23 November Search Engine Technology 17 Simple9 Encoding Scheme [Anh & Moffat, 2004] A word-aligned, multiple number encoding scheme Encoding block: 4 bytes (32 bits) Most significant nibble (4 bits) describe the layout of the 28 other bits as follows: 0: a single 28-bit number 1: two 14-bit numbers Layout (4 bits) n numbers of b bits each n * b 28 2: three 9-bit numbers (and one spare bit) 3: four 7-bit numbers 4: five 5-bit numbers (and three spare bits) 5: seven 4-bit numbers 6: nine 3-bit numbers (and one spare bit) 7: fourteen two-bit numbers 8: twenty-eight one-bit numbers Simple16 is a variant that defines 5 additional (uneven) configurations Can be efficiently decodable using bit masks 23 November Search Engine Technology 18 9
10 PForDelta [S. Heman, 2005] Encode a block of B integers together (e.g. B=128) Determine a percentage threshold x, such that x% of the B integers fit in k bits (e.g. x=90) Allocate an array of kb bits, and write any integer that fits in k bits in its corresponding slot; the minority of integers that don t fit in k bits are called exceptions. Encode the locations of the exceptions by chaining, using their unused k-bit slots in the array The index of the first exception is encoded before the array in log B bits Gap to next exception is encoded in k bits; if it doesn t fit in k bits, force an additional exception Encode the exceptions somehow after the log B + kb bits. 23 November Search Engine Technology 19 Practical Considerations Most search engines are believed to be using byte-aligned compression schemes While this favors Vint/Group-Varint and Simple9, one can also byte-align any of the other methods, by adding padding zeros When using d-gap compression on postings lists that support efficient skipping, each possible landing point of a skip (e.g. each block in a B + Tree) must start with an absolute docid rather than a d-gap from the previous postings element Offsets (locations) within documents are also encoded 23 November Search Engine Technology 20 10
11 DocID Assignment Problem The previous methods all compressed d-gaps; in all cases, small d- gaps are encoded by less bits than large ones Can documents be ordered (i.e. can document identifiers be assigned) such that the implied d-gaps are smaller and will thus compress better? c1 c2 c3 c4 c5 c6 c7 c c6 c3 c8 c1 c2 c5 c7 c November Search Engine Technology 21 DocID Assignment Problem (cont.) When framed as an optimization (minimization) problem of finding the best permutation of a set of documents, docid assignment is NP-Hard For small d-gaps - smaller than the expected N/df(t) - documents with similar terms should be assigned close docids Techniques applied: clustering, TSP approximations Observation: if a d-gap cannot be made smaller than N/df(t), try making it as large as possible (why?) N number of document; df(t) document frequency of term t 23 November Search Engine Technology 22 11
12 DocID Assignment by URL Sorting State of the art on Web collections is surprisingly simple ordering URLs by lexicographic order results in good compression (small d- gaps), as same-host pages that use similar vocabulary are grouped [Silvestri, ECIR 2007] Same topic, same page template, navigation bars, etc. Lexicographic URL sorting further preserves finer-grain site structure However, most of the benefit of this method is gained from simply grouping same-host documents together Lexicographic URL sorting is also key to compressing the Web graph [Boldi & Vigna, WWW 2004] 23 November Search Engine Technology 23 Further Research The following areas have been researched: Exploiting redundancy when indexing multiple documents with highly overlapping content: Near-duplicate Web pages Versioned documents (code, Wikipedia) threads (back-and-forth messages) Document assignment problems on partitioned indexes Achieving compact representations of the Web graph In particular, the adjacency lists can also be d-gap encoded 23 November Search Engine Technology 24 12
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationCompression techniques
Compression techniques David Bařina February 22, 2013 David Bařina Compression techniques February 22, 2013 1 / 37 Contents 1 Terminology 2 Simple techniques 3 Entropy coding 4 Dictionary methods 5 Conclusion
More informationInformation Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
More informationThe WebGraph Framework:Compression Techniques
The WebGraph Framework: Compression Techniques Paolo Boldi Sebastiano Vigna DSI, Università di Milano, Italy The Web graph Given a set U of URLs, the graph induced by U is the directed graph having U as
More informationInformation, Entropy, and Coding
Chapter 8 Information, Entropy, and Coding 8. The Need for Data Compression To motivate the material in this chapter, we first consider various data sources and some estimates for the amount of data associated
More informationInternational Journal of Advanced Research in Computer Science and Software Engineering
Volume 3, Issue 7, July 23 ISSN: 2277 28X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Greedy Algorithm:
More informationOn the Use of Compression Algorithms for Network Traffic Classification
On the Use of for Network Traffic Classification Christian CALLEGARI Department of Information Ingeneering University of Pisa 23 September 2008 COST-TMA Meeting Samos, Greece Outline Outline 1 Introduction
More informationImage Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
More informationOn Inverted Index Compression for Search Engine Efficiency
On Inverted Index Compression for Search Engine Efficiency Matteo Catena 1, Craig Macdonald 2, and Iadh Ounis 2 1 GSSI - Gran Sasso Science Institute, INFN Viale F. Crispi 7, 67100 L Aquila, Italy matteo.catena@gssi.infn.it
More informationSection 1.4 Place Value Systems of Numeration in Other Bases
Section.4 Place Value Systems of Numeration in Other Bases Other Bases The Hindu-Arabic system that is used in most of the world today is a positional value system with a base of ten. The simplest reason
More informationStorage Optimization in Cloud Environment using Compression Algorithm
Storage Optimization in Cloud Environment using Compression Algorithm K.Govinda 1, Yuvaraj Kumar 2 1 School of Computing Science and Engineering, VIT University, Vellore, India kgovinda@vit.ac.in 2 School
More informationIndexing and Compression of Text
Compressing the Digital Library Timothy C. Bell 1, Alistair Moffat 2, and Ian H. Witten 3 1 Department of Computer Science, University of Canterbury, New Zealand, tim@cosc.canterbury.ac.nz 2 Department
More informationAnalysis of Compression Algorithms for Program Data
Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory
More informationModified Golomb-Rice Codes for Lossless Compression of Medical Images
Modified Golomb-Rice Codes for Lossless Compression of Medical Images Roman Starosolski (1), Władysław Skarbek (2) (1) Silesian University of Technology (2) Warsaw University of Technology Abstract Lossless
More informationDiffusion and Data compression for data security. A.J. Han Vinck University of Duisburg/Essen April 2013 Vinck@iem.uni-due.de
Diffusion and Data compression for data security A.J. Han Vinck University of Duisburg/Essen April 203 Vinck@iem.uni-due.de content Why diffusion is important? Why data compression is important? Unicity
More informationA Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
More informationAPP INVENTOR. Test Review
APP INVENTOR Test Review Main Concepts App Inventor Lists Creating Random Numbers Variables Searching and Sorting Data Linear Search Binary Search Selection Sort Quick Sort Abstraction Modulus Division
More informationGambling and Data Compression
Gambling and Data Compression Gambling. Horse Race Definition The wealth relative S(X) = b(x)o(x) is the factor by which the gambler s wealth grows if horse X wins the race, where b(x) is the fraction
More information3-17 15-25 5 15-10 25 3-2 5 0. 1b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true
Section 5.2 solutions #1-10: a) Perform the division using synthetic division. b) if the remainder is 0 use the result to completely factor the dividend (this is the numerator or the polynomial to the
More informationCHAPTER 2 LITERATURE REVIEW
11 CHAPTER 2 LITERATURE REVIEW 2.1 INTRODUCTION Image compression is mainly used to reduce storage space, transmission time and bandwidth requirements. In the subsequent sections of this chapter, general
More informationAn Introduction to Information Theory
An Introduction to Information Theory Carlton Downey November 12, 2013 INTRODUCTION Today s recitation will be an introduction to Information Theory Information theory studies the quantification of Information
More informationLZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b
LZ77 The original LZ77 algorithm works as follows: A phrase T j starting at a position i is encoded as a triple of the form distance, length, symbol. A triple d, l, s means that: T j = T [i...i + l] =
More informationArithmetic Coding: Introduction
Data Compression Arithmetic coding Arithmetic Coding: Introduction Allows using fractional parts of bits!! Used in PPM, JPEG/MPEG (as option), Bzip More time costly than Huffman, but integer implementation
More informationStreaming Lossless Data Compression Algorithm (SLDC)
Standard ECMA-321 June 2001 Standardizing Information and Communication Systems Streaming Lossless Data Compression Algorithm (SLDC) Phone: +41 22 849.60.00 - Fax: +41 22 849.60.01 - URL: http://www.ecma.ch
More informationStorage Management for Files of Dynamic Records
Storage Management for Files of Dynamic Records Justin Zobel Department of Computer Science, RMIT, GPO Box 2476V, Melbourne 3001, Australia. jz@cs.rmit.edu.au Alistair Moffat Department of Computer Science
More informationData Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs
Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of
More informationClass Notes CS 3137. 1 Creating and Using a Huffman Code. Ref: Weiss, page 433
Class Notes CS 3137 1 Creating and Using a Huffman Code. Ref: Weiss, page 433 1. FIXED LENGTH CODES: Codes are used to transmit characters over data links. You are probably aware of the ASCII code, a fixed-length
More informationProbability Interval Partitioning Entropy Codes
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Probability Interval Partitioning Entropy Codes Detlev Marpe, Senior Member, IEEE, Heiko Schwarz, and Thomas Wiegand, Senior Member, IEEE Abstract
More informationRecord Storage and Primary File Organization
Record Storage and Primary File Organization 1 C H A P T E R 4 Contents Introduction Secondary Storage Devices Buffering of Blocks Placing File Records on Disk Operations on Files Files of Unordered Records
More informationDNA Sequencing Data Compression. Michael Chung
DNA Sequencing Data Compression Michael Chung Problem DNA sequencing per dollar is increasing faster than storage capacity per dollar. Stein (2010) Data 3 billion base pairs in human genome Genomes are
More informationChapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
More informationChapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
More informationwinhex Disk Editor, RAM Editor PRESENTED BY: OMAR ZYADAT and LOAI HATTAR
winhex Disk Editor, RAM Editor PRESENTED BY: OMAR ZYADAT and LOAI HATTAR Supervised by : Dr. Lo'ai Tawalbeh New York Institute of Technology (NYIT)-Jordan X-Ways Software Technology AG is a stock corporation
More informationLossless Data Compression Standard Applications and the MapReduce Web Computing Framework
Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Sergio De Agostino Computer Science Department Sapienza University of Rome Internet as a Distributed System Modern
More informationEntropy and Mutual Information
ENCYCLOPEDIA OF COGNITIVE SCIENCE 2000 Macmillan Reference Ltd Information Theory information, entropy, communication, coding, bit, learning Ghahramani, Zoubin Zoubin Ghahramani University College London
More informationKhalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska
PROBLEM STATEMENT A ROBUST COMPRESSION SYSTEM FOR LOW BIT RATE TELEMETRY - TEST RESULTS WITH LUNAR DATA Khalid Sayood and Martin C. Rost Department of Electrical Engineering University of Nebraska The
More informationClassification - Examples
Lecture 2 Scheduling 1 Classification - Examples 1 r j C max given: n jobs with processing times p 1,...,p n and release dates r 1,...,r n jobs have to be scheduled without preemption on one machine taking
More informationOn the Unique Games Conjecture
On the Unique Games Conjecture Antonios Angelakis National Technical University of Athens June 16, 2015 Antonios Angelakis (NTUA) Theory of Computation June 16, 2015 1 / 20 Overview 1 Introduction 2 Preliminary
More informationLossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks
Lossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks Christian Renner and Phu Anh Tuan Nguyen ENSsys 14, Memphis, TN, USA November 6 th, 2014
More informationCalculator Notes for the TI-Nspire and TI-Nspire CAS
INTRODUCTION Calculator Notes for the Getting Started: Navigating Screens and Menus Your handheld is like a small computer. You will always work in a document with one or more problems and one or more
More informationSMALL INDEX LARGE INDEX (SILT)
Wayne State University ECE 7650: Scalable and Secure Internet Services and Architecture SMALL INDEX LARGE INDEX (SILT) A Memory Efficient High Performance Key Value Store QA REPORT Instructor: Dr. Song
More informationWe can express this in decimal notation (in contrast to the underline notation we have been using) as follows: 9081 + 900b + 90c = 9001 + 100c + 10b
In this session, we ll learn how to solve problems related to place value. This is one of the fundamental concepts in arithmetic, something every elementary and middle school mathematics teacher should
More informationReading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.
Reading.. IMAGE COMPRESSION- I Week VIII Feb 25 Chapter 8 Sections 8.1, 8.2 8.3 (selected topics) 8.4 (Huffman, run-length, loss-less predictive) 8.5 (lossy predictive, transform coding basics) 8.6 Image
More information1/18/2013. 5.5-year Ph.D. student internships in. Done job hunting recently. Will join Yahoo! Labs soon. Interviewed with
Liangjie Hong Ph.D. Candidate Dept. of Computer Science and Engineering 5.5-year Ph.D. student internships in a local software company (2008, 2 months) Yahoo! Labs (2010, 3 months) Yahoo! Labs (2011, 3
More informationAlgorithms for Advanced Packet Classification with Ternary CAMs
Algorithms for Advanced Packet Classification with Ternary CAMs Karthik Lakshminarayanan UC Berkeley Joint work with Anand Rangarajan and Srinivasan Venkatachary (Cypress Semiconductor) Packet Processing
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationEncoding Text with a Small Alphabet
Chapter 2 Encoding Text with a Small Alphabet Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out
More informationIntro to the Art of Computer Science
1 LESSON NAME: Intro to the Art of Computer Science Lesson time: 45 60 Minutes : Prep time: 15 Minutes Main Goal: Give the class a clear understanding of what computer science is and how it could be helpful
More informationLecture 11: Number Systems
Lecture 11: Number Systems Numeric Data Fixed point Integers (12, 345, 20567 etc) Real fractions (23.45, 23., 0.145 etc.) Floating point such as 23. 45 e 12 Basically an exponent representation Any number
More informationStorage and File Structure
Storage and File Structure Chapter 10: Storage and File Structure Overview of Physical Storage Media Magnetic Disks RAID Tertiary Storage Storage Access File Organization Organization of Records in Files
More informationScheduling Shop Scheduling. Tim Nieberg
Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations
More informationOn Directed Information and Gambling
On Directed Information and Gambling Haim H. Permuter Stanford University Stanford, CA, USA haim@stanford.edu Young-Han Kim University of California, San Diego La Jolla, CA, USA yhk@ucsd.edu Tsachy Weissman
More informationParquet. Columnar storage for the people
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li nong@cloudera.com Software engineer, Cloudera Impala Outline Context from various
More informationTables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)
Hash Tables Tables so far set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n) Worst O(lg n) O(lg n) O(lg n) Table naïve array implementation
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationCHAPTER 17: File Management
CHAPTER 17: File Management The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
More informationWeb Graph Visualizer. AUTOMATYKA 2011 Tom 15 Zeszyt 3. 1. Introduction. Micha³ Sima*, Wojciech Bieniecki*, Szymon Grabowski*
AUTOMATYKA 2011 Tom 15 Zeszyt 3 Micha³ Sima*, Wojciech Bieniecki*, Szymon Grabowski* Web Graph Visualizer 1. Introduction Web Graph is a directed, unlabeled graph G = (V, E), which represents connections
More informationCounters and Decoders
Physics 3330 Experiment #10 Fall 1999 Purpose Counters and Decoders In this experiment, you will design and construct a 4-bit ripple-through decade counter with a decimal read-out display. Such a counter
More informationLecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY 11794 4400
Lecture 18: Applications of Dynamic Programming Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Problem of the Day
More information6 3 4 9 = 6 10 + 3 10 + 4 10 + 9 10
Lesson The Binary Number System. Why Binary? The number system that you are familiar with, that you use every day, is the decimal number system, also commonly referred to as the base- system. When you
More informationWan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran
Wan Accelerators: Optimizing Network Traffic with Compression Bartosz Agas, Marvin Germar & Christopher Tran Introduction A WAN accelerator is an appliance that can maximize the services of a point-to-point(ptp)
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationJUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004
Scientiae Mathematicae Japonicae Online, Vol. 10, (2004), 431 437 431 JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS Ondřej Čepeka and Shao Chin Sung b Received December May 12, 2003; revised February
More informationencoding compression encryption
encoding compression encryption ASCII utf-8 utf-16 zip mpeg jpeg AES RSA diffie-hellman Expressing characters... ASCII and Unicode, conventions of how characters are expressed in bits. ASCII (7 bits) -
More information5.1 Bipartite Matching
CS787: Advanced Algorithms Lecture 5: Applications of Network Flow In the last lecture, we looked at the problem of finding the maximum flow in a graph, and how it can be efficiently solved using the Ford-Fulkerson
More informationStoring Measurement Data
Storing Measurement Data File I/O records or reads data in a file. A typical file I/O operation involves the following process. 1. Create or open a file. Indicate where an existing file resides or where
More informationHow To Code With Cbcc (Cbcc) In Video Coding
620 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 7, JULY 2003 Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard Detlev Marpe, Member,
More informationCSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM
CSE 123: Computer Networks Fall Quarter, 2014 MIDTERM EXAM Instructor: Alex C. Snoeren Name SOLUTIONS Student ID Question Score Points 1 15 15 2 35 35 3 25 25 4 15 15 5 10 10 Total 100 100 This exam is
More informationBase Conversion written by Cathy Saxton
Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,
More informationComputer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction
Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals Modified from the lecture slides of Lami Kaya (LKaya@ieee.org) for use CECS 474, Fall 2008. 2009 Pearson Education Inc., Upper
More informationTo convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:
Binary Numbers In computer science we deal almost exclusively with binary numbers. it will be very helpful to memorize some binary constants and their decimal and English equivalents. By English equivalents
More informationChapter 13. Disk Storage, Basic File Structures, and Hashing
Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible Hashing
More informationTransformation of LOG file using LIPT technique
Research Article International Journal of Advanced Computer Research, Vol 6(23) ISSN (Print): 2249-7277 ISSN (Online): 2277-7970 http://dx.doi.org/ 10.19101/IJACR.2016.623015 Transformation of LOG file
More informationAn overview of FAT12
An overview of FAT12 The File Allocation Table (FAT) is a table stored on a hard disk or floppy disk that indicates the status and location of all data clusters that are on the disk. The File Allocation
More informationMultimedia Systems WS 2010/2011
Multimedia Systems WS 2010/2011 31.01.2011 M. Rahamatullah Khondoker (Room # 36/410 ) University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de
More informationToday s topics. Digital Computers. More on binary. Binary Digits (Bits)
Today s topics! Binary Numbers! Brookshear.-.! Slides from Prof. Marti Hearst of UC Berkeley SIMS! Upcoming! Networks Interactive Introduction to Graph Theory http://www.utm.edu/cgi-bin/caldwell/tutor/departments/math/graph/intro
More informationChapter 7 Memory Management
Operating Systems: Internals and Design Principles Chapter 7 Memory Management Eighth Edition William Stallings Frame Page Segment A fixed-length block of main memory. A fixed-length block of data that
More informationIntroduction to image coding
Introduction to image coding Image coding aims at reducing amount of data required for image representation, storage or transmission. This is achieved by removing redundant data from an image, i.e. by
More informationCatch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks
Catch Me If You Can: A Practical Framework to Evade Censorship in Information-Centric Networks Reza Tourani, Satyajayant (Jay) Misra, Joerg Kliewer, Scott Ortegel, Travis Mick Computer Science Department
More informationCopyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1
Slide 13-1 Chapter 13 Disk Storage, Basic File Structures, and Hashing Chapter Outline Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and Extendible
More informationExtend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
More informationCMPE 150 Winter 2009
CMPE 150 Winter 2009 Lecture 6 January 22, 2009 P.E. Mantey CMPE 150 -- Introduction to Computer Networks Instructor: Patrick Mantey mantey@soe.ucsc.edu http://www.soe.ucsc.edu/~mantey/ / t / Office: Engr.
More informationTHE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM
THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM Iuon Chang Lin Department of Management Information Systems, National Chung Hsing University, Taiwan, Department of Photonics and Communication Engineering,
More informationPolynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range
THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis
More informationZeros of Polynomial Functions
Zeros of Polynomial Functions Objectives: 1.Use the Fundamental Theorem of Algebra to determine the number of zeros of polynomial functions 2.Find rational zeros of polynomial functions 3.Find conjugate
More informationChapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search
Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential
More informationBM307 File Organization
BM307 File Organization Gazi University Computer Engineering Department 9/24/2014 1 Index Sequential File Organization Binary Search Interpolation Search Self-Organizing Sequential Search Direct File Organization
More informationSelf-Indexing Inverted Files for Fast Text Retrieval
Self-Indexing Inverted Files for Fast Text Retrieval Alistair Moffat Justin Zobel February 1994 Abstract Query processing costs on large text databases are dominated by the need to retrieve and scan the
More informationRN-coding of Numbers: New Insights and Some Applications
RN-coding of Numbers: New Insights and Some Applications Peter Kornerup Dept. of Mathematics and Computer Science SDU, Odense, Denmark & Jean-Michel Muller LIP/Arénaire (CRNS-ENS Lyon-INRIA-UCBL) Lyon,
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 3 Random Variables: Distribution and Expectation Random Variables Question: The homeworks of 20 students are collected
More informationUseful Number Systems
Useful Number Systems Decimal Base = 10 Digit Set = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} Binary Base = 2 Digit Set = {0, 1} Octal Base = 8 = 2 3 Digit Set = {0, 1, 2, 3, 4, 5, 6, 7} Hexadecimal Base = 16 = 2
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationFast Arithmetic Coding (FastAC) Implementations
Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationTwo Parts. Filesystem Interface. Filesystem design. Interface the user sees. Implementing the interface
File Management Two Parts Filesystem Interface Interface the user sees Organization of the files as seen by the user Operations defined on files Properties that can be read/modified Filesystem design Implementing
More informationSolutions to Problem Set 1
YALE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE CPSC 467b: Cryptography and Computer Security Handout #8 Zheng Ma February 21, 2005 Solutions to Problem Set 1 Problem 1: Cracking the Hill cipher Suppose
More informationKrishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C
Tutorial#1 Q 1:- Explain the terms data, elementary item, entity, primary key, domain, attribute and information? Also give examples in support of your answer? Q 2:- What is a Data Type? Differentiate
More informationSecondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems
1 Any modern computer system will incorporate (at least) two levels of storage: primary storage: typical capacity cost per MB $3. typical access time burst transfer rate?? secondary storage: typical capacity
More information