6. Strings. Strings. Using Strings in Java. String Implementation in Java. String. Ordered list of characters.

Similar documents
4.2 Sorting and Searching

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Sorting Algorithms. Nelson Padua-Perez Bill Pugh. Department of Computer Science University of Maryland, College Park

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

CS/COE

Review of Hashing: Integer Keys

MS SQL Performance (Tuning) Best Practices:

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

CS473 - Algorithms I

public static void main(string[] args) { System.out.println("hello, world"); } }

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

Binary Trees and Huffman Encoding Binary Search Trees

DATA STRUCTURES USING C

Binary Heap Algorithms

Structure for String Keys

cs2010: algorithms and data structures

Physical Data Organization

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

dictionary find definition word definition book index find relevant pages term list of page numbers

SMALL INDEX LARGE INDEX (SILT)

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Union-Find Algorithms. network connectivity quick find quick union improvements applications

Unit 6. Loop statements

PES Institute of Technology-BSC QUESTION BANK

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

To My Parents -Laxmi and Modaiah. To My Family Members. To My Friends. To IIT Bombay. To All Hard Workers

Biostatistics 615/815

Binary Search Trees. basic implementations randomized BSTs deletion in BSTs

6. Standard Algorithms

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

APP INVENTOR. Test Review

Persistent Binary Search Trees

Sequential Data Structures

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Unit Write iterative and recursive C functions to find the greatest common divisor of two integers. [6]

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Data Structures. Topic #12

Symbol Tables. Introduction

Data Structure [Question Bank]

10CS35: Data Structures Using C

Rethinking SIMD Vectorization for In-Memory Databases

Chapter 13: Query Processing. Basic Steps in Query Processing

String Search. Brute force Rabin-Karp Knuth-Morris-Pratt Right-Left scan

Longest Common Extensions via Fingerprinting

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

Sample Questions Csci 1112 A. Bellaachia

Analysis of Algorithms I: Binary Search Trees

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

1.4 Arrays Introduction to Programming in Java: An Interdisciplinary Approach Robert Sedgewick and Kevin Wayne Copyright /6/11 12:33 PM!

Converting a Number from Decimal to Binary

recursion, O(n), linked lists 6/14

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

Data Structures and Data Manipulation

CSE 2123 Collections: Sets and Iterators (Hash functions and Trees) Jeremy Morris

Introduction to Programming System Design. CSCI 455x (4 Units)

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

A Comparison of Dictionary Implementations

Dynamic Programming. Lecture Overview Introduction

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Original-page small file oriented EXT3 file storage system

Fast string matching

Class : MAC 286. Data Structure. Research Paper on Sorting Algorithms

Overview. java.math.biginteger, java.math.bigdecimal. Definition: objects are everything but primitives The eight primitive data type in Java

Now is the time. For all good men PERMUTATION GENERATION. Princeton University. Robert Sedgewick METHODS

Linked Lists, Stacks, Queues, Deques. It s time for a chainge!

Introduction to Data Structures

Ordered Lists and Binary Trees

Heaps & Priority Queues in the C++ STL 2-3 Trees

Record Storage and Primary File Organization

Exam study sheet for CS2711. List of topics

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Organization of Programming Languages CS320/520N. Lecture 05. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

You are to simulate the process by making a record of the balls chosen, in the sequence in which they are chosen. Typical output for a run would be:

Chapter 8: Bags and Sets

From Last Time: Remove (Delete) Operation

CS106A, Stanford Handout #38. Strings and Chars

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

Algorithms. Margaret M. Fleck. 18 October 2010

Class Overview. CSE 326: Data Structures. Goals. Goals. Data Structures. Goals. Introduction

Common Data Structures

Data Structures Fibonacci Heaps, Amortized Analysis

Introduction to Parallel Programming and MapReduce

LINKED DATA STRUCTURES

Efficient Data Structures for Decision Diagrams

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Basic Programming and PC Skills: Basic Programming and PC Skills:

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Transcription:

Strings. Strings String. Ordered list of characters. Ex. Natural languages, Java programs, genomic sequences,. "The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string G's, A's, T's and C's. This string is the root data structure of an organism's biology." -M. V. Olson Robert Sedgewick and Kevin Wayne Copyright http://www.princeton.edu/~cos Using Strings in Java String Implementation in Java String concatenation. Append one string to end of another string. Substring. Extract a contiguous list of characters from a string. Memory. + N bytes for virgin string! could use byte array instead of String to save space S T R I N G S public final class String implements Comparable<String> { private char[] value; // characters private int offset; // index of first char into array private int count; // length of string private int hash; // cache of hashcode String s = "strings"; // s = "STRINGS" char c = s.charat(); // c = 'R' String t = s.substring(, ); // t = "RINGS" String u = s + t; // u = "STRINGSRINGS" private String(int offset, int count, char[] value) { this.offset = offset; this.count = count; this.value = value; public String substring(int from, int to) { return new String(offset + from, to - from, value);...

String vs. StringBuilder String. [immutable] Fast substring, slow concatenation. StringBuilder. [mutable] Slow substring, fast append. Radix Sorting public static String reverse(string s) { String r = ""; for (int i = s.length() - ; i >= ; i--) r += s.charat(i); return r; quadratic time public static String reverse(string s) { StringBuilder r = new StringBuilder(""); for (int i = s.length() - ; i >= ; i--) r.append(s.charat(i)); return r.tostring(); linear time Reference: Chapter, Algorithms in Java, rd Edition, Robert Sedgewick. Robert Sedgewick and Kevin Wayne Copyright http://www.princeton.edu/~cos Radix Sorting An Application: Redundancy Detector Radix sorting.! Specialized sorting solution for strings.! Same ideas for bits, digits, etc. Longest repeated substring.! Given a string of N characters, find the longest repeated substring.! Ex: a a c a a g t t t a c a a g c! Application: computational molecular biology. Applications.! Sorting strings.! Full text indexing.! Plagiarism detection.! Burrows-Wheeler transform. [see data compression]! Computational molecular biology. Dumb brute force.! Try all indices i and j, and all match lengths k, and check.! O(W N ) time, where W is length of longest match. k k a a c a a g t t t a c a a g c i j

An Application: Redundancy Detector A Sorting Solution Longest repeated substring.! Given a string of N characters, find the longest repeated substring.! Ex: a a c a a g t t t a c a a g c! Application: computational molecular biology. Suffix sort.! Form N suffixes of original string.! Sort to bring longest repeated substrings together. Brute force.! Try all indices i and j for start of possible match, and check.! O(W N ) time, where W is length of longest match. a a c a a g t t t a c a a g c i j a a c a a g t t t a c a a g c a c a a g t t t a c a a g c c a a g t t t a c a a g c a a g t t t a c a a g c a g t t t a c a a g c g t t t a c a a g c t t t a c a a g c t t a c a a g c t a c a a g c a c a a g c c a a g c a a g c a g c g c c a a c a a g t t t a c a a g c a a g c a a g t t t a c a a g c a c a a g c a c a a g t t t a c a a g c a g c a g t t t a c a a g c c c a a g c c a a g t t t a c a a g c g c g t t t a c a a g c t a c a a g c t t a c a a g c t t t a c a a g c String Sorting Suffix Sorting: Java Implementation Notation.! String = variable length sequence of characters.! W = max # characters per string.! N = # input strings.! R = radix. for extended ASCII,, for original UNICODE Java implementation. public class LRS { public static void main(string[] args) { String s = StdIn.readAll(); int N = s.length(); read input Java syntax.! Array of strings: String[] a String[] suffixes = new String[N]; for (int i = ; i < N; i++) suffixes[i] = s.substring(i, N); create suffixes (linear time)! Number of strings: N = a.length! The i th string: a[i]! The d th character of the i th string: a[i].charat(d)! Strings to be sorted: a[],, a[n-] Arrays.sort(suffixes); System.out.println(lcp(suffixes)); sort and find longest match (bottleneck) longest common prefix of adjacent strings % java LRS < mobydick.txt,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th

String Sorting Performance Key Indexed Counting String Sort Suffix (sec) Key indexed counting.! Count frequencies of each letter. [ th character] Worst Case Moby Dick Brute W N, Quicksort W N log N. a count W = max length of string. N = number of strings.. million for Moby Dick estimate probabilistic guarantee int N = a.length; int[] count = new int[+]; for (int i = ; i < N; i++) { char c = a[i].charat(d); count[c+]++; d = frequencies a b c d e f g Key Indexed Counting Key Indexed Counting Key indexed counting.! Count frequencies of each letter. [ th character]! Compute cumulative frequencies. Key indexed counting.! Count frequencies of each letter. [ th character]! Compute cumulative frequencies.! Use cumulative frequencies to rearrange strings. a count a count temp for (int i = ; i < ; i++) count[i] += count[i-]; cumulative counts a b c d e f g a b c d e f g for (int i = ; i < N; i++) { char c = a[i].charat(d); temp[count[c]++] = a[i]; rearrange a b c d e f g

Key Indexed Counting LSD Radix Sort Key indexed counting.! Count frequencies of each letter. [ th character]! Compute cumulative frequencies.! Use cumulative frequencies to rearrange strings. Least significant digit radix sort. Ancient method used for card-sorting. a count temp a b for (int i = ; i < N; i++) a[i] = temp[i]; c d copy back e f g Lysergic Acid Diethylamide, Circa Card Sorter, Circa LSD Radix Sort LSD Radix Sort Least significant digit radix sort.! Consider digits from right to left: use key-indexed counting to stable sort by character Least significant digit radix sort.! Consider digits from right to left: use key-indexed counting to stable sort by character public static void lsd(string[] a) { int W = a[].length(); for (int d = W-; d >= ; d--) { // do key-indexed counting sort on digit d... Assumes fixed length strings (length = W)

LSD Radix Sort: Correctness LSD Radix Sort Correctness Pf. [left-to-right]! If two strings differ on first character, keyindexed sort puts them in proper relative order.! If two strings agree on first character, stability keeps them in proper relative order. Pf. [right-to-left]! If the characters not yet examined differ, it doesn't matter what we do now.! If the characters not yet examined agree, later pass won't affect order. Running time.!(). why doesn't it violate N log N lower bound? Advantage. Fastest sorting method for random fixed length strings. Disadvantages.! Accesses memory "randomly."! Inner loop has a lot of instructions.! Wastes time on low-order characters.! Doesn't work for variable-length strings.! Not much semblance of order until very last pass. Goal. Find fast algorithm for variable length strings. MSD Radix Sort MSD Radix Sort Implementation Most significant digit radix sort.! Partition file into pieces according to first character.! Recursively sort all strings that start with the same character, etc. public static void msd(string[] a) { int N = a.length; msd(a,, N-, ); inclusive Q. How to sort on d th character? A. Use key-indexed counting. private static void msd(string[] a, int l, int r, int d) { if (r <= l) return; // key-indexed counting sort on digit d of a[l] to a[r] int[] count = new int[+];... // recursively sort subfiles assumes '\' terminated for (int i = ; i < ; i++) msd(a, l + count[i], l + count[i+] -, d+);

String Sorting Performance MSD Radix Sort: Small Files String Sort Suffix (sec) Worst Case Moby Dick Brute W N, Quicksort LSD * MSD W N log N. - Disadvantages.! Too slow for small files. ASCII: x slower than insertion sort for N = UNICODE:,x slower for N =! Huge number of recursive calls on small files. Solution. Cutoff to insertion sort for small N. Consequence. Competitive with quicksort for string keys. R = radix. W = max length of string. N = number of strings. estimate * fixed length strings only probabilistic guarantee. million for Moby Dick String Sorting Performance Recursive Structure of MSD Radix Sort String Sort Worst Case Suffix (sec) Moby Dick Trie structure. Describe recursive calls in MSD radix sort. Brute W N, Quicksort W N log N. LSD * - MSD null links (not shown) MSD with cutoff. Problem. Algorithm touches lots of empty nodes ala R-way tries.! Tree can be as much as times bigger than it appears! R = radix. W = max length of string. N = number of strings. estimate * fixed length strings only probabilistic guarantee. million for Moby Dick

Correspondence With Sorting Algorithms -Way Radix Quicksort Correspondence between trees and sorting algorithms.! BSTs correspond to quicksort recursive partitioning structure.! R-way tries corresponds to MSD radix sort.! What corresponds to ternary search tries? Idea. Use d th character to "sort" into pieces instead of, and sort each piece recursively. Idea. Keep all duplicates together in partitioning step. s by h the e e l shells shore sea sells -way partition -way radix quicksort Recursive Structure of MSD Radix Sort vs. -Way Quicksort -Way Radix Quicksort -way radix quicksort collapses empty links in MSD tree. private static void quicksortx(string a[], int lo, int hi, int d) { if (hi - lo <= ) return; int i = lo-, j = hi; int p = lo-, q = hi; char v = a[hi].charat(d); MSD radix sort recursion tree ( null links, not shown) -way radix quicksort recursion tree ( null links) while (i < j) { repeat until pointers cross while (a[++i].charat(d) < v) if (i == hi) break; find i on left and while (v < a[--j].charat(d)) if (j == lo) break; j on right to swap if (i > j) break; exch(a, i, j); if (a[i].charat(d) == v) exch(a, ++p, i); swap equal chars if (a[j].charat(d) == v) exch(a, j, --q); to left or right if (p == q) { if (v!= '\') quicksortx(a, lo, hi, d+); special case for return; all equal chars if (a[i].charat(d) < v) i++; for (int k = lo; k <= p; k++) exch(a, k, j--); swap equal ones for (int k = hi; k >= q; k--) exch(a, k, i++); back to middle quicksortx(a, lo, j, d); if ((i == hi) && (a[i].charat(d) == v)) i++; sort pieces recursively if (v!= '\') quicksortx(a, j+, i-, d+); quicksortx(a, i, hi, d);

Quicksort vs. -Way Radix Quicksort String Sorting Performance Quicksort.! N ln N string comparisons on average.! Long keys are costly to compare if they differ only at the end, and this is common case!! absolutism, absolut, absolutely, absolute. -way radix quicksort.! Avoids re-comparing initial parts of the string.! Uses just "enough" characters to resolve order.! N ln N character comparisons on average for random strings.! Sub-linear sort for large W since input is of size NW. String Sort Suffix (sec) Worst Case Moby Dick Brute W N, Quicksort LSD * MSD MSD with cutoff W N log N. -. -way radix quicksort W N log N. Theorem. Quicksort with -way partitioning is OPTIMAL. Pf. Ties cost to entropy. Beyond scope of. R = radix. W = max length of string. N = number of strings. estimate * fixed length strings only probabilistic guarantee. million for Moby Dick Suffix Sorting: Worst Case Input Suffix Sorting in Linearithmic Time: Key Idea Length of longest match small.! Hard to beat -way radix quicksort. Length of longest match very long.! -way radix quicksort is quadratic.! Ex: two copies of Moby Dick. Can we do better?!(n log N)?!(N)? Observation. Must find longest repeated substring while suffix sorting to beat N. abcdefghi abcdefghiabcdefghi bcdefghi bcdefghiabcdefghi cdefghi cdefghiabcdefgh defghi efghiabcdefghi efghi fghiabcdefghi fghi ghiabcdefghi fhi hiabcdefghi hi iabcdefghi i Input: "abcdeghiabcdefghi" babaaaabcbabaaaaa abaaaabcbabaaaaab baaaabcbabaaaaaba aaaabcbabaaaaabab aaabcbabaaaaababa aabcbabaaaaababaa abcbabaaaaababaaa bcbabaaaaababaaaa cbabaaaaababaaaab babaaaaababaaaabc abaaaaababaaaabcb baaaaababaaaabcba aaaaababaaaabcbab aaaababaaaabcbaba aaababaaaabcbabaa aababaaaabcbabaaa ababaaaabcbabaaaa babaaaabcbabaaaaa + = + = babaaaabcbabaaaaa ababaaaabcbabaaaa aababaaaabcbabaaa aaababaaaabcbabaa aaaabcbabaaaaabab aaaaababaaaabcbab aaaababaaaabcbaba aaabcbabaaaaababa aabcbabaaaaababaa abaaaabcbabaaaaab abaaaaababaaaabcb abcbabaaaaababaaa baaaabcbabaaaaaba baaaaababaaaabcba babaaaabcbabaaaaa babaaaaababaaaabc bcbabaaaaababaaaa cbabaaaaababaaaab Input: "babaaaabcbabaaaaa"

Suffix Sorting in Subquadratic Time String Sorting Performance Manber's MSD algorithm.! Phase : sort on first character using key-indexed sorting.! Phase i: given list of suffixes sorted on first i- characters, create list of suffixes sorted on first i characters! Finishes after lg N phases. Manber's LSD algorithm.! Same idea but go from right to left.! O(N log N) guaranteed running time.! O(N) extra space (but need several auxiliary arrays). Best in theory. O(N) but more complicated to implement. Quicksort LSD * MSD MSD with cutoff String Sort Worst Case W N log N Moby Dick Brute W N, -way radix quicksort W N log N Manber N log N Suffix Sort (seconds). -.. AesopAesop, - memory. R = radix. W = max length of string. N = number of strings.. million for Moby Dick thousand for Aesop's Fables estimate * fixed length strings only probabilistic guarantee suffix sorting only