Strings. Strings String. Ordered list of characters. Ex. Natural languages, Java programs, genomic sequences,. "The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string G's, A's, T's and C's. This string is the root data structure of an organism's biology." -M. V. Olson Robert Sedgewick and Kevin Wayne Copyright http://www.princeton.edu/~cos Using Strings in Java String Implementation in Java String concatenation. Append one string to end of another string. Substring. Extract a contiguous list of characters from a string. Memory. + N bytes for virgin string! could use byte array instead of String to save space S T R I N G S public final class String implements Comparable<String> { private char[] value; // characters private int offset; // index of first char into array private int count; // length of string private int hash; // cache of hashcode String s = "strings"; // s = "STRINGS" char c = s.charat(); // c = 'R' String t = s.substring(, ); // t = "RINGS" String u = s + t; // u = "STRINGSRINGS" private String(int offset, int count, char[] value) { this.offset = offset; this.count = count; this.value = value; public String substring(int from, int to) { return new String(offset + from, to - from, value);...
String vs. StringBuilder String. [immutable] Fast substring, slow concatenation. StringBuilder. [mutable] Slow substring, fast append. Radix Sorting public static String reverse(string s) { String r = ""; for (int i = s.length() - ; i >= ; i--) r += s.charat(i); return r; quadratic time public static String reverse(string s) { StringBuilder r = new StringBuilder(""); for (int i = s.length() - ; i >= ; i--) r.append(s.charat(i)); return r.tostring(); linear time Reference: Chapter, Algorithms in Java, rd Edition, Robert Sedgewick. Robert Sedgewick and Kevin Wayne Copyright http://www.princeton.edu/~cos Radix Sorting An Application: Redundancy Detector Radix sorting.! Specialized sorting solution for strings.! Same ideas for bits, digits, etc. Longest repeated substring.! Given a string of N characters, find the longest repeated substring.! Ex: a a c a a g t t t a c a a g c! Application: computational molecular biology. Applications.! Sorting strings.! Full text indexing.! Plagiarism detection.! Burrows-Wheeler transform. [see data compression]! Computational molecular biology. Dumb brute force.! Try all indices i and j, and all match lengths k, and check.! O(W N ) time, where W is length of longest match. k k a a c a a g t t t a c a a g c i j
An Application: Redundancy Detector A Sorting Solution Longest repeated substring.! Given a string of N characters, find the longest repeated substring.! Ex: a a c a a g t t t a c a a g c! Application: computational molecular biology. Suffix sort.! Form N suffixes of original string.! Sort to bring longest repeated substrings together. Brute force.! Try all indices i and j for start of possible match, and check.! O(W N ) time, where W is length of longest match. a a c a a g t t t a c a a g c i j a a c a a g t t t a c a a g c a c a a g t t t a c a a g c c a a g t t t a c a a g c a a g t t t a c a a g c a g t t t a c a a g c g t t t a c a a g c t t t a c a a g c t t a c a a g c t a c a a g c a c a a g c c a a g c a a g c a g c g c c a a c a a g t t t a c a a g c a a g c a a g t t t a c a a g c a c a a g c a c a a g t t t a c a a g c a g c a g t t t a c a a g c c c a a g c c a a g t t t a c a a g c g c g t t t a c a a g c t a c a a g c t t a c a a g c t t t a c a a g c String Sorting Suffix Sorting: Java Implementation Notation.! String = variable length sequence of characters.! W = max # characters per string.! N = # input strings.! R = radix. for extended ASCII,, for original UNICODE Java implementation. public class LRS { public static void main(string[] args) { String s = StdIn.readAll(); int N = s.length(); read input Java syntax.! Array of strings: String[] a String[] suffixes = new String[N]; for (int i = ; i < N; i++) suffixes[i] = s.substring(i, N); create suffixes (linear time)! Number of strings: N = a.length! The i th string: a[i]! The d th character of the i th string: a[i].charat(d)! Strings to be sorted: a[],, a[n-] Arrays.sort(suffixes); System.out.println(lcp(suffixes)); sort and find longest match (bottleneck) longest common prefix of adjacent strings % java LRS < mobydick.txt,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th
String Sorting Performance Key Indexed Counting String Sort Suffix (sec) Key indexed counting.! Count frequencies of each letter. [ th character] Worst Case Moby Dick Brute W N, Quicksort W N log N. a count W = max length of string. N = number of strings.. million for Moby Dick estimate probabilistic guarantee int N = a.length; int[] count = new int[+]; for (int i = ; i < N; i++) { char c = a[i].charat(d); count[c+]++; d = frequencies a b c d e f g Key Indexed Counting Key Indexed Counting Key indexed counting.! Count frequencies of each letter. [ th character]! Compute cumulative frequencies. Key indexed counting.! Count frequencies of each letter. [ th character]! Compute cumulative frequencies.! Use cumulative frequencies to rearrange strings. a count a count temp for (int i = ; i < ; i++) count[i] += count[i-]; cumulative counts a b c d e f g a b c d e f g for (int i = ; i < N; i++) { char c = a[i].charat(d); temp[count[c]++] = a[i]; rearrange a b c d e f g
Key Indexed Counting LSD Radix Sort Key indexed counting.! Count frequencies of each letter. [ th character]! Compute cumulative frequencies.! Use cumulative frequencies to rearrange strings. Least significant digit radix sort. Ancient method used for card-sorting. a count temp a b for (int i = ; i < N; i++) a[i] = temp[i]; c d copy back e f g Lysergic Acid Diethylamide, Circa Card Sorter, Circa LSD Radix Sort LSD Radix Sort Least significant digit radix sort.! Consider digits from right to left: use key-indexed counting to stable sort by character Least significant digit radix sort.! Consider digits from right to left: use key-indexed counting to stable sort by character public static void lsd(string[] a) { int W = a[].length(); for (int d = W-; d >= ; d--) { // do key-indexed counting sort on digit d... Assumes fixed length strings (length = W)
LSD Radix Sort: Correctness LSD Radix Sort Correctness Pf. [left-to-right]! If two strings differ on first character, keyindexed sort puts them in proper relative order.! If two strings agree on first character, stability keeps them in proper relative order. Pf. [right-to-left]! If the characters not yet examined differ, it doesn't matter what we do now.! If the characters not yet examined agree, later pass won't affect order. Running time.!(). why doesn't it violate N log N lower bound? Advantage. Fastest sorting method for random fixed length strings. Disadvantages.! Accesses memory "randomly."! Inner loop has a lot of instructions.! Wastes time on low-order characters.! Doesn't work for variable-length strings.! Not much semblance of order until very last pass. Goal. Find fast algorithm for variable length strings. MSD Radix Sort MSD Radix Sort Implementation Most significant digit radix sort.! Partition file into pieces according to first character.! Recursively sort all strings that start with the same character, etc. public static void msd(string[] a) { int N = a.length; msd(a,, N-, ); inclusive Q. How to sort on d th character? A. Use key-indexed counting. private static void msd(string[] a, int l, int r, int d) { if (r <= l) return; // key-indexed counting sort on digit d of a[l] to a[r] int[] count = new int[+];... // recursively sort subfiles assumes '\' terminated for (int i = ; i < ; i++) msd(a, l + count[i], l + count[i+] -, d+);
String Sorting Performance MSD Radix Sort: Small Files String Sort Suffix (sec) Worst Case Moby Dick Brute W N, Quicksort LSD * MSD W N log N. - Disadvantages.! Too slow for small files. ASCII: x slower than insertion sort for N = UNICODE:,x slower for N =! Huge number of recursive calls on small files. Solution. Cutoff to insertion sort for small N. Consequence. Competitive with quicksort for string keys. R = radix. W = max length of string. N = number of strings. estimate * fixed length strings only probabilistic guarantee. million for Moby Dick String Sorting Performance Recursive Structure of MSD Radix Sort String Sort Worst Case Suffix (sec) Moby Dick Trie structure. Describe recursive calls in MSD radix sort. Brute W N, Quicksort W N log N. LSD * - MSD null links (not shown) MSD with cutoff. Problem. Algorithm touches lots of empty nodes ala R-way tries.! Tree can be as much as times bigger than it appears! R = radix. W = max length of string. N = number of strings. estimate * fixed length strings only probabilistic guarantee. million for Moby Dick
Correspondence With Sorting Algorithms -Way Radix Quicksort Correspondence between trees and sorting algorithms.! BSTs correspond to quicksort recursive partitioning structure.! R-way tries corresponds to MSD radix sort.! What corresponds to ternary search tries? Idea. Use d th character to "sort" into pieces instead of, and sort each piece recursively. Idea. Keep all duplicates together in partitioning step. s by h the e e l shells shore sea sells -way partition -way radix quicksort Recursive Structure of MSD Radix Sort vs. -Way Quicksort -Way Radix Quicksort -way radix quicksort collapses empty links in MSD tree. private static void quicksortx(string a[], int lo, int hi, int d) { if (hi - lo <= ) return; int i = lo-, j = hi; int p = lo-, q = hi; char v = a[hi].charat(d); MSD radix sort recursion tree ( null links, not shown) -way radix quicksort recursion tree ( null links) while (i < j) { repeat until pointers cross while (a[++i].charat(d) < v) if (i == hi) break; find i on left and while (v < a[--j].charat(d)) if (j == lo) break; j on right to swap if (i > j) break; exch(a, i, j); if (a[i].charat(d) == v) exch(a, ++p, i); swap equal chars if (a[j].charat(d) == v) exch(a, j, --q); to left or right if (p == q) { if (v!= '\') quicksortx(a, lo, hi, d+); special case for return; all equal chars if (a[i].charat(d) < v) i++; for (int k = lo; k <= p; k++) exch(a, k, j--); swap equal ones for (int k = hi; k >= q; k--) exch(a, k, i++); back to middle quicksortx(a, lo, j, d); if ((i == hi) && (a[i].charat(d) == v)) i++; sort pieces recursively if (v!= '\') quicksortx(a, j+, i-, d+); quicksortx(a, i, hi, d);
Quicksort vs. -Way Radix Quicksort String Sorting Performance Quicksort.! N ln N string comparisons on average.! Long keys are costly to compare if they differ only at the end, and this is common case!! absolutism, absolut, absolutely, absolute. -way radix quicksort.! Avoids re-comparing initial parts of the string.! Uses just "enough" characters to resolve order.! N ln N character comparisons on average for random strings.! Sub-linear sort for large W since input is of size NW. String Sort Suffix (sec) Worst Case Moby Dick Brute W N, Quicksort LSD * MSD MSD with cutoff W N log N. -. -way radix quicksort W N log N. Theorem. Quicksort with -way partitioning is OPTIMAL. Pf. Ties cost to entropy. Beyond scope of. R = radix. W = max length of string. N = number of strings. estimate * fixed length strings only probabilistic guarantee. million for Moby Dick Suffix Sorting: Worst Case Input Suffix Sorting in Linearithmic Time: Key Idea Length of longest match small.! Hard to beat -way radix quicksort. Length of longest match very long.! -way radix quicksort is quadratic.! Ex: two copies of Moby Dick. Can we do better?!(n log N)?!(N)? Observation. Must find longest repeated substring while suffix sorting to beat N. abcdefghi abcdefghiabcdefghi bcdefghi bcdefghiabcdefghi cdefghi cdefghiabcdefgh defghi efghiabcdefghi efghi fghiabcdefghi fghi ghiabcdefghi fhi hiabcdefghi hi iabcdefghi i Input: "abcdeghiabcdefghi" babaaaabcbabaaaaa abaaaabcbabaaaaab baaaabcbabaaaaaba aaaabcbabaaaaabab aaabcbabaaaaababa aabcbabaaaaababaa abcbabaaaaababaaa bcbabaaaaababaaaa cbabaaaaababaaaab babaaaaababaaaabc abaaaaababaaaabcb baaaaababaaaabcba aaaaababaaaabcbab aaaababaaaabcbaba aaababaaaabcbabaa aababaaaabcbabaaa ababaaaabcbabaaaa babaaaabcbabaaaaa + = + = babaaaabcbabaaaaa ababaaaabcbabaaaa aababaaaabcbabaaa aaababaaaabcbabaa aaaabcbabaaaaabab aaaaababaaaabcbab aaaababaaaabcbaba aaabcbabaaaaababa aabcbabaaaaababaa abaaaabcbabaaaaab abaaaaababaaaabcb abcbabaaaaababaaa baaaabcbabaaaaaba baaaaababaaaabcba babaaaabcbabaaaaa babaaaaababaaaabc bcbabaaaaababaaaa cbabaaaaababaaaab Input: "babaaaabcbabaaaaa"
Suffix Sorting in Subquadratic Time String Sorting Performance Manber's MSD algorithm.! Phase : sort on first character using key-indexed sorting.! Phase i: given list of suffixes sorted on first i- characters, create list of suffixes sorted on first i characters! Finishes after lg N phases. Manber's LSD algorithm.! Same idea but go from right to left.! O(N log N) guaranteed running time.! O(N) extra space (but need several auxiliary arrays). Best in theory. O(N) but more complicated to implement. Quicksort LSD * MSD MSD with cutoff String Sort Worst Case W N log N Moby Dick Brute W N, -way radix quicksort W N log N Manber N log N Suffix Sort (seconds). -.. AesopAesop, - memory. R = radix. W = max length of string. N = number of strings.. million for Moby Dick thousand for Aesop's Fables estimate * fixed length strings only probabilistic guarantee suffix sorting only