Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC
|
|
|
- Martina Johnston
- 9 years ago
- Views:
Transcription
1 Paper Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently help to assess the degree to which different character strings are similar or have elements in common. These do not necessarily supersede the simpler functions provided in earlier versions. In some cases it makes sense to use old and new together. INTRODUCTION We want to summarize the archives of the SAS-L list by enumerating the message authors and the number of messages written by each. So we extract the addresses and use PROC FREQ or the equivalent to come up with the counts. THE PROBLEM IN MINIATURE The output, in part, looks something like this: SendingAddress Count [email protected] 101 [email protected] 3 [email protected] 55 [email protected] 44 [email protected] 14 [email protected] 17 [email protected] 14 The problem is that the observations represent distinct addresses rather than distinct individuals. People s addresses change and vary, for a number of reasons: new jobs; new internet service providers; transparent changes introduced by systems administrators. It s also common for individuals to have multiple addresses in simultaneous use. So, accurate counts cannot be developed unless we can aggregate over multiple addresses which actually belong to the same individual. This is a variant of the oft-discussed fuzzy merge problem. In this example, what we really want is something like: SendingAddress Count [email protected] 132 [email protected] 72 [email protected] 44 where we have arbitrarily kept the address with the highest count to identify each individual. Since addresses can be cryptic, even better is a list identifying people by name: LastName FirstName Count Sugigoer Susan 132 Sasmaniac Samuel 72 Sowhat Sarah 44 1
2 THE ACTUAL PROBLEM The SAS-L list has been around since 1986, and more than twenty thousand distinct addresses have been used to post questions, answers, and comments over the years. The total number of postings is in the hundreds of thousands. It is only realistic to accept that the aggregation process (1) cannot be completely automated; (2) cannot be done with complete accuracy; and (3) cannot, as a practical matter, be completed for all individuals represented in the data. Since the primary purpose is to identify the leading contributors in approximate rank order, we can accept (1) relatively small errors in the end results and (2) omission of any individual who posted only a small number of times. Our goal is to develop a computer-supported manual research system. The computer will be expected to identify relatively small subsets of observations which seem like they might be aliases for various individuals. THE FUNCTIONS SAS now offers several functions (SPEDIS, COMPLEV, and COMPGED) which can aid in solving such problems by quantifying the degree to which two character values are similar. These supplement simpler functions (such as INDEX and SCAN) which can reveal whether or not substrings match. COMPGED EXAMPLE Here s an example: data _null_; ged = compged('baby','baboon'); put ged=; ged = compged('balloon','baboon'); put ged=; run; When this code is run, the log displays: ged=120 ged=200 In the first case, SAS transformed baboon to babyon and assessed a cost of 100 for the replacement of one character with another; then it truncated the string, first to babyo and finally to baby, with a cost of 10 for each truncation. The total cost, 120, is the result returned by the function. In the second case, a replacement changed baboon to baloon, again at a cost of 100; then insertion of a letter ( l ) at the cost of 100 completed the process and brought the result to 200. The implication is that baby is more similar to baboon than is balloon to baboon. MORE ABOUT COMPGED The COMPGED function has in its repertoire a number of other operations (for example, transposition of adjacent characters), each of which has a cost coefficient. The details, and more, are presented in the documentation. COMPGED also has an algorithm to guide it to the least-cost sequence of operations for any transformation. Associated with the COMPGED function is the CALL COMPCOST routine, which allows you to change cost coefficients and even selectively disable operations. Use of this routine is optional. It does permit you to tune the COMPGED function for particular purposes. 2
3 COMPLEV The COMPLEV function computes the so-called Levenshtein edit distance to compare two character strings. It counts the minimum number of single-character insert, delete, or replace operations needed to accomplish the transformation, and thus is essentially a streamlined special case of the COMPGED function. It is less versatile than COMPGED, but it executes faster. SPEDIS The SPEDIS (spelling distance) function has been part of SAS longer than the other two functions in this category. It is similar in nature to COMPGED, but less efficient in execution. THE PROCESS IN GENERAL We ll now turn to the process which I developed to aggregate, by author, message counts from the archives of SAS-L. This method for identifying groups of addresses belonging to individual SAS-L contributors involves a combination of computer processing and human research and analysis. There are two elements to the human research and analysis. The first is to look at possible aliases as identified by the computer and select those which appear most plausible. The other is to use the Advanced Search screen of the Google Groups web site to find the real names (or in some cases partial names or screen names) corresponding to these selected addresses. Triage is essential, to reduce both the human and computer processing burdens. Addresses used very few times are completely excluded, and among the remainder, priority is given to those addresses used the most. This vastly reduces the scale of the problem with little sacrifice in the scope or accuracy of the final results. The process is iterative. The results of one cycle provide input for the next. These techniques are hardly perfect. The mechanical matching generates many false positives and, presumably, false negatives. Even without such errors, the strategy would miss a person who posted a large number of messages by using many addresses a relatively few times each. But that seems an unlikely scenario, so barring major errors in the manual analysis, the results should be quite satisfactory. THE PROCESS IN DETAIL With the exception of the step which performs the fuzzy merge, the SAS code is very straightforward and does not merit any special explanation. So we will begin with a schematic view of the entire process, then examine the code which is of particular interest. FLOW CHART Below is a flow chart depicting the process. The red, round-cornered boxes with labels identified by numerals are SAS datasets. The blue, square-cornered boxes with labels identified by letters of the alphabet are processing steps. 3
4 D. Accrete 5. Identified 1. All A. Filter 3. High Frequency B. Sort 2. Not Yet Identified C. Research 4. Newly Identified 6. Likely Aliases E. Fuzzy Merge Several terms will be used repeatedly in the notes which follow: address: an address used to send messages to SAS-L between 1986 and 2003 identification: the process of associating a person with an address, usually in terms of first name and last name, but occasionally in terms of just one or the other or a screen name alias: one of multiple addresses belonging to the same person frequency: number of SAS-L messages originated from an address. Notes: (1) This file (derived from the LISTSERV archives at Marist University) contains one observation for each distinct originating address (of which there are around 23 thousand), along with a count indicating the number of SAS-L messages from that address (November 1986 through December 2003). As might be expected, a small number of addresses are associated with a large number of postings. More than 10 thousand (or nearly half) of the addresses have been used exactly once, and almost 18 thousand have been used five or fewer times. On the other hand, there are perhaps 250 addresses associated with 100 or more postings each. (A) This process removes the already-identified addresses (5) from the file of all addresses (1) to yield a file of not-yet-identified addresses. For practical reasons, it also removes addresses with very low frequencies. This greatly expedites other processes (C, E) at little cost in terms of the end results. On the first pass, there 4
5 are no identified addresses, so (5) is empty and only low-frequency addresses are removed. (2) This file contains all addresses not yet identified except for those with very low frequencies, which theoretically should be included, but are excluded for practical reasons. (B) This process sorts the list of not-yet-identified addresses by frequency, to isolate those with the highest frequencies. These higher-frequency addresses are generally more worth researching. (3) This file exists to present addresses which are not yet identified but which have relatively high frequencies. (C) This process entails manual research to identify the person to which a particular address belongs or belonged. Usually the identification is in terms of a first name and a last name, but in some cases it is just one or the other, or a screen name. In some cases, identification may be obvious (eg, samuel.sasmaniac@ ). In general however, it is not (eg, suzy1342@ ), and the research is done using the Advanced Search screen of the Google Groups site. It is not practical to research all addresses. Since the objective is to get reasonably accurate counts for the most active posters, two selection criteria are used to pick addresses to research high frequency similarity to previously identified addresses Two inputs are available to guide the process. The High Frequency list (3) shows the not yet identified addresses with the highest frequencies; this is the only guidance available during the initial iteration. The Likely Aliases file (6) identifies addresses which seem (to the computer) similar to those already identified. It also includes frequencies for those addresses. The research task includes weeding out obvious false positives. Frequencies of the likely aliases may also be considered, but with a different threshold (since, for the ultimate purpose of the exercise, identifying an address used 30 times by somebody who has made 500 other postings using other addresses is more important than identifying an address used 60 times by somebody who has not used other addresses). (4) This file lists only those addresses identified in the most recent cycle. This shrinks both the join in (E), saving computer processing time, and the results of the join (6), sparing the researcher the trouble of reviewing the same possible matches again and again. (D) This process adds the most recent set of identified addresses to the list of those identified in previous cycles. (5) This file contains the cumulative list of identified addresses. It is the end result of the whole exercise. (E) This process performs a cartesian join pairing each newly identified address (4) with each unidentified one (2), then uses a formula employing the COMPGED function to quantify similarity. Thresholds, based on three scenarios, are applied to keep pairs with low distances in a list of likely aliases (6). 5
6 In making the comparisons, each address is broken into two components: the box (characters preceding ) and the system (characters following ), after removing most non-alphabetic characters (on the theory that they reveal little about similarity among addresses). (6) This file is the list of likely aliases, enumerating addresses which seem like they might belong to people who are already identified with other addresses. To the extent that it is complete (few false negatives), the ultimate results should be rather accurate. CODE These are the tables and columns mentioned in the SQL statements which perform the fuzzy merge. Tables: NEWNAMES: newly identified addresses (4) NOTYETID: not yet identified addresses (2) POSSIBLE: likely aliases for the newly identified addresses, drawn from the not yet identified addresses; some repetition DISTINCT: likely aliases (6) for the newly identified addresses, drawn from the not yet identified addresses; no repetition Columns: LAST: last name FIRST: first name address BOX: left portion of address, with digits and most special characters removed SYSTEM: right portion of address, with digits and most special characters removed BIG5: flag indicating that address is in one of five widely used domains (AOL, MSN, Compuserve, Deja, Yahoo) MANY: frequency BOX_BOX: COMPGED score for BOX values LAST_BOX: COMPGED score for a LAST value (from NEWNAMES) and a BOX value (from NOTYETID) SYSTEM_SYSTEM: COMPGED score for SYSTEM values 6
7 Here is the PROC SQL code used. First, the cartesian join is formed, and filtered: create table possible as select last, first, notyetid. , notyetid.box, notyetid.system, big5, notyetid.many, compged(newnames.box,notyetid.box,400) as box_box, compged(newnames.last,notyetid.box,300) as last_box, compged(newnames.system,notyetid.system,300) as system_system from notyetid, newnames where newnames. ne notyetid. and ( /* Scenario: corporate change */ calculated system_system < 300 and calculated box_box < 400 and newnames.box ne '_' and notyetid.box ne '_' and not big5 or /* Scenario: address resembles surname */ calculated last_box < 300 and substr(newnames.last,1,3) ne '[?]' and notyetid.box ne '_' or /* Scenario: new job or ISP */ calculated box_box < 200 and newnames.box ne '_' and notyetid.box ne '_' ) ; These are the three scenarios underlying the filters which appear in the WHERE clause. A company (or university, or government agency) refines or upgrades its system. The result is that an address like [email protected] is replaced with one like [email protected]. Since both components (left and right of ) ought to be similar, both BOX_BOX and SYSTEM_SYSTEM are in the filter. An individual at some times uses an address which does not reflect his or her surname, but at other times uses one which does include at least an approximation of the surname. Example: [email protected] belongs to Susan Sugigoer, as does [email protected]. The test for this scenario is to compare NEWNAMES.LAST with NOTYETID.BOX. This approach is not symmetric (identifying [email protected] with Susan Sugigoer will not help with identification of [email protected]). 7
8 An individual gets a job, or a new job, or a new internet service provider. Example: [email protected] becomes [email protected]. To model this, only the left component of the address is considered, so BOX_BOX is in the filter. The underscore character and the bracketed question mark are conventions used for missing values. Testing for them in the WHERE clause helps to reduce false positives, as does including the BIG5 flag for widely used domains such as yahoo.com. The (numeric and optional) third argument to the COMPGED function is a cutoff which can avoid needless computation by terminating evaluation once the cost reaches the specified value. It is explained in the documentation. The code as shown is fairly arbitrary. It was developed through trial and error. Additional tuning and elaboration could undoubtedly improve its efficiency and accuracy. The table POSSIBLE includes duplicate address pairs. These are eliminated by means of the DISTINCT keyword: create table distinct as select distinct last, first, , many from possible order by last, first, many descending; The table DISTINCT is the list of likely aliases (6). CONCLUSION COMPGED and similar functions for quantifying the degree of similarity between character values can be quite useful in semi-automated fuzzy merge processes. REFERENCES Google, Advanced Groups Search, Marist University, Listserv logs for SAS-L, SAS Institute Inc., SAS OnlineDoc 9, Schreier, Howard, Ask and Ye Shall Receive: Getting the Most from SAS-L, CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Howard Schreier U.S. Dept. of Commerce Mail Stop H-2015 Washington DC [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 8
Paper 109-25 Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation
Paper 109-25 Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation Abstract This paper discusses methods of joining SAS data sets. The different methods and the reasons for choosing a particular
Performing Queries Using PROC SQL (1)
SAS SQL Contents Performing queries using PROC SQL Performing advanced queries using PROC SQL Combining tables horizontally using PROC SQL Combining tables vertically using PROC SQL 2 Performing Queries
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Relational Database: Additional Operations on Relations; SQL
Relational Database: Additional Operations on Relations; SQL Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Overview The course packet
Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois
Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois Abstract This paper introduces SAS users with at least a basic understanding of SAS data
Innovative Techniques and Tools to Detect Data Quality Problems
Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful
White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.
White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access
Handling Missing Values in the SQL Procedure
Handling Missing Values in the SQL Procedure Danbo Yi, Abt Associates Inc., Cambridge, MA Lei Zhang, Domain Solutions Corp., Cambridge, MA ABSTRACT PROC SQL as a powerful database management tool provides
Using SQL Queries in Crystal Reports
PPENDIX Using SQL Queries in Crystal Reports In this appendix Review of SQL Commands PDF 924 n Introduction to SQL PDF 924 PDF 924 ppendix Using SQL Queries in Crystal Reports The SQL Commands feature
Chapter 1 Overview of the SQL Procedure
Chapter 1 Overview of the SQL Procedure 1.1 Features of PROC SQL...1-3 1.2 Selecting Columns and Rows...1-6 1.3 Presenting and Summarizing Data...1-17 1.4 Joining Tables...1-27 1-2 Chapter 1 Overview of
Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC
A PROC SQL Primer Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC ABSTRACT Most SAS programmers utilize the power of the DATA step to manipulate their datasets. However, unless they pull
SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC
PharmaSUG 2013 Poster # P015 SQL SUBQUERIES: Usage in Clinical Programming Pavan Vemuri, PPD, Morrisville, NC ABSTRACT A feature of PROC SQL which provides flexibility to SAS users is that of a SUBQUERY.
A Method for Cleaning Clinical Trial Analysis Data Sets
A Method for Cleaning Clinical Trial Analysis Data Sets Carol R. Vaughn, Bridgewater Crossings, NJ ABSTRACT This paper presents a method for using SAS software to search SAS programs in selected directories
BUSINESS RULES AND GAP ANALYSIS
Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More
Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA
WUSS2015 Paper 84 Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA ABSTRACT Creating your own SAS application to perform CDISC
Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC
Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC ABSTRACT Have you used PROC MEANS or PROC SUMMARY and wished there was something intermediate between the NWAY option
Create Beautiful Reports with AWR Cloud and Prove the Value of Your SEO Efforts
Create Beautiful Reports with AWR Cloud and Prove the Value of Your SEO Efforts It can be difficult sometimes to show your clients the value that they get from your service. Your job, as an SEO, is to
Access Queries (Office 2003)
Access Queries (Office 2003) Technical Support Services Office of Information Technology, West Virginia University OIT Help Desk 293-4444 x 1 oit.wvu.edu/support/training/classmat/db/ Instructor: Kathy
Top Ten Reasons to Use PROC SQL
Paper 042-29 Top Ten Reasons to Use PROC SQL Weiming Hu, Center for Health Research Kaiser Permanente, Portland, Oregon, USA ABSTRACT Among SAS users, it seems there are two groups of people, those who
SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board
SAS PROGRAM EFFICIENCY FOR BEGINNERS Bruce Gilsen, Federal Reserve Board INTRODUCTION This paper presents simple efficiency techniques that can benefit inexperienced SAS software users on all platforms.
Dup, Dedup, DUPOUT - New in PROC SORT Heidi Markovitz, Federal Reserve Board of Governors, Washington, DC
CC14 Dup, Dedup, DUPOUT - New in PROC SORT Heidi Markovitz, Federal Reserve Board of Governors, Washington, DC ABSTRACT This paper presents the new DUPOUT option of PROC SORT and discusses the art of identifying
Guide to Performance and Tuning: Query Performance and Sampled Selectivity
Guide to Performance and Tuning: Query Performance and Sampled Selectivity A feature of Oracle Rdb By Claude Proteau Oracle Rdb Relational Technology Group Oracle Corporation 1 Oracle Rdb Journal Sampled
Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.
SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants
IBM SPSS Direct Marketing 19
IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS
Performance Tuning for the Teradata Database
Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document
Data Warehousing. Jens Teubner, TU Dortmund [email protected]. Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1
Jens Teubner Data Warehousing Winter 2014/15 1 Data Warehousing Jens Teubner, TU Dortmund [email protected] Winter 2014/15 Jens Teubner Data Warehousing Winter 2014/15 152 Part VI ETL Process
PeopleSoft Query Training
PeopleSoft Query Training Overview Guide Tanya Harris & Alfred Karam Publish Date - 3/16/2011 Chapter: Introduction Table of Contents Introduction... 4 Navigation of Queries... 4 Query Manager... 6 Query
Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC
Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC ABSTRACT One of the most expensive and time-consuming aspects of data management
One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University
One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University ABSTRACT In real world, analysts seldom come across data which is in
Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux
Reporting MDM Data Attribute Inconsistencies for the Enterprise Using DataFlux Ernesto Roco, Hyundai Capital America (HCA), Irvine, CA ABSTRACT The purpose of this paper is to demonstrate how we use DataFlux
Business Insight Report Authoring Getting Started Guide
Business Insight Report Authoring Getting Started Guide Version: 6.6 Written by: Product Documentation, R&D Date: February 2011 ImageNow and CaptureNow are registered trademarks of Perceptive Software,
Access 2003 Introduction to Queries
Access 2003 Introduction to Queries COPYRIGHT Copyright 1999 by EZ-REF Courseware, Laguna Beach, CA http://www.ezref.com/ All rights reserved. This publication, including the student manual, instructor's
Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.
Paper 23-27 Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA. ABSTRACT Have you ever had trouble getting a SAS job to complete, although
020112 2008 Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted in any form or by any means, electronic, or
Point of Sale Guide 020112 2008 Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted in any form or by any means, electronic, or mechanical, including photocopying,
USERV Auto Insurance Rule Model in Corticon
USERV Auto Insurance Rule Model in Corticon Mike Parish Progress Software Contents Introduction... 3 Vocabulary... 4 Database Connectivity... 4 Overall Structure of the Decision... 6 Preferred Clients...
Query Builder. The New JMP 12 Tool for Getting Your SQL Data Into JMP WHITE PAPER. By Eric Hill, Software Developer, JMP
Query Builder The New JMP 12 Tool for Getting Your SQL Data Into JMP By Eric Hill, Software Developer, JMP WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Section 1: Making the Connection....
Normalization. Reduces the liklihood of anomolies
Normalization Normalization Tables are important, but properly designing them is even more important so the DBMS can do its job Normalization the process for evaluating and correcting table structures
Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER
Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Useful vs. So-What Metrics... 2 The So-What Metric.... 2 Defining Relevant Metrics...
Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison
Preparing your data for analysis using SAS Landon Sego 24 April 2003 Department of Statistics UW-Madison Assumptions That you have used SAS at least a few times. It doesn t matter whether you run SAS in
Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1
Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1 Institute for Health Care Research and Improvement, Baylor Health Care System 2 University of North
Regular Expressions and Automata using Haskell
Regular Expressions and Automata using Haskell Simon Thompson Computing Laboratory University of Kent at Canterbury January 2000 Contents 1 Introduction 2 2 Regular Expressions 2 3 Matching regular expressions
Taming the PROC TRANSPOSE
Taming the PROC TRANSPOSE Matt Taylor, Carolina Analytical Consulting, LLC ABSTRACT The PROC TRANSPOSE is often misunderstood and seldom used. SAS users are unsure of the results it will give and curious
Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution
Warehouse and Business Intelligence : Challenges, Best Practices & the Solution Prepared by datagaps http://www.datagaps.com http://www.youtube.com/datagaps http://www.twitter.com/datagaps Contact [email protected]
Identifying Invalid Social Security Numbers
ABSTRACT Identifying Invalid Social Security Numbers Paulette Staum, Paul Waldron Consulting, West Nyack, NY Sally Dai, MDRC, New York, NY Do you need to check whether Social Security numbers (SSNs) are
Search and Replace in SAS Data Sets thru GUI
Search and Replace in SAS Data Sets thru GUI Edmond Cheng, Bureau of Labor Statistics, Washington, DC ABSTRACT In managing data with SAS /BASE software, performing a search and replace is not a straight
Oracle Database 10g: Introduction to SQL
Oracle University Contact Us: 1.800.529.0165 Oracle Database 10g: Introduction to SQL Duration: 5 Days What you will learn This course offers students an introduction to Oracle Database 10g database technology.
SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN
Paper PH200 SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN ABSTRACT SAS Enterprise Guide is a member
Software-assisted document review: An ROI your GC can appreciate. kpmg.com
Software-assisted document review: An ROI your GC can appreciate kpmg.com b Section or Brochure name Contents Introduction 4 Approach 6 Metrics to compare quality and effectiveness 7 Results 8 Matter 1
Information Systems SQL. Nikolaj Popov
Information Systems SQL Nikolaj Popov Research Institute for Symbolic Computation Johannes Kepler University of Linz, Austria [email protected] Outline SQL Table Creation Populating and Modifying
Fastnet SA 2008. www.mailcleaner.net
U S E R M A N U A L Fastnet SA 2008. Reproduction of this manual in whole or in part without the permission of Fastnet SA, St-Sulpice, Switzerland is strictly forbidden. MailCleaner is a trademark of Fastnet
SAP InfiniteInsight Explorer Analytical Data Management v7.0
End User Documentation Document Version: 1.0-2014-11 SAP InfiniteInsight Explorer Analytical Data Management v7.0 User Guide CUSTOMER Table of Contents 1 Welcome to this Guide... 3 1.1 What this Document
How To Create A Text Mining Model For An Auto Analyst
Paper 003-29 Text Mining Warranty and Call Center Data: Early Warning for Product Quality Awareness John Wallace, Business Researchers, Inc., San Francisco, CA Tracy Cermack, American Honda Motor Co.,
Salary. Cumulative Frequency
HW01 Answering the Right Question with the Right PROC Carrie Mariner, Afton-Royal Training & Consulting, Richmond, VA ABSTRACT When your boss comes to you and says "I need this report by tomorrow!" do
Tips and Tricks SAGE ACCPAC INTELLIGENCE
Tips and Tricks SAGE ACCPAC INTELLIGENCE 1 Table of Contents Auto e-mailing reports... 4 Automatically Running Macros... 7 Creating new Macros from Excel... 8 Compact Metadata Functionality... 9 Copying,
Parameter Fields and Prompts. chapter
Parameter Fields and Prompts chapter 23 Parameter Fields and Prompts Parameter and prompt overview Parameter and prompt overview Parameters are Crystal Reports fields that you can use in a Crystal Reports
Calc Guide Chapter 9 Data Analysis
Calc Guide Chapter 9 Data Analysis Using Scenarios, Goal Seek, Solver, others Copyright This document is Copyright 2007 2011 by its contributors as listed below. You may distribute it and/or modify it
Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7
Chapter 1 Introduction 1 SAS: The Complete Research Tool 1 Objectives 2 A Note About Syntax and Examples 2 Syntax 2 Examples 3 Organization 4 Chapter by Chapter 4 What This Book Is Not 5 Chapter 2 SAS
The Basics of Querying in FoxPro with SQL SELECT. (Lesson I Single Table Queries)
The Basics of Querying in FoxPro with SQL SELECT (Lesson I Single Table Queries) The idea behind a query is to take a large database, possibly consisting of more than one table, and producing a result
A Brief Introduction to MySQL
A Brief Introduction to MySQL by Derek Schuurman Introduction to Databases A database is a structured collection of logically related data. One common type of database is the relational database, a term
A basic create statement for a simple student table would look like the following.
Creating Tables A basic create statement for a simple student table would look like the following. create table Student (SID varchar(10), FirstName varchar(30), LastName varchar(30), EmailAddress varchar(30));
How To Use Excel With A Calculator
Functions & Data Analysis Tools Academic Computing Services www.ku.edu/acs Abstract: This workshop focuses on the functions and data analysis tools of Microsoft Excel. Topics included are the function
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps
QOS Based Web Service Ranking Using Fuzzy C-means Clusters
Research Journal of Applied Sciences, Engineering and Technology 10(9): 1045-1050, 2015 ISSN: 2040-7459; e-issn: 2040-7467 Maxwell Scientific Organization, 2015 Submitted: March 19, 2015 Accepted: April
Chapter-1 : Introduction 1 CHAPTER - 1. Introduction
Chapter-1 : Introduction 1 CHAPTER - 1 Introduction This thesis presents design of a new Model of the Meta-Search Engine for getting optimized search results. The focus is on new dimension of internet
AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C. Y. Associates
AN INTRODUCTION TO THE SQL PROCEDURE Chris Yindra, C Y Associates Abstract This tutorial will introduce the SQL (Structured Query Language) procedure through a series of simple examples We will initially
Managing Tables in Microsoft SQL Server using SAS
Managing Tables in Microsoft SQL Server using SAS Jason Chen, Kaiser Permanente, San Diego, CA Jon Javines, Kaiser Permanente, San Diego, CA Alan L Schepps, M.S., Kaiser Permanente, San Diego, CA Yuexin
Excel Working with Data Lists
Excel Working with Data Lists Excel Working with Data Lists Princeton University COPYRIGHT Copyright 2001 by EZ-REF Courseware, Laguna Beach, CA http://www.ezref.com/ All rights reserved. This publication,
SQL Server An Overview
SQL Server An Overview SQL Server Microsoft SQL Server is designed to work effectively in a number of environments: As a two-tier or multi-tier client/server database system As a desktop database system
Clinical Trial Data Integration: The Strategy, Benefits, and Logistics of Integrating Across a Compound
PharmaSUG 2014 - Paper AD21 Clinical Trial Data Integration: The Strategy, Benefits, and Logistics of Integrating Across a Compound ABSTRACT Natalie Reynolds, Eli Lilly and Company, Indianapolis, IN Keith
Data Mining Extensions (DMX) Reference
Data Mining Extensions (DMX) Reference SQL Server 2012 Books Online Summary: Data Mining Extensions (DMX) is a language that you can use to create and work with data mining models in Microsoft SQL Server
IBM SPSS Data Preparation 22
IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release
CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases
3 CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases About This Document 3 Methods for Accessing Relational Database Data 4 Selecting a SAS/ACCESS Method 4 Methods for Accessing DBMS Tables
IBM SPSS Direct Marketing 20
IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications White Paper Table of Contents Overview...3 Replication Types Supported...3 Set-up &
Guide to SQL Programming: SQL:1999 and Oracle Rdb V7.1
Guide to SQL Programming: SQL:1999 and Oracle Rdb V7.1 A feature of Oracle Rdb By Ian Smith Oracle Rdb Relational Technology Group Oracle Corporation 1 Oracle Rdb Journal SQL:1999 and Oracle Rdb V7.1 The
Effective Use of SQL in SAS Programming
INTRODUCTION Effective Use of SQL in SAS Programming Yi Zhao Merck & Co. Inc., Upper Gwynedd, Pennsylvania Structured Query Language (SQL) is a data manipulation tool of which many SAS programmers are
Teamstudio USER GUIDE
Teamstudio Software Engineering Tools for IBM Lotus Notes and Domino USER GUIDE Edition 30 Copyright Notice This User Guide documents the entire Teamstudio product suite, including: Teamstudio Analyzer
EXST SAS Lab Lab #4: Data input and dataset modifications
EXST SAS Lab Lab #4: Data input and dataset modifications Objectives 1. Import an EXCEL dataset. 2. Infile an external dataset (CSV file) 3. Concatenate two datasets into one 4. The PLOT statement will
Working with Spreadsheets
osborne books Working with Spreadsheets UPDATE SUPPLEMENT 2015 The AAT has recently updated its Study and Assessment Guide for the Spreadsheet Software Unit with some minor additions and clarifications.
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board
Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board INTRODUCTION In 20 years as a SAS consultant at the Federal Reserve Board, I have seen SAS users make
# or ## - how to reference SQL server temporary tables? Xiaoqiang Wang, CHERP, Pittsburgh, PA
# or ## - how to reference SQL server temporary tables? Xiaoqiang Wang, CHERP, Pittsburgh, PA ABSTRACT This paper introduces the ways of creating temporary tables in SQL Server, also uses some examples
Postgres Plus xdb Replication Server with Multi-Master User s Guide
Postgres Plus xdb Replication Server with Multi-Master User s Guide Postgres Plus xdb Replication Server with Multi-Master build 57 August 22, 2012 , Version 5.0 by EnterpriseDB Corporation Copyright 2012
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Manage Workflows. Workflows and Workflow Actions
On the Workflows tab of the Cisco Finesse administration console, you can create and manage workflows and workflow actions. Workflows and Workflow Actions, page 1 Add Browser Pop Workflow Action, page
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
SAS BI Dashboard 3.1. User s Guide
SAS BI Dashboard 3.1 User s Guide The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. SAS BI Dashboard 3.1: User s Guide. Cary, NC: SAS Institute Inc. SAS BI Dashboard
Instant SQL Programming
Instant SQL Programming Joe Celko Wrox Press Ltd. INSTANT Table of Contents Introduction 1 What Can SQL Do for Me? 2 Who Should Use This Book? 2 How To Use This Book 3 What You Should Know 3 Conventions
Order from Chaos: Using the Power of SAS to Transform Audit Trail Data Yun Mai, Susan Myers, Nanthini Ganapathi, Vorapranee Wickelgren
Paper CC-027 Order from Chaos: Using the Power of SAS to Transform Audit Trail Data Yun Mai, Susan Myers, Nanthini Ganapathi, Vorapranee Wickelgren ABSTRACT As survey science has turned to computer-assisted
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
How To Write A Clinical Trial In Sas
PharmaSUG2013 Paper AD11 Let SAS Set Up and Track Your Project Tom Santopoli, Octagon, now part of Accenture Wayne Zhong, Octagon, now part of Accenture ABSTRACT When managing the programming activities
MOC 20461C: Querying Microsoft SQL Server. Course Overview
MOC 20461C: Querying Microsoft SQL Server Course Overview This course provides students with the knowledge and skills to query Microsoft SQL Server. Students will learn about T-SQL querying, SQL Server
Mail Merges, Labels and Email Message Merges in Word 2007 Contents
Mail Merges, Labels and Email Message Merges in Word 2007 Contents Introduction to Mail Merges... 2 Mail Merges Using the Mail Merge Wizard... 3 Creating the Main Document... 3 Selecting the Data Source...
