Identity Resolution in Criminal Justice Data: An Application of NORA

Transcription

1 Identity Resolution in Criminal Justice Data: An Application of NORA Queen E. Booker 1 1 Minnesota State University, Mankato, 150 Morris Hall Mankato, Minnesota Queen.booker@mnsu.edu Abstract. Identifying aliases is an important component of the of the criminal justice system. Accurately identifying a person of interest or someone who has been arrested can significantly reduce the costs within the entire criminal justice system. This paper examines the problem domain of matching and relating identities, examines traditional approaches to the problem, and applies the identity resolution approach described by Jeff Jonas [1] and relationship awareness to the specific case of client identification for the indigent defense office. The combination of identify resolution and relationship awareness offered improved accuracy in matching identities. Keywords: Pattern Analysis, Identity Resolution, Text Mining 1 Introduction Appointing counsel for indigent clients is a complex task with many constraints and variables. The manager responsible for assigning the attorney is limited by the number of attorneys at his/her disposal. If the manager assigns an attorney to a case with which the attorney has a conflict of interest, the office loses the funds already invested in the case by the representing attorney. Additional resources are needed to bring the next attorney up to speed Thus, it is in the best interest of the manager to be able to accurately identify the client, the victim and any potential witnesses to minimize any conflict of interest. As the number of cases grows, many times, the manager simply selects the next person on the list when assigning the case. This type of assignment can lead to a high number of withdrawals due to a late identified conflict of interest. Costs to the office increase due to additional incarceration expenses while the client is held in custody as well as the sunk costs of prior and repeated attorney representation regardless of whether the client is in or out of custody.

2 These problems are further exacerbated when insufficient systems are in place to manage the data that could be used to make assignments easier. The data on the defendant is separately maintained by the various criminal justice agencies including the indigent defense service agency itself. This presents a challenge as the number of cases increases but without a concomitant increase in staff available to make the assignments. Thus those individuals responsible for assigning attorneys want not only the ability to better assign attorneys, but also to do so in a more expedient fashion. The aggregate data from all the information systems in the criminal justice process has been proven to improve the attorney assignment process [2] Criminal justice systems have many disparate information systems, each with their own data sets. These include systems concerned with arrests, court case scheduling, the prosecuting attorneys office, to name a few. In many cases, relationships are non-obvious. It is not unusual for a repeat offender to provide an alternative name that is not validated prior to sending the arrest data to the indigent defense office. Likewise it is not unusual for potential witnesses to provide alternative names in an attempt to protect their identities. And further, it is not unusual for a victim to provide yet another name in an attempt to hide a previous interaction with the criminal justice process. Detecting aliases becomes harder as the indigent defense problem grows in complexity. 2 Problems with matching Matching identities or finding aliases is a difficult process to perform manually. The process relies on institutional knowledge and/or visual stimulation. For example, if an arrest report is accompanied by a picture, the manager or attorney can easily ascertain the person s identity. But that is not the case. Arrest reports sent generally are textual with the defendant s name, demographic information, arrest charges, victim, and any witness information. With the institutional knowledge, the manager or an attorney can review the information on the report and identify the person by the use of a previous alias or by other pertinent information on the report. So essentially, it is possible to identify many aliases by humans, and hence possible for an information system because the enterprise contains all the necessary knowledge. But the knowledge and the process is trapped across isolated operational systems within the criminal justice agencies. One approach to improving the indigent defense agency problem is to amass information from as many different available data sources, clean the data, and finding matches to improve the defense process. Traditional algorithms aren't well suited for this process. Matching is further encumbered by the poor quality of the underlying data. Lists containing subjects of interest commonly have typographical errors such as data from the defendants who intentionally misspell their names to frustrate data matching efforts, and legitimate natural variability (Mike versus Michael and 123 Main Street versus 123 S. Maine Street). Dates are often a problem as well. Months and days are sometimes transposed, especially in international settings. Numbers often have transposition errors or might have been entered with a different number of leading zeros.

3 2.1 Current Identity Matching Approaches Organizations typically employ three general types of identity matching systems: merge/purge and match/merge, binary matching engines, and centralized identity catalogues. Merge/purge and match/merge is the process of combining two or more lists or files, simultaneously identifying and eliminating duplicate records. This process was developed by direct marketing organizations to eliminate duplicate customer records in mailing lists. Binary matching engines test an identity in one data set for its presence in a second data set. These matching engines are also sometimes used to compare one identity with another single identity (versus a list of possibilities), with the output often expected to be a confidence value pertaining to the likelihood that the two identity records are the same. These systems were designed to help organizations recognize individuals with whom they had previously done business or, alternatively, recognize that the identity under evaluation is known as a subject of interest that is, on a watch list thus warranting special handling. [1] Centralized identity catalogues are systems collect identity data from disparate and heterogeneous data sources and assemble it into unique identities, while retaining pointers to the original data source and record with the purpose of creating an index. Each of the three types of identity matching systems uses either probabilistic or deterministic matching algorithms. Probabilistic techniques rely on training data sets to compute attribute distribution and frequency looking for both common and uncommon patterns. These statistics are stored and used later to determine confidence levels in record matching. As a result, any record containing similar, but uncommon data might be considered a record the same person with a high degree of probability. These systems lose accuracy when the underlying data's statistics deviate from the original training set and must frequently retrained to maintain its level of accuracy. Deterministic techniques rely on pre-coded expert rules to define when records should be matched. One rule might be that if the names are close (Robert versus Rob) and the social security numbers are the same, the system should consider the records as matching identities. These systems often have complex rules based on itemsets such as name, birthdate, zipcode, telephone number, and gender. However, these systems fail as data becomes more complex. 3 NORA Jeff Jonas introduced a system called NORA which stands for non-obvious relationship awareness. He developed the system specifically to solve Las Vegas casinos' identity matching problems. NORA accepts data feeds from numerous enterprise information systems, and built a model of identities and relationships between identities (such as shared addresses or phone numbers) in real time. If a new identity matched or related to another identity in a manner that warranted human scrutiny (based on basic rules, such as good guy connected to very bad guy), the system would immediately generate an intelligence alert. The system approach for the Las Vegas casinos is very similar to the

4 needs of the criminal justice system. The data needed to identify aliases and relationships for conflict of interest concerns comes from multiple data sources arresting agency, probation offices, court systems, prosecuting attorney office, and the defense agency itself, and the ability to successfully identify a client is needed in real-time to reduce costs to the defenses office. The NORA system requirements were: Sequence neutrality. The system needed to react to new data in real time. Relationship awareness. Relationship awareness was designed into the identity resolution process so that newly discovered relationships could generate realtime intelligence. Discovered relationships also persisted in the database, which is essential to generate alerts to beyond one degree of separation. Perpetual analytics. When the system discovered something of relevance during the identity matching process, it had to publish an alert in real time to secondary systems or users before the opportunity to act was lost. Context accumulation. Identity resolution algorithms evaluate incoming records against fully constructed identities, which are made up of the accumulated attributes of all prior records. This technique enabled new records to match to known identities in toto, rather than relying on binary matching that could only match records in pairs. Context accumulation improved accuracy and greatly improved the handling of low-fidelity data that might otherwise have been left as a large collection of unmatched orphan records. Extensible. The system needed to accept new data sources and new attributes through the modification of configuration files, without requiring that the system be taken offline. Knowledge-based name evaluations. The system needed detailed name evaluation algorithms for high-accuracy name matching. Ideally, the algorithms would be based on actual names taken from all over the world and developed into statistical models to determine how and how often each name occurred in its variant form. This empirical approach required that the system be able to automatically determine the culture that the name most likely came from because names vary in predictable ways depending on their cultural origin. Real time. The system had to handle additions, changes, and deletions from realtime operational business systems. Processing times are so fast that matching results and accompanying intelligence (such as if the person is on a watch list or the address is missing an apartment number based on prior observations) could be returned to the operational systems in sub-seconds. Scalable. The system had to be able to process records on a standard transaction server, adding information to a repository that holds hundreds of identities. [1] Like the gaming industry, the defense attorney s office has relatively low daily transactional volumes. Although it receives booking reports on an ongoing basis, initial

5 court appearances are handled by a specific attorney, and the assignments are made daily, usually the day after the initial court appearance. The attorney at the initial court appearance is not the officially assigned attorney, allowing the manager a window of opportunity from booking to assigning the case to accurately identify the client. But the analytical component of accurate identification involves numerous records with accurate linkages including aliases as well as past relationships and networks as related to the case. The legal profession has rules and regulations that constitute conflict of interest. Lawyers must follow these rules to maintain their license to practice which makes the assignment process even more critical. [3] NORA s identity resolution engine is capable of performing in real time against extraordinary data volumes. The gaming industry's requirements of less than 1 million affected records a day means that a typical installation might involve a single Intel-based server and any one of several leading SQL database engines. This performance establishes an excellent baseline for application to the defense attorney data since the NORA system demonstrated that the system could handle multibillion-row databases consisting of hundreds of millions of constructed identities and ingest new identities at a rate of more than 2,000 identity resolutions per second; such ultra-large deployments require 64 or more CPUs and multiple terabytes of storage, and move the performance bottleneck from the analytic engine to the database engine itself. While the defense attorney dataset is not quite as large, the processing time on the casino data suggests that NORA would be able to accurately and easily handle the defense attorney s needs in real-time. 4 Identity resolution Identity resolution is an operational intelligence process, typically powered by an identity resolution engine, whereby organizations can connect disparate data sources with a view to understanding possible identity matches and non-obvious relationships across multiple data sources. It analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non-obvious relationships exist between those identities. These engines are used to uncover risk, fraud, and conflicts of interest. Identity resolution is designed to assemble i identity records from j data sources into k constructed, persistent identities. The term "persistent" indicates that matching outcomes are physically stored in a database at the moment a match is computed. Accurately evaluating the similarity of proper names is undoubtedly one of the most complex (and most important) elements of any identity matching system. Dictionarybased approaches fail to handle the complexities of names such as common names such as Robert Johnson. The approaches fail even greater when cultural influences in naming are involved. Soundex is an improvement over traditional dictionary approaches. It uses a phonetic algorithm for indexing names by their sound when pronounced in English. The basic aim is for names with the same pronunciation to be encoded to the same string so that

6 matching can occur despite minor differences in spelling. Such systems' attempts to neutralize slight variations in name spelling by assigning some form of reduced "key" to a name (by eliminating vowels or eliminating double consonants) frequently fail because of external factors for example, different fuzzy matching rules are needed for names from different cultures. Jonas found that the deterministic method is essential for eliminating dependence on training data sets. As such, the system no longer needed periodic reloads to account for statistical changes to the underlying universe of data. However, he also asserts many common conditions in which deterministic techniques fail specifically, certain attributes were so overused that it made more sense to ignore them than to use them for identity matching and detecting relationships. For example, two people with the first name of "Rick" who share the same social security number are probably the same person unless the number is Two people who have the same phone number probably live at the same address unless that phone number is a travel agency's phone number. He refers to such values as generic because the overuse diminishes the usefulness of the value itself. It's impossible to know all of these generic values a priori for one reason, they keep changing thus probabilistic-like techniques are used to automatically detect and remember them. His identity resolution system uses a hybrid matching approach that combines deterministic expert rules with a probabilistic-like component to detect generics in real time (to avoid the drawback of training data sets). The result is expert rules that look something like this: If the name is similar AND there is a matching unique identifier THEN match UNLESS this unique identifier is generic In his system, a unique identifier might include social security or credit-card numbers, or a passport number, but wouldn't include such values as phone number or date of birth. The term "generic" here means the value has become so widely used (across a predefined number of discreet identities) that one can no longer use this same value to disambiguate one identity from another. [1] However, the approach for the study for the defense data included a merged itemset that combined date of birth, gender, and ethnicity code because of the inability or legal constraint of not being able to use the social security number for identification. Thus, an identifier was developed from a merged itemset after using the SUDA algorithm to identify infrequent itemsets based on data mining [4]. The actual deterministic matching rules for NORA as well as the defense attorney system are much more elaborate in practice because they must explicitly address fuzzy matching to scrub and clean the data as well as address transposition errors in numbers, malformed addresses, and other typographical errors. The current defense attorney agency model has thirty-six rules. Once the data is cleansed it is stored and indexed to provide user-friendly views of the data that make it easy for the user to find specific information

7 when performing queries and ad hoc reporting. Then, a data-mining algorithm using a combination of binary regression and logit models is run to update patterns for assigning attorneys based on the day s outcomes [5]. The algorithm identifies patterns for the outcomes and tree structure for attorney and defendant combinations where the attorney completed the case. [6] Although matching accuracy is highly dependent on the available data, using the techniques described here achieves the goals of identity resolution, which essentially boil down to accuracy, scalability, and sustainability even in extremely large transactional environments. 5 Relationship awareness According to Jonas, detecting relationships is vastly simplified when a mechanism for doing so is physically embedded into the identity matching algorithm. Stating the obvious, before analyzing meaningful relationships, the system must be able to resolve unique identities. As such, identity resolution must occur first. Jonas purported that it was computationally efficient to observe relationships at the moment the identity record is resolved because in-memory residual artifacts (which are required to match an identity) comprise a significant portion of what's needed to determine relevant relationships. Relevant relationships, much like matched identities, were then persisted in the same database. Notably, some relationships are stronger than others; a relationship score that's assigned with each relationship pair captures this strength. For example, living at the same address three times over 10 years should yield a higher score than living at the same address once for three months. As identities are matched and relationships detected, the NORA evaluates userconfigurable rules to determine if any new insight warrants an alert being published as an intelligence alert to a specific system or user. One simplistic way to do this is via conflicting roles. A typical rule for the defense attorney might be notification any time a client rule is associated to a role of victim, witness, co-defendant, or previously represented relative, for example. In this case, associated might mean zero degrees of separation (they're the same person) or one degree of separation (they're roommates). Relationships are maintained in the database to one degree of separation; higher degrees are determined by walking the tree. Although the technology supports searching for any degree of separation between identities, higher orders include many insignificant leads and are thus less useful. 6 Comparative Results This research is an ongoing process to improve the attorney assignment process in the defense attorney offices. As economic times get harder, crime increases and as crimes increase, so do the number of people who require representation by the public defense offices. The ability to quickly identify conflicts of interests reduces the amount of time a

8 person stays in the system and also reduces the time needed to process the case. The original system built to work with the alias/identity matching as called the Court Appointed Counsel System or CACS. CACS identified 83% more conflicts of interests than the indigent defense managers during the initial assignments [Booker]. Using the merged itemset and an algorithm using NORA s underlying technology, the conflicts improved from 83% to 87%. But the real improvement came in the processing time. The key to the success of these systems is the ability to update and provide accurate data at a moments notice. Utilizing NORA s underlying algorithms improved the updating and matching process significantly, allowing for new data to be entered and analyzed within a couple of hours as opposed to the days it took to process using the CACS algorithms. Further, the merged itemset approach helped to provide a unique identifier in 90% of the cases significantly increasing automated relationship identifications. The ability to handle real-time transactional data with sustained accuracy will continue to be of "front and center" importance as organizations seek competitive advantage. The identity resolution technology applied here provides evidence that such technologies can be applied to more than simple fraud detection but also to improve business decision making and intelligence support to entities whose purpose are to. References 1. Jonas, J., "Threat and Fraud Intelligence, Las Vegas Style," IEEE Security & Privacy, Vol. 4, No. 06, pp 28-34, (2006) 2. Booker, Q., Kitchens, F. K., and Rebman, C., A Rule Based Decision Support System Prototype for Assigning Felony Court Appointed Counsel, Proceedings of the 2004 Decision Sciences Annual Meeting, Boston, MA. (2004) 3. Gross, L., "Are Differences Among the Attorney Conflict of Interest Rules Consistent with Principles of Behavioral Economics". Georgetown Journal of Legal Ethics, Vol. 19, p. 111, (2006) 4. Manning, A. M., Haglin, D. J., and Keane, J. A., A Recursive Search Algorithm for Statistical Disclosure Assessment, Data Mining and Knowledge Discovery, (2007), conditionally accepted. 5. Kitchens, Fred L.; Sharma, S. K.; and Harris, T., Cluster Computers for e-business Applications, Asian Journal of Information Systems (AJIS), 3 (10) (2004) 6. Forgy, C., Rete: A Fast Algorithm for the Many Pattern/ Many Object Pattern Match Problem, Artificial Intelligence 19, (1982)