Fuzzy Association Rule Mining for Community Crime Pattern Discovery Anna L. Buczak, Christopher M. Gifford ACM SIGKDD Workshop on Intelligence and Security Informatics Held in conjunction with KDD-2010 July 25, 2010
Outline Objective Fuzzy Association Rule Mining (FARM) methodology Rule interestingness Rule pruning Data set Results: All state rules Post-pruning of all state rules Regional rules Conclusions 2
Crime Pattern Discovery Currently: Manual inspection of crime data by analysts. Limited due to the amount of data that can be processed in an acceptable time frame. Complex relationships between various crime attributes can be overlooked. Increased interest by state and city law enforcement to discover patterns in crime data sets. Goal: automatic discovery of community crime patterns. Extract rules governing crime patterns. 3
Fuzzy Association Rule Mining (FARM) Goal: find novel relationships in the data For numerical and categorical attributes crisp rules are unsatisfactory and fuzzy rules provide significantly higher quality results. Fuzzy logic assigns degree of membership between 0 and 1 (e.g., 0.4) to each element of a set. Fuzzy association rules are of the form: (X is A) (Y is B) where X, Y are attributes and A, B are fuzzy sets which characterize X and Y respectively. Example fuzzy association rule for a banking application: (Customer-age is young) and (Accountbalance is small) (Loan-balance is moderate) Fuzzy Membership Functions * for Variable Customer-age * Au, W-H and Chan, KCC, Mining fuzzy association rules in a bank-account database, IEEE Transactions on Fuzzy Systems, 11:238-248, April 2003. 4 Fuzzy rules are well suited to problems with numerical and categorical variables
Rule Interestingness Rules of interest usually have high support and high confidence. In certain domains rules of interest don t have a high support (i.e. they describe rare events). Holds for crime, equipment failure and rare disease applications. When interested in rare events, even rules with low support need to be generated. Result: very large number of rules. 5 Novel post-pruning methods are necessary
Rule Pruning (1) Pruning based on Support, Confidence, Lift Consequent-constraint rule pruning * An item constraint is used that requires rule consequents to satisfy a given constraint. Requires prior knowledge of which consequents should be interesting. Antecedent-constraint rule pruning Remove rules that are subsets of other rules and have similar confidence: R1: (A1 & A2) -> C1, conf = 0.7 R2: (A1 & A2 & A3) -> C1, conf = 0.7 R3: (A1 & A2 & A4) -> C1, conf = 0.88 R2 is a subset of R1 with the same confidence - it should be removed. R3 is a subset of R1with a different confidence - it should stay. 6 * R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-Based Rule Mining in Large, Dense Databases, Data Mining and Knowledge Discovery, 4(2/3), pp. 217-240, 2000.
Rule Pruning (2) Defined Relative Fuzzy Support (RFS) measure: Allows reduction of the support threshold for consequents that have low frequency and increase of the support threshold for consequents that have high frequency. The reduction or increase of support is significant because of the square in the denominator. RFS is well suited for applications in which the user knows the consequents of interest. This is the case in the crime application, as the user is most interested in Violent Crimes, Murders, Robberies and Assaults being High. 7
Crime Association Rule Mining Communities and Crime Data Set * (UCI Machine Learning Repository): Total of 128 variables Census data (1990) Crime data (1995) Law enforcement data (1990) For many communities these attributes are missing (e.g., police officers per 100k population, police requests per officer, officers assigned to drug units, police operating budget) Data from 2215 communities 8 * obtained from Dr. Michael Redmond from La Salle University
Antecedents and Consequents Mean People Per Household Race: African American (%) Race: Caucasian (%) Race: Hispanic (%) Race: Asian (%) Age: 12-21 (%) Age: 12-29 (%) Age: 16-24 (%) Age: 65+ (%) Unemployed (%) Employed (%) Divorced (%) Houses with Salary Income (%) Houses with Retirement Income (%) Houses with Social Security Income (%) Houses with Public Assistance Income (%) Per Capita Median People in Dense People in Urban Income Household Housing (%) Area (%) Income People Speaking No English (%) Education: Less than 9th Grade (%) People in Homeless Shelters Foreign Born (%) People Speaking English Only (%) Education: No High School Diploma (%) Homeless People Counted in Street Population Density (Persons Per Square Mile) Median Gross Rent Education: Bachelor's or Higher (%) Houses with Kids Living with Two Parents (%) People Commute Using Public Transit (%) People in Owner Occupied Households (%) Occupied Housing Units Without Phone (%) Kids Born to Never Married (%) People Under Poverty Level (%) Violent Crimes Robberies Assaults Murders 40 variables 122 membership functions
Example Variables and Membership Functions Houses with Public Assistance Income (%) Membership Functions Violent Crimes Per 100K Population Membership Functions 10
Post-Pruning Pruning of All State Rules (1) Confidence = 60% Support = 0.135% Total of 13,657 rules generated Post-pruning using RFS = 1.0 Rules left: 657 95.2% reduction in the number of rules Average support of rules with membership functions Low and No increased the most considerably. Number of rules Average support 11 Large portion of rules of no interest was automatically removed
Post-Pruning Pruning of All State Rules (2) Rules with consequents Murders (High) and Robberies (High) have the highest lift, exceeding several times the average lift of the other consequents. Average lift of rules with consequent Violent Crimes (High): remaining after pruning increased more than three times. Average lift of rules with membership functions No, Low, and Medium remaining after pruning is unchanged. Rules with consequents Murders (High) and Robberies (High) have the highest relative support. Pruning does not increase the average relative support of this class of rules. Average relative support of rules with membership functions No, Low, and Medium remaining after pruning increased. 12 Average lift Average relative support
All State Rules Producing Highest Value of a Metric People Speaking No English (Low) & People in Dense Housing (Low) Robberies (Low), conf=85.0, lift=1.0, rel sup=1.1, sup=75.3 Kids Born to Never Married (Low) & People in Dense Housing (Low) Robberies (Low), conf=88.0, lift=1.1, rel sup=1.1, sup=73.9 People in Urban Area (High) & Kids Born to Never Married (High) Robberies (High), conf=63.0, lift=34.7, rel sup=11.9, sup=0.4 Race Caucasian (Minority) & Kids Born to Never Married (High) Robberies (High), conf=61.0, lift=33.3, rel sup=10.9, sup=0.4 Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1 Race African American (Minority) & People Speaking No English (Low) Robberies (Low), conf=91.0, lift=1.1, rel sup=1.0, sup=65.9 Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1 Houses with Kids Living with Two Parents (Low) & People Commute Using Public Transit (High) Robberies (High), conf=86.0, lift=47.4, rel sup=5.6, sup=0.2 13 Prominent variables: Kids Born to ever Married & Houses with Kids Living with Two Parents are present in 6 out of 8 rules
Surprising All State Rules Identified by Subject Matter Expert Employed (High) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.67, lift=15.2, rel sup=1.0 sup=0.002 Employed (High) Violent Crimes (Low), conf=0.67, lift=1.2, rel sup=1.0, sup=0.297 Kids Born to Never Married (High) Violent Crimes (High), conf=0.58, lift=13, rel sup=2.1, sup=0.004 People Under Poverty Level (Low) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.65, lift=14.6, rel sup=1.1 sup=0.002 People Under Poverty Level (Low) Violent Crimes (Low), conf=0.67, lift=1.2, rel sup=1.4, sup=0.416 Kids Born to Never Married (High) Violent Crimes (High), conf=0.58, lift=13, rel sup=2.1, sup=0.004 Age: 16-24 (Low) & Kids Born to Never Married (High) Murders (High), conf=0.6, lift=13.5, rel sup=2.1, sup=0.004 Age: 16-24 (Low) Murders (Low), conf=0.6, lift=1.1, rel sup=1.3, sup=0.365 Kids Born to Never Married (High) Murders (High), conf=0.52, lift=20.5, rel sup=5.9, sup=0.004 Houses with Salary Income (High) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.64, lift=14.5, rel sup=1.4, sup=0.003 Houses with Income (High) Violent Crimes (Low), conf=0.66, lift=1.2, rel sup=0.9, sup=0.257 Kids Born to Never Married (High) Violent Crimes (High), conf=0.58, lift=13, rel sup=2.1, sup=0.004 14
Regional Rules Rules were generated separately for 5 US regions * : MW, NE, SE, SW, and W. 3188 rules were consistent through all regions: Rule with highest avg Support: People in Dense Housing (Low) Robberies (Low) Rule with highest avg Confidence: Houses with Retirement Income (High) & People in Homeless Shelters (None) Robberies (Low) Rule with highest avg Lift: Race African American (Middle) & Houses with Public Assistance Income (Medium) Assaults (Medium) Rules per Region * J. Dembsky, United States Regions, http://www.dembsky.net/regions/. 15
Conclusions Fuzzy association rule mining has proven useful for this crime application. First experimental study of applying fuzzy association rule mining to a crime data set. Both frequent and rare rules are of interest. New Fuzzy Relative Support metric defined for rule postpruning: Achieves a 95.2% reduction in the final number of rules. Rules discovered represent patterns of interest to law enforcement officials. Subject Matter Expert recommendation: Law enforcement personnel and analysts should further analyze the identified set of surprising rules and the corresponding underlying data in an attempt to better understand crime patterns and develop more effective approaches to combat crime. 16
Questions? Contact info: Dr. Anna L. Buczak anna.buczak@jhuapl.edu Tel. 443-778-9350 17
18 BACKUP SLIDES
Advantages of FARM Fuzzy association rules work well with numerical and categorical data. Fuzzy rules are easy to understand by a human. FARM does not make any assumptions about the rules that are to be extracted, removing a bias that humans might have. Use of fuzzy techniques makes fuzzy association rules mining resilient to noise and missing values. Fuzzy rules are proven to provide superior performance to crisp rules in many applications (e.g., fuzzy temperature controller). 19 FARM methods are well suited to our data
Rule Support and Confidence Rule: X Y # X IY Support( X Y) = # D Y D X X Y # X IY Confidence( X Y) = # X Support (coverage) is the number of instances the rule predicts correctly expressed as a proportion of all items in the data set. Support = number of instances that contain both X and Y divided by number of all transactions in database (D). Confidence (accuracy) is the number of instances that the rule predicts correctly, expressed as a proportion of all instances to which it applies. Confidence = number of transactions that contain both X and Y divided by number of transactions that contain X. Confidence can be treated as conditional probability of a transaction containing X also containing Y (P(Y X)). 20
Rule Lift Rule: Lift( X X Y Conf( X Y) Y) = Expected_ Conf( X Y) Expected _ Conf( X Y) = Sup( Y) #( X I Y ) # D #( X I Y ) Lift ( X Y) = = # D # X # Y # X# Y Y D X X Y Lift - measures the deviation from independence of X and Y. Lift - ratio of the number of instances X and Y appear together to the multiple of number of instances X appears and number of instances Y appears. Lift - values larger than 1.0 indicate that transactions containing the antecedent (X) tend to contain the consequent (Y) more often than transactions that do not contain the antecedent (X). The higher the lift, the more likely that the existence of X and Y together is not just a random occurrence but because of a relationship between them. 21
US Regions Community data were grouped into five regions: Northeast: CT, DE, ME, MD, MA, NH, NJ, NY, PA, RI, and VT. This subset covers 632 communities. Southeast: AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, and WV. This subset covers 420 communities. Midwest: IL, IN, IA, KS, MI, MN, MO, NE (no data), ND, OH, SD, and WI. It covers a total of 513 communities. Southwest: AZ, NM, OK, and TX. This subset covers 228 communities. West: CA, CO, ID, MT (no data), NV, OR, UT, WA, and WY. This represents 418 communities. 22
Experimental Setup All attributes with a large number of missing values were removed. Odds ratios between each remaining attribute and Violent Crimes, Murders, Robberies, and Assaults were computed. Attributes exhibiting small odds ratios were removed. Similar attributes were omitted (e.g., from the attributes Divorced (%), Male Divorced (%), and Female Divorced (%), only Divorced (%) was kept). 23
Examples of Membership Functions Race: Hispanic (%) Homeless People in Shelters Per 100K Population Violent Crimes Per 100K Population 24
Examples of Unsurprising All State Rules Frequent rules: Houses with Kids Living with Two Parents (High) & People Speaking No English (Low) Murders (No) conf=0.61, lift=1.3, sup=0.221 Houses with Public Assistance Income (Low) & Houses with Kids Living with Two Parents (High) Murders (No) conf=0.6, lift=1.3, sup=0.219 Rare rules: People in Homeless Shelters (High) & People Commute Using Public Transit (High) Robberies (High), conf=0.76, lift=42.1, sup=0.002 Houses with Public Assistance Income (High) & Kids Born to Never Married (High) Robberies (High), conf=0.7, lift=38.7, sup=0.002 Houses with Public Assistance Income (High) & Kids Born to Never Married (High) Murders (High), conf=0.74, lift=28.9, sup=0.002 25