ACM SIGKDD Workshop on Intelligence and Security Informatics Held in conjunction with KDD-2010



Similar documents
U.S. Department of Housing and Urban Development: Weekly Progress Report on Recovery Act Spending

How To Rate Plan On A Credit Card With A Credit Union

NHIS State Health insurance data

New York Public School Spending In Perspec7ve

Department of Business and Information Technology

State Corporate Income Tax-Calculation

Preapproval Inspections for Manufacturing. Christy Foreman Deputy Director Division of Enforcement B Office of Compliance/CDRH

Standardized Pharmacy Technician Education and Training

NAAUSA Security Survey

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: Continuing Competence

TITLE POLICY ENDORSEMENTS BY STATE

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: Continuing Competence

What does Georgia gain. by investing in its

Regional Electricity Forecasting

2014 APICS SUPPLY CHAIN COUNCIL OPERATIONS MANAGEMENT EMPLOYMENT OUTLOOK

APICS OPERATIONS MANAGEMENT EMPLOYMENT OUTLOOK REPORT SUMMER 2013

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2013

ehealth Price Index Trends and Costs in the Short-Term Health Insurance Market, 2013 and 2014

MT/editor Total Responses: 516 full-time, 212 part-time, with 872 total respondents in the MT field (MTs/editors; QA; MT supervisors)

Ending Veteran and Veteran Family Homelessness: The Homeless Veteran Supported Employment Program (HVSEP)

CINCINNATI HILLS CHRISTIAN ACADEMY COLLEGE QUESTIONNAIRE FOR STUDENTS

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: PTA Supervision Requirements

The Lincoln National Life Insurance Company Variable Life Portfolio

Piloting a searchable database of dropout prevention programs in nine low-income urban school districts in the Northeast and Islands Region

States Future Economic Standing

Table 12: Availability Of Workers Compensation Insurance Through Homeowner s Insurance By Jurisdiction

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2014

Nurse Practitioners and Physician Assistants in the United States: Current Patterns of Distribution and Recent Trends. Preliminary Tables and Figures

Florida Workers Comp Market

Hourly Wages. For additional information, please contact:

ANALYSIS OF US AND STATE-BY-STATE CARBON DIOXIDE EMISSIONS AND POTENTIAL SAVINGS IN FUTURE GLOBAL TEMPERATURE AND GLOBAL SEA LEVEL RISE

Health Workforce Data Collection: Findings from a Survey of States

State Annual Report Due Dates for Business Entities page 1 of 10

Final Expense Life Insurance

Annual Survey of Public Pensions: State- and Locally- Administered Defined Benefit Data Summary Brief: 2015

Incarcerated Women and Girls

Table 11: Residual Workers Compensation Insurance Market By Jurisdiction

Florida 1/1/2015 Workers Compensation Rate Filing

Download at

Federation of State Boards of Physical Therapy Jurisdiction Licensure Reference Guide Topic: License Renewal Who approves courses?

U.S. Department of Education NCES NAEP. Tools on the Web

ANTHONY P. CARNEVALE NICOLE SMITH JEFF STROHL

STATES VEHICLE ASSET POLICIES IN THE FOOD STAMP PROGRAM

Estimating college enrollment rates for Virginia public high school graduates

Funding for Accreditation of Medicolegal Death Investigation Offices and Certification of Medicolegal Death Investigation Personnel

An Introduction to... Equity Settlement

Ambulance Industry Receives Financial Relief Through the MMA

Understanding Payroll Recordkeeping Requirements

Surety Bond Requirements for Mortgage Brokers and Mortgage Bankers As of July 15, 2011

AN INSIDE LOOK AT SOCIAL RECRUITING IN THE USA

Auto Insurance Underwriting/Rating

Rates and Bills An Analysis of Average Electricity Rates & Bills in Georgia and the United States

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2015

AmGUARD Insurance Company EastGUARD Insurance Company NorGUARD Insurance Company WestGUARD Insurance Company GUARD

State of the Residential Property Management Market Survey Report, Fall 2012

In Utilization and Trend In Quality

Forethought Medicare Supplement and ForeLife Final Expense Life Insurance Phase 1

Dashboard. Campaign for Action. Welcome to the Future of Nursing:

A descriptive analysis of state-supported formative assessment initiatives in New York and Vermont

State Survey Results MULTI-LEVEL LICENSURE TITLE PROTECTION

Community Eligibility Option: Guidance and Procedures for Selection of States for School Year

Fact Sheet* Physical Therapist Assistant Education Programs October 2015

EFFECTS OF LEGALIZING MARIJUANA 1

How To Calculate Health Insurance Coverage In The United States

Moving TIM from Good to Great?

LexisNexis Law Firm Billable Hours Survey Report

Admitting Foreign Trained Lawyers. National Conference of Bar Examiners Chicago, May 2, 2015

How To Know The Nursing Workforce

Life Settlements Source List

CDFI FUND NEW MARKETS TAX CREDIT PROGRAM:

The following rates are the maximum rates that should be illustrated. Be sure to update the IRIS illustration system

Athene Annuity (DE) Rates

Appendix: Data Supplement. This appendix contains supplementary information about the data and exhibits.

Post-Graduation Survey Results 2013 College of Fine Arts School of Design Undergraduate

PRODUCTS CURRENTLY AVAILABLE FOR SALE. Marquis SP

Aetna Senior Supplemental Insurance 1 Home Office Directory Medicare Supplement

Payroll Tax Chart Results

A N S W E R S R E L N

The Youth Vote in 2012 CIRCLE Staff May 10, 2013

CONTINGENT COVERAGES AVAILABLE FOR AUTO LESSORS

Health of Wisconsin. Children and young adults (ages 1-24) B D. Report Card July 2010

Oral Health Workforce for Low Income Children

ADEA Survey of Dental School Seniors, 2014 Graduating Class Tables Report

Pharmacist Administered Vaccines Types of Vaccines Authorized to Administer

AMFMM Benchmarking Data,

Fixed Indexed Annuity Rates

10 Reasons Why Vertex SMB is A Better Way to Handle Your Sales and Use Tax Automation 11:00 11:30. Scott Coleman. Channel Sales Manager

Broadband Technology Opportunities Program: Sustainable Broadband Adoption and Public Computer Centers

Breakeven Cost for Residential Photovoltaics in the United States: Key Drivers and Sensitivities (Report Summary)

The Survey of Undergraduate and Graduate Programs in Communication. Sciences and Disorders has been conducted since Surveys were conducted in

the polling company, inc./ WomanTrend On behalf of The Center for Security Policy TOPLINE DATA Nationwide Online Survey among 600 Muslim-Americans

Post-Graduation Survey Results 2014 Dietrich College of Humanities & Social Sciences STATISTICS Bachelor of Science

American Equity Investment Life Insurance Company Bonus Gold (Index 1-07) PFG Marketing Group, Inc.

Health Insurance Coverage: Early Release of Estimates From the National Health Interview Survey, January June 2013

The Consumer Text Messaging Habits of Mobile Phone Users

The Future of Nursing Report

OPPORTUNITIES IN THE AFFORDABLE CARE ACT TO IMPROVE HEALTH CARE COORDINATION AND DELIVERY FOR PEOPLE LIVING WITH HIV

Enrollment Snapshot of Radiography, Radiation Therapy and Nuclear Medicine Technology Programs 2012

States Served. CDFI Fund 601 Thirteenth Street, NW, Suite 200, South, Washington, DC (202)

Standardization of Technician Education Want it? Need it? Janet Teeters, M.S., R.Ph. Director of Accreditation Services ASHP

Transcription:

Fuzzy Association Rule Mining for Community Crime Pattern Discovery Anna L. Buczak, Christopher M. Gifford ACM SIGKDD Workshop on Intelligence and Security Informatics Held in conjunction with KDD-2010 July 25, 2010

Outline Objective Fuzzy Association Rule Mining (FARM) methodology Rule interestingness Rule pruning Data set Results: All state rules Post-pruning of all state rules Regional rules Conclusions 2

Crime Pattern Discovery Currently: Manual inspection of crime data by analysts. Limited due to the amount of data that can be processed in an acceptable time frame. Complex relationships between various crime attributes can be overlooked. Increased interest by state and city law enforcement to discover patterns in crime data sets. Goal: automatic discovery of community crime patterns. Extract rules governing crime patterns. 3

Fuzzy Association Rule Mining (FARM) Goal: find novel relationships in the data For numerical and categorical attributes crisp rules are unsatisfactory and fuzzy rules provide significantly higher quality results. Fuzzy logic assigns degree of membership between 0 and 1 (e.g., 0.4) to each element of a set. Fuzzy association rules are of the form: (X is A) (Y is B) where X, Y are attributes and A, B are fuzzy sets which characterize X and Y respectively. Example fuzzy association rule for a banking application: (Customer-age is young) and (Accountbalance is small) (Loan-balance is moderate) Fuzzy Membership Functions * for Variable Customer-age * Au, W-H and Chan, KCC, Mining fuzzy association rules in a bank-account database, IEEE Transactions on Fuzzy Systems, 11:238-248, April 2003. 4 Fuzzy rules are well suited to problems with numerical and categorical variables

Rule Interestingness Rules of interest usually have high support and high confidence. In certain domains rules of interest don t have a high support (i.e. they describe rare events). Holds for crime, equipment failure and rare disease applications. When interested in rare events, even rules with low support need to be generated. Result: very large number of rules. 5 Novel post-pruning methods are necessary

Rule Pruning (1) Pruning based on Support, Confidence, Lift Consequent-constraint rule pruning * An item constraint is used that requires rule consequents to satisfy a given constraint. Requires prior knowledge of which consequents should be interesting. Antecedent-constraint rule pruning Remove rules that are subsets of other rules and have similar confidence: R1: (A1 & A2) -> C1, conf = 0.7 R2: (A1 & A2 & A3) -> C1, conf = 0.7 R3: (A1 & A2 & A4) -> C1, conf = 0.88 R2 is a subset of R1 with the same confidence - it should be removed. R3 is a subset of R1with a different confidence - it should stay. 6 * R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-Based Rule Mining in Large, Dense Databases, Data Mining and Knowledge Discovery, 4(2/3), pp. 217-240, 2000.

Rule Pruning (2) Defined Relative Fuzzy Support (RFS) measure: Allows reduction of the support threshold for consequents that have low frequency and increase of the support threshold for consequents that have high frequency. The reduction or increase of support is significant because of the square in the denominator. RFS is well suited for applications in which the user knows the consequents of interest. This is the case in the crime application, as the user is most interested in Violent Crimes, Murders, Robberies and Assaults being High. 7

Crime Association Rule Mining Communities and Crime Data Set * (UCI Machine Learning Repository): Total of 128 variables Census data (1990) Crime data (1995) Law enforcement data (1990) For many communities these attributes are missing (e.g., police officers per 100k population, police requests per officer, officers assigned to drug units, police operating budget) Data from 2215 communities 8 * obtained from Dr. Michael Redmond from La Salle University

Antecedents and Consequents Mean People Per Household Race: African American (%) Race: Caucasian (%) Race: Hispanic (%) Race: Asian (%) Age: 12-21 (%) Age: 12-29 (%) Age: 16-24 (%) Age: 65+ (%) Unemployed (%) Employed (%) Divorced (%) Houses with Salary Income (%) Houses with Retirement Income (%) Houses with Social Security Income (%) Houses with Public Assistance Income (%) Per Capita Median People in Dense People in Urban Income Household Housing (%) Area (%) Income People Speaking No English (%) Education: Less than 9th Grade (%) People in Homeless Shelters Foreign Born (%) People Speaking English Only (%) Education: No High School Diploma (%) Homeless People Counted in Street Population Density (Persons Per Square Mile) Median Gross Rent Education: Bachelor's or Higher (%) Houses with Kids Living with Two Parents (%) People Commute Using Public Transit (%) People in Owner Occupied Households (%) Occupied Housing Units Without Phone (%) Kids Born to Never Married (%) People Under Poverty Level (%) Violent Crimes Robberies Assaults Murders 40 variables 122 membership functions

Example Variables and Membership Functions Houses with Public Assistance Income (%) Membership Functions Violent Crimes Per 100K Population Membership Functions 10

Post-Pruning Pruning of All State Rules (1) Confidence = 60% Support = 0.135% Total of 13,657 rules generated Post-pruning using RFS = 1.0 Rules left: 657 95.2% reduction in the number of rules Average support of rules with membership functions Low and No increased the most considerably. Number of rules Average support 11 Large portion of rules of no interest was automatically removed

Post-Pruning Pruning of All State Rules (2) Rules with consequents Murders (High) and Robberies (High) have the highest lift, exceeding several times the average lift of the other consequents. Average lift of rules with consequent Violent Crimes (High): remaining after pruning increased more than three times. Average lift of rules with membership functions No, Low, and Medium remaining after pruning is unchanged. Rules with consequents Murders (High) and Robberies (High) have the highest relative support. Pruning does not increase the average relative support of this class of rules. Average relative support of rules with membership functions No, Low, and Medium remaining after pruning increased. 12 Average lift Average relative support

All State Rules Producing Highest Value of a Metric People Speaking No English (Low) & People in Dense Housing (Low) Robberies (Low), conf=85.0, lift=1.0, rel sup=1.1, sup=75.3 Kids Born to Never Married (Low) & People in Dense Housing (Low) Robberies (Low), conf=88.0, lift=1.1, rel sup=1.1, sup=73.9 People in Urban Area (High) & Kids Born to Never Married (High) Robberies (High), conf=63.0, lift=34.7, rel sup=11.9, sup=0.4 Race Caucasian (Minority) & Kids Born to Never Married (High) Robberies (High), conf=61.0, lift=33.3, rel sup=10.9, sup=0.4 Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1 Race African American (Minority) & People Speaking No English (Low) Robberies (Low), conf=91.0, lift=1.1, rel sup=1.0, sup=65.9 Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1 Houses with Kids Living with Two Parents (Low) & People Commute Using Public Transit (High) Robberies (High), conf=86.0, lift=47.4, rel sup=5.6, sup=0.2 13 Prominent variables: Kids Born to ever Married & Houses with Kids Living with Two Parents are present in 6 out of 8 rules

Surprising All State Rules Identified by Subject Matter Expert Employed (High) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.67, lift=15.2, rel sup=1.0 sup=0.002 Employed (High) Violent Crimes (Low), conf=0.67, lift=1.2, rel sup=1.0, sup=0.297 Kids Born to Never Married (High) Violent Crimes (High), conf=0.58, lift=13, rel sup=2.1, sup=0.004 People Under Poverty Level (Low) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.65, lift=14.6, rel sup=1.1 sup=0.002 People Under Poverty Level (Low) Violent Crimes (Low), conf=0.67, lift=1.2, rel sup=1.4, sup=0.416 Kids Born to Never Married (High) Violent Crimes (High), conf=0.58, lift=13, rel sup=2.1, sup=0.004 Age: 16-24 (Low) & Kids Born to Never Married (High) Murders (High), conf=0.6, lift=13.5, rel sup=2.1, sup=0.004 Age: 16-24 (Low) Murders (Low), conf=0.6, lift=1.1, rel sup=1.3, sup=0.365 Kids Born to Never Married (High) Murders (High), conf=0.52, lift=20.5, rel sup=5.9, sup=0.004 Houses with Salary Income (High) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.64, lift=14.5, rel sup=1.4, sup=0.003 Houses with Income (High) Violent Crimes (Low), conf=0.66, lift=1.2, rel sup=0.9, sup=0.257 Kids Born to Never Married (High) Violent Crimes (High), conf=0.58, lift=13, rel sup=2.1, sup=0.004 14

Regional Rules Rules were generated separately for 5 US regions * : MW, NE, SE, SW, and W. 3188 rules were consistent through all regions: Rule with highest avg Support: People in Dense Housing (Low) Robberies (Low) Rule with highest avg Confidence: Houses with Retirement Income (High) & People in Homeless Shelters (None) Robberies (Low) Rule with highest avg Lift: Race African American (Middle) & Houses with Public Assistance Income (Medium) Assaults (Medium) Rules per Region * J. Dembsky, United States Regions, http://www.dembsky.net/regions/. 15

Conclusions Fuzzy association rule mining has proven useful for this crime application. First experimental study of applying fuzzy association rule mining to a crime data set. Both frequent and rare rules are of interest. New Fuzzy Relative Support metric defined for rule postpruning: Achieves a 95.2% reduction in the final number of rules. Rules discovered represent patterns of interest to law enforcement officials. Subject Matter Expert recommendation: Law enforcement personnel and analysts should further analyze the identified set of surprising rules and the corresponding underlying data in an attempt to better understand crime patterns and develop more effective approaches to combat crime. 16

Questions? Contact info: Dr. Anna L. Buczak anna.buczak@jhuapl.edu Tel. 443-778-9350 17

18 BACKUP SLIDES

Advantages of FARM Fuzzy association rules work well with numerical and categorical data. Fuzzy rules are easy to understand by a human. FARM does not make any assumptions about the rules that are to be extracted, removing a bias that humans might have. Use of fuzzy techniques makes fuzzy association rules mining resilient to noise and missing values. Fuzzy rules are proven to provide superior performance to crisp rules in many applications (e.g., fuzzy temperature controller). 19 FARM methods are well suited to our data

Rule Support and Confidence Rule: X Y # X IY Support( X Y) = # D Y D X X Y # X IY Confidence( X Y) = # X Support (coverage) is the number of instances the rule predicts correctly expressed as a proportion of all items in the data set. Support = number of instances that contain both X and Y divided by number of all transactions in database (D). Confidence (accuracy) is the number of instances that the rule predicts correctly, expressed as a proportion of all instances to which it applies. Confidence = number of transactions that contain both X and Y divided by number of transactions that contain X. Confidence can be treated as conditional probability of a transaction containing X also containing Y (P(Y X)). 20

Rule Lift Rule: Lift( X X Y Conf( X Y) Y) = Expected_ Conf( X Y) Expected _ Conf( X Y) = Sup( Y) #( X I Y ) # D #( X I Y ) Lift ( X Y) = = # D # X # Y # X# Y Y D X X Y Lift - measures the deviation from independence of X and Y. Lift - ratio of the number of instances X and Y appear together to the multiple of number of instances X appears and number of instances Y appears. Lift - values larger than 1.0 indicate that transactions containing the antecedent (X) tend to contain the consequent (Y) more often than transactions that do not contain the antecedent (X). The higher the lift, the more likely that the existence of X and Y together is not just a random occurrence but because of a relationship between them. 21

US Regions Community data were grouped into five regions: Northeast: CT, DE, ME, MD, MA, NH, NJ, NY, PA, RI, and VT. This subset covers 632 communities. Southeast: AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, and WV. This subset covers 420 communities. Midwest: IL, IN, IA, KS, MI, MN, MO, NE (no data), ND, OH, SD, and WI. It covers a total of 513 communities. Southwest: AZ, NM, OK, and TX. This subset covers 228 communities. West: CA, CO, ID, MT (no data), NV, OR, UT, WA, and WY. This represents 418 communities. 22

Experimental Setup All attributes with a large number of missing values were removed. Odds ratios between each remaining attribute and Violent Crimes, Murders, Robberies, and Assaults were computed. Attributes exhibiting small odds ratios were removed. Similar attributes were omitted (e.g., from the attributes Divorced (%), Male Divorced (%), and Female Divorced (%), only Divorced (%) was kept). 23

Examples of Membership Functions Race: Hispanic (%) Homeless People in Shelters Per 100K Population Violent Crimes Per 100K Population 24

Examples of Unsurprising All State Rules Frequent rules: Houses with Kids Living with Two Parents (High) & People Speaking No English (Low) Murders (No) conf=0.61, lift=1.3, sup=0.221 Houses with Public Assistance Income (Low) & Houses with Kids Living with Two Parents (High) Murders (No) conf=0.6, lift=1.3, sup=0.219 Rare rules: People in Homeless Shelters (High) & People Commute Using Public Transit (High) Robberies (High), conf=0.76, lift=42.1, sup=0.002 Houses with Public Assistance Income (High) & Kids Born to Never Married (High) Robberies (High), conf=0.7, lift=38.7, sup=0.002 Houses with Public Assistance Income (High) & Kids Born to Never Married (High) Murders (High), conf=0.74, lift=28.9, sup=0.002 25