Advanced record linkage methods: scalability, classification, and privacy
|
|
|
- Jessie Murphy
- 10 years ago
- Views:
Transcription
1 Advanced record linkage methods: scalability, classification, and privacy Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University Contact: July 2015 p. 1/34
2 Outline A short introduction to record linkage Challenges of record linkage for data science Techniques for scalable record linkage Advanced classification techniques Privacy aspects in record linkage Research directions July 2015 p. 2/34
3 What is record linkage? The process of linking records that represent the same entity in one or more databases (patient, customer, business name, etc.) Also known as data matching, entity resolution, object identification, duplicate detection, identity uncertainty, merge-purge, etc. Major challenge is that unique entity identifiers are not available in the databases to be linked (or if available, they are not consistent or change over time) E.g., which of these records represent the same person? Dr Smith, Peter Pete Smith 42 Miller Street 2602 O Connor 42 Miller St 2600 Canberra A.C.T. P. Smithers 24 Mill Rd 2600 Canberra ACT July 2015 p. 3/34
4 Applications of record linkage Remove duplicates in one data set (deduplication) Merge new records into a larger master data set Create patient or customer oriented statistics (for example for longitudinal studies) Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of record linkage Immigration, taxation, social security, census Fraud, crime, and terrorism intelligence Business mailing lists, exchange of customer data Health, social science, and data science research July 2015 p. 4/34
5 Record linkage challenges No unique entity identifiers are available (use approximate (string) comparison functions) Real world data are dirty (typographical errors and variations, missing and out-of-date values, different coding schemes, etc.) Scalability to very large databases (naïve comparison of all record pairs is quadratic; some form of blocking, indexing or filtering is needed) No training data in many record linkage applications (true match status not known) Privacy and confidentiality (because personal information is commonly required for linking) July 2015 p. 5/34
6 The record linkage process Database A Database B Data pre processing Data pre processing Indexing / Searching Matches Comparison Classif ication Non matches Evaluation Clerical Review Potential Matches July 2015 p. 6/34
7 Types of record linkage techniques Deterministic matching Exact matching (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Medicare or NHS numbers Rule-based matching (complex to build and maintain) Probabilistic record linkage (Fellegi and Sunter, 1969) Use available attributes for linking (often personal information, like names, addresses, dates of birth, etc.) Calculate match weights for attributes Computer science approaches (based on machine learning, data mining, database, or information retrieval techniques) July 2015 p. 7/34
8 Challenges for data science Often the aim is to create social genomes for individuals by linking large population databases (Population Informatics, Kum et al. IEEE Computer, 2013) Knowing how individuals and families change over time allows for a diverse range of studies (fertility, employment, education, health, crimes, etc.) Different challenges for historical data compared to contemporary data, but some are common Database sizes (computational aspects) Accurate match classification Sources and coverage of population databases July 2015 p. 8/34
9 Challenges for historical data Low literacy (recording errors and unknown exact values), no address or occupation standards Large percentage of a population had one of just a few common names ( John or Mary ) Households and families change over time Immigration and emigration, birth and death Scanning, OCR, and transcription errors July 2015 p. 9/34
10 Challenges for present-day data These data are about living people, and so privacy is of concern when data are linked between organisations Linked data allow analysis not possible on individual databases (potentially revealing highly sensitive information) Modern databases contain more details and more complex types of data (free-format text or multimedia) Data are available from different sources (governments, businesses, social network sites, the Web) Major questions: Which data are suitable? Which can we get access to? July 2015 p. 10/34
11 Techniques for scalable record linkage Number of record pair comparisons equals the product of the sizes of the two databases (matching two databases containing 1 and 5 million records will result in trillion record pairs) Number of true matches is generally less than the number of records in the smaller of the two databases (assuming no duplicate records) Performance bottleneck is usually the (expensive) detailed comparison of attribute values between records (using approximate string comparison functions) Aim of indexing: Cheaply remove record pairs that are obviously not matches July 2015 p. 11/34
12 Traditional blocking Traditional blocking works by only comparing record pairs that have the same value for a blocking variable (for example, only compare records that have the same postcode value) Problems with traditional blocking An erroneous value in a blocking variable results in a record being inserted into the wrong block (several passes with different blocking variables can solve this) Values of blocking variable should have uniform frequencies (as the most frequent values determine the size of the largest blocks) Example: Frequency of Smith in NSW: 25,425 Frequency of Dijkstra in NSW: 4 July 2015 p. 12/34
13 Recent indexing approaches Sorted neighbourhood approach Sliding window over sorted databases Use several passes with different blocking variables Q-gram based blocking (e.g. 2-grams / bigrams) Convert values into q-gram lists, then generate sub-lists peter [ pe, et, te, er ], [ pe, et, te ], [ pe, et, er ],.. pete [ pe, et, te ], [ pe, et ], [ pe, te ], [ et, te ],... Each record will be inserted into several blocks Overlapping canopy clustering Based on computationally cheap similarity measures, such as Jaccard (set intersection) based on q-grams Records will be inserted into several clusters / blocks July 2015 p. 13/34
14 Controlling block sizes Important for real-time and privacy-preserving record linkage, and with certain machine learning algorithms We have developed an iterative split-merge clustering approach (Fisher et al. ACM KDD, 2015) Original data set from Table 1 Split using < FN, F2> < Jo > Merge Split using <SN, Sdx> Merge Final Blocks < Jo > < S530 > < S530, S253 > < Jo > < S530, S253 > John, Smith, 2000 John, Smith, 2000 John, Smith, 2000 John, Smith, 2000 John, Smith, 2000 John, Smith, 2000 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Johnathon, Smith, 2009 Joey, Schmidt, 2009 Joey, Schmidt, 2009 Joey, Schmidt, 2009 < S253 > Joey, Schmidt, 2009 Joey, Schmidt, 2009 Joe, Miller, 2902 Joe, Miller, 2902 Joe, Miller, 2902 Joey, Schmidt, 2009 < M460, M450 > < Jo >< M460, M450 > Joseph, Milne, 2902 Joseph, Milne, 2902 Joseph, Milne, 2902 < M460 > Joe, Miller, 2902 Joe, Miller, 2902 Paul,, 3000 < Pa > Joe, Miller, 2902 Joseph, Milne, 2902 Joseph, Milne, 2902 Peter, Jones, 3000 Paul,, 3000 < Pe > Peter, Jones, 3000 < Pa, Pe > Paul,, 3000 Peter, Jones, 3000 < M450 > Joseph, Milne, 2902 Blocking Keys = <FN, F2>, <SN, Sdx> S = 2, S = 3 min max < Pa, Pe > Paul,, 3000 Peter, Jones, 3000 July 2015 p. 14/34
15 Advanced classification techniques View record pair classification as a multidimensional binary classification problem (use attribute similarities to classify record pairs as matches or non-matches) Many machine learning techniques can be used Supervised: Requires training data (record pairs with known true match status) Un-supervised: Clustering Recently, collective classification techniques have been investigated (build graph of database and conduct overall classification, and also take relational similarities into account) July 2015 p. 15/34
16 Collective classification example Paper 2? w1=? Dave White w3=?? Paper 6 Susan Grey Paper 3 Intel Liz Pink MIT w2=? Don White w4=? Paper 5 John Black Paper 1 CMU Paper 4 Joe Brown (A1, Dave White, Intel) (A2, Don White, CMU) (A3, Susan Grey, MIT) (A4, John Black, MIT) (A5, Joe Brown, unknown) (A6, Liz Pink, unknown) (P1, John Black / Don White) (P2, Sue Grey / D. White) (P3, Dave White) (P4, Don White / Joe Brown) (P5, Joe Brown / Liz Pink) (P6, Liz Pink / D. White) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006 July 2015 p. 16/34
17 Classification challenges In many cases there are no training data available (no data set with known true match status) Possible to use results of earlier record linkage projects? Or from manual clerical review process? How confident can we be about correct manual classification of potential matches? No large test data collections available (unlike in information retrieval or machine learning) Many record linkage researchers use synthetic or bibliographic data (which have very different characteristics to personal data) July 2015 p. 17/34
18 Group matching using household information (Fu et al. 2011, 2012) Conduct pair-wise linking of individual records Calculate household similarities using Jaccard or weighted similarities (based on pair-wise links) Promising results on UK Census data from 1851 to 1901 (Rawtenstall, around 17,000 to 31,000 records) July 2015 p. 18/34
19 AttrSim = 0.42 Graph-matching based on household structure (Fu et al. 2014) 1 1 r11 r12 r21 r22 AttrSim = H1 30 H H r13 AttrSim = 0.56 r23 H ID r11 r12 r13 Address goodshaw goodshaw goodshaw SN smith smith smith FN john mary anton Age ID r21 r22 r23 Address goodshaw goodshaw goodshaw SN smith smith smith FN jack marie toni Age One graph per household, find best matching graphs using both record attribute and structural similarities Edge attributes are information that does not change over time (like age differences) July 2015 p. 19/34
20 Privacy aspects in record linkage Objective is to link data across organisations such that besides the linked records (the ones classified to refer to the same entities) no information about the sensitive source data can be learned by any party involved in the linking, or any external party. Main challenges Allow for approximate linking of values Have techniques that are not vulnerable to any kind of attack (frequency, dictionary, crypt-analysis, etc.) Have techniques that are scalable to linking large databases July 2015 p. 20/34
21 Privacy and record linkage: An example scenario A demographer who aims to investigate how mortgage stress is affecting different people with regard to their mental and physical health She will need data from financial institutions, government agencies (social security, health, and education), and private sector providers (such as health insurers) It is unlikely she will get access to all these databases (for commercial or legal reasons) She only requires access to some attributes of the records that are linked, but not the actual identities of the linked individuals (but personal details are needed to conduct the actual linkage) July 2015 p. 21/34
22 The PPRL process Database A Database B Data pre processing Data pre processing Privacy preserving context Indexing / Searching Matches Comparison Classif ication Non matches Evaluation Encoded data Clerical Review Potential Matches July 2015 p. 22/34
23 Hash-encoding for PPRL A basic building block of many PPRL protocols Idea: Use a one-way hash function (like SHA) to encode values, then compare hash-codes Having only access to hash-codes will make it nearly impossible to learn their original input values But dictionary and frequency attacks are possible Single character difference in input values results in completely different hash codes For example: peter or 4R#x+Y4i9!e@t4o] pete or Z5%o-(7Tq1@?7iE/ Only exact matching is possible July 2015 p. 23/34
24 Research directions (1) For historical data, the main challenge is data quality (develop (semi-)automatic data cleaning and standardisation techniques) How to employ collective classification techniques for data with personal information? No training data in most applications Employ active learning approaches Visualisation for improved manual clerical review Linking data from many sources (significant challenge in PPRL, due to issue of collusion) Frameworks for record linkage that allow comparative experimental studies July 2015 p. 24/34
25 Research directions (2) Collections of test data sets which can be used by researchers Challenging (impossible?) to have true match status Challenging as most data are either proprietary or sensitive Develop practical PPRL techniques Standard measures for privacy Improved advanced classification techniques for PPRL Methods to assess accuracy and completeness Pragmatic challenge: Collaborations across multiple research disciplines July 2015 p. 25/34
26 Advertisement: Book Population Reconstruction (2015) The book details the possibilities and limitations of information technology with respect to reasoning for population reconstruction. Follows the three main processing phases from handwritten registers to a reconstructed digitized population. Combines research from historians, social scientists, linguists, and computer scientists. July 2015 p. 26/34
27 Advertisement: Book Data Matching (2012) The book is very well organized and exceptionally well written. Because of the depth, amount, and quality of the material that is covered, I would expect this book to be one of the standard references in future years. William E. Winkler, U.S. Bureau of the Census. July 2015 p. 27/34
28 Managing transitive closure a1 a2 a3 a4 If record a1 is classified as matching with record a2, and record a2 as matching with record a3, then records a1 and a3 must also be matching. Possibility of record chains occurring Various algorithms have been developed to find optimal solutions (special clustering algorithms) Collective classification and clustering approaches deal with this problem by default July 2015 p. 28/34
29 Current best practice approach used in health domain (1) Mortgage database Names, addresses, DoB, etc. Financial details Mental health database Names, addresses, DoB, etc. Health details Education database Names, addresses, DoB, etc. Education details Linkage unit Researchers Step 1: Database owners send partially identifying data to linkage unit Step 2: Linkage unit sends linked record identifiers back Step 3: Database owners send payload data to researchers Details given in: Chris Kelman, John Bass, and D Arcy Holman: Research use of Linked Health Data A Best Practice Protocol, Aust NZ Journal of Public Health, vol. 26, July 2015 p. 29/34
30 Current best practice approach used in health domain (2) Problem with this approach is that the linkage unit needs access to personal details (metadata might also reveal sensitive information) Collusion between parties, and internal and external attacks, make these data vulnerable Privacy-preserving record linkage (PPRL) aims to overcome these drawbacks No unencoded data ever leave a data source Only details about matched records are revealed Provable security against different attacks PPRL is challenging (employs techniques from cryptography, machine learning, databases, etc.) July 2015 p. 30/34
31 Advanced PPRL techniques First generation (mid 1990s): exact matching only using simple hash encoding Second generation (early 2000s): approximate matching but not scalable (PP versions of edit distance and other string comparison functions) Third generation (mid 2000s): take scalability into account (often a compromise between PP and scalability, some information leakage accepted) Different approaches have been developed for PPRL, so far no clear best technique (for example based on Bloom filters, phonetic encodings, generalisation, randomly added values, or secure multi-party computation) July 2015 p. 31/34
32 A taxonomy for PPRL PPRL Taxonomy Privacy aspects Linkage techniques Theoretical analysis Evaluation Practical aspects Number of parties Indexing Scalability Scalability Implementation Aversary model Comparison Linkage quality Linkage quality Data sets Privacy techniques Classification Privacy vulnerabilities Privacy Application area July 2015 p. 32/34
33 Basic PPRL protocols Alice (2) (1) (2) Bob Alice (1) (2) (2) Bob (3) (3) (3) Carol (3) Two basic types of protocols Two-party protocol: Only the two database owners who wish to link their data Three-party protocols: Use a (trusted) third party (linkage unit) to conduct the linkage (this party will never see any unencoded values, but collusion is possible) July 2015 p. 33/34
34 Secure multi-party computation Compute a function across several parties, such that no party learns the information from the other parties, but all receive the final results [Yao 1982; Goldreich 1998/2002] Simple example: Secure summation s = i x i. Step 0: Z=999 Step 4: s = 1169 Z = 170 Party 1 x1=55 Step 1: Z+x1= 1054 Party 2 Step 2: (Z+x1)+x2 = 1127 x2=73 Step 3: ((Z+x1)+x2)+x3=1169 Party 3 x3=42 July 2015 p. 34/34
Privacy Aspects in Big Data Integration: Challenges and Opportunities
Privacy Aspects in Big Data Integration: Challenges and Opportunities Peter Christen Research School of Computer Science, The Australian National University, Canberra, Australia Contact: [email protected]
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
Quality and Complexity Measures for Data Linkage and Deduplication
Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System Peter Christen School of Computer Science The Australian National University Canberra ACT 0200,
Febrl A Freely Available Record Linkage System with a Graphical User Interface
Febrl A Freely Available Record Linkage System with a Graphical User Interface Peter Christen Department of Computer Science, The Australian National University Canberra ACT 0200, Australia Email: [email protected]
Identity Management. An overview of the CareEvolution RHIO Technology Platform s Identity Management (record linking) service
Identity Management An overview of the CareEvolution RHIO Technology Platform s Identity Management (record linking) service CareEvolution, Inc. All Rights Reserved. All logos, brand and product names
Online Credit Card Application and Identity Crime Detection
Online Credit Card Application and Identity Crime Detection Ramkumar.E & Mrs Kavitha.P School of Computing Science, Hindustan University, Chennai ABSTRACT The credit cards have found widespread usage due
An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework
An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework Nishand.K, Ramasami.S, T.Rajendran Abstract Record linkage is an important process
A Tutorial on Population Informatics using Big Data
A Tutorial on Population Informatics using Big Data Peter Christen 1, Hye-Chung Kum 2, Qing Wang 1, and Dinusha Vatsalan 1 1 Research School of Computer Science, The Australian National University, Canberra,
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0
Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
Information Security in Big Data using Encryption and Decryption
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Information Security in Big Data using Encryption and Decryption SHASHANK -PG Student II year MCA S.K.Saravanan, Assistant Professor
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System
Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System Max G. Arellano, MA; Gerald I. Weber, PhD To develop successfully an integrated delivery system (IDS), it is
How. Matching Technology Improves. White Paper
How Matching Technology Improves Data Quality White Paper Table of Contents How... 3 What is Matching?... 3 Benefits of Matching... 5 Matching Use Cases... 6 What is Matched?... 7 Standardization before
Secure Computation Martin Beck
Institute of Systems Architecture, Chair of Privacy and Data Security Secure Computation Martin Beck Dresden, 05.02.2015 Index Homomorphic Encryption The Cloud problem (overview & example) System properties
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
2011 UK Census Coverage Assessment and Adjustment Methodology. Owen Abbott, Office for National Statistics, UK 1
Proceedings of Q2008 European Conference on Quality in Official Statistics 2011 UK Census Coverage Assessment and Adjustment Methodology Owen Abbott, Office for National Statistics, UK 1 1. Introduction
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Hybrid Technique for Data Cleaning
Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
A Review of Anomaly Detection Techniques in Network Intrusion Detection System
A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In
Efficient Similarity Search over Encrypted Data
UT DALLAS Erik Jonsson School of Engineering & Computer Science Efficient Similarity Search over Encrypted Data Mehmet Kuzu, Saiful Islam, Murat Kantarcioglu Introduction Client Untrusted Server Similarity
Susan J Hyatt President and CEO HYATTDIO, Inc. Lorraine Fernandes, RHIA Global Healthcare Ambassador IBM Information Management
Accurate and Trusted Data- The Foundation for EHR Programs Susan J Hyatt President and CEO HYATTDIO, Inc. Lorraine Fernandes, RHIA Global Healthcare Ambassador IBM Information Management Healthcare priorities
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Improving data integrity on cloud storage services
International Journal of Engineering Science Invention ISSN (Online): 2319 6734, ISSN (Print): 2319 6726 Volume 2 Issue 2 ǁ February. 2013 ǁ PP.49-55 Improving data integrity on cloud storage services
Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.
White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,
Theoretical Aspects of Storage Systems Autumn 2009
Theoretical Aspects of Storage Systems Autumn 2009 Chapter 3: Data Deduplication André Brinkmann News Outline Data Deduplication Compare-by-hash strategies Delta-encoding based strategies Measurements
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search
Project for Michael Pitts Course TCSS 702A University of Washington Tacoma Institute of Technology ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search Under supervision of : Dr. Senjuti
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
AUTOMATICALLY DECEPTIVE CRIMINAL IDENTITIES
by GANG WANG, HSINCHUN CHEN, and HOMA ATABAKHSH AUTOMATICALLY DETECTING DECEPTIVE CRIMINAL IDENTITIES Fear about identity verification reached new heights since the terrorist attacks on Sept. 11, 2001,
NETWORK SECURITY: How do servers store passwords?
NETWORK SECURITY: How do servers store passwords? Servers avoid storing the passwords in plaintext on their servers to avoid possible intruders to gain all their users passwords. A hash of each password
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION
REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION Pilar Rey del Castillo May 2013 Introduction The exploitation of the vast amount of data originated from ICT tools and referring to a big variety
Data Outsourcing based on Secure Association Rule Mining Processes
, pp. 41-48 http://dx.doi.org/10.14257/ijsia.2015.9.3.05 Data Outsourcing based on Secure Association Rule Mining Processes V. Sujatha 1, Debnath Bhattacharyya 2, P. Silpa Chaitanya 3 and Tai-hoon Kim
Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America
1 Top Ten Security and Privacy Challenges for Big Data and Smartgrids Arnab Roy Fujitsu Laboratories of America 2 User Roles and Security Concerns [SKCP11] Users and Security Concerns [SKCP10] Utilities:
Keywords Cloud Storage, Error Identification, Partitioning, Cloud Storage Integrity Checking, Digital Signature Extraction, Encryption, Decryption
Partitioning Data and Domain Integrity Checking for Storage - Improving Cloud Storage Security Using Data Partitioning Technique Santosh Jogade *, Ravi Sharma, Prof. Rajani Kadam Department Of Computer
SoSe 2014: M-TANI: Big Data Analytics
SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Formal Methods for Preserving Privacy for Big Data Extraction Software
Formal Methods for Preserving Privacy for Big Data Extraction Software M. Brian Blake and Iman Saleh Abstract University of Miami, Coral Gables, FL Given the inexpensive nature and increasing availability
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Beyond 2011: Population Coverage Survey Field Test 2013 December 2013
Beyond 2011 Beyond 2011: Population Coverage Survey Field Test 2013 December 2013 Background The Beyond 2011 Programme in the Office for National Statistics (ONS) is currently reviewing options for taking
IMPROVED MASK ALGORITHM FOR MINING PRIVACY PRESERVING ASSOCIATION RULES IN BIG DATA
International Conference on Computer Science, Electronics & Electrical Engineering-0 IMPROVED MASK ALGORITHM FOR MINING PRIVACY PRESERVING ASSOCIATION RULES IN BIG DATA Pavan M N, Manjula G Dept Of ISE,
The Value of Advanced Data Integration in a Big Data Services Company. Presenter: Flavio Villanustre, VP Technology September 2014
The Value of Advanced Data Integration in a Big Data Services Company Presenter: Flavio Villanustre, VP Technology September 2014 About LexisNexis We are among the largest providers of risk solutions in
Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining
Lluis Belanche + Alfredo Vellido Intelligent Data Analysis and Data Mining a.k.a. Data Mining II Office 319, Omega, BCN EET, office 107, TR 2, Terrassa [email protected] skype, gtalk: avellido Tels.:
HIGH PRECISION MATCHING AT THE HEART OF MASTER DATA MANAGEMENT
HIGH PRECISION MATCHING AT THE HEART OF MASTER DATA MANAGEMENT Author: Holger Wandt Management Summary This whitepaper explains why the need for High Precision Matching should be at the heart of every
Real-time customer information data quality and location based service determination implementation best practices.
White paper Location Intelligence Location and Business Data Real-time customer information data quality and location based service determination implementation best practices. Page 2 Real-time customer
Content Teaching Academy at James Madison University
Content Teaching Academy at James Madison University 1 2 The Battle Field: Computers, LANs & Internetworks 3 Definitions Computer Security - generic name for the collection of tools designed to protect
Web Forensic Evidence of SQL Injection Analysis
International Journal of Science and Engineering Vol.5 No.1(2015):157-162 157 Web Forensic Evidence of SQL Injection Analysis 針 對 SQL Injection 攻 擊 鑑 識 之 分 析 Chinyang Henry Tseng 1 National Taipei University
Course Bachelor of Information Technology majoring in Network Security or Data Infrastructure Engineering
Course Bachelor of Information Technology majoring in Network Security or Data Infrastructure Engineering Course Number HE20524 Location Meadowbank OVERVIEW OF SUBJECT REQUIREMENTS Note: This document
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
Research of Postal Data mining system based on big data
3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication
New Hash Function Construction for Textual and Geometric Data Retrieval
Latest Trends on Computers, Vol., pp.483-489, ISBN 978-96-474-3-4, ISSN 79-45, CSCC conference, Corfu, Greece, New Hash Function Construction for Textual and Geometric Data Retrieval Václav Skala, Jan
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
HASH CODE BASED SECURITY IN CLOUD COMPUTING
ABSTRACT HASH CODE BASED SECURITY IN CLOUD COMPUTING Kaleem Ur Rehman M.Tech student (CSE), College of Engineering, TMU Moradabad (India) The Hash functions describe as a phenomenon of information security
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
ก ก ก ก ก 460-104 3(3-0-6) ก ก ก (Introduction to Business) (Principles of Marketing)
ก ก ก 460-101 3(3-0-6) ก ก ก (Introduction to Business) ก ก ก ก ก ก ก ก ก ก ก ก ก ก ก Types of business; business concepts of human resource management, production, marketing, accounting, and finance;
Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG
Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG 1 The Big Data Working Group (BDWG) will be identifying scalable techniques for data-centric security and privacy problems. BDWG s investigation
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
A Survey on Deduplication Strategies and Storage Systems
A Survey on Deduplication Strategies and Storage Systems Guljar Shaikh ((Information Technology,B.V.C.O.E.P/ B.V.C.O.E.P, INDIA) Abstract : Now a day there is raising demands for systems which provide
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Strengthen RFID Tags Security Using New Data Structure
International Journal of Control and Automation 51 Strengthen RFID Tags Security Using New Data Structure Yan Liang and Chunming Rong Department of Electrical Engineering and Computer Science, University
Integrated Data Infrastructure and prototype
Integrated Data Infrastructure and prototype Crown copyright This work is licensed under the Creative Commons Attribution 3.0 New Zealand licence. You are free to copy, distribute, and adapt the work,
