Data Warehousing and Data Mining for improvement of Customs Administration in India Lessons learnt overseas for implementation in India
Participants Shailesh Kumar (Group Leader) Sameer Chitkara (Asst. Group Leader) Ajazuddin M. S. Chandrakant Rao Prashant S Kaduskar
What Is Data Mining? Data mining is a process of discovering and interpreting (unknown) patterns in data that leads to actionable knowledge from large datasets through one or more data mining techniques (such as market basket analysis/ association rules, clustering, neural networks etc)
Different nomenclatures of Data Mining Knowledge Discovery Pattern Discovery Knowledge Mining Data Dredging Data Mining Data Archaeology
Foundations of Data Mining Data mining is the process of using raw data to infer important business relationships. Despite a consensus on the value of data mining, a great deal of confusion exists about what it is. Data Mining is a collection of powerful techniques intended for analyzing large amounts of data. There is no single data mining approach, but rather a set of techniques that can be used stand alone or in combination with each other.
Disciplines involved in Data Mining Statistics Decision Support Data Management and Warehousing Machine Learning Data Mining Parallel Processing Visualization
Understanding of the problem domain Steps in Data Mining Understanding of the data Input Data (Database, images, videos etc) Preparation of Data Knowledge ( Patterns, value, clusters, association etc) Data Mining Evaluation of discovered knowledge Use of the discovered knowledge Extend knowledge to other domains
Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Raw Dat a DATA Ware house Target Data Transformed Data Patterns and Rules Knowledge Understanding
Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
Multidisciplinary Statistics Pattern Neurocomputing Recognition Data Mining Machine Learning AI Databases KDD
Data Mining Models and Tasks
Major Application Areas for Data Mining Solutions Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection E-Commerce Health Care Banking/ Financial Services/ Insurance Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web
Data Mining Applications in CRM Typical Applications Customer Segmentation Customer Segmentation Propensity to Buy Propensity to Buy Profitability Modeling & Profiling Profitability Modeling & Profiling Customer Attrition Customer Attrition Channel Optimization Channel Optimization Fraud Detection Fraud Detection Which What How are is of customers my the can my market best I most tell channel which are valuable segments good to What is the life time reach candidates transactions customers who my customers are for are my are our at customers likely risk new in each of long profitability of my customers? distance market be by fraudulent leaving calling segment?? plans??? Targeting Personalize Increase Prevent Interact loss w/customers high of value high based customers value relationships. based customers their based needs. Detect and prevent fraud to minimize loss. More Higher on and their current product let preference. satisfaction go of & sales lower future = value = Greater profitability. Higher customers. loyalty retention Source: Teradata, 2001
DM and its applications in Finance Stock market prediction Portfolio optimization Foreign exchange forecasting Bankruptcy prediction Fraud detection Credit scoring Options pricing
Usage of Data Mining in Indian Customs To analyse Transaction Risks Risk Rule fine tuning Transaction Value referencing For Revenue Forecasting Based on past time periods Import Elasticity mapping
Number of Commodities Elasticity Mapping in CBEC for Taxation Purpose Commodity Distribution by elasticity bands (2009-10) 1000 900 800 700 600 500 400 300 200 100 0 553 <-10 (Highly Elastic) 876 310 600 351-1 to -10-0.5 to -1-0.01 to -0.5 0 to 0.5 0.5 to 1 1 to 10 >10 (Highly Inelastic) 202 843 270 Elasticity Bands Knowing the Import Elasticity for various commodities can be a very useful tool in tax planning and policy-making
Use of Data Mining in Valuation Process Ensuring that clusters are created using Valuation Rules Comparing apples with apples Steps Step-1 Step-2 Step-3 Step-4 Step Description CTH Selection Data Regrouping: Based on ITEM description Data Regrouping: Based on Country of Origin (COO) Data Regrouping : Based on UQC Currency Standardization Variable Selection for Pricing MEDIAN AS MEASURE OF CENTRAL TENDENCY Unaffected by extreme values or outliers Non-parametric measure Does not assume a normal distribution of data Effective for highly skewed or asymmetrical distributions of data Minimises absolute errors Represents at least half the observations
Strategic Picture of Risk (SPR) used by Her Majesty s Revenue and Customs, UK Senior Decision maker Risk, Response, Result Strategy Tier Tactical Tier Senior Officials Risk, behavior segment Operational Tier Executive officials Behavior Segment
Analytical Technique used by HMRC High impact low probability event analysis Creative - Mind Mapping - Concept Mapping Diagnostic indicator - Argument mapping - Key assumption check - Diagnostic reasoning - Structural Brainstorm - Impact analysis - Scenario analysis
RISK MAP / RISK MATRIX Low likelihood High Impact (Transfer the Risk) High likelihood High Impact (Terminate the risk) Impact Low Likelihood Low Impact (Tolerate) High likelihood Low Impact (Area of caution) (Threat the Risk) (0,0) Likelihood of Risk Management Source : Fundamental of Risk Management by Paul Hopkin
SOURCE SYSTEMS (ICES, SERMON, SAPS, ACES ) CBEC DATA WAREHOUSE basic schematic diagram TRU DGRI / DGCEI ICENET / ICEGATE VALUATION CENTRAL DATA REPOSITORY cleaned and consistent data COMMISSIONERATES OTHERS
Usage of Data Mining in CBEC Presently CBEC is using various analytical tools for Data Mining like prediction of Revenue, Risk analysis, Policy making etc. effectively. For this purpose, two models are used, extended model and parsimonious model. The Parsimonious model accuracy rate is higher. It depicts Price Elasticity Response of a commodity with change of quantity demanded and price. Price elasticity responses are utilized for raising or cutting tax rates, identifying highly elastic and inelastic commodities. Knowledge of Import Elasticity for various commodities are used for planning and policy-making on account of commodity taxation. Currently EDW facility of data mining are used by TRU, field formations and enforcement agencies effectively.