- Clustering Taiwan s Real Estate Data for Market Structure Analysis



Similar documents
Data Mining Solutions for the Business Environment

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

Healthcare Measurement Analysis Using Data mining Techniques

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

A Review of Data Mining Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Applications in Higher Education

A New Approach for Evaluation of Data Mining Techniques

Data Mining Analytics for Business Intelligence and Decision Support

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

S.Thiripura Sundari*, Dr.A.Padmapriya**

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

An Overview of Knowledge Discovery Database and Data mining Techniques

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

TOWARD A DISTRIBUTED DATA MINING SYSTEM FOR TOURISM INDUSTRY

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Comparison of K-means and Backpropagation Data Mining Algorithms

A Comparative Study of clustering algorithms Using weka tools

Prediction of Stock Performance Using Analytical Techniques

Social Media Mining. Data Mining Essentials

Real Estate Customer Relationship Management using Data Mining Techniques

Issues in Information Systems Volume 16, Issue IV, pp , 2015

DHL Data Mining Project. Customer Segmentation with Clustering

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

KnowledgeSEEKER Marketing Edition

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Crime Hotspots Analysis in South Korea: A User-Oriented Approach

Evaluating an Integrated Time-Series Data Mining Environment - A Case Study on a Chronic Hepatitis Data Mining -

Chapter ML:XI. XI. Cluster Analysis

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Application of Data Mining Methods in Health Care Databases

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

perspective Progressive Organization

PREDICTIVE DATA MINING ON WEB-BASED E-COMMERCE STORE

CLUSTER ANALYSIS FOR SEGMENTATION

Segmentation: Foundation of Marketing Strategy

Random forest algorithm in big data environment

Use of Data Mining in the field of Library and Information Science : An Overview

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

COURSE RECOMMENDER SYSTEM IN E-LEARNING

An Empirical Study of Application of Data Mining Techniques in Library System

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Customer Classification And Prediction Based On Data Mining Technique

The Prophecy-Prototype of Prediction modeling tool

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Business Intelligence and Decision Support Systems

2.1. Data Mining for Biomedical and DNA data analysis

Towards applying Data Mining Techniques for Talent Mangement

Standardization of Components, Products and Processes with Data Mining

Financial Trading System using Combination of Textual and Numerical Data

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

A Brief Tutorial on Database Queries, Data Mining, and OLAP

SPATIAL DATA CLASSIFICATION AND DATA MINING

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Pentaho Data Mining Last Modified on January 22, 2007

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Taming Big Data. 1010data ACCELERATES INSIGHT

Nagarjuna College Of

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Clustering Technique in Data Mining for Text Documents

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

not possible or was possible at a high cost for collecting the data.

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Clustering Marketing Datasets with Data Mining Techniques

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Description and Documentation for the Cooperative Database Company Dataset Version 1.0

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Business Intelligence Using Data Mining Techniques on Very Large Datasets

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Constrained Clustering of Territories in the Context of Car Insurance

Clustering Data Streams

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Strategic Online Advertising: Modeling Internet User Behavior with

The Scientific Data Mining Process

Transcription:

Unlock the Value of Open Data - Clustering Taiwan s Real Estate Data for Market Structure Analysis 1 Sheng-Chi Chen, 2 Chien-hung Liu 1,2 Department of Management Information Systems, National Chengchi University, Taipei, Taiwan 1 102356503@nccu.edu.tw, 2 claude.liu@gmail.com Abstract In recent years, data mining has been a rapid growing area that is utilized for knowledge discovery in database. By the use of information technology with data mining techniques, large amount of data can be discovered, analyzed and converted into useful information and knowledge. Despite this upward trend, using cluster analysis for market structure analysis in the Taiwan s real estate is received limited attentions in past researches. This paper aims to fill this gap by applying cluster analysis to actual-sold real estate data acquired from government open database. The findings provide important insights into real estate market structures for Taipei City and New Taipei City. This paper suggests future scholars may consider open market information and expert domain knowledge to add more value onto analytical results commercially. Keywords: Data Mining, Cluster Analysis, Ward s Method, Real Estate Transaction. 1. Introduction In the era of knowledge economy, it is critical for enterprises to know how to better use information technology, and business data in order to formulate best business and customer strategy for more business value creation. Traditionally, business pursuing e-business strategy tends to focus on data aggregation and integration or sorely on automating the labor-intensive processes. However, as information technology evolves, the power of software and hardware lifts, and the volume of data grows rapidly, most enterprises today start to realize the urgency and importance of leveraging existing data on hands for competitive advantages. By nature, real estate is a type of product that is a long-lasting durable goods while also has a nature of investment. These characteristics make it very different from general merchandise. Traditionally, real estate industry values any information related to real estate products such as current status of buyer or seller (i.e. reason to buy or sell for a particular real estate), ownership and any transactional related information. Basically real estate agents make profits by broking sellers and buyers for real estate transaction. Equipping with critical information, real estate agents can act as a powerful intermediary between sellers and buyers and can expedite the process of transaction. However, buyers or sellers of real estate do not hold the same information as their real estate agents do. The advantage for real estate agents are mainly due to information asymmetry. Thus, there are many disputes happened among buyers and sellers of house and real estate agents during the transaction process. This information asymmetry and intransparancy not only reduce trust but also increase the rising house prices. This conflict indeed reflects the importance of openness and transparency in the real estate transaction information requested by general public. Taiwanese government expects to blasts real estate speculation through policy measures such as luxury tax, and real estate price disclosure. For example, to reduce information intransparency and asymmetry of real estate prices, Real Estate Value Laws was promulgated in 2011 and the general public is able to access timely and trustable information from governments now. The recent real estate policy reform in Taiwan along with the trend of open government data in the world have opened up many opportunities to researchers and industry practitioners for leveraging open government data to create more business value. This research applies a data mining approach to analyze real estate market data of Taiwan, especially cluster analysis, for better understanding of the real estate market structure in Taiwan. In addition, this research identified several segments for Taipei City and New Taipei City that have International Journal of Digital Content Technology and its Applications(JDCTA) Volume 8, Number 5, October 2014 48

similarities as well as well as dissimilarities, which help provide more insights about real estate market structures in Taipei City and New Taipei City, and possible family types behind different product segments. The research results will not only provide insights into market structure of real estate in Taiwan but also provides product-customer relationship that may help consumers select suitable products or agencies recommend suitable products to consumers. 2. Literature review 2.1. Real estate transactions In real estate marketplace, buyers often lack of sufficient information to locate the right objects of their interests while sellers have difficulties in getting timely price information, neither. As a result, real estate agencies often manipulate market information without being found out at first place. However, more and more customers (buyer and seller) have disputes with real estate agencies during and after transactions. As times go by, the calls from general public for real estate reform has intensified. Consequently, the Real Estate Broking Management Act was enacted by congress in 1999 to order to raise the professionalism of the real estate agency industry and safeguard consumer interest in Taiwan. The ideal functions that real estate agents can provide to buyers and sellers are to create an efficient marketplace, reduce information search cost, and add moral hazard cost before and after transaction. Many scholars address about the relationship between real estate regulatory framework design and its market efficiency [7]. By nature, real estate products tend to be non-standardized and thus prices are determined on a case-by-case negotiation basis. In addition, the access to transaction data is often difficult, inaccurate or not real time. Today, majority of transactions in Taiwan are finished through real estate agents [3]. What bothers real estate buyers is that they neither could get accountable information from real estate agents nor could they acquired accurate information elsewhere. Information in transparency and asymmetry intensify the rising of real estate prices. Potential buyers of real estate suffer from high Misery Index. Therefore, Taiwan government started to pay attentions to this information in transparency and asymmetry issues. For example, Taiwan s Legislative Yuan has passed the review of Real Estate Value Law, which requires real estate buyers, real-estate agents and land administration agents register the actual transaction prices of properties within 30 days of deals being closed or face a fine. This Act was enhanced in August 1, 2012. Government and general public expect this Act would help real-estate transactions become transparent and a sound trading environment [10]. 2.2. Data mining Data mining has been grown quickly as an important issue in the application area of database. The objective of data mining is to discover knowledge hidden in a large scale of data. Data mining help analyzes large amount of data, through automatic or semi-automatic approach, builds effective models and rules [6]. Some scholar considers data mining as a process of searching and analyzing data to find the useful information hidden in the data [8]. Other scholars refer data mining to knowledge discovery from database, data warehouse, or other forms of large data storage. It extracts meaningful knowledge, including patterns, relationships or changes. From technical perspective, it refers to different forms and approaches of extracting information and knowledge from large volumes of data, which may include data visualization, machine learning and statistical techniques [4]. From business perspective, data mining is expected to extract potential, hidden and useful knowledge, pattern or trends from large volumes of daily transaction data. Today many governments start to promote open data policies in hopes that business and general public may use their innovations and capabilities to identify meaningful and valuable information. However, limited empirical researches are found in literatures about applying data mining techniques onto real estate transaction data. 49

2.3. Cluster analysis Cluster analysis is a statistical classification technique used to reveals patterns, relationships, and structures in large volumes of data in which data are divided, based on similarity into different groups such that data in a cluster are homogeneous while heterogeneous between groups. Cluster analysis can identify classification rules in a seemingly messy data. Cluster analysis is useful for market structure analysis: identifying groups of similar products according to competitive measures of similarity [11]. The advantage of using cluster analysis is that users do not need to have fully understanding about target being analyzed, and in that sense is purely data driven. In other words, anonymous data set can be split into groups without users understanding about data but purely rely on data. However, the disadvantage of this is that users cannot predict what kinds of clusters will be produced and consequently require users interpret the results by themselves [1]. Among the cluster techniques, K-means is a popular algorithm for cluster analysis in data mining, which was used by James MacQueenin 1967 [2]. Given D, a data set of n objects, and k, the number of clusters to form, K-means clustering starts with randomly selecting n observations as initial centroids for k clusters, which is also named Center of Mass. Then K-means algorithm assigns each of n data points to its closest cluster centroids, new clusters will be computed and produced. And then this iterative process start over again and will not finish until K-means algorithm finds n clusters with minimized variance and a maximized variance among n clusters. The advantage of K-means is quick and easy. However, K-means method is not appropriate when size of data set is too big or its density is too diverse. In addition, K-means clustering requires users to specify numbers of clusters to be developed. For example, given D, a data set of n objects, and k, the number of clusters to form, when a user specifies k groups, then K-means algorithm would randomly select n data points as initial centroids, and this iterative process will not end until all k clusters reached to the conditions as mentioned earlier. In other words, K-means is an iterative process searching for center of mass for each cluster. Ward s method, also called Ward s minimum variance method, is another popular algorithm for cluster analysis, especially applied in hierarchical cluster analysis, which was proposed by Joe H. Ward Jr. in 1963 [5]. Initially, Ward s method treats every single data points as a clusters, and then at each step find the pair of clusters are merged according to the variance within clusters after merge (within cluster variance). The key difference between K-means algorithm and Ward s method is that the former requires users to specify k, the number of clusters to form, while Ward s method automatically specify number of groups based on the minimized objective function, which is the minimized total within-in variance. 3. Research method Data mining is the analysis step of knowledge discovery process in large volume of data [8], such as database or data warehouse, which often contains a series of step such as data preparation, data mining and modeling, analysis and application according to predetermined objective. Outputs from data mining models assist in decision making process. There are five steps required for a typical data mining process [7], described as Figure 1. The goal of defining a problem is to define the project objectives from a business perspective and then decide the data mining problem to be solved. Data preparation defines the scope of data being included for data mining and implements these tasks to make final data set ready for data mining. Data preparation includes cleaning, normalization, transformation, feature extraction and selection, etc [9]. Data mining processes include data sampling, feature selection, analysis and process, attribute change, model building and evaluation. Result analysis step mainly to validate the outputs from data miming. The accuracy can be measured through comparing the learned pattern in training set with that of test set. Finally, the learned and validated knowledge will be applied in business as planned. The impact on business is collected, and evaluated, which completes the cycle of data mining. 50

3.1. Problem identification Figure 1. Research Framework Property, which means a fixed object on the land or a house and the right of transferring its ownership. House refers to a readily available house or a presale house and the right of transferring its ownership while Broking Agency, which refers to the company or incorporation dealing with real estate broking or sales. Real estate transaction data have owned by real estate agencies and has not been opened to general public until August 1, 2012 requested by government. The objective of data mining is to identify hidden valuable information from large amount of transaction data. This research aims to understand real estate market structure in Greater Taipei region (Taipei City and New Taipei City). Cluster analysis is applied to identify groups of similar products according to competitive measures of similarity from government open data. The result will provide insights into market structures in Taipei City and new Taipei City. The data used for this research is acquired from Taiwan s government open data platform (DATA.GOV.TW) which contains more than 1,890 data sets with wide and diverse subjects of database such as transportation and distribution, real estate transaction, price and consumption, environment monitor data, and etc, described as Table 1. This research uses real estate transaction data for data mining. Table 1. Real Estate Transaction Data Content Type Description Source Ministry of the Interior Total amount of rental Key field Area of rental land (square feet) Area of rental building (square feet) Land use zoning Format XML, CSV, TXT Cities 21 Record 14,000 Period 98.04-103.04 Renewal date 1 st and 16 th /month Owing to the regional genetic characteristics of real estate transaction data, it is believed that the analysis of total cities will be hard to interpret. Thus, the research only uses the transaction data of Taipei City and New Taipei City as the data mining target. There are 3,982 records available for data mining analysis, including 1,415 records for Taipei City and 2,567 records for New Taipei City. 3.2. Data preparation Raw data is not mainly designed for analysis but for operations so it is not always easy to identify the relationship directly from raw data. Therefore, to make data mining a more insightful tool, this research takes more efforts on pre-processing step, including variable selection, data cleaning, data transformation, which will be explained in the following sections. 51

3.2.1. Variable selection The dataset used includes 25 variables but not all of them are used in this research. To select the candidate variables for cluster analysis, this research first exclude the irrelevant and redundant variables. The candidate variables for analysis includes OBJECT OF TRADE, TOTAL FLOOR AREA TRANSFERRED (Square Meters), Type of Building, Completion Data of Building, TOTAL PRICE, UNIT PRIC of BUILDING (Per Square Meter). The candidate variable will be reduced based on the level of importance after each cluster analysis is conducted. 3.2.2. Data cleaning Data cleaning removes incomplete, inaccurate, irrelevant record, or irregular outliers from database and include only necessary data. In the data set, object of transaction contains land, house, and car park. Land use zoning includes industrial land, farming land, cemetery and residential land. This research only keeps house as object of transaction and exclude land and car park in our analysis as they irrelevant to the objective of this research. 3.2.3. Data transformation To make raw more readable and easier for analysis, the raw data is cleaned in the data cleaning step so that outliners are removed. Several variables are transformed to become new variables. For example, the date of building being completed is shown as 1020609 in data set. We first extract 102 from raw data to represent Taiwan Year and subtract it from current year to create AGE OF BUILDING. This year is 103 in Taiwan year so the age is 1 in this case. The unit of FLOOR AREAS Transferred is converted from Square Meter to Ping, a unit of the size of buildings in Japan, Korea and Taiwan. One ping is equivalent to 3.3058 square meters. This transformation is more suitable for communications in Taiwan. An extended variable is thus created to serve this purpose. After the data preparation step, there are 1,260 records available for data mining analysis, including 346 records for Taipei City and 824 records for New Taipei City. 3.3. Data mining This research aims to identify the grouping relationship of the real estate product and price and thus apply cluster analysis to actual price registered data of Taipei City and New Taipei City respectively. After data preparation step, the processed data set is different from the original ones. Two data sets are loaded. Relationship among variables are explored. Number of clusters are determined and data sets are clustered. The first step of cluster analysis is to determine the objective and variables to be included in the analysis. This research follows a two-stage cluster analysis. First, Ward s method is used to determine the number of clusters. Second, Ward, Average and Centroid are three algorithms used to build the model for cluster analysis. In addition, based on the number of cluster suggested by Ward s method. This research also takes a further step to fine-tune the number of clusters. In addition to the use of two-stage approach, this research bases on importance level variables to select variables. Many candidate models are created, compared and consequently final candidate model that is meaningful and interpretable is identified and chosen as final model. This searching process is iterative and often time consuming. 4. Result analysis This research applies cluster analysis techniques to analyze pre-processed data of Taipei City and New Taipei City and produce clusters for each City respectively. The variables used in the first data mining model includes OBJECT OF TRANSACTION, FLOOR AREAS, BUILDING TYPE, AGE OF BUILDING, TOTAL PRICE, UNIT PRICE (unit: ping). 52

At each model building process, variables are reduced and new models are developed. After many modeling building processes, the final model shows better meaningful results. Three variables used in this final model are AGE OF BUILDING, FLOOR AREA, and TOTAL PRICE. All of them have high level of importance, reflecting that these are most important factors considered by customers (buyers or sellers) of the house. The results of cluster analysis for Taipei City and New Taipei City are shown in respective section as follows. 4.1. Cluster analysis of Taipei City The characteristics of cluster analysis of Taipei City are shown in Figure 2 and Table 2. AGE OF BUILDING, FLOOR AREAS (unit: ping) and TOTAL PRICE came out as the most important criteria. The final model results of cluster analysis for Taipei City suggest 5 clusters. Figure 2. Cluster scatter of Taipei City City Big/ Extended Family: Cluster 1contains least records- only 5 records, equal to 1% of total sample size. Its FLOOR AREAS (79.16 ping) implied this product is a 5 bed-room product. This product is suitable for Big /Extended family that means a family that extends beyond the immediate family, consisting of grandparents, or relatives all living in the same household. City Three-generation Family: Cluster 2, with sample size of 16 records, groups products with averaged areas of 52,71 ping (roughly equal to a 4 bed room product), years of 15.85 years old, and total price of 41,226,875 NT dollars. City Couple with Dependents: Cluster 5, with sample size of 100 records, group products that are suitable for Families with two dependents. City Young Family: Cluster 4 contains more records than those of the rest, which has a total of 154 records, weighing around 45% of total sample size. It shows lower price with middle range of FLOOR AREAS (25.13 ping) or around two small bed-room apartment. This product is very popular as its lower TOTAL PRICE with adequate housing space despite the house is 53

older (AGE OF BUILDING: 37.41 years). This product cluster is suitable for young couple with young dependent. City Single or Newly Married: Cluster 3, with sample size of 71 records, shows that the TOTAL PRICE OF BUILDING is the lowest (10,596,030), AGE OF BUILDING is the youngest (12.3 year) and FLOOR AREAS is the smallest among all clusters. This distinguishing characteristic reflects a unique popular product called Small luxury apartment in Taiwan, which is suitable for Single or Newly married customers. Segment ID 1 2 5 SEGMENT NAMING City Big/ Extended Family City Three-generation Family City Couple with Dependents Table 2. Cluster analysis of Taipei City AGE OF BUILDING FLOOR AREAS (ping) TOTAL PRICE (NTD) 34.60 79.16 57,953,240 15.85 52.71 41,226,875 31.09 37.87 24,721,804 4 City Young Family 37.41 25.13 11,691,419 3 City Single or Newly Married 4.2. Cluster analysis of New Taipei City 12.30 15.72 10,596,030 The characteristics of cluster analysis of New Taipei City are shown in Figure 3 and Table 3. AGE OF BUILDING, FLOOR AREAS (unit: ping) and TOTAL PRICE came out as the most important criteria. The final model results of cluster analysis for New Taipei City suggest 6 clusters. Figure 3. Cluster scatter of New Taipei City Metro Young Family: Cluster 1 data shows buildings in Cluster 1 are aged 35.69 years with 26.11 ping and amount of 8,201,718 NT dollars, which contains most records than that clusters of the rest, which has a total of 323 records, weighing around 40% of total sample size. This result show this type of housing product is most popular in New Taipei city. Besides, 26 ping 54

implies a two bed-room setting suitable for young couple with one or without kids. From marketing perspective, Real estate companies can use this insight to prepare products beforehand to response to customers who are interested in this housing option. This product feature is similar to cluster 4 of Taipei City, expect for the total price. Thus, we name it as Metro Young Family to distinguish itself from City Young Family segment. Metro Big/ Extended Family: Cluster 2, with sample size of 12 records, has similar characteristic to Cluster 1 of Taipei City. This product is also suitable for Big /Extended family that means a family that extends beyond the immediate family, consisting of grandparents, or relatives all living in the same household. Due to the wide difference in TOTAL PRICE, we names this cluster as Metro Young Family to show the similarity and dissimilarity between the two clusters. Metro Couple with Dependents: Cluster 4, with sample size of 223 records, has similar product characteristic to that of Cluster 5 of Taipei City, expect that the price difference is almost double. Metro Single or Newly Married: Cluster 5, with sample size of 252 records, has similar product characteristic to that of Cluster 3 of Taipei City. Not only the price difference is double but also this product is not as luxurious as Small luxury apartment mentioned earlier. Cluster 3 and Cluster 6 contain extreme values compared with average. Both have only sample size of 1 records respectively. Cluster 3 contains only 1 record within this cluster, showing this unique characteristic housing type. Cluster 1 shows AGE OF BUILDING is less than 3 years, with 95.81 ping of FLOOR AREAS and a TOTAL PRICE of 60,360,000 NT dollars. Cluster implied that there should be local rich existing in a price-friendly New Taipei City, which is different from what general public understand as this price is high enough to buy many housing products in Taipei City. While Cluster 3 shows similar results, Cluster 6 contains only 1 record of building aged 23 years old with 173.38 ping and with a TOTAL PRICE of 3,780,000 NT dollars. Cluster 3 and Cluster 6 identify two extreme which could the insights to identify local rich ; however, the size of cluster 1 and 3 are too small to have value for marketing strategies. They are named as Metro Local Rich I and Metro Local Rich II respectively. Table 3. Cluster analysis of New Taipei City Segment ID SEGMENT NAMING AGE OF BUILDING FLOOR AREAS (ping) TOTAL PRICE (NTD) 1 Metro Young Family 35.69 26.11 8,201,718 2 Metro Big/ Extended Family 14.36 79.34 27,319,167 3 Metro Local Rich I 3.00 95.81 60,360,000 4 5 Metro Couple with Dependents Metro Single or Newly Married 15.80 36.06 12,619,642 16.14 19.30 5,345,196 6 Metro Local Rich II 23.00 173.38 3,780,000 To sum up, except 2 extremes, overall result shows that most TOTAL PRICE of buildings in New Taipei City is less than 30 million NT dollars (equal to 988,084 US dollars) and majority is ranging from 5 million to 12 million NT dollars (equal to 164,681 to 395,233 US dollars) while most FLOOR AREAS is less than 79 ping (equal to 261.49 square meters) and majority lies between 19 ping to 36 ping (equal to 62.89 to 119.16 square meters). Results of Taipei City shows all housing products costs more than 10 million NT dollars while with relatively small FLOOR AREAS if compared to New Taipei City. Interestingly, Cluster 3 of Taipei City (segment: City Single or Newly Married) shows a unique life style-small luxury rich suite, which costs more than 1 million NT per ping dollars and total Area Floor is only 15 ping (equal to 49.65 square meters), roughly equally to a studio 55

or small one-bed room setting. Worthnotingly, this study did not include township into cluster analysis as it may dilute the difference among clusters and possibly impact the interpretation ability. 5. Contribution It is important to illustrate what data mining discovers from data in a way that everyone can possibly understand, and efficiently interpret implications behind data and help end users to make better judgment and timely decisions. Results from analysis can be classified into four types: (1) results from common sense, (2) possible results (many related but not validated results), (3) results that are not clear and hard to evaluate (possible results but hard to explain), and (4) impossible results (results impossible to occur). In the evaluation stage of Data Mining, data miners should prove (1) results from common sense, (2) possible results and make efforts to interpret (3) results that are not clear and hard to evaluate or leave to end users for further explorations. In the process of analysis, this study attempts to include different variables to explore the possible cluster structure existing within transaction data, identify cluster relationship and finally interpret findings. Cluster analysis is an unsupervised learning algorithm that requires careful variable selections and multiple trials in order to identify a cluster relationship that makes sense. In the experiment design, this study first consolidates the real estate open data of Taipei City and New Taipei City, filters out seven variables as candidates for cluster analysis in order to run a Greater Taipei Region cluster analysis (Taipei City plus New Taipei City), then compared its results with clusters analysis from Taipei City and New Taipei City respectively so as to test nature of data. Meanwhile, in variable selection process, each model adjusts its input variables for cluster analysis according to its level of importance. The results show that cluster analysis with two cities separately is better than clustering them as a whole. We believe this study has demonstrated the good use of the real estate open data provided by Taiwanese Government in the area of data mining, especially the cluster analysis, and provided the market structure results that are insightful. The contribution this study makes to practice is threefold as following: Offered a holistic view of market structure insights to house buyers, sellers and real estate agents, breaking the barriers that real estate agencies hold up information on their side. Provided industry with a business direction and data mining approach to leverage its own data. Real estate companies can compare their cluster analysis results with that of this study so that they can quickly understand how they perform, identify sales gaps, and discover new opportunities at different segments within market structure. Set a new model for utilizing government open platform for data mining the real estate open data and provided insights into how to leverage big data opportunity created by government for business value 6. Conclusion Data mining is a rapid growing area that produces many research reports, new systems or prototyping development. For example, many new applications based on early data mining researches and a wide array of algorithms have gradually unlock the value of data residing in database. The diversification of data mining approaches, including machine learning and statistical researches, multiply the efficacy of advanced knowledge discovery. Cluster analysis is widely used across different fields; however, there is no universal agreed criteria justifying the results produced by cluster analysis. Thus, selecting a proper criteria to evaluate cluster analysis is critical. However, using cluster analysis for market structure analysis in Taiwan is received limited attentions in real estate research. This research acquires the real estate actual price registered data from the government open data platform, applying cluster analysis to explore the data of Taipei City and New Taipei City and includes experience from real estate agency into consideration. The results show that AGE OF BUILDING, FLOOR AREAS (unit: ping) and TOTAL PRICE reflect different nature and meanings in different geographic areas. For example, there is a unique 56

cluster identified in this research, which have average small FLOOR AREAS with very high UNIT PRICE and are located in the prime area within Taipei City. This results reflects a unique product segment matching a special type of life style. Results of Taipei City and New Taipei City have many clusters have similar FLOOR AREAS which shows the similar needs for space and the family type beyond these clusters. However, interestingly, the TOTAL PRICE of these similar clusters are so different that shows even the needs for space is similar but the needs for geographic locations are not alike. This findings reflect the product choices of households in city area as well as metropolitan area are driven by different needs and wealth level. These also support the decision of our study in conducting cluster analysis separately for Taipei City and New Taipei City rather than analyzing them as a whole. Including opinions from experienced professional from real estate industry helps define the evaluation criteria as well as result interpretation. Future researchers may consider include more market information to better adjust evaluation criteria so that the research results will be more value added. In addition, it is suggested to use different data mining techniques to analyze real estate transaction data (total market data), which will enhance efficacy of government policy, create more values for businesses as well as general public, and eventually reach a win-win situations for all stakeholders. 7. Acknowledgment Data source: Government Open Data Platform- Real Estate Actual Price Registered Data (http://data.gov.tw/node/6213) 8. References [1] A. K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognition Letters, vol. 31, no. 8, pp.651-666, 2010. [2] J. B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations, In Proceedings of the 5 th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, vol. 1, pp.281-297, 1967. [3] J. D. Benjamin, G. D. Jud, and G. S. Sirmans, What do we know about real estate brokerage? Journal of Real Estate Research, vol. 20, no. 1, pp.5-30, 2000. [4] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. [5] J. H. Ward Jr., Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, vol. 58, no. 301, pp.236-244, 1963. [6] M. J. A. Berry and G. Linoff, Data Mining: For Marketing, Sales, and Customer Support, Wiley Computer Publishing, 1997. [7] T. J. Miceli, Information costs and the organization of the real estate brokerage industry in the U.S. and Great Britain, AREUEA Journal, vol. 16, no. 2, pp.173-188, 1988. [8] U. Fayyad, G. Piatetsky-shapiro, and P. Smyth, The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communication of the ACM, vol. 39, no. 11, pp.27-34, 1996. [9] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, Data Preprocessing for Supervised Leaning, International Journal of Computer Science, vol. 1, no. 2, pp.111-117, 2006. [10] Real Estate Broking Management Act, Laow and Regulations Retrieving System, Ministery of Interior, Taiwan, available at http://glrs.moi.gov.tw/englawcontent.aspx?type=e&id=210&key Word=%E4%B8%8D%E5%8B%95%E7%94%A2 [11] G. Shmueli, N. R. Patel, and P. Bruce, Data Mining for Business Intelligence, John Wiley & Sons Inc, 2007. 57