Data Mining for Technical Operation of Telecommunications Companies: a Case Study

Transcription

1 Data Mining for Technical Operation of Telecommunications Companies: a Case Study Wiktor Daszczuk *, Piotr Gawrysiak *, Tomasz Gerszberg +, Marzena Kryszkiewicz * -HU]\ 0LHFLFNL *, 0LHF]\VáDZ 0XUDV]NLHZLF] * 0LFKDá 2NRQLHZVNL * +HQU\N5\ELVNL *, Tomasz Traczyk, Zbigniew Walczak * * {wbd, gawrysia, mkr, jms, mrm, okoniews, hrb, walczakz}@ii.pw.edu.pl Institute of Computer Science, Warsaw University of Technology T.Traczyk@ia.pw.edu.pl Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19, Warsaw, Poland + tgerszberg@eragsm.com.pl Polska Telefonia Cyfrowa Sp. z o.o Al. Jerozolimskie 181, Warsaw, Poland Abstract: This paper is an overview of a Data Mining project carried out by the Warsaw University of Technology in the Network Planning and Maintenance Department of a Polish cellular telecom provider. This project has provided an excellent opportunity to test various Data Mining methods on real, non-classic (i.e. mostly not related to purely marketing problems) data from the technology area. In this paper the Data Mining experiment results are presented together with a short description of the applied methods and algorithms. Some remarks on managerial problems that have emerged during the Data Mining techniques implementation in a large corporation have also been included. Keywords: data mining, automatic knowledge discovery, cellular telecommunication systems, business process analysis 1. Introduction. The Data Mining methodology was evolving rapidly over last five years, and despite being quite a new concept in Information Technology applications, it has gained a widespread market acceptance. The above mainly refers to such applications where analyzed data is easily interpretable by humans and can be relatively easily discretized. This includes marketing, sales analysis, company strategy building and so on - in short, the areas where other data analysis methods (such as statistics) have successfully been used for years [1]. We should however realize that the increase of amount of "intelligence" (in a form of microprocessor based controllers) embedded into various machinery and tools enlarges the amount of diagnostic information generated automatically, which can not be efficiently analyzed by humans. The cellular telephone network is a very good example of this phenomenon. An average GSM network consists of several thousand so-called base stations, each incorporating several controllers possibly informing network monitoring center about their status every couple of seconds. This activity generates enormous amount of data, which is purely of a technological nature and usually difficult for interpreting. In most situations this information is simply discarded - in fact, according to W. Schmidt [2], several companies adopt the "switch off" methodology and simply disable most of the network telemetry equipment. Hence, the demand for a kind of automated knowledge discovery in similar environments seems to be obvious, yet relatively few research projects have been undertaken towards this end. The occasion for performing a Data Mining analysis in a department of ERA GSM (one of three Polish cellular telecom providers) was therefore very stimulating and promising research opportunity for the Warsaw University of Technology and especially for the Data Mining Team set up within the Information Systems Division. We present the overview of our research in this paper. 1

2 The paper is organized as follows. Section 2 contains description of several experiments performed by the Data Mining Team together with their results evaluation. In Section 3 the managerial aspects of Data Mining projects are discussed. Concluding remarks contained in Section 4, complete the paper. 2. Case study. As the business processes analysis proved (see Section 3), data mining solutions may enhance the telecommunications company value chain at the level of many different stages. However, because the idea of data mining project came out from the technology department managers, the research was focused on, but not limited to, the problems of this particular area. The research team generally analyzed processes of two sections of the company - the Network Planning section, and the Network Quality section. The Network Planning's main responsibility includes tasks related to network expansion and increasing coverage - either by building new base stations or by reconfiguring existing ones. On the other hand the Network Quality Management ensures the quality of services i.e. attempts to minimize the amount of dropped calls, unsuccessful handover 1 attempts etc. 2.1 Data mining models for cellular network planning. One of the first problems where local engineers thought the data mining would be applicable, was the support for cellular radio network planning. The cellular telecom company at the beginning of its market activity has to establish a network of base stations and cells that are related to them. The users of cellular phones in a given area contact the nearest base station that collects calls and initiates further transmission. This way the so-called "cellular traffic" is generated. Some cells and areas generate more traffic and need more transmitters in the base station (to increase the number of concurrently available communication channels), which are obviously more expensive. The goal of the network planning team is to establish optimal network of base stations in the area of activity. The transmitters should have enough capacity to handle local cellular traffic without problems, but on the other hand installing too many transmitters is improper due to the cost factor. Usually a planning process should smoothly anticipate the growth of the network, and it should also reflect the network expansion strategy of the company. The crucial success factor here is the proper location of a new base station and prediction of traffic that determines transmitter power. The Data Mining team was supplied with information about existing cellular network. The data had a form of a simple relational table with the following attributes: unique cell identification number; cell size in pixels; amount of each landuse (terrain type) in this cell in pixels; average traffic in this cell in Erlangs. A pixel is a unit of land area, of dimensions 5 by 5 arc seconds, what gives (the average) of 100 by 150 meters in Poland. There were 9 landuse classes: forests, agricultural, water, swamps, concrete, residential, dense residential, city, industrial. On the basis of this data, the Data Mining team was expected to predict the cellular traffic for new cells. Obviously, the major scope of research was the assessment on how much traffic is generated by a single pixel of a particular landuse. This may be done using multiple regression, which is a good method to find the coefficients for equation: T= a 1 *l 1 + a 2 *l a 9 *l 9 Where T is the traffic (in Erlangs) in the cell, l 1..l 9 number of pixels for every landuse, and a 1..a 9 traffic coefficients for particular landuse (in Erlangs per pixel). Applying the above method to data from the entire network resulted in quite a poor accuracy. However, we expected the approximation to improve when the regression is calculated for a subset of cells having a similar characteristic. For example, typically urban, rural or industrialized cells were expected to have similar traffic coefficients. In this way a research methodology was introduced: classification of cells into subsets with similar characteristic using clustering and decision trees; building a traffic model for every subset using multiple regression or neural networks. It turned out that this problem is a good example of the approach in which one used different classic data mining techniques, while some of them were competitive and some complementary. In addition to the above methodology, the mining team developed a concept of a method that would allow finding the best possible solution to this problem. This method is described in [3] and [4] as a regressional clustering. In brief, it is an algorithm based on k-means clustering (or genetic algorithm), that divides the whole population into clusters, using regression quality estimators as a measure of cluster quality. Such an algorithm, after a successful implementation should produce the best possible classification of cells. 1 Switching moving user between GSM cells he traverses 2

3 Another approach to the problem involved the training of a neural network (3-layer perceptron). The landuse values were network inputs, while network output was assigned to traffic value. This method generated similar results as the multiple regression with clustering, but was substantially slower. The research was supported with strong feedback from the Network Planning engineers. Only in the initial phase, the Data Mining team discovered some strange rules in the classification of landuses. Our experts interpreted them as misclassification of roads, and some kinds of residential areas, which were quite important for a traffic prediction. The major difficulty in this problem was caused by the poor quality of data. The information about landuses proved not to be too accurate and up-to-date. The purity of decision trees based on landuse data was below 70%, and it determined the final outcome of regression approach. The mean square error in both models was approximately 40% of average traffic. Therefore, we decided to include other attributes, not normally used in GSM network planning. For example, adding information about average population and income in the area allowed to improve the accuracy up to 80%. 2.2 Time oriented data in a GSM network. In our work we found out that a lot of data which are gathered in different places of the technical department had the following format: Time 1 Event 1 Time 2 Event 2... Time n Event n The aim of data mining within such data is to discover interesting rules in the following form: If a message of type A is generated, a message of type B will also be generated in a very short time. The main idea is to identify such types of events, which typically occur sequentially. We should notice that this approach could be successfully used to create efficient rules for expert systems, which might then reduce the number of alarms in GSM networks. Usually only one fault in such a network can generate a lot of (let s say - thousands) of different messages. Similarly such approach can be used to analyze the network's behavior based on the SS7 messages stored in a large database. The association rules seem to be a good tool for the above purpose. Time can be divided into intervals of a fixed length (one hour, half an hour, etc.). Each interval forms a transaction in terms of association rules discovery. For instance a generated rule [A, B, C] => [D, F] has the following meaning: If, in the interval of one hour, events A,B,C occur, than also in this hour events D,F occur. Each rule is associated with two coefficients: support and confidence. Support describes the percentage of transactions in the database in which this particular set of events occur. We are interested only in the rules with the support level being above a certain (user specified) level. Confidence on the other hand is used to give some measure how good (or "strong") the rule is. Obviously a much better approach is to use so called sliding time window analysis (see e.g. [5, 6]) since there is no mistake generated by fixed time intervals. To the best of our knowledge there is no available commercial system, which can generate such rules. This is particularly inconvenient, because the experiments proved that the use of a time window of a fixed length could result in omitting many important rules, and on the other hand, some rules that are not significant might be generated. Currently at the Warsaw University of Technology a separate project inspired by the above observations has been started in order to create specialized software addressing the aforementioned problem; a first prototype is expected in summer' Network anomalies. The analysis of behavior of different cells in the network is another problem that we encountered. Each cell in the network is associated with different parameters. These parameters are divided into a few subgroups. First of them are configuration parameters for every cell and other attributes are parameters gathered during some period of time and describe the behavior of certain network elements assigned to this cell. For example parameter Attempts gives the total number of requests to allocate a channel and the parameter Blocks gives the number of unsuccessful requests. If Blocks/Attempts achieve value above 2% it means that the quality of service is not good enough and network optimization is needed. There were defined several types of errors by the ERA experts that indicated cell anomalies such as channel congestion, blocking, call drops etc on levels higher than acceptable for well designed network. The team applied association rules discovery to find relationships between values of such parameters, but the results obtained were not very interesting as experts in the field have already known all discovered rules. Because analysis of standard data was not satisfactory, additional information about cell neighborhood was used. Two cells are defined to be neighbors if it is possible to make a handover of a call between them. The task was redefined to finding association rules of the form: (Cellid 1,Error_or_<attribute,valueRange>)... (Cellid n-1,error_or_<attribute,valuerange>) =>(Cellid n,error_or_<attribute,valuerange>) 3

4 where: Cellid 1, Cellid 2,.., Cellid n-1 are identification numbers of cells in the network that are neighbors of some cell identified by Cellid n ; Error is an expression from the set of errors predefined by the ERA experts; errors of different types were allowed to occur in one rule <attribute,valuerange> - a value range to which the attribute value of a respective cell belongs; different attributes were allowed to occur in one rule The association rules discovered, representing knowledge about mutual influence of different network elements, were much more significant to experts. Especially interesting were "one-way" rules, describing situations in which one cell located at one site influenced behavior of second cell located at another site and not the other way round. This potentially allows to identify cells that are the source of faults in the network. The association rules with parameters belonging to different subgroups were evaluated as more promising than those with parameters belonging to the same subgroup. In particular, they were found to be useful for specialists working on optimization of radio network. In addition to rules that confirm experts knowledge, unknown dependencies were identified. In the opinion of the ERA experts, they can be applied directly to generate intelligent trouble lists to be used by radio network optimization groups. It was interesting to observe that many rules with only one condition had high confidence (greater than 90%). The experts found useful applying an additional rule parameter, namely lift, when looking for useful rules. Lift determined how much the computed value of the rule confidence is greater than the expected confidence (i.e. the confidence the rule would have if the occurrence of condition and decision values were statistically independent). Further on, the DM team processed the found set of rules in order to identify essential neighboring cells that influence an unrequired behavior of a faulty cell. A neighbor of a faulty cell, say FC, was treated as essential one if its error or attribute occurred in the body of the rule that indicated an error for FC with sufficiently high support, confidence and lift. The results were different for different faulty cells and different rule threshold values, but in general, the found set of essential neighbors was a proper subset of originally defined set of neighbors. In particular, for some faulty cell with 9 predefined neighboring cells there were extracted 7 essential neighbors from the set of association rules with minsup > 70%, minconf > 20%, and lift > 1.2. On the other hand, there were extracted only 2 essential neighboring cells when the minsup was increased to 25%. After checking robustness of this method on small subset of cells, we are now extending it to entire network. Unfortunately, the existing data mining tools are not well suited to the needs of radio network optimization specialists. Necessary data pre-processing is a tedious activity and post-processing functionality provided by existing data mining tools is not sufficient to filter out the knowledge of interest easily and quickly. Practically, necessary post-processing had to be done by means of classical query languages (e.g. SQL), which makes mining around rules relatively slow and unfriendly activity. The rule languages proposed in [7,8,9,10] do not address several pre- and post-processing issues the DM team faced when working on the problem of network anomalies. 2.4 In search for too thrifty consumers. The Data Mining team in the final stage of the project also performed a quick analysis for the marketing department. The goal was defined as follows: to find the customer profile of subscribers that make calls shorter than 5 seconds. According to a billing schema, such calls are treated as mistakes, and thus are free of charge. However, thrifty subscribers used quite a lot of such calls to send short messages (like call me back ) instead of, for example, SMS texts. Some of them were supposed to use 5-second calls for automatic communication purposes generating hundreds of them each day. Our preliminary analysis showed that those 5-second calls, while not generating any profit for the company, were responsible for 40% of the total network load. The team tried therefore to find a "thrifty customer profile" using attributes such as type of tariff, age, and so on. This problem of customer segmentation [1] is close to classic examples of data mining analyses. Therefore, the team adopted classic approach utilizing full scope of DM tools and methodologies: association rules, clustering, decision trees, statistics and neural methods. Data about subscribers and their calls was extracted from company's data warehouse. It contained information about the following attributes: age; gender; county of residence; tariff plan; contract duration; total number of calls (during 2 week period - this gives about 0.6 million subscriber records) in four classes : <0-5> second calls (5-30> second calls (30-60> second calls (60...+inf.) second calls 4

5 The final outcome has proved that there is no such thing as typical too thrifty consumer, because all groups of subscribers generated similar number of those short calls, what was quite unexpected result. The marketing department learned from this that switching off the feature of free 5-second calls might have significant impact on all users' behavior. This experiment is still in progress. The analyses are now being applied to data sets gathered during a longer period (6 months). Some interesting relationships in data have already been observed here such as apparently smaller tendency to use 5-second calls among women of age between 40 and Management perspective. In the world of management Data Mining is a concept often discussed and "referred to"; yet, implemented rarely. Now, Data Mining is such a buzzword as Data Warehouse was only five years ago. The reason for many companies to "go DM" is the fact that their competitors use Data Mining, or claim to do so. In this project we were not driven by a Data Mining fashion. The commissioner wanted the Data Mining team to be creative and pro-active in terms of identifying the areas that might be subject to data mining and the data mining itself. The Data Mining team had an entire technical department of a company to work with, and surprisingly enough no specific and urgent business goal has been defined. Therefore, the first and vital step was to identify the business processes and information sources available throughout the department that were amenable to Data Mining. Such an analysis differs significantly from a conventional business process analysis (used for example for a reengineering purpose), in which analysts describe an entire organization as a system of cooperating processes, pinpoint their objectives, and than redesign the structure in order to maximize performance. Our approach was more oriented towards the data sources keeping in mind that they are parts of certain business processes. The team was trying to find such places in a company structure where large amounts of possibly important data had been generated, and dumped without further analysis, because traditional manual or OLAP methods just simply could not cope with the complexity, or size of the data [11]. These "hot spots not necessarily had to be crucial from the entire company's efficiency point of view (what is the biggest difference from a conventional business process analysis). Several such processes have already been described in Section 2. During the analysis it has turned out that it's practically impossible to finish it without the help of company's insider. Whereas in most Data Mining projects the data is pretty straightforward and intuitive (such as sales figures or customer data) this was not the case in our experiments. We have been dealing mostly with output from telemetry equipment, so even to decide on it's potential usefulness for Data Mining we badly needed assistance of a telecommunications specialist. The need for a multidisciplinary team became therefore evident (for other practical purposes it's of course also good to have a dedicated person in the company's structure that would deal with security and interpersonal contact problems for the project). During the investigations another problem emerged. Technology specialists from the company, not having previous knowledge of automatic knowledge discovery methodology, had problems with imagining what was possible with Data Mining and how these techniques might help them. Some of them expressed even fright that automatic (and intelligent) data processing tools may make their jobs redundant. We have decided that it's easier to show the potential of Data Mining to telecommunications specialists than to learn the GSM technology ourselves. In a series of five seminars, open for all company's employees, most important Data Mining methods such as association rules discovery, clustering, classification and statistics have been presented. The additional bonus of this activity was building awareness throughout the company about the project. This resulted, among other things, in an interest from other departments (such as marketing) which finally created an opportunity to study customer-behavior related problems in the end of first phase of the project. Final seminars involved the commercial tools presentation, but before we were able to exhibit them, we had to perform extensive evaluation of Data Mining packages available on the market. During this analysis it turned out that not all software surveys may be trusted, as several well-respected software tools turned out to be not so useful as promised [12]. The description of problems analyzed by our team have already been presented in Section 2. However, we would like to remind of two very important aspects that seemed to significantly influence our findings. First of them is the apparent lack of methods that could cope with raw numerical data. Such data is quite rare in traditional Data Mining, but in technological problems such information represents the majority of all information available. One of the most popular ways of dealing with this is discretization, but it turns out that without good insight in the very nature of the data it's difficult to perform, even with quite complex methods [13]. A manual discretization with the thresholds proposed by domain specialists proved to be most effective but for certain data types this may be very laborious process. The numerical mining (such as quantitative association rules) seems, therefore, to be very interesting and mostly unexplored research area. The presentation of results also proved to be very important step of the whole Data Mining process. First of all it is very difficult to evaluate the quality of the mined knowledge. In several of our experiments the extracted 5

6 rules, which seemed at first to be quite interesting, proved to be well known to domain specialists and therefore of no great importance (albeit their presence in extracted knowledge proved the reliability of used Data Mining methods). Frequent results evaluation is therefore necessary and before presenting final report to company's authorities an in-house expert should proofread it. Because the technical problems are not very intuitive, the final report should contain also problem description, and presentation definitely must be understandable to nonspecialists. It turned out that even in a field so narrow as GSM network planning and maintenance the management staff not necessarily has the expertise needed to evaluate the results from other than theirs departments. Finally we would like to give some words of warning to all just starting their Data Mining projects. It is very difficult to evaluate the results of a whole project and to determine whether the project has been a success or no. The second thing - and it is the fact that is rarely remembered - Data Mining may serve as an analysis tool for external data as well as internal one. Concentrating only on internal data (i.e. generated within company) may be very dangerous, especially for organizations operating in dynamically changing environments. We must also remember that Data Mining actions do not contribute directly to company's value chain - they only provide information. This information may be wisely used, and thereby increase company's competitive advantage, but may be discarded and therefore wasted. In short - successful Data Mining gives companies an opportunity to act wiser on the market but it is up to the managerial staff how to make use of this opportunity. 4. Concluding remarks. This project, while still not finished, has already proved that mining the technological data creates several new problems not experienced by "conventional" data miners. The nature of the data and business processes makes the whole analysis much more demanding and also a delicate task. Fortunately the solutions that could be possibly found by Data Mining in technical departments seem to be quite effective as they contribute directly to efficiency of the existing processes and systems. This is not always true in term of marketing or strategy building. [4] 3LRWU *DZU\VLDN 0LFKDá 2NRQLHZVNL $SSO\LQJ Data Mining Methods for Cellular Radio Network Planning", submitted to IIS'2000 conference, 2000 [5] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequences, First International Conference on Knowledge Discovery and Data Mining (KDD'95), , Montreal, Canada, August AAAI Press [6] H. Mannila and H. Toivonen, Discovering generalized episodes using minimal occurrences, Second International Conference on Knowledge Discovery and Data Mining (KDD'96), , Portland, Oregon, August AAAI Press [7] T. Imielinski., A. Virmani., Abdulghani, A., Discover Board Application Programming Interface and Query Language for Database Mining, In Proc. of KDD 96, Portland Ore., August 1996, pp [8] T. Imielinski, H. Mannila., A Database Perspective on Knowledge Discovery, Communications of the ACM, November Vol. 39, No 11. [9] R. Meo, G. Psaila, S. Ceri, A New SQL-like Operator for Mining Asscociation Rules, Proc. of the 22nd VLDB Conference, Mumbai (Bombay), India, [10] T. Morzy, M. Zakrzewicz, SQL-like Language for Data Mining, 1st International Conference on Advances in Databases and Information Systems, St. Petersburg, 1997 [11] Rob Mattison, Data Warehousing and Data Mining for Telecommunications - Artech House Computer Science Library 1997 [12] DM tools manuals and reference materials: IBM Intelligent Miner, SGI Mine Set, SAS Enterprise Miner, Rosetta, RD2, Oracle Darwin [13] Andrzej Skowron, Son H. Nguyen, "Quantization of Real Value Attributes", Warsaw University of Technology Report, References [1] Michael J. A. Berry, Gordon Linoff, "Data Mining Techniques: For Marketing, Sales, and Customer Support", John Wiley & Sons 1997 [2] W. Schmidt, private communication, 21/01/2000 [3] 3LRWU *DZU\VLDN 0LFKDá 2NRQLHZVNL +HQU\N 5\ELVNL 5HJUHVVLRQ \HW DQRWKHU FOXVWHULQJ PHWKRG submitted to DEXA'2000 conference,