Data Mining Analysis of a Complex Multistage Polymer Process

Transcription

1 Data Mining Analysis of a Complex Multistage Polymer Process Rolf Burghaus, Daniel Leineweber, Jörg Lippert 1 Problem Statement Especially in the highly competitive commodities market, the chemical process industries (CPI) are forced to continually improve the efficiency and quality of their processes. To stay in this market, one has to produce the quality demanded by the customer in a very cost-effective manner. Therefore, many companies have invested in new systems for process and quality data acquisition in recent years. The large amounts of data collected are, however, often not adequately used in practice, because suitable automated analysis methods are only rarely available in the plants. By the successful analysis of a complex multistage polymer process we show that data mining can be used to extract valuable information from existing process and quality data. Data mining comprises various computational methods which are used to automatically identify interesting patterns and possible relationships in large data sets. Hypotheses are generated from the data in the form of explicit rules which can then be directly interpreted in the application context. Thereby it is often possible to discover hidden information in the data even in cases, where classical (hypothesis-based) analysis methods normally fail. The polymer process considered here consists of the two main process stages polymerization and processing, plus a number of auxiliary facilities (see Fig. 1). The polymerization stage itself is composed of several sequentially coupled units (batch/continuous) which need not be described in further detail here. On the processing stage, we consider two processing machines (MA1, MA2) which are operated in parallel. The auxiliary units are used, e.g., for solvent recovery and additive preparation. A central process information management system (PIMS) continuously collects and archives data for about 300 process parameters from polymerization, processing, and auxiliary facilities; the sampling rate is in the order of seconds. In addition, 8 quality parameters for the final polymer product are determined in the lab; only one measurement per day is available. Starting point of our data mining analysis was the need to quickly identify the cause of the large fluctuations of product quality observed over a six-month period, here shown for one of the quality parameters (see upper part of Fig. 2). Interestingly, there was a restricted time period within the analysis horizon, where the processing machines MA1 and MA2 showed significant differences regarding the quality produced. This was another fact which had to be explained. The overall objective of the analysis was, therefore, to identify the key process parameters determining quality and the corresponding cause-and-effect relationships, thus providing the basis for an improved process control. 2 Methods Used Data mining has its roots in the field of databases and artificial intelligence; there has been a rapid development in recent years [1, 2]. Classical areas of application are in marketing (customer relationship management) and in the finance and insurance industries (e.g., assessment of credit-worthiness, fraud detection). More recently, a number of interesting new areas of application specifically related to the CPI have emerged: tailored data mining technologies in combination with statistical methods and neural networks are being used successfully, e.g., at Bayer group, for process data analysis and also in catalyst and life-sciences research. The data mining approach to process data analysis is particularly interesting in the context of troubleshooting activities, because it allows to quickly identify possible causes for process upsets or quality problems from available process data. Our newly developed data mining-driven process analysis strategy is built around the so-called subgroup discovery method [3]. This method allows to automatically identify those subsets of data records (subgroups) showing interesting deviations from the whole set of all data records regarding some target attribute. At the same time, the relevant influence factors which characterize the subgroups are determined. The target attribute (dependent variable, here a quality parameter) and all attributes to be considered as potential influence factors (independent variables, here all process parameters monitored by the PIMS) must be specified in advance. In addition, continuous influence factors must be suitably discretized, because the method can handle only discrete Dr. R. Burghaus, Dr. D. Leineweber, Dr. J. Lippert, Bayer Technology Services GmbH, Process Technology Division, D Leverkusen.

2 attribute values. The discretization is done by simply splitting the continuous ranges into a (usually small) number of subintervals, e.g., for high, medium, and low attribute values. The analysis of the polymer process was complicated by the fact that it was not possible to directly merge the process and quality data due to the different sampling rates involved. Hence a data reduction had to be performed for the high resolution process data during data preprocessing (DPP). To this end, the analysis horizon was divided into suitable time intervals, and each process parameter was characterized on each time interval by a number of quasi-stationary descriptor parameters (e.g., average, standard deviation, Fourier modes, event counts). The resulting quasi-stationary process data could then be combined with the corresponding quality data. For each data mining analysis, all generated descriptor parameters were considered as potential influence factors and one of the quality parameters was chosen as target attribute. All continuous input parameters were discretized using eight subintervals; the interval boundaries were determined such that the same number of data records was assigned to each of the subintervals. For the target attribute, an equidistant discretization with ten subintervals was employed. Based on the preprocessed data, the subgroup discovery method generated a large number of rules, i.e., possible relationships between process and quality parameters. An automatic rule stability analysis ensured that only rules which are stable with respect to certain shifts of the discretization boundaries were considered, thereby eliminating meaningless dummy rules to a large extent. For instance, the following interesting rule was found (here slightly simplified for illustration): IF Average(MA Parameter) "High" AND [...] THEN Quality Parameter "High" This rule can be nicely visualized with the underlying data (see lower part of Fig. 2); it explains the quality differences between the processing machines MA1 and MA2 observed during a restricted time period (see upper part of Fig. 2). In principle, it would have been possible to find a rule like this by direct visual inspection of the preprocessed data, but this would have been very cumbersome due to the large number of potential influence factors. For more complex rules which describe the combined influence of several parameters, the visual approach quickly becomes hopeless. In the course of our data mining analysis, it turned out that the subgroup discovery method alone did not lead to the desired result. Basically, this was due to the fact that there were strong correlations among the input parameters (a quite common situation in process data analysis). Hence an extremely large number of rules was generated, most of which were redundant. In order to cope with this problem, we had to find a way of filtering out a minimal set of independent influence factors which on one hand allow a sufficiently accurate description of the target attribute, on the other hand can be plausibly interpreted and provide some means for an improved process control. To this end, the subgroup discovery method was combined with the validation through a neural network in an iterative analysis workflow as follows: 1. Subgroup discovery to identify dominant influence factors ( best descriptors ) 2. Manual selection of the most plausible best descriptor in case of correlated rules 3. Neural network training using the best descriptors found so far as input parameters 4. Comparison of neural network prediction and target measurement: in case of sufficient agreement stop, otherwise continue with step 1. With a moderate manual effort, this iterative analysis workflow leads to a meaningful list of relevant influence factors which are largely uncorrelated. Furthermore, the neural network provides a data-based process model which can later be used for quality prediction. Our complete data mining workflow for process data analysis is summarized again in Fig. 3. The DPP methods employed and the subgroup discovery method are part of our Data Mining Expert Toolbox (DAMINEX), an inhouse development on the basis of MATLAB (The Math Works, Inc.). For the neural network, we have used the commercial software NN-Tool [4]. 3 Results The above data mining workflow ultimately led to only four relevant influence factors for the quality parameter shown in Fig. 1: the averages of two relative feed streams and the standard deviation of a third relative feed stream (polymerization) as well as the average of a machine parameter (processing). A neural network which was trained with these four input parameters already provided an excellent description of the quality parameter, see upper part of Fig. 4 (only four hidden neurons were used). It turned out that almost the entire observed variation of the quality parameter could be explained by the variation of the four identified key influence parameters (model prediction and measured data were correlated with a correlation coefficient of 0.88). In addition the neural network allowed to establish a ranking of the influence factors. The effect of the MA parameter has already been discussed in the context of Fig.1; the corresponding influence was confirmed as highly plausible by the plant personnel.

3 Similarly, the influence of the relative feed rates on quality was perfectly in line with the existing process knowledge. However, there immediately arose the question, why the relative feed rates and hence the recipe of the polymer showed such large variations. In order to answer this question, one of the fluctuating relative feed rates was taken as a new target attribute, and another data mining analysis was performed according to the scheme above (see Fig. 3). Here it should be noted that the relative feed rate chosen serves as the manipulated variable for viscosity control of the polymer solution produced. Relevant influence factors for this recipe parameter were the viscosity of the polymer solution (a trivial result), and, more interestingly, the binary operation mode of the auxiliary unit solvent recycling : under high load, a second distillation train is used in parallel to the normal train. Hence this operation mode can be characterized by a discrete parameter with only two possible values (on/off). The relative feed rate could be well predicted by a neural network trained with just the two input parameters mentioned (see lower part of Fig. 4); the corresponding correlation coefficient between model and data was Identifying the influence of the solvent recycling unit turned out to be crucial for finding the root cause of the quality problems. It was possible to show by lab analyses that the solvent coming from the second parallel distillation train had a higher level of impurities than the solvent coming from the normal train. Obviously, the impurities contained in the solvent led to a poisoning of the polymerization reaction and thus, via the viscosity control, to the observed recipe and quality variations. 4 Summary and Conclusions By successfully analyzing a complex polymer process, we have shown that detailed data mining analyses can be done not only for single isolated process stages, but also for large multistage processes. Data mining allows the quick identification of hidden non-local effects which are difficult to discover with other methods. A particular advantage of data mining methods like subgroup discovery is the fact that the generated explicit rules can be interpreted directly in the application context. A specifically tailored preprocessing strategy (quasi-stationary description on suitable time intervals) allows to combine process and quality data with different time resolutions for the analysis. The combination of subgroup discovery and neural network validation within an iterative analysis workflow helps to quickly narrow down on a meaningful set of independent rules describing the target attribute of interest. Of course, our data mining methodology can be directly applied to other processes. As demonstrated by a growing number of successful applications, data mining provides a very powerful new tool for process data analysis, especially for troubleshooting. Hence data mining is the method of choice for the automated identification of special causes in the context of statistical process control [5]. By quickly identifying the quality-relevant key influences in the process and the corresponding cause-and-effect relationships, data mining can significantly contribute to process insight. References [1] Advances in Knowledge Discovery and Data Mining (Eds: U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy), MIT Press, Cambridge [2] S. Wrobel, Künstliche Intelligenz 1998, 12(1), [3] S. Wrobel, in Principles of Data Mining and Knowledge Discovery (Eds: H. J. Komorowski, J. M. Zytkow), Lecture Notes in Computer Science 1263, Springer, Berlin 1997, [4] F. Bärmann, F. Biegler-König, Neural Networks 1992, 5(1), [5] G. Box, A. Luceno, Statistical Control by Monitoring and Feedback Adjustment, Wiley, New York 1997.

4 Figure 1: Schematic depiction of the polymer process and characterization of the available process and quality data (number of measured parameters, sampling rates). Figure 2: Time profile of one of the quality parameters, classified according to the processing machines MA1 and MA2 (upper part), daily averages of the quality-relevant MA parameter identified by the analysis (lower part). In the marked time period between day 35 and day 50, the machines MA1 and MA2 produce significantly different quality. Figure 3: Iterative data mining workflow for process data analysis, comprising the two phases data preprocessing (DPP) and analysis. The neural network generated in the analysis phase can later be used for prediction of the quality parameters. Figure 4: Prediction quality for two of neural network models established as part of the analysis: description of the quality parameter as a function of the MA parameter and three polymer recipe parameters (upper part), description of the main recipe parameter as a function of the binary operation mode of the auxiliary unit solvent recovery and the viscosity of the polymer solution (lower part). The legend shows the correlation coefficient R between model prediction and measured data.

5 Figure 1

6 Figure 2

7 Figure 3

8 Figure 4