SPE MS. Copyright 2014, Society of Petroleum Engineers
|
|
|
- Tiffany Anthony
- 10 years ago
- Views:
Transcription
1 SPE MS Predicting Failures from Oilfield Sensor Data using Time Series Shapelets Om Prasad Patri, Anand V. Panangadan, Charalampos Chelmis, University of Southern California, Randall G. McKee, Chevron U.S.A. Inc., and Viktor K. Prasanna, University of Southern California Copyright 2014, Society of Petroleum Engineers This paper was prepared for presentation at the SPE Annual Technical Conference and Exhibition held in Amsterdam, The Netherlands, October This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright. Abstract Increasing instrumentation of the modern digital oilfield produces streams of data from sensors that monitor the functioning of different components in the field. This data should be converted to actionable information rapidly in order to respond to events as they happen or are predicted. The challenge is therefore to develop technologies that can process these large sensor datasets rapidly and with minimal manual supervision to ensure a data processing system that can scale with the increasing instrumentation. We consider as a use-case an oilfield with several Electrical Submersible Pumps (ESPs), each instrumented with sensors that continually measure electrical properties of the pump (the streams of sensor data), which are then relayed to a central location. In this paper, we demonstrate how a time-series analysis approach can be applied to failure detection and failure prediction from the streams of sensor data. The method involves identifying shapelets short instances that are particularly distinct in the streams of sensor data. The shapelets approach is particularly applicable to large oil and gas enterprise datasets because the algorithm does not need access to the entire historical data. This greatly reduces the amount of data that needs to be stored for data analysis. Moreover, unlike model-based approaches, shapelet-based analysis does not make any assumptions about the underlying nature of the data, making it practical for applications where a detailed physical model of the pump is not available. We validate our proposed method by analysis on a representative set of instrumented ESPs. We describe the preprocessing steps that were applied in our analysis. We report the results of experiments to study the effects of varying the data processing parameters on the accuracy of fault detection and prediction. These results indicate that shapelet-based approaches are promising for analysis of time-series data in the oil and gas industry. 1. Introduction The evolving nature of the modern digital oilfield requires large-scale instrumentation and monitoring. Data mining and machine learning approaches have become vital to digital oilfield operations [1, 2] as they move into the age of Big Data. A massive portion of oilfield data today is in the form of sensor streams which necessitates the use of rapid real-time data analysis techniques [3]. Due to the high frequency of sensor data in the oil and gas industry [4], it is inevitable for oil and gas enterprises to leverage efficient data mining techniques, especially those dealing with time series data. In this paper, we adapt a time-series mining approach that was recently developed in the computer science community, known as time-series shapelets [5], for application to sensor measurements typically collected in the oil and gas industry, in particular to component failure detection and prediction. The streaming time series nature of such data is especially suited to this approach. The shapelets method identifies those time segments ( shapelets ) within the available sensor data which are most discriminative for differentiating between two classes, for instance those arising from failed pump components in contrast to those that are working normally. Using discovered shapelets from historical data, predictions about future failures or anomalies can be made such that proactive steps can be taken to mitigate their effects. As an example, a shapelet that was discovered from intake pressure measurements from an electrical submersible pump is overlaid over the full time series Figure 1; the short segment shown in red was found to be the most discriminative from this time series and others in the labeled data record for distinguishing failed pumps from normal ones. This shapelet, along with other shapelets, can be used for detecting failures by comparing thsem with real-time sensor data (further details about finding shapelets are provided in Section 4). The shapelets approach is particularly effective for oil and gas enterprise data because it does not need access to the entire historical record of sensor data while making decisions only the shapelet time segments, identified in an offline step from
2 2 SPE MS the historical record, are needed for real-time analysis. This greatly reduces the amount of data needed to be stored for further data mining. Moreover, this approach does not make any assumptions about the nature of the data, making it practical for real world scenarios. The shapelets found are visually interpretable, making deeper root cause analysis possible. Such time series mining methods are directly related to events in the oilfield, and as described in [2], there are several value propositions for efficient event processing [6]. Figure 1: A discriminative segment (shown in red), or shapelet, from a time series as computed by the shapelet mining algorithm. The x-axis shows time (in days) while the y-axis represents normalized values of the intake pressure for a specified ESP. To summarize, the primary purpose of performing time series mining on oilfield sensor data are: To detect anomalies in equipment operation automatically To identify time segments that are indicators of failures Reduce the data storage and computational power requirements for real-time sensor data analysis Figure 2: Sensor measurements as time series for failure detection and prediction 1.1 Use-case Electric submersible pumps (ESPs) are one of the main artificial lift methods for extraction of fluid. The failure of an ESP increases operating costs and affects production. Statistical analysis of ESP failures shows that the failure rates of ESPs vary considerably [7]. There are several factors affecting ESP failures, including reservoir type, sand control, and whether the ESP is a replacement [7]. In an effort to understand the causes of failures, the oil and gas industry has attempted to collect data on all potential factors and exchange this information for better data analysis [8]. However, data aggregation is a difficult task considering the myriad ways of collecting and recording the diverse types of data [9]. Instead of attempting to build a complete model of ESP failures from these disparate data sets, we apply machine learning and time-series data processing techniques to automatically learn failure models from only the sensor measurements (with relatively high temporal frequency) collected directly from the ESP. The goal of our research is therefore to understand the limits of failure detection and prediction using only the most accessible sensor measurements. It has been estimated that just a 1% improvement in ESP performance world-wide would provide over a half-million additional barrels of oil per day [10]. Considering the high cost incurred by an ESP failure, early detection and prediction of pump failures of even a subset of all operational assets from readily available data can reduce OPEX (Operational expenditure) and can reduce average maintenance cost. The approach proposed in this paper can be generalized for other smart oilfield areas where time series data is plentiful. This use case is illustrated in Figure 2. The rest of this paper is organized as follows. In Section 2, we describe related work in ESP failure analysis. In Section 3, we provide a brief introduction to the shapelets approach to time series data processing. In Section 4, we describe our approach and Section 5 presents the results from our experiments. We conclude in Section 6.
3 SPE MS 3 2. Related Work on ESP failure prediction Considering the large number of ESPs deployed in an oilfield, the bulk of ESP failure prediction research has focused on computing population-level estimates and causes of failure [11]. Statistical analysis of ESP failures across an oilfield has shown that the failure rates vary considerably [7]. Sawaryn has derived analytical expressions describing failure patterns at a population level [12]. The amount of sand in the extracted fluid is one of the important causal factors affecting failure and Kalu-Ulu et al. have modeled ESP failures in sand producing wells [13]. Furthermore, Liu et al. [14] has adapted data mining classification algorithms and Liu et al. [15] presents a Bayesian network-based machine learning algorithm to predict rod pump failures. In our work, we explore methods to identify and predict failures of individual ESPs using machine learning techniques. We have also used a technique (similar to this work) based on time series shapelets for electricity disaggregation in [16]. Specifically, in this work, we have used the concept of events as temporal periods of operation (either failure or normal) of ESP pumps as described in Section 4. However, it is possible to instead add more semantics and structure to define and represent events for oil field processes. The Process-oriented Event Model [17] describes the semantics of such an event model and also provides a case study for pump failure event detection. 3. Background on Time Series Shapelets Time series shapelets, introduced by Ye et al. [5], are discriminative subsequences in time series which can differentiate instances of the positive class (e.g. normal operation of pump) from those of the negative class (e.g. failed or anomalous operation of pump). Time series shapelets are identified based on information gain criteria [5]. Essentially, the subsequence with the highest information gain is identified to be the most discriminative between the classes and thus is the shapelet. To perform classification of time series using shapelets, a decision tree classifier is built based on distance from the shapelet. When a new time series instance is encountered, the distance to the shapelet is calculated, and depending on which branch of the tree it falls in (distance lesser or greater than distance measure of shapelet), it is assigned a predicted class label. For the case of multiple shapelets from one time series, the same decision tree can be extended to have multiple nodes. To summarize, our primary motivations for applying the time series shapelets method to oil and gas sensor data is because they have the following advantages over other conventional approaches: Shapelet-based methods provide very fast classification times They are visually interpretable They do not need to use the complete training data for classification (after the shapelet has been discovered from the time series), only the discovered shapelet segments are required They make no assumptions about structure of time series (in contrast to conventional autoregressive or ARIMA time series models). The number of parameters involved is also minimal. Various extensions to time series shapelets have been proposed in the literature, most notable of them being Logical Shapelets [18], Local Shapelets [19], and Fast Shapelets [20]. In this paper we use the Fast Shapelets algorithm, which is currently the fastest supervised shapelet finding method [19]. Though they were originally intended for use with supervised datasets, extending shapelets to include the idea of unsupervised shapelets (U-shapelets) was presented in Zakaria et al. [21]. Instead of maximizing the information gain in the case of (supervised) shapelets, the U-shapelets method maximizes the separation gap, a new metric described in [21]. This method allows unsupervised shapelets to be used for clustering. A stopping criteria based on Rand index is used to decide the number of clusters. 4. Approach This section describes the dataset used for our experiments and then explains our proposed shapelet mining approach. We developed a pre-processing method to address the irregular nature of real world sensor data and transform the sensor measurements into a format suitable for shapelet-based time series analysis. This method can handle invalid data points, irregularity in sampling intervals and reduce redundant data segments. These are three frequent problems with industrial sensor datasets and are particularly applicable to datasets from the highly instrumented oil and gas industry. 4.1 Dataset description We used data from electrical submersible pumps (ESPs) from a single onshore oilfield within Chevron North America. We consider a supervised classification scenario. The classification task is to detect and predict the behavior of a new ESP instance over a specified period of time, given sensor readings of its physical attributes (current, voltage and intake pressure). We are provided with labeled data from ESPs defining periods of normal and failure operation. The data ranges from 1 Aug 2011 to 29 July 2013, and measurements are nominally recorded at hourly intervals. We use data from 11 normal operation periods (from 11 ESPs) and 10 failure operation periods (from 8 ESPs, with two ESPs providing two failures each). For these periods, we have sensor data containing time-stamped data points for various attributes or physical properties of the pump. In this paper, we focus on three of these attributes current, voltage, and intake pressure, which are the most readily available sensor measurements. We also have timestamps for all of the data with separate timestamps for insertion date of the data points as well as the last good scan dates (which we use later for ensuring that the data points are valid). Instances of sensor
4 4 SPE MS measurements from selected ESPs during both normal and failed operation periods are shown in Figure 3 (intake pressure), Figure 4 (current), and Figure 5 (voltage). These plots show measurements from two normally functioning ESPs and two failed ESPs. These do not represent all the patterns of normal or failure periods since there is a lot of diversity in the data. Our prediction methods need to incorporate such diverse instances to be able to make decisions even for previously unseen instances without overfitting. Shapelets are suitable for this purpose because they are designed to identify the most discriminative segments in the time series instances, irrespective of the nature of the data. Figure 3: Examples of intake pressure measurements during normal operation (blue) and failed operation (red). Figure 4: Examples of current measurements during normal operation (blue) and failed operation (red).
5 SPE MS 5 Figure 5: Examples of voltage measurements during normal operation (blue) and failed operation (red). 4.2 Issues with using the raw dataset Since real oilfield data is irregularly recorded and contains large missing portions, we have developed an appropriate preprocessing method for use before the described shapelets technique can be used for classification. In particular, we have developed a pre-processing method to obtain clean data segments. The proposed method divides each sensor data stream into windows of a specified segment size (preferably, the time duration over which predictions need to be made) and ensures that the elements within each segment are regularly sampled. Only a specified degree of overlap is allowed between consecutive segments so as to avoid overly similar time segments. A check is performed on whether the sensor value read was new or not, providing only regularly sampled data segments in the end, suitable for input to shapelet based classification approaches. Several factors make the raw dataset, collected directly from sensors in an oilfield, unsuitable for direct analysis via the shapelet mining algorithm (or any generic data mining method). We desire a regularly sampled version that can directly be input to a state-of-the-art shapelet finding method (such as Fast Shapelets [20]). To process the dataset, a logical choice would be obtaining one time series per ESP (per failure/normal operation period). However, this is not possible to implement here because of the following three issues. The first issue is that of validity. Since it is possible for sensors to fail in their functioning, some of the data points recorded may not be valid and need to be eliminated. Keeping track of the last good scan timestamp in the data can help us perform this elimination. The next concern is of regularity in the frequency of sampling the data points. In cases where the sensor was not operational, or for other reasons, we found that the data was sampled for some periods at a 1-day frequency instead of the standard 1-hour frequency imposed on the rest of the data. Figure 6 shows plots of insertion dates and last good scan dates for a single pump. As seen in the plot, it is not regular for the complete duration of the data stream. Regular intervals between data points are essential for the shapelet finding algorithm. One method to handle this is to interpolate for the missing data points, but this can result in localized artifacts, which are then spuriously interpreted as distinctive patterns in the shapeletbased algorithm. Instead, as described later, we find clean segments where the data is regularly sampled for further use. The third issue is that of redundancy. There may be several data segments in our input training dataset that are very similar to each other with only a few data points being different at the beginning and end of the segments. However, if we do not allow any overlap between the segments at all, then we are left with an extremely limited number of segments and the possibility that the most distinctive patterns are across two segments. To handle this tradeoff, we introduce an overlap parameter, which controls the amount of overlap between consecutive segments.
6 6 SPE MS Figure 6: Irregularity in sampling of oilfield sensor data. The period between consecutive data points varies between 1-hour and 1-day. The X-axis shows time (in days) and the Y-axis represents the insertion date of the data point (left) and the last good scan date (right). 4.3 Pre-processing Since large portions of the data can be missing, we need an effective pre-processing technique to filter useful data. The following filtering step does not modify any of the data or interpolate any values. Our pre-processing method is summarized in Figure 7. We extract segments out of each time series for each pump. The first criterion for keeping a segment is to ensure that it is valid. We ensure this by requiring that the insertion date of the data entry is within 1 day of the last good scan timestamp. The second condition is to fix a regular time interval for sampling of the values. We only consider those values, which are recorded 1 hour apart. These two pre-processing conditions give us regularly sampled segments of data. However, each segment is very similar to the previous segment with the exception of just one or two data points, which have changed. To eliminate many of these close segments, we add a third condition, which restricts the degree of overlap between adjacent or consecutive segments. This parameter can be changed to reflect various requirements in specific use cases. The more the allowed overlap, the more the number of segments we get after pre-processing. In our experiments, we allow a 75% overlap between adjacent segments, thus forcing at least ¼ of consecutive segments to be different from each other. Valid data segment if and only if all the following hold Regular time interval between data points (1 hour) Insert date within 1 day of last good scan date No overlap with previous segment for a speci>ied duration (atleast 1/4th different) Figure 7: Proposed pre-processing workflow 4.4 Failure event detection and failure event prediction We describe the two major scenarios of data mining tasks performed via the shapelet mining approach. The first task is failure detection. The failure detection experiment aims to detect whether a given data segment represents a failure or not. It is reactive rather than being proactive. For failure event detection, we label the failure event durations as negative class instances while all other segments are treated to be of the positive class. For example, as described earlier in the preprocessing steps, given a segment length parameter, all possible clean data segments (of the specified segment length), which satisfy all the three processing checks, may be extracted. Failure event prediction aims to predict failures before they occur, so that maintenance personnel may be notified in advance, taking the proactive anomaly detection approach mentioned in [6, 22]. The failure prediction experiment aims to classify whether a given data segment will eventually lead to a failure. We define a user-customizable lookback parameter, which is to be used for finding precursor segments for building the training and testing data sets in the shapelet-based classification approach. The intuition behind this is that there is a failure event precursor at some point before the failure is
7 SPE MS 7 detected and we want to make predictions based on including this data point in our training and testing data. We label segments that are precursors to the failure or normal event durations (within the lookback period) as negative or positive class instances respectively. Note that we do not extract or consider any data from the actual failure or normal duration while making a prediction. Typically, the lookback should be larger than the segment length parameter so as to have multiple data segments for training the system. However, if it is too large, then it would cover a lot of redundant data points which may not have the real precursors for the failure event that is about to happen. Since we do not know the point of time when an exact precursor for a failure event occurs, we fix a moderately long lookback period, typically a few times the segment length. However, the shapelet-based algorithm provides good results (as shown in the next section) for different lookback periods irrespective of the fact that we may not have included the real failure precursor data point or added too many redundant data points. Figure 8 illustrates the extraction of time segments from each sensor dataset for failure detection and prediction. Figure 8: Failure Event Detection and Failure Event Prediction Scenarios. The red shaded portion denotes a failure operation and the green operation period denotes a normal operation period. All segments extracted are of the specified segment length. In the detection scenario, the segments are extracted from the actual failure or normal operation duration, while, for prediction, segments are extracted only from the lookback period without utilizing any data in the actual labeled failure or normal periods. 5. Evaluation This section describes our experimental framework, experiments performed and evaluation results. We have two major categories of experiments at the segment-level and at the pump-level, and for each category, there are two scenarios failure detection and failure prediction, as described earlier. We used the supervised shapelet finding algorithm, Fast Shapelets [20], for finding shapelets in all the experiments. A discussion of the experimental results is given in Section Segment-level Failure Detection The first category of experiments is that of segment-level classification either for failure detection or failure prediction. In this category, we extract data segments from all pumps during pre-processing. All extracted segments (from all pumps) are collected to form the full segment-level dataset. This dataset is split randomly into training and testing sets, which are independent of each other. Note that in this method of evaluation, though a given segment only appears in either the training or test portion, a particular pump can contribute sensor data segments to both the training and test portions. Two shapelets, extracted from one such training experiment, are shown in Figure 9. According to the shapelet framework, these segments are the most discriminative for distinguishing failures from working instances. We used various parameters in our experiments. We used segment lengths of 1-day, 2-day and 1-week. The allowed overlap was set such that at least one-fourths of the data is different between consecutive segments. We experimented for 25%-75% and 50%-50% random splits for the training and testing sets respectively. We compared our results to a baseline classifier, denoted by ZeroR, and to other common classifiers as well in Section 5.3. The ZeroR classifier simply assigns the label of the majority class in the training data to a new class instance. Thus, if there are more positive class instances than
8 8 SPE MS negative in the training data, it will assign the positive class to every test data instance, and vice-versa. This baseline classifier provides the minimum expected accuracy estimates for this dataset. The accuracy results are shown in Table 1. Figure 9: Two shapelets (shown in red) extracted for failure detection based on intake pressure sensor data (normalized). The segment length was 1 week. The first shapelet (left) is of 35 hours duration and the second one (right) is of 5 hours duration. The data was split into a 50%-50% train-test split (half of the data was used for training). Table 1: Accuracy of detecting failures at the pump level. Segment length 1 day 2 days 1 week % Data in Training 25% 50% 25% 50% 25% 50% Current Voltage Intake Pressure ZeroR (baseline) Segment-level failure prediction For our failure prediction experiment, we considered data segments that appear before the failure or normal events (i.e., data segments are collected from the lookback period). The lookback periods were set as follows: a lookback of 1-week for 1-day segment length, and a lookback of 4-weeks for both the 2-day and the 1-week segment lengths. The accuracy results are shown in Table 2. The shapelets extracted from training a prediction classifier are shown in Figure 10. Table 2: Accuracy of predicting failures at the pump level. Segment length 1 day 2 days 1 week % Data in Training 25% 50% 25% 50% 25% 50% Current Voltage Intake Pressure ZeroR (baseline)
9 SPE MS 9 Figure 10: Three shapelets (shown in red) extracted for failure prediction based on intake pressure sensor data (normalized). The segment length was 1 week and lookback period was 4 weeks. The shapelet durations are 50 hours, 36 hours and 5 hours. The data was divided into a train-test split (half of the data was used for training). 5.3 Comparison with other machine learning techniques We compared our results with those of several state-of-the-art machine learning approaches, widely used for pump failure detection. The results are shown in Table 3. We observe that shapelet-based methods are comparable to other classifiers. In these experiments, the segment length is 1 week and the split for training and testing is 50%. The lookback window for failure prediction is 4 weeks. Table 3: Comparison of accuracy (%) of shapelet-based segment-level failure detection with other machine learning techniques. The following classifiers are compared to our proposed Fast Shapelets-based method (FS): Logistic Regression, Multi Layer Perceptron (MLP), Support Vector Machines (SVM) using either Gaussian or radial basis function (rbf) kernel, polynomial kernel, or linear kernel, AdaBoost, Decision Tree (J48), Random Forests (RF) and our baseline classifier (ZeroR). FS Logistic MLP SVM(rbf) SVM(poly) SVM(linear) AdaBoost J48 RF ZeroR Failure detection Failure prediction Pump-level failure detection The segment-level classification experiments attempt to detect and predict failures given a single segment. However, in a typical oilfield monitoring application, it is sufficient to detect and predict failures at a given time using all the sensor measurements available up to that time. We can therefore combine the predictions of all the time segments from a given ESP to get a prediction of the failure for that pump. We call this analysis pump-level failure detection and prediction. For pump-level analysis, the train-test split is performed differently from the segment-level experiments. We use leaveone-out (LOO) cross validation for evaluating pump-level failure detection and prediction. In LOO, only instances (the data segments extracted after pre-processing) from one of the pumps are set aside for testing the learned classifier while all other segments (along with their failure/normal labels) are used for training using the Fast Shapelets algorithm as described earlier. This evaluation is repeated for each pump and the classification accuracy results are aggregated. Note that unlike segmentlevel evaluation, none of the data segments from the well set aside for testing are used for training the shapelet-based failure detector or predictor. (In segment-level evaluation, the segments from all the wells were randomly split into training and test instances.) Table 4 presents the accuracy of pump-level failure detection. There are 8 ESPs which failed and 11 which functioned normally in the available data (column Failed/Normal? ). The number of data segments extracted after pre-processing is shown (column #test segments ) for each of these 19 cases. The accuracy of applying the shapelet-based classifier for each of the segments from an ESP (the classifier is trained on all the segments from the remaining ESPs) is shown in columns #detected failure segments and equivalently as a percentage in column Segment-level accuracy (%). In order to combine the segment-level labels into a single label for the pump, we apply a threshold. Setting this threshold enables the precisionrecall performance of the classifier to be traded-off against each other. (Precision is defined as the fraction of all positives labeled by the classifier that are true positives; recall is defined as the fraction of all positives in the dataset that are identified as true positives by the classifier.) Increasing the threshold increases the precision (certainty that identified failures are genuine faults) at the cost of recall (missing some genuine faults). In our evaluation, we set the failure detection rule to classify a pump as a failure if 40% or more of the segments were classified as failure segments. This approach is then able to detect ESP failures with a precision of 89% 8 out of the 9 pumps labeled as failures had actually failed and a recall of 100%
10 10 SPE MS 8 out of the 8 failed pumps were correctly identified. In Table 4, setting this threshold enables Pump #5, #9, and #16 to be labeled correctly even though their segment-level accuracy is less than 100%. Table 4: Results of detecting failures at the pump level. Rows in bold indicate pumps that were labeled correctly as failures/normal though their segment-level accuracy is less than 100% Pump number Failed/ Normal? #test segments #detected segments failure Segment-level accuracy (%) Detection label after threshold of 40% 1 F F 3 F F 4 F F 5 F F 6 F F 7 F F 8 F F 9 F F 10 N N 11 N F 12 N N 13 N N 14 N N 15 N N 16 N N 17 N N 18 N N 19 N N 20 N N 5.5 Pump-level failure prediction The evaluation described in Section 5.4 is repeated for pump failure prediction. In this case, all the segments for training and testing are taken from the lookback period. The accuracy results are shown in Table 5. In this case, we apply a threshold of 20%. The failure prediction rule is that a pump is predicted to fail if 20% or more of the segments were classified as failure segments. The pump-level shaplets-based predictor is able to predict ESP failures with a precision of 78% (7 out of the 9 pumps labeled as failures will actually fail) and a recall of 78% (7 out of the 9 pumps that would fail were correctly identified). 5.6 Discussion The computational cost of the Fast Shapelets method that was presented is very different for training and classification. Classification is typically fast since it depends only on the number and length of shapelets identified during training and these are few in number (Figure 9 and Figure 10). The shapelets identified during training can be viewed by domain experts (Figure 9 and Figure 10) and an interpretation with regards to failure causes can be attempted. As expected, prediction of failures is harder than detection (the accuracies of prediction experiments are lower than that of the corresponding detection experiments). Intake pressure is generally a better indicator and predictor of failures compared to current and voltage given the accuracy results in (Table 1 and Table 2). As expected, increasing the amount of data used in training gives higher accuracies (25% versus 50%, Table 1) but the accuracies are comparable in magnitude. Moreover, the accuracy is comparable with other ML techniques even with low amounts of training (Table 3). Our novel threshold-based method of detecting pump-level failures is able to achieve relatively high precision of 89% and recall of 100% when the classifier is trained on all data segments available from other pumps (Table 4). Note that in a fault monitoring application with low
11 SPE MS 11 failure rates, the operational cost of implementing the monitoring workflow is largely determined by the false alarm rates (100 precision). Pump-level failure prediction has lower precision and recall (both 78%, Table 5). The pre-processing procedure and the shapelets algorithm have a few parameters that need to be determined before it can be applied for classification. In our case, a segment length of 1 week, and a lookback period of 4 weeks was found to give the best results. The optimal parameters may have to be recomputed for a different dataset. Pump number Will fail/ #test Normal? segments Table 5: Results of predicting failures at the pump level # predicted to fail segments Segment-level accuracy (%) Prediction label after threshold of 20% 1 F F 2 F N 3 F F 4 F F 5 F F 6 F F 7 F F 8 F F 9 F N 10 N N 11 N N 14 N N 15 N N 16 N F 17 N N 18 N N 19 N F 20 N N 6. Conclusions Shapelets-based time series classification is attractive for oil and gas applications due to their fast classification time and the possibility of interpreting the underlying shapelets by domain experts. We adapted the shapelets approach for detecting and predicting ESP failures only from readily available sensor data. This required developing an appropriate pre-processing procedure. In our experiments to evaluate the accuracy of this method, we found that the accuracy is comparable to other machine learning-based classifiers even with relatively low amounts of training data. The method is able to detect failures only from intake pressure measurements with a precision of 89% and 100% recall. The accuracy of predicting failures is lower with a precision of 78% at 78% recall. This approach has a few customizable parameters that may have to be adapted to different datasets. In future work, multiple measurement attributes (intake pressure, current, and voltage) can be considered together in the shapelets-based classifier to improve the detection and prediction accuracy. 7. Acknowledgments This work is supported by Chevron U.S.A. Inc. under the joint project, Center for Interactive Smart Oilfield Technologies (CiSoft), at the University of Southern California. 8. References [1] B. Burda, J. Crompton, H. M. Sardoff, and J. Falconer, "Information architecture strategy for the digital oil field," in Digital Energy Conference and Exhibition, [2] J. Crompton and C. G. Upstream, "Putting the FOCUS on Data," in Keynote Address at the 2008 W3C Workshop on Semantic Web in Oil & Gas Industry, Houston, Texas, 2008.
12 12 SPE MS [3] M. R. Brule, "Big Data in Exploration and Production: Real-Time Adaptive Analytics and Data-Flow Architecture," in SPE Digital Energy Conference, [4] A. Abou-Sayed, "Data mining applications in the oil and gas industry," Journal of Petroleum Technology, vol. 64, pp , [5] L. Ye and E. Keogh, "Time series shapelets: a new primitive for data mining," in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009, pp [6] O. P. Patri, V. Sorathia, and V. K. Prasanna, "Event-driven information integration for the digital oilfield," in SPE Annual Technical Conference and Exhibition, [7] S. Sawaryn, K. Grames, and O. Whelehan, "The Analysis and Prediction of Electric Submersible Pump Failures in the Milne Point Field Alaska," SPE production & facilities, vol. 17, pp , [8] Electric submersible pump Reliability information and failure tracking system Joint Industry Project. Available: [9] F. Alhanati, S. Solanki, and T. Zahacy, "ESP Failures: Can We Talk the Same Language?," in Gulf Coast Section ESP Workshop held in Houston, TX, [10] J. Presley. (2013) ESP for ESPs? Exploration & Production. [11] S. Sawaryn and E. Ziegel, "Statistical assessment and management of uncertainty in the number of electrical submersible pump failures in a field," SPE production & facilities, vol. 18, pp , [12] S. Sawaryn, "The dynamics of electrical-submersible-pump populations and the implication for dual-esp systems," SPE production & facilities, vol. 18, pp , [13] T. Kalu-Ulu, J. Andrawus, and I. George, "Modelling System Failures of Electric Submersible Pumps in Sand Producing Wells," in Nigeria Annual International Conference and Exhibition, [14] Y. Liu, K.-T. Yao, S. Liu, C. S. Raghavendra, T. L. Lenz, L. Olabinjo, F. B. Seren, S. Seddighrad, and C. Dinesh Babu, "Failure Prediction for Artificial Lift Systems," in SPE Western Regional Meeting, [15] S. Liu, C. S. Raghavendra, Y. Liu, K.-T. Yao, O. Balogun, L. Olabinjo, R. Soma, J. Ivanhoe, B. Smith, and B. B. Seren, "Automatic Early Fault Detection for Rod Pump Systems," in SPE Annual Technical Conference and Exhibition, [16] O. P. Patri, A. Panangadan, C. Chelmis, and V. K. Prasanna, "Extracting discriminative features for event-based electricity disaggregation," presented at the IEEE Conference on Technologies for Sustainability, Portland, Oregon, USA, [17] O. P. Patri, V. S. Sorathia, A. V. Panangadan, and V. K. Prasanna, "The process-oriented event model (PoEM): a conceptual model for industrial events," in Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems, 2014, pp [18] A. Mueen, E. Keogh, and N. Young, "Logical-shapelets: an expressive primitive for time series classification," in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp [19] Z. Xing, J. Pei, S. Y. Philip, and K. Wang, "Extracting Interpretable Features for Early Classification on Time Series," in SDM, 2011, pp [20] T. Rakthanmanon and E. Keogh, "Fast shapelets: A scalable algorithm for discovering time series shapelets," in Proceedings of the thirteenth SIAM conference on data mining (SDM), [21] J. Zakaria, A. Mueen, and E. Keogh, "Clustering Time Series Using Unsupervised-Shapelets," in Proceedings of the 2012 IEEE 12th International Conference on Data Mining, 2012, pp [22] Y. Engel and O. Etzion, "Towards proactive event-driven computing," in Proceedings of the 5th ACM international conference on Distributed event-based system, 2011, pp
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Pattern Recognition and Data-Driven Analytics for Fast and Accurate Replication of Complex Numerical Reservoir Models at the Grid Block Level
SPE-167897-MS Pattern Recognition and Data-Driven Analytics for Fast and Accurate Replication of Complex Numerical Reservoir Models at the Grid Block Level S. Amini, S.D. Mohaghegh, West Virginia University;
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Anomaly Detection in Predictive Maintenance
Anomaly Detection in Predictive Maintenance Anomaly Detection with Time Series Analysis Phil Winters Iris Adae Rosaria Silipo [email protected] [email protected] [email protected] Copyright
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Data Cleansing for Remote Battery System Monitoring
Data Cleansing for Remote Battery System Monitoring Gregory W. Ratcliff Randall Wald Taghi M. Khoshgoftaar Director, Life Cycle Management Senior Research Associate Director, Data Mining and Emerson Network
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Pentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Hong Kong Stock Index Forecasting
Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei [email protected] [email protected] [email protected] Abstract Prediction of the movement of stock market is a long-time attractive topic
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
REMOTE MONITORING SYSTEM MATURE FIELDS UNCONVENTIONALS MONITORING SYSTEM. Solving challenges.
REMOTE MONITORING SYSTEM InteLift REMOTE MONITORING SYSTEM Solving challenges. UNCONVENTIONALS MATURE FIELDS InteLift Remote Monitoring System InteLift REMOTE MONITORING SYSTEM YOUR RESERVES AND YOUR
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Mimicking human fake review detection on Trustpilot
Mimicking human fake review detection on Trustpilot [DTU Compute, special course, 2015] Ulf Aslak Jensen Master student, DTU Copenhagen, Denmark Ole Winther Associate professor, DTU Copenhagen, Denmark
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected]
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Big Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Conclusions and Future Directions
Chapter 9 This chapter summarizes the thesis with discussion of (a) the findings and the contributions to the state-of-the-art in the disciplines covered by this work, and (b) future work, those directions
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Concept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
Applying Data Science to Sales Pipelines for Fun and Profit
Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Meeting the challenges of today s oil and gas exploration and production industry.
Meeting the challenges of today s oil and gas exploration and production industry. Leveraging innovative technology to improve production and lower costs Executive Brief Executive overview The deep waters
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis
Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Florida International University - University of Miami TRECVID 2014
Florida International University - University of Miami TRECVID 2014 Miguel Gavidia 3, Tarek Sayed 1, Yilin Yan 1, Quisha Zhu 1, Mei-Ling Shyu 1, Shu-Ching Chen 2, Hsin-Yu Ha 2, Ming Ma 1, Winnie Chen 4,
A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML
www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework
ADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard ([email protected]) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
KEITH LEHNERT AND ERIC FRIEDRICH
MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
A Review of Anomaly Detection Techniques in Network Intrusion Detection System
A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In
Beating the NCAA Football Point Spread
Beating the NCAA Football Point Spread Brian Liu Mathematical & Computational Sciences Stanford University Patrick Lai Computer Science Department Stanford University December 10, 2010 1 Introduction Over
An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks
2011 International Conference on Network and Electronics Engineering IPCSIT vol.11 (2011) (2011) IACSIT Press, Singapore An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks Reyhaneh
Advanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
Grid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management
Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management Ram Soma 2, Amol Bakshi 1, Kanwal Gupta 3, Will Da Sie 2, Viktor Prasanna 1 1 University of Southern California,
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Business Analytics using Data Mining Project Report. Optimizing Operation Room Utilization by Predicting Surgery Duration
Business Analytics using Data Mining Project Report Optimizing Operation Room Utilization by Predicting Surgery Duration Project Team 4 102034606 WU, CHOU-CHUN 103078508 CHEN, LI-CHAN 102077503 LI, DAI-SIN
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Investigating Clinical Care Pathways Correlated with Outcomes
Investigating Clinical Care Pathways Correlated with Outcomes Geetika T. Lakshmanan, Szabolcs Rozsnyai, Fei Wang IBM T. J. Watson Research Center, NY, USA August 2013 Outline Care Pathways Typical Challenges
Dan French Founder & CEO, Consider Solutions
Dan French Founder & CEO, Consider Solutions CONSIDER SOLUTIONS Mission Solutions for World Class Finance Footprint Financial Control & Compliance Risk Assurance Process Optimization CLIENTS CONTEXT The
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis
Webinar will begin shortly Hadoop s Advantages for Machine Learning and Predictive Analytics Presented by Hortonworks & Zementis September 10, 2014 Copyright 2014 Zementis, Inc. All rights reserved. 2
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Using Predictive Maintenance to Approach Zero Downtime
SAP Thought Leadership Paper Predictive Maintenance Using Predictive Maintenance to Approach Zero Downtime How Predictive Analytics Makes This Possible Table of Contents 4 Optimizing Machine Maintenance
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
Email Spam Detection Using Customized SimHash Function
International Journal of Research Studies in Computer Science and Engineering (IJRSCSE) Volume 1, Issue 8, December 2014, PP 35-40 ISSN 2349-4840 (Print) & ISSN 2349-4859 (Online) www.arcjournals.org Email
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
Measurement Information Model
mcgarry02.qxd 9/7/01 1:27 PM Page 13 2 Information Model This chapter describes one of the fundamental measurement concepts of Practical Software, the Information Model. The Information Model provides
Galaxy Morphological Classification
Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics
for Healthcare Analytics Si-Chi Chin,KiyanaZolfaghar,SenjutiBasuRoy,AnkurTeredesai,andPaulAmoroso Institute of Technology, The University of Washington -Tacoma,900CommerceStreet,Tacoma,WA980-00,U.S.A.
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Recognizing Informed Option Trading
Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe
