Identifying relationships between drugs and medical conditions: winning experience in the Challenge 2 of the OMOP 2010 Cup

Similar documents
Representing Alabama s Public Two-Year College System NUR 204 ROLE TRANSITION FOR THE REGISTERED NURSE. Plan Of Instruction

Sequential Auctions of Oligopoly Licenses: Bankruptcy and Signaling

A Holistic Method for Selecting Web Services in Design of Composite Applications

European Test User Standards. for test use in Work and. Organizational settings

FCC Form 471 Do not write in this area. Approval by OMB

London Metropolitan Polymer Centre (LMPC)

CHILDREN S HEALTH INSURANCE

Parallel-Task Scheduling on Multiple Resources

Hierarchical Clustering and Sampling Techniques for Network Monitoring

Content-Aware Caching and Traffic Management in Content Distribution Networks

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

National Summary. State Teacher Policy Yearbook Progress on Teacher Quality. National Council on Teacher Quality

Condominium Project Questionnaire Full Form

tr(a + B) = tr(a) + tr(b) tr(ca) = c tr(a)

Cellulosic Ethanol Technology as Waste Management tool the Belize Potential

1. Name and Contact Information of Person(s) Responsible for Program s Assessment

Henley Business School at Univ of Reading. Chartered Institute of Personnel and Development (CIPD)

TRANSMISSION LINES, PARAMETERS, AND APPLICATION IN COMMUNICATIONS SYSTEMS

A Comparison of Service Quality between Private and Public Hospitals in Thailand

Weighting Methods in Survey Sampling

Robust Classification and Tracking of Vehicles in Traffic Video Streams

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

Big Data Analysis and Reporting with Decision Tree Induction

TECHNOLOGY-ENHANCED LEARNING FOR MUSIC WITH I-MAESTRO FRAMEWORK AND TOOLS

PH.D. PROGRAM SCHOOL PSYCHOLOGY. Manual of Policies and Procedures. College of Education. Department of Education and Human Services

Static Fairness Criteria in Telecommunications

Suggested Answers, Problem Set 5 Health Economics

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

State of Louisiana Office of Information Technology. Change Management Plan

Supply chain coordination; A Game Theory approach

Granular Problem Solving and Software Engineering

GABOR AND WEBER LOCAL DESCRIPTORS PERFORMANCE IN MULTISPECTRAL EARTH OBSERVATION IMAGE DATA ANALYSIS

TRENDS IN EXECUTIVE EDUCATION: TOWARDS A SYSTEMS APPROACH TO EXECUTIVE DEVELOPMENT PLANNING

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

An integrated optimization model of a Closed- Loop Supply Chain under uncertainty

Sebastián Bravo López

Agile ALM White Paper: Redefining ALM with Five Key Practices

Improved Vehicle Classification in Long Traffic Video by Cooperating Tracker and Classifier Modules

FIRE DETECTION USING AUTONOMOUS AERIAL VEHICLES WITH INFRARED AND VISUAL CAMERAS. J. Ramiro Martínez-de Dios, Luis Merino and Aníbal Ollero

HyPer: A Hybrid OLTP&OLAP Main Memory Database System Based on Virtual Memory Snapshots

Customer Efficiency, Channel Usage and Firm Performance in Retail Banking

RATING SCALES FOR NEUROLOGISTS

Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

10.2 Systems of Linear Equations: Matrices

Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising

Learning Curves and Stochastic Models for Pricing and Provisioning Cloud Computing Services

Transfer of Functions (Isle of Man Financial Services Authority) TRANSFER OF FUNCTIONS (ISLE OF MAN FINANCIAL SERVICES AUTHORITY) ORDER 2015

Stock Market Value Prediction Using Neural Networks

AT 6 OF 2012 GAMBLING DUTY ACT 2012

2. Properties of Functions

Pattern Recognition Techniques in Microarray Data Analysis

Detecting Possibly Fraudulent or Error-Prone Survey Data Using Benford s Law

ONLINE APPENDIX. The Impact of a Firm s Share of Exports on Revenue, Wages, and Measure of Workers Hired. Theory and Evidence.

A Data Placement Strategy in Scientific Cloud Workflows

How To Fator

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

Availability, Reliability, Maintainability, and Capability

Solving the Game of Awari using Parallel Retrograde Analysis

FOOD FOR THOUGHT Topical Insights from our Subject Matter Experts

Discovering Trends in Large Datasets Using Neural Networks

MODELLING OF TWO STRATEGIES IN INVENTORY CONTROL SYSTEM WITH RANDOM LEAD TIME AND DEMAND


Improved SOM-Based High-Dimensional Data Visualization Algorithm

MEMBER. Application for election MEMBER, NEW GRADUATE. psychology.org.au. April 2015

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

Cross-Over Analysis Using T-Tests

A Keyword Filters Method for Spam via Maximum Independent Sets

Recovering Articulated Motion with a Hierarchical Factorization Method

Health Savings Account Application

Channel Assignment Strategies for Cellular Phone Systems

Programming Basics - FORTRAN 77

protection p1ann1ng report

Data Center Power System Reliability Beyond the 9 s: A Practical Approach

Improving Direct Marketing Profitability with Neural Networks

A New Evaluation Measure for Information Retrieval Systems

The Advantages of Using Aountable Care Organizations ( ACOs)

i_~f e 1 then e 2 else e 3

Recommending Questions Using the MDL-based Tree Cut Model

Price-based versus quantity-based approaches for stimulating the development of renewable electricity: new insights in an old debate

Dip Solder Test/Processing

The one-year non-life insurance risk

Context-Sensitive Adjustments of Cognitive Control: Conflict-Adaptation Effects Are Modulated by Processing Demands of the Ongoing Task

THE PERFORMANCE OF TRANSIT TIME FLOWMETERS IN HEATED GAS MIXTURES

An Enhanced Critical Path Method for Multiple Resource Constraints

Intelligent Measurement Processes in 3D Optical Metrology: Producing More Accurate Point Clouds

SLA-based Resource Allocation for Software as a Service Provider (SaaS) in Cloud Computing Environments

2. Use of Internet attacks in terrorist activities is termed as a. Internet-attack b. National attack c. Cyberterrorism d.

Melbourne Docklands ESD Guide

From a strategic view to an engineering view in a digital enterprise

Firewall Design: Consistency, Completeness, and Compactness

Professional Certificate Training in Business Writing

i e AT 7 of 2015 PAYMENT SERVICES ACT 2015

AUDITING COST OVERRUN CLAIMS *

THE UNIVERSITY OF TEXAS AT ARLINGTON COLLEGE OF NURSING. NURS Introduction to Genetics and Genomics SYLLABUS

The D.C. Long Term Disability Insurance Plan Exclusively for NBAC members Issued by The Prudential Insurance Company of America (Prudential)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 14 10/27/2008 MOMENT GENERATING FUNCTIONS

MEDICATION MANAGEMENT ASSESSMENT

Optimizing Multiple Stock Trading Rules using Genetic Algorithms

Transcription:

Ientifying relationships between rugs an meial onitions: winning experiene in the Challenge 2 of the OMOP 2010 Cup Vlaimir Nikulin # Department of Mathematial Methos in Eonomis, Vyatka State University, Kirov, Russia Abstrat There is a growing interest in using a longituinal observational atabases to etet rug safety signal. In this paper we present a novel metho, whih we use online uring the OMOP Cup. We onsier homogeneous ensembling, whih is base on ranom re-sampling (known, also, as bagging) as a main innovation ompare to the previous publiations in the relate fiel. This stuy is base on a very large simulate atabase of the 10 million patients reors, whih was reate by the Observational Meial Outomes Partnership (OMOP). Compare to the traitional lassifiation problem, the given ata are unlabelle. The objetive of this stuy is to isover hien assoiations between rugs an onitions. The main iea of the approah, whih we use uring the OMOP Cup is to ompare the numbers of observe an expete patterns. This omparison may be organise in several ifferent ways, an the outomes (base learners) may be quite ifferent as well. It is propose to onstrut the final eision funtion as an ensemble of the base learners. Our metho was reognise formally by the Organisers of the OMOP Cup as a top performing metho for the Challenge N2. Keywors: longituinal observational ata, signal etetion, temporal pattern isovery, unsupervise learning, eletroni health reors INTRODUCTION An improvement of rug safety an the ientifiation of averse rug events remains a very important problem. Several reent rug safety events have highlighte the nee for new ata soures an algorithms to assist in ientifying averse rug events in a more timely, effetive, an effiient manner. The methos an statistial tools use on large healthare ata soures (e.g., aministrative laims an eletroni health reors) have been laking an are not yet systematize to look at isparate atabases. The Observational Meial Outomes Partnership (OMOP) onute a Cup Competition as a atalyst for new methos evelopment to ientify relationships in ata between rugs an averse events or onitions (OMOP Newsletter, 2010). To provie an objetive basis for monitoring an assessing the safety of markete prouts, pharmaeutial ompanies an regulatory agenies have implemente post-marketing surveillane ativities base in large measure on the olletion of spontaneously generate averse reation reports. Report initiation (by health professionals an onsumers) is generally voluntary; by ontrast, the pharmaeutial ompanies are generally uner legal obligation to follow up on reports that they reeive an to pass them along to various regulatory authorities (Fram et al., 2003). Every rug has unergone extensive testing before being release to the market, but even pre-marketing linial trials involving thousans of people annot unover all averse events that may our in a muh larger an iverse population. Traitionally, post-marketing safety signal # vnikulin.uq@gmail.om

Ientifying relationships between rugs an meial onitions 2 etetion has relie on voluntary, spontaneous reporting of suspete averse rug reations by health are professionals, patients, an onsumers (Shuemie, 2010). There is a global interest in using eletroni health reors for ative rug safety surveillane. Many methos have been evelope an exploite for quantitative signal etetion in spontaneous reporting atabases, most of these are base on isproportionality methos of ase reports. A full safety profile of a new rug an never be known at the time that it is introue to the general publi. Whereas premarketing linial trials o onsier safety enpoints, they are limite in the types an numbers of patients expose. Atual linial pratie often iffers from the ontrolle setting of a linial trial, with respet to the iniation for treatment, onomitant meiation, an osage at whih a rug is presribe. Also, it may iffer over time. As a onsequene, safety monitoring an evaluation must ontinue throughout a rug s life-yle (Norén et al., 2009). In this paper we woul like to share our suessful experiene, whih was obtaine online uring the OMOP 2010 Cup. Also, we woul like to iret reaers to some selete publiations (Nikulin, 2008), (Nikulin an MLahlan, 2010) an (Nikulin et al., 2011), where we reporte our suessful moels an methos, whih were use uring ifferent ata mining Challenges. OMOP CHALLENGE OMOP is a publi-private partnership esigne to improve the monitoring of rugs for safety an effetiveness. The partnership is onuting a two-year researh initiative to etermine whether it is feasible an useful to use automate healthare ata to ientify an evaluate safety issues of rugs on the market. The Partnership s methoologial researh is onute aross multiple isparate observational atabases (aministrative laims an eletroni health reors). The series of stuies being onute inlue assessing ifferent types of automate healthare ata, eveloping tools an methos to analyze the atabases, an evaluating how analyses an ontribute to eision-making. OMOP relies on the expertise an resoures of the U.S. Foo an Drug Aministration, other feeral agenies, aaemi institutions, the pharmaeutial an health insurane inustries an non-profit organizations. A network of institutions, manage by the Founation for the National Institutes of Health, arries out speifi OMOP tasks, an all together, more than 100 partners are ollaborating. Throughout the work phases of OMOP all work prouts are mae publily available to promote transpareny an onsisteny in researh. The ompetition starte in September 2009. OMOP provie the partiipants with a large simulate ata set resembling healthare ata that was spike with averse events. The ompetitors ha to fin the signals by generating methos to ientify relationships in the ata between rugs an meial outomes (averse events). The goal was to evelop methos that orretly ientifie true rug-event assoiations while minimizing false positive finings. Methos were evaluate by how aurately they preite the known relationships that existe in the ata. At the en of the ompetition, whih was lose on Marh 31, 2010, there were over sixty ompetitors from many fiels an entities. OMOP atabase The given atabase inlues reors of 10 million patients with ates when observation was starte an ene. The overall observation perio is 10 years. For any partiular patient we

Ientifying relationships between rugs an meial onitions 3 have 2 sequenes: 1) rugs with starting an ening ates; 2) onitions with starting ate. The total numbers of rugs an onitions are 5,000 an 4,519, respetively. Aoringly, the total number of possible assoiations is 22,595,000. There are also some emographial information available, suh as age an sex. We shall enote by D an C sets of all rugs an onitions. As an illustration, the organisers mae available a small subset of pairs {rug, onition} with true label (4000 positive an 3920 negative), but we i not use this information in the training proess. More etails regaring the atabase an the Challenge may be foun on the OMOP website *. Most of the pre-proessings were onute using speial software written in Perl, the main algorithms were implemente in C. In aition, we use speial oes written in Matlab. DISPROPORTIONALITY ANALYSIS (DPA) In stuying the temporal assoiation between two events, it is onvenient to let one event set the relative time frame in whih the iniene of the other event is examine. We shall in the ontext of this paper let rug presriptions efine the relative time frame in whih the iniene of other meial events is examine. The other meial events onsiere inlue notes of linial symptoms, signs, an iagnoses, an presriptions of other rugs. Our objetive is to ientify interesting temporal patterns relating the ourrene of a meial event to first presriptions of a speifi rug (Norén et al., 2009). Let us enote by a threshol temporal parameter (for example it may be in the range of 30-60 ays). Then, we an onsier the observation perio T (for example, it may be the whole perio of 10 years). We shall onsier all the liste patients an shall ompute n to be the numbers of assoiations/ases, where 0 t t, (1) t an t are the ates when onition an rug were starte. As a next step, we ount n an n to be the numbers of the times rug an onition were foun within the time interval T. Note that n an n were ompute inepenently. Assuming that events are inepenent, the expete number of assoiations may be alulate aoring to the following formula n where λ =, N = n. N D b = λ n (2), Finally, we shall ompute require ratings n + α r =, f (3) b + α * http://omop.fnih.org/omopup

Ientifying relationships between rugs an meial onitions 4 where f is a logarithmi or power funtion, α is a smoothing or shrinkage parameter. In our experiments we use 0.1 α 0.5. Mean average preision The performane of the solutions was measure using the Mean Average Preision (MAP), metri often use in the fiel of information retrieval. It measure how well a system ranks items, an emphasizes ranking true positive items higher. It is the average of preisions ompute at the point of eah of the true positives in the ranke list returne by the metho (Shuemie, 2010). With the approah presente in this setion we ahieve result MAP = 0.12 for the Challenge 1. TEMPORAL ANALYSIS Most likely, the ratings (3) will be too rough if they are alulate aoring to the whole time-interval T of 10 years. Therefore, it is propose to split the whole interval T into several ( ) onseutive subintervals: T i ( ), i = 1,..., m, an alulate r i, i = 1,..., m, aoringly. The most suitable value m = 10, whih orrespons to the number of years within the whole observation ( ) perio. As an outome, we an proue solution for Challenge 2 using ratings r i, i 1,..., m. = Challenge 2: ientifying rug-onition assoiations as ata aumulates over time Timely etetion of rug-relate averse events as part of an ative surveillane system woul allow patients an health are proviers to minimise potential risks an inform eisionmaking authorities as quikly as possible. Challenge 2 seeks to evaluate a metho s performane in ientifying true rug-onition assoiations an iserning from false assoiation as ata aumulates over time. It is important to mention that an assoiation is efine as a rug that inreases the likelihoo of a onition ourring. A onition that is less likely to our after reeiving a rug, possibly, as an intentional result of a treatment, is not ounte as an assoiation. For the seon hallenge, it was neessary to examine the first 500 rugs more losely. As requeste, submissions shoul ontain one entry for eah suh rug-onition ombination at the en of eah of 10 alenar years, resulting in 10 times 500 times 4519 (22,595,000) total reors. That means, the size of all possible ombinations for Challenge 2 was exatly the same as for Challenge 1. We alulate solution for Challenge 2 aoring to the formula year ( year) 1 ( i) s = r, year = 1,...,10, (4) year i= 1 (10) an observe MAP = 0.13. Also, we onsiere s in appliation to Challenge 1 with MAP = 0.14. Figure 1 illustrates behaviour of the 16 selete (strongest relations) pairs {, }, whih are presente in Table 1.

Ientifying relationships between rugs an meial onitions 5 RANDOM RESAMPLING (BAGGING OR HOMOGENEOUS ENSEMBLING) Bagging preitors is a metho for generating multiple versions of a preitor an using these to get an aggregate preitor. The aggregation averages over the versions when preiting a numerial outome an oes a plurality vote when preiting a lass (Breiman, 1996). In this setion we onsier metho of ranom resampling: it is suppose that using the hunres of preitors (base learners), base on the ranomly selete subsets of the whole training set, we shall reue the ranom fators. Aoring to the priniples of homogeneous ensembling, the final preitor represents an average of the base preitors. As a referene, we mention ranom forests (Breiman, 2001) is a well-known example of suessful homogeneous ensemble. However, the onstrution of ranom forests is base on another metho, whih is linke to the features but not to the samples. With the metho of ranom resampling we were able to ahieve a ramati improvement in performane: MAP=0.21 for Challenge 1 an MAP=0.18 for Challenge 2. The ratings for Challenge 2 were alulate aoring to the following formula k ( year) 1 ( year) z = s ( j), year = 1,...,10, (5) k j= 1 where j is a sequential inex of the ranomly selete ( ) is assume that omputation of s year ( j) was base on Ω j. Ω j subset of patients, an, by efinition, it Table 1: List of 16 strongest (aoring to our evaluation) relations between rugs an onitions, where ratings were ompute aoring to (4). Column Figure 1 iniates horizontal label of the winow in Figure 1, where this time-series of the orresponing relationship is presente. N Figure 1 Drug Conition Rating 1 a1 198 4017 6214.97 2 a2 199 4018 6105.93 3 a3 80 4011 5843.94 4 a4 3 4002 5802.4 5 b1 314 4025 5700.24 6 b2 137 4013 5623.87 7 b3 362 3509 5613.11 8 b4 437 4039 5585.65 9 1 2 4002 5543.93 10 2 471 1996 5311.32 11 3 318 4027 5302.99 12 4 251 4020 5289.66 13 1 79 4011 5256.8 14 2 339 4032 5233.38 15 3 198 1280 5208.57 16 4 3 1377 5179.46

Ientifying relationships between rugs an meial onitions 6 Seletion of the patients was onute aoring to the onition: γ 0. 65, where γ is a stanar uniformly istribute ranom variable. Base on our experiments, the number of ranom samples k=100 is a quite suffiient. In aition, we eie to exten the ranom sampling further, an use as a threshol parameter in (1) uniformly istribute ranom variable: 40 60. Figure 1. Temporal epenenes (ratings as a funtion of the years) for the selete pairs {rugs, onitions}, whih are presente in Table 1. Note that we use solution struture/histogram of the solution z (10) (10) z for Challenge 1. Figure 2(a) illustrates the, whih was reue to the logarithmi sale, where we use only rugs with inexes from 1 to 500 (this orrespons to Challenge 2). In aorane with Figure 2(a), an absolute majority of the pairs {, } has no expete relations. Figure 2(b) shows histogram of the right part of the solution presente in Figure 2(a) with some potential links. DPA: A SECOND APPROACH BASED ON THE DRUG ERAS Compare to the first approah of DPA, we shall use here not a ounter of the number of times when rug was use, but the total uration in ays when rug was use. Let us enote by h the total uration of the time when rug was use uring observational perio T. Then, we an rewrite (2) in this way b = θn, (6)

Ientifying relationships between rugs an meial onitions 7 where h θ =, H = h. H D Remark 1. Base on our experimental evaluations, there is a signifiant ifferene between formulas (2) an (6) in terms of the relate outomes. The formulas (2) an (6) are similar in the strutural sense, an represent the most important initial steps. The following steps to onstrut solution for this partiular metho are the same: we an apply (6) to (3). Then, we an repeat the temporal analysis (4) an resampling (5). As an outome of this moifie proeure we ha observe the sores: MAP = 0.225 for Challenge 1 an MAP = 0.205 for Challenge 2. Figure 2. (a) histogram of the solution part of the solution z the temporal weighting. (10) (10) z, whih is efine in (5); (b) histogram of the right (with potential links between rugs an onitions); () funtion w for HETEROGENEOUS ENSEMBLING Definition. An ensemble is efine as a heterogeneous if the base moels in an ensemble are generate by methoologially ifferent learning algorithms. On the other han, an ensemble is efine as a homogeneous if the base moels are of the same type (for example, resampling or bagging as isusse above). As far as solutions DPA1 an DPA2, whih are base on expete numbers of assoiations (2) an (6), are very ifferent in a strutural sense, they annot be linke together iretly. At the same time we know that the qualities of both solutions DPA1 an DPA2 are high. The later observation represents a very positive fator, whih iniates that the solutions DPA1

Ientifying relationships between rugs an meial onitions 8 an DPA2 ontains ifferent information, whih may lea to further improvement if linke in a proper way. Using an ensemble onstrutor (Nikulin an MLahlan, 2009), we an ajust one solution to the sale of another solution. After that, we an ompute an ensemble solution as a linear ombination: ENS = τ DPA1 + (1 τ ) DPA2, (7) where DPA 1 is the same as DPA1 solution, whih was ajuste to the sale of DPA2 solution, 0 < τ < 1 is a positive weight oeffiient. Clearly, the stronger performane of the solution DPA2 ompare to DPA1 the smaller will be value of the oeffiient τ. With an ensemble onstrutor (7), we observe MAP=0.23 for Challenge 1 an MAP=0.22 for Challenge 2, where we use τ = 0.3. TEMPORAL WEIGHTING FOR COMPUTATION OF THE NUMBERS OF ASSOCIATIONS Aoring to (3), the value of n is a very important. Clearly, the strength of the signal epens essentially on the ifferene t t, subjet to the onition (1). Base on our statistial analysis (an, also, on some qualitative onsierations), we eie to implement the following formula n = w( t t ), (8) t t t + where the struture of weight funtion w is illustrate in Figure 2(): it is logial to assume that reation of the patient s organism to rug is not an immeiate, an the likelihoo of the possible assoiation will eline over time after some point (6-10 ays). COMPUTATION TIME A Linux multiproessor omputer with spee 3.2GHz, RAM 16GB, was use for the most of the omputations. All the algorithms were implemente in C. The running time for 100 ranom samplings aoring to (5) was about 10 hours. CONCLUDING REMARKS As a main outome of our stuy, we an report very strong improvement with homogeneous ensembling (bagging). Also, we were trying to ifferentiate the matries (5) for the partiular age/sex groups, an then reate submission assuming that the ifferent age/sex groups are equally important. However, we i not observe any signifiant improvements with this approah. During the Challenge we onute experiments with many ifferent methos an approahes, whih were not mentione in the above Setions. For example, we trie 2D k-means lustering (Nikulin an MLahlan, 2009) an graient-base matrix fatorisation (Nikulin et al., 2011) in appliation to the matrix (5) in orer to smooth the noise, an an report some moest progress in this iretion.

Ientifying relationships between rugs an meial onitions 9 There may be several onseutive eras of one rug for the same patient. We ahieve goo improvements in the ase if we use only first rug era, an ignore all the other eras. As to the prospetive work: assuming that there are true relations for any rug/onition, it maybe a goo iea to alibrate the matrix (5) so that the most shiny rugs/onitions will not outshine the other rugs/onitions. Aoring to (Jelizarow et al., 2010), the superiority of new algorithms shoul always be emonstrate on an inepenent valiation ata. In this sense, an importane of the ata mining ontests is unquestionable. The rapi popularity growth of the ata mining hallenges emonstrates with onfiene that it is the best-known way to evaluate ifferent moels an systems. ACKNOWLEDGMENTS We are grateful to the Organisers of the OMOP 2010 ata mining Contest for this stimulating opportunity. REFERENCES Breiman L. (1996) Bagging Preitors. Mahine Learning, 24, 123-140. Breiman L. (2001) Ranom Forests. Mahine Learning, 45, 5-32. Cho H. an Dhillon I. (2008) Colustering of Human Caner Miroarrays Using Minimum Sum- Square Resiue Colustering. IEEE/ACM Transations on Computational Biology an Bioinformatis, 5(3), 385-400. Fram D., Almenoff J. an DuMouhel W. (2003) Empirial Bayesian Data Mining for Disovering Patterns in Post-Marketing Drug Safety. SIGKDD 2003, August 24-27, Washington, DC, USA: 359-368. Jelizarow M., Guillemot V., Tenenhaus A., Strimmer K. an Boulesteix A.-L. (2010) Overoptimism in bioinformatis: an illustration, Bioinformatis, 26(16), 1990-1998. Nikulin V. (2008) Classifiation of Imbalane Data with Ranom Sets an Mean-Variane Filtering. International Journal of Data Warehousing an Mining, Vol. 4(2), pp.63-78. Nikulin V. an MLahlan G. J. (2009) Classifiation of imbalane marketing ata with balane ranom sets. JMLR: Workshop an Conferene Proeeings, 7, 89-100. Nikulin V. an MLahlan G.J. (2010) Ientifying fiber bunles with regularise k-means lustering applie to the gri-base ata. In WCCI 2010 IEEE Worl Congress on Computational Intelligene July, 18-23, 2010 - CCIB, Barelona, Spain, pp. 2281-2288. Nikulin V., Huang T.-H., Ng S.-K., Rathnayake S. an MLahlan G. J. (2011) A Very Fast Algorithm for Matrix Fatorisation. Statistis an Probability Letters, 81, 773-782.

Ientifying relationships between rugs an meial onitions 10 Nikulin V., Huang T.-H., an MLahlan G.J. (2011a) Classifiation of high-imensional miroarray ata with a two steps proeure via a Wiloxon riterion an multilayer pereptron. International Journal of Computational Intelligene an Appliations, 10(1), pp. 1-14. Norén G., Hopstaius J., Bate A., Star K. an Ewars I. (2009) Temporal pattern isovery in longituinal eletroni patient reors. Data Mining an Knowlege Disovery. OMOP Newsletter, June 2010, 2(2). Available at: http://omop.fnih.org/omopup/ [Aesse on 28 July 2011]. Shuemie M. (2010) Methos for rug safety signal etetion in longituinal observational atabases: LGPS an LEOPARD. Pharmaoepimiology an Drug Safety.