Big Data and Text Mining Dr. Ian Lewin Senior NLP Resource Specialist Ian.lewin@linguamatics.com www.linguamatics.com
About Linguamatics Boston, USA Cambridge, UK Software Consulting Hosted content Agile, scalable, real-time NLP-based text mining Fact extraction and knowledge synthesis Pharma/Biotech Including 17 of the top 20 Healthcare Including Kaiser Permanente Government Including FDA 2 Linguamatics 2015
Solutions & Applications in Life Sciences Advanced text analytics delivers value along the pipeline Gene-disease mapping Target ID/selection Trial site selection and study design Regulatory Submission QC HEOR Toxicity analysis and prediction Safety Pharmacovigilance Mutation/expression analysis SAR Biomarker discovery Competitive intelligence Comparative Effectiveness Drug repurposing Patent analysis KOL identification Opportunity scouting Social media analysis 3 Linguamatics 2015 - Confidential
Solutions & Applications in Healthcare Structured data Patient characteristics FDA drug labels Pathology, radiology, initial assessment, discharge, check up Patient characteristics Electronic Health Record Enterprise Data Warehouse Potential adverse drug reactions Patient characteristics Scientific literature Clinical case histories and/or genomic interpretation Patient characteristics Care gap models Patient lists Matching Clinical trials Clinical trials gov 4 Linguamatics 2015 - Confidential
Structured Data & its Evidential Basis... I2E can mine and extract with precision at scale Scientific literature Patents News feeds EHRs Internal reports Drug labels Clinical trials... Social media 5 Linguamatics 2015 - Confidential
Text Mining a precursor to Big Data? Unstructured data is just huge We can t wait for those human db curators... Besides, those curators ignore my parameter.. And all that text is just out there! (see Google for details) Only it isn t 6 Copyright Linguamatics 2014 - Confidential
Multisource data Big data Lots of different types of data Scientific literature Medical records Patents Regulatory publications (clinical trials, drug labels, adverse event reporting ) Internal reports Lots of different types of text In lots of different silos & lots of different licences 7 Copyright Linguamatics 2014 - Confidential
Connected Data Technology Single query across multiple data sources and network locations 8 Copyright Linguamatics 2014-2015 - Confidential
Connected Data Technology Query across multiple data sources simultaneously 9 Copyright Linguamatics 2014-2015 - Confidential
Connected Data Technology Unified results for fast review and discovery of relationships across multiple data sources 10 Copyright Linguamatics 2014-2015 - Confidential
Huge (Textual) Data Big Data We (i.e. text-miners ) are often joining data Unstructured And structured Across silos Before the tabular results go to analysis 11 Copyright Linguamatics 2014 - Confidential
The How of Text Mining Text Mining isn t completely shrink wrapped There is, usually, some customization To find the parameter value that you re interested in To find the value that everyone s interested in, but only in circumstances c To find it in datasource X To find it in X but only in circumstances c To map to ontology A rather than B It often makes sense to express these constraints at time of text-mining (not analysis) 12 Copyright Linguamatics 2014 - Confidential
Toolbox of Methods for Powerful Querying NLP Precise linguistic relationships, sentence co-occurrence Precise negation e.g. pressure but not blood pressure Terminologies Regular Expressions Search for concepts and classes, not just keywords e.g. cancer and get synonyms and children: Malignant neoplasms, Malignant tumor Rule based pattern matching for e.g. measurements, lab codes, mutations e.g. microrna: let-?\d+.* mirn?a?-?\d+.* Chemistry Fielded Search Restrict within particular regions of a document, including nested e.g. table cell in table in Description High Throughput Simultaneous processing of large numbers of items e.g. 500 compounds, 500 genes from microarray experiment, etc. 13 Linguamatics 2015 - Confidential
Linguistic Processing Using NLP Interprets meaning of the text Groups words into meaningful units Search for different forms of words sentences noun groups verb groups morphology - match entities match actions different forms We find that p42mapk phosphorylates c-myb on serine and threonine. Purified recombinant p42 MAPK was found to phosphorylate Wee1. 14 Linguamatics 2015 - Confidential
Discovering extraction patterns.. We often need to look at the data first (the huge data ) to find the extraction patterns Linguistic patterns of expression vary Over data sets Over time This pre-extraction exploration is something itself that needs informing By the ontologies and KBs that are already out there By the re-use of generally successful strategies 15 Copyright Linguamatics 2014 - Confidential
Innovative tools to enable exploration of complex and specialised data sets Grant funded by InnovateUK (Dept of BIS and EPSRC) Sponsored Partners: Univ. of Essex & Linguamatics Project End-date: mid 2016 easier discovery and extraction of key facts by sharing search strategies rather than sharing just search results by using novel algorithms for semantic information extraction linking information from multiple resources to help users find similar and relevant information. 16 Copyright Linguamatics 2014 - Confidential
Summary Text Mining the extraction of structured information from unstructured text It s a natural precursor to large scale analytics It s also a big data task itself Voluminous source data Distributed over many silos Expressed in different ways It s not just a precursor We re (already) joining data at extraction time We re researching exploiting and joining more data at the earliest phases of data exploration, prior to extraction 17 Copyright Linguamatics 2014 - Confidential
Thank You For more information Visit: www.linguamatics.com Contact: Ian Lewin ian.lewin@linguamatics.com