Teach Document-to-Structure to be Trilingual: Extract, Display and Search Chemical Information within English, Chinese and Japanese Patents David Deng, Daniel Bonniot ACS San Francisco Aug 10 th, 2014 1
I recently travelled to Washington D.C. for a client visit 2
After that, I flew to Shanghai 3
Then a short flight to Tokyo 4
ChemAxon s Chemistry Text Mining Suite Naming Structure to Name Name to Structure Document to Structure Document to Database 5 JChem for SharePoint
Text Chemical Information in Documents D2S 6 6
Structure Images in Documents D2S D2S Supports CLiDE (Keymodule) OSRA (NIH) Imago (GGA) 7 7
Non-Text Document. A.K.A Patents OCR Full Text 8 8
Automatic OCR Error Correction (2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate (2R)-2-methylsulfanyl-3-hydroxybutanedioate Λr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide N-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide
Document Annotation 10
ChiKEL Project ChiKEL (Chemically Informed Knowledge Extraction from Literature) Linguamatics / ChemAxon collaboration Two-year project supported by European Union EUREKA s Eurostars Programme Combine Natural Language Processing-based text mining with structure search and visualization USPTO 5.8 5.9 5.10 5.11 dev F Score Precision Recall 11
Patent Filings at IP5 Offices 1980-2012
Chinese Name to Structure 2-( 乙酰氧基 ) 苯甲酸 阿司匹林 2-(acetyloxy) benzoic acid Aspirin Acetylsalicylate Easprin N2S 50-78-2 13
The Real Chinese N2S 2-( 乙酰氧基 ) 苯甲酸 2-(acetyl oxy ) benzoic acid 14
The Challenges (Tip of the Iceberg) Ester & Salt 乙酸乙酯 Ethyl Acetate 15
The Challenges (Tip of the Iceberg) English: name alterations 丁烷 = buta + ane = butane Chinese: Same characters, different meanings 盐 = salt 酸 = acid 盐酸 = hydrochloric acid 16
Validation 1: Chinese Name to Structure Test set: 26,017 English + Chinese names As of July 2014 Conversion rate = 82.7 % Accuracy = 92.5% 17
Validation 2: Chinese Patents 54K Chinese patents with automated English translation Filter: structures with at least 20 heavy atoms, and patents with at least 20 structures Remains: 2108 patents 18
3-( 笨基 ) 丙酸 Chinese OCR Error
Chinese OCR Error 3-( 苯基 ) 丙酸 苯 = Benzene 苯基 = phenyl 丙酸 = propionic acid 3-( 笨基 ) 丙酸 笨 = stupid 笨基 = stupid yl 20
Automatic OCR Error Correction (2R)-2-rnethylsulfany1-3-hydr0xybutanedi0ate (2R)-2-methylsulfanyl-3-hydroxybutanedioate Λr-benzyl-Λr-[3-(lH-tetrazol-5-yl)phenyl]propanamide N-benzyl-N-[3-(1H-tetrazol-5-yl)phenyl]propanamide 我们日前巳经开友了中文化字名称的 OCR 白动纠错工力能 我们目前已经开发了中文化学名称的 OCR 自动纠错功能
Validation 2: Chinese patents Average number of unique chemical structures in a patent
Patent Filings at IP5 Offices 1980-2012
How Different Japanese and Chinese Names Can Be? In general, when translating from English words, Chinese translation is by meaning; Japanese translation uses sound simulation. English methyl methane Chinese Japanese 甲基 (meth-yl) メチル me-chi-ru me-thy-l 甲烷 (meth-ane) メタン me-ta-n me-thane 24
Japanese Validation Set Test set: 45,000 English + Japanese names As of July 2014 Conversion rate = 85 % Accuracy = 97% Note: only on IUPAC/systematic names, common name is a different beast 25
Exact Chemical Information from Asian Patents Additional challenge: no spaces 如式 I 所示的 {5-[2-(4- 正辛基苯基 ) 乙基 ]-2,2- 二甲基 -1,3- 二氧六环 -5- 基 } 氨基甲酸叔丁酯是合成芬戈莫德及其衍生物的重要中间体 XML Markup Patent metadata Encoding of characters Tags (e.g. <p>) Document annotation 26
Asian Language Document Annotation 27
Document to Database Chemical information Document metadata Searchable Customizable Document annotation 28 28
D2DB video 29
Asian-language patent mining Essential to IP protection In Summary Machine translation for general purpose is not enough ChemAxon s Chemistry Text Mining Suite Fast and reliable conversion Extracting chemical information from any documents in English, Chinese and Japanese Chinese OCR error correction is a unique and important feature Flexible usage 30
Flexible Usage 31
Automated Markush Extraction? 32
Computer-assisted Markush Structures Curation from Patent Documents David Deng, Arpad Figyelmesi Monday, August 11, 10:10 AM Location: Palace Hotel, Presidio Room 33
Acknowledgements Daniel Bonniot N2S, D2S, D2DB, DA, CN2S,JN2S 邓巍 David Deng CN2S JN2S 34