METHODS IN MEDICAL INFORMATICS



Similar documents
METHODS IN MEDICAL INFORMATICS. Fundamentals of Healthcare Programming in Perl, Python, and Ruby

Customer and Business Analytic

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Quality Management. Theory and Application PETER D. MAUCH. Ltfi) CRC Press. \ V J Taylor & Francis Group. ^ ^ Boca Raton London New York

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

Computer-Aided Multivariate Analysis

BIOTECHNOLOGY OPERATIONS

MED 2400 MEDICAL INFORMATICS FUNDAMENTALS

RESILIENT. SECURE and SOFTWARE. Requirements, Test Cases, and Testing Methods. Mark S. Merkow and Lakshmikanth Raghavan. CRC Press

CHAPMAN & HALL/CRC INNOVATIONS IN SOFTWARE ENGINEERING AND SOFTWARE DEVELOPMENT. Software Test Attacks to Break Mobile and Embedded Devices

THE COMPLETE PROJECT MANAGEMENT METHODOLOGY AND TOOLKIT

SOFTWARE TESTING AS A SERVICE

Exploratory Data Analysis with MATLAB

ANDROID SECURITY ATTACKS AND DEFENSES ABHISHEK DUBEY I ANMOL MISRA. ( r öc) CRC Press VV J Taylor & Francis Group ^ "^ Boca Raton London New York

Data Visualization. Principles and Practice. Second Edition. Alexandru Telea

THE MODERN THEORY OF THE TOYOTA PRODUCTION SYSTEM

Engineering Design. Software. Theory and Practice. Carlos E. Otero. CRC Press. Taylor & Francis Croup. Taylor St Francis Croup, an Informa business

Implementation. Business-Driven IT-Wide Agile (Scrum) and Kanban (Lean) Andrew T. Pham and David K. Pham. An Action Guide for Business and IT Leaders

Ctfo MANAGEMENT SECURITY PATCH. Felicia M. Nicastro. Second Edition. CRC Press. VC#*' J Taylor & Francis Group / Boca Raton London New York

Information Technology and Organizational Learning

Advances in Network Management

Development and Management

EFFECTIVE NON-PROFIT MANAGEMENT

Implementing the Project Management Balanced Scorecard

National Cancer Institute

SOFTWARE TESTING. A Craftsmcm's Approach THIRD EDITION. Paul C. Jorgensen. Auerbach Publications. Taylor &. Francis Croup. Boca Raton New York

life science data mining

Cloud Computing. and Scheduling. Data-Intensive Computing. Frederic Magoules, Jie Pan, and Fei Teng SILKQH. CRC Press. Taylor & Francis Group

Computer Security Literacy

BUSINESS ANALYSIS FDR INTELLIGENCE

Management. ITIL Release. Dave Howard. A Hands-on Guide. CRC Press. Taylor & Francis Group. Taylor St Francis Croup, an Informa business

Analysis of Population Cancer Risk Factors in National Information System SVOD

Parallel Computing for Data Science

IMPROVEMENT THE PRACTITIONER'S GUIDE TO DATA QUALITY DAVID LOSHIN

A Note for Students: How to Use This Book

Business Information Systems and Technology

DIGITAL IMAGE PROCESSING AND ANALYSIS

CREATING A THIRD EDITION DAVID MANN

for Research and Guiding Innovation for Positive R&D Outcomes Lory Mitchell Wingate

Improving Business Process Performance

The Ontario Cancer Registry moves to the 21 st Century

Management. Project. Software. Ashfaque Ahmed. A Process-Driven Approach. CRC Press. Taylor Si Francis Group Boca Raton London New York

Data Warehousing in the Age of Big Data

Study Guide. ScrumMaster. The. James Schiel. CRC Press. Taylor & Francis Croup, an Inform* business AN AUERBACH BOOK. CRC Press (s an imprint of the

Networking. Systems Design and. Development. CRC Press. Taylor & Francis Croup. Boca Raton London New York. CRC Press is an imprint of the

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Master Data Management

Schneps, Leila; Colmez, Coralie. Math on Trial : How Numbers Get Used and Abused in the Courtroom. New York, NY, USA: Basic Books, p i.

Requirements Engineering for Software

SECOND EDITION THE SECURITY RISK ASSESSMENT HANDBOOK. A Complete Guide for Performing Security Risk Assessments DOUGLAS J. LANDOLL

Technical Report. The KNIME Text Processing Feature:

Grid Computing FUNDAMENTALS OF. Theory, Algorithms and Technologies. Frederic Magoules. Edited by. CRC Press

A Simulation-Based lntroduction Using Excel

Networking. Cloud and Virtual. Data Storage. Greg Schulz. Your journey. effective information services. to efficient and.

APPENDIX to CAHIIM 2012 Curriculum Requirements Health Informatics Master s Degree

Design of Enterprise Systems

Epidemiology Foundations. The Science of Public Health. Public Health/Epidemiology and Biostatistics

Introduction to Supply Chain Management Technologies

Effective Methods for Software and Systems Integration

Warning Signs and the Red Flag System

Extraction and Visualization of Protein-Protein Interactions from PubMed

Explorer's Guide to the Semantic Web

Software Factories: Assembling Applications with Patterns, Models, Frameworks, and Tools

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

How To Write A Diagram

Introduction to Financial Models for Management and Planning

Job Hazard Analysis. A Guide for Voluntary Compliance and Beyond. From Hazard to Risk: Transforming the JHA from a Tool to a Process

Summary of Responses to the Request for Information (RFI): Input on Development of a NIH Data Catalog (NOT-HG )

TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENT LIST OF FIGURES LIST OF TABLES

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

SQL Server Integration Services. Design Patterns. Andy Leonard. Matt Masson Tim Mitchell. Jessica M. Moss. Michelle Ufford

Introduction to Windchill PDMLink 10.0 for Heavy Users

Governance Simplified

Contents RELATIONAL DATABASES

INFORMATION SECURITY A MULTIDISCIPLINARY. Stig F. Mjolsnes INTRODUCTION TO. Norwegian University ofscience & Technology. CRC Press

MBARI Deep Sea Guide: Designing a web interface that represents information about the Monterey Bay deep-sea world.

CLINICAL DATA MANAGEMENT

CENG 734 Advanced Topics in Bioinformatics

Windows PowerShell Cookbook

Lean Management System LMS:2OI2

Web Security Testing Cookbook*

Object-Oriented Design

Monte Carlo Methods and Models in Finance and Insurance

Donna J. Dean, Ph.D. October 27, 2009 Brown University

Public & Population Health Informatics Careers

Managing DICOM Image Metadata with Desktop Operating Systems Native User Interface

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Therapeutic Goods Administration Orphan Drugs Program: Discussion paper

The Data Warehouse Challenge

Location-Based Information Systems

Transcription:

Chapman & Hall/CRC Mathematical and Computational Biology Series METHODS IN MEDICAL INFORMATICS Fundamentals of Healthcare Programming in Perln Pythoni and Ruby Jules J- Berman TECHNISCHE INFORMATION SBIBLIOTHEK UNIVERSITATSBIBLIOTHEK HANNOVER V J CRC Press Taylor & Francis Group Boca Raton London NewYork CRC Press is an imprint of the Taylor & Francis Group an Informa business A CHAPMAN & HALL BOOK

Contents Preface Nota Bene About the Author xv xxi xxiii Part I Fundamental Algorithms and Methods of Medical Informatics Chapter 1 Parsing and Transforming Text Files 3 1.1 Peeking into Large Files 3 1.1.1 3 1.1.2 Analysis 5 1.2 Paging through Large Text Files 5 1.2.1 5 1.2.2 Analysis 7 1.3 Extracting Lines that Match a Regular Expression 7 1.3.1 8 1.3.2 Analysis 10 1.4 Changing Every File in a Subdirectory 10 1.4.1 10 1.4.2 Analysis 11 1.5 Counting the Words in a File 12 1.5.1 12 1.5.2 Analysis 14 1.6 Making a Word List with Occurrence Tally 14 1.6.1 14 1.6.2 Analysis 16 1.7 Using Printf Formatting Style 16 1.7.1 17 1.7.2 Analysis 18 VII

VIII CONTENTS Chapter 2 Utility Scripts 21 2.1 Random Numbers 21 2.1.1 21 2.1.2 Analysis 22 2.2 Converting Non-ASCII to Base64 ASCII 22 2.2.1 23 2.2.2 Analysis 24 2.3 Creating a Universally Unique Identifier 24 2.3.1 24 2.3.2 Analysis 25 2.4 Splitting Text into Sentences 25 2.4.1 26 2.4.2 Analysis 26 2.5 One-Way Hash on a Name 27 2.5.1 28 2.5.2 Analysis 30 2.6 One-Way Hash on a File 30 2.6.1 30 2.6.2 Analysis 31 2.7 A Prime Number Generator 31 2.7.1 32 2.7.2 Analysis 34 Chapter 3 Viewing and Modifying Images 37 3.1 Viewing a JPEG Image 37 3.1.1 38 3.1.2 Analysis 39 3.2 Converting between Image Formats 40 3.2.1 40 3.2.2 Analysis 41 3.3 Batch Conversions 42 3.3.1 42 3.3.2 Analysis 43 3.4 Drawing a Graph from List Data 44 3.4.1 44 3.4.2 Analysis 46 3.5 Drawing an Image Mashup 46 3.5.1 46 3.5.2 Analysis 50 Chapter 4 Indexing Text 53 4.1 ZIPF Distribution of a Text File 53 4.1.1 54 4.1.2 Analysis 56 4.2 a Preparing Concordance 57 4.2.1 57 4.2.2 Analysis 59 4.3 Extracting Phrases 60 4.3.1 61 4.3.2 Analysis 63 4.4 Preparing an Index 63 4.4.1 65 4.4.2 Analysis 68

CONTENTS IX 4.5 Comparing Texts Using Similarity Scores 69 4.5.1 69 4.5.2 Analysis 76 Part II Medical Data Resources Chapter 5 The National Library of Medicine's Medical Subject Headings (MeSH) 81 5.1 Determining the Hierarchical Lineage for MeSH Terms 83 5.1.1 83 5.1.2 Analysis 86 5.2 a Creating MeSH Database 88 5.2.1 88 5.2.2 Analysis 90 5.3 Reading the MeSH Database 90 5.3.1 91 5.3.2 Analysis 92 5.4 Creating an SQLite Database for MeSH 92 5.4.1 93 5.4.2 Analysis 96 5.5 Reading the SQLite MeSH Database 96 5.5.1 96 5.5.2 Analysis 97 Chapter 6 The International Classification of Diseases 99 6.1 Creating the ICD Dictionary 99 6.1.1 100 6.1.2 Analysis 101 6.2 Building the ICD-O (Oncology) Dictionary.102 6.2.1 103 6.2.2 Analysis 104 Chapter 7 SEER: The Cancer Surveillance, Epidemiology, and End Results Program 107 7.1 Parsing the SEER Data Files 107 7.1.1 107 7.1.2 Analysis 109 7.2 Finding the Occurrences ofall Cancers in the SEER Data Files 110 7.2.1 111 7.2.2 Analysis 114 7.3 Finding the Age Distributions of the Cancers in the SEER Data Files 115 7.3.1 115 7.3.2 Analysis 119 Chapter 8 OMIM: The Online Mendehan Inheritance in Man 123 8.1 Collecting the OMIM Entry Terms 124 8.1.1 124 8.1.2 Analysis 125 8.2 Finding Inherited Cancer Conditions 126 8.2.1 126 8.2.2 Analysis 128

X CONTENTS Chapter 9 PubMed 131 9.1 Building a Large Text Corpus of Biomedical Information 131 9.1.1 132 9.1.2 Analysis 134 9.2 Creating a List of Doublets from a PubMed Corpus 134 9.2.1 136 9.2.2 Analysis 138 9.3 Downloading Gene Synonyms from PubMed 139 9.4 Downloading Protein Synonyms from PubMed 140 Chapter 10 Taxonomy 143 10.1 Finding Taxonomic 143 Hierarchy a 10.1.1 144 10.1.2 Analysis 147 10.2 Finding the Restricted Classes ofhuman Infectious Pathogens 148 10.2.1 148 10.2.2 Analysis 153 Chapter 11 Developmental Lineage Classification and Taxonomy of Neoplasms 157 11.1 Building the Doublet Hash 158 11.1.1 158 11.1.2 Analysis 11.2 Scanning the Literature for Candidate Terms 161 11.2.1 11.2.2 Analysis 166 11.3 Adding Terms to the Neoplasm Classification 167 11.3.1 168 11.3.2 Analysis 11.4 Determining the Lineage of Every Neoplasm Concept 171 11.4.1 11.4.2 Analysis 175 161 161 170 172 Chapter 12 U.S. Census Files 177 12.1 Total Population of the United States 177 12.1.1 12.1.2 181 Analysis 12.2 Stratified Distribution for the U.S. Census 182 12.2.1 182 12.2.2 Analysis 184 12.3 Adjusting for Age 12.3.1 186 12.3.2 Analysis 177 185 189 Chapter 13 Centers for Disease Control and Prevention Mortality Files 193 13.1 Death Certificate Data 193 13.2 Obtaining the CDC Data Files 196 13.3 How Death Certificates Are Represented in Data Records 197 13.4 Ranking, by Number of Occurrences, Every Condition in the CDC Mortality Files 200 13.4.1 200 13.4.2 Analysis 204

CONTENTS XI Part III Primary Tasks of Medical Informatics Chapter 14 Autocoding 209 14.1 A Neoplasm Autocoder 209 14.1.1 210 14.1.2 Analysis 215 14.2 Recoding 216 Chapter 15 Text Scrubber for Deidentifying Confidential Text 219 15.1 220 15.2 Analysis 222 Chapter 16 Web Pages and CGI Scripts 227 16.1 Grabbing Web 227 Pages 16.1.1 227 16.1.2 Analysis 229 16.2 CGI Script for Searching the Neoplasm Classification 230 16.2.1 231 16.2.2 Analysis 235 Chapter 17 Image Annotation 237 17.1 a Inserting Header Comment 238 17.1.1 238 17.1.2 Analysis 240 17.2 Extracting the Header Comment in a JPEG Image File 240 17.2.1 240 17.2.2 Analysis 241 17.3 Inserting IPTC Annotations 242 17.4 Extracting Comment, EXIF, and IPTC Annotations 242 17.4.1 242 17.4.2 Analysis 242 17.5 Dealing with DICOM 243 17.6 Finding DICOM Images 244 17.7 DICOM-to-JPEG Conversion 245 17.7.1 245 17.7.2 Analysis 246 Chapter 18 Describing Data with Data, Using XML 249 18.1 Parsing XML 250 18.1.1 250 18.1.2 Analysis 252 18.1.3 Resource Description Framework (RDF) 252 18.2 Dublin Core Metadata 254 18.3 Insert an RDF Document into an Image File 254 18.3.1 255 18.3.2 Analysis 256 18.4 Insert an Image File into an RDF Document 256 18.4.1 257 18.4.2 Analysis 258 18.5 RDF Schema 259 18.6 Visualizing an RDF Schema with GraphViz 260 18.7 Obtaining GraphViz 262

XII CONTENTS 18.8 Converting a Data Structure to GraphViz 263 18.8.1 263 18.8.2 Analysis 265 Part IV Medical Discovery Chapter 19 Case Study: Emphysema Rates 269 19.1 270 19.2 Analysis 273 Chapter 20 Case Study: Cancer Occurrence Rates 275 20.1 275 20.2 Analysis 281 Chapter 21 Case Study: Germ Cell Tumor Rates across Ethnicities 285 21.1 286 21.2 Analysis 293 Chapter 22 Case Study: Ranking the Death-Certifying Process, by State 295 22.1 295 22.2 Analysis 298 Chapter 23 Case Study: Data Mashups for Epidemics 301 23.1 Tally of Coccidioidomycosis Cases by State 302 23.1.1 303 23.1.2 Analysis 306 23.2 Creating the Map Mashup 307 23.2.1 307 23.2.2 Analysis 311 Chapter 24 Case Study: Sickle Cell Rates 315 24.1 315 24.2 Analysis 318 Chapter 25 Case Study: Site-Specific Tumor Biology 321 25.1 Anatomic Origins of Mesotheliomas 321 25.2 Mesothelioma Records in the SEER Data Sets 323 25.2.1 324 25.2.2 Analysis 329 25.3 Graphic Representation 329 25.3.1 330 25.3.2 Analysis 333 Chapter 26 Case Study: Bimodal Tumors 335 26.1 337 26.2 Analysis 344 Chapter 27 Case Study: The Age of Occurrence of Precancers 351 27.1 351 27.2 Analysis 357

CONTENTS XIII Epilogue for Healthcare Professionals and Medical Scientists 361 Learn One or More Open Source Programming Languages 361 Don't Agonize Over Which Language You Should Choose 362 Learn Algorithms 362 Unless You Are a Professional Programmer, Relax and a Enjoy Being Newbie 363 Do Not Delegate Simple Programming Tasks to Others 363 Break Complex Tasks into Simple Methods and Algorithms 364 Write Fast Scripts 364 Concentrate on the Questions, Not the Answers 365 Appendix 367 How to Acquire Ruby 367 How to Acquire Perl 367 How to Acquire Python 367 How to Acquire RMagick 368 How to Acquire SQLite 369 How to Acquire the Public Data Files Used in This Book 370 Other Publicly Available Files, Data Sets, and Utilities 376 Index 377