Syllabus HMI 7437: Data Warehousing and Data/Text Mining for Healthcare 1. Instructor Illhoi Yoo, Ph.D Office: 404 Clark Hall Email: muteaching@gmail.com Office hours: TBA Classroom: TBA Class hours: TBA 2. Prerequisites Students are expected to have taken the following courses before registering this course. Algorithm Design/Programming I and II (2 semesters or CS 1050 and 2050 or equivalent) Database management systems I (CS 4380/7380 or equivalent) 3. Course Background A huge amount of clinical data, biomedical data, genomic data, and healthcare data have been produced and collected in disparate repository systems. Most of these data have piled up on an unprecedented scale without any analysis. Using data warehouse and data/text mining technologies, we can systematically integrate and re-organize multiple heterogeneous data sources to extract meaningful hidden patterns from disparate data sources. Ultimately, these technologies enable us to transform healthcare raw data into healthcare enterprise decisions. Thus, one of the ultimate goals of data warehouse and data/text mining in our field could be to enable physicians to make clinical decisions based on each patient s genome or to even facilitate the personalized medicine based on each patient s genome. 4. Course Description This course provides an introduction to the basic concepts of data warehouse and data/text mining and creates an understanding of why we need those technologies and how they can be applied to healthcare problems. University of Missouri-Columbia 1
5. Course Objectives Understanding major concepts of data warehouse (DW) and data/text mining (DM/TM) Understanding why we need DW and DM/TM for healthcare Understanding how DW, DM and TM are related to or different from information retrieval (IR), information extraction (IE), database query, statistics, and machine learning (ML) Introducing major DM algorithms such as decision trees, clustering, classification, association, etc. Understanding how these algorithms can be applied to biomedical/healthcare problems. For example, you would o Forecast patient s disease(s) based on patient s current record More efficient disease prevention Cost-effective medical examination o Determine which medical examinations are more accurate to diagnose a specific disease Multiple cheap medical exams could be more accurate to diagnose diseases than an expensive medical exam. Cost-effective medical examination Demonstrating major DW and DM/TM solutions available in the market/internet o commercial DM/TM solutions and research-oriented DM/TM solutions Introducing the importance of ontologies in DM/TM Introducing how life science, healthcare and health insurance companies have used DM through their DM success stories 6. Course projects The course project must be done individually. The written project report must be submitted in electronic form only such as MS Word. There are four kinds of course projects since the student needs for the course could be very different. Survey projects o Conducting a comprehensive survey on any DM/TM applications Searching articles in the qualified journals or proceedings (see Appendix) that discuss how DM/TM has been applied to healthcare systems and summarizing them Research-oriented (theoretical) projects o Must incorporate new/novel ideas in DW, DM, or TM o The report should be in a style of a scientific paper including the related work. Implementation (programming) projects o You can implement existing or novel DM/TM algorithms. o You should include the source code, the screen captures, test data sets, the limitations of the programs, etc. Their Own o but must be deliverable Or the instructor can recommend a course project to students based on their background and current occupation if students want. University of Missouri-Columbia 2
Course Project Procedure and Deadline Deadline 9/24/07 Week 5 13 Week 14 (11/26/07) Week 15 12/3/07 Submission Proposal Submission; you will be notified of its approval within a few days Working on your course project Course project report submission Revising Project Report based on instructor s review Final Course Project Report Submission University of Missouri-Columbia 3
7. Evaluation 20% Assignments HW1: Business Intelligence success stories in healthcare o 10% of course grade o Given in 4 th week Lab1: Data/Text Mining Lab o 10% of course grade o Given in 11 th week 25% Mid-term exam 25% Final-term exam 30% Project: Project report (15%) and final revised-submission (15%) University of Missouri-Columbia 4
8. Course Schedule Dates Topics Reading Week 1 (8/27/07) Database I Week 2 (9/4/07) Database II Week 3 [HK]Ch1, Introduction to data mining (9/10/07) [RG]Ch1,2,5 Week 4 (9/17/07) Business Intelligence (BI) success stories in healthcare Week 5 [HK]Ch2, Data Preprocessing (9/24/07) [RG]Ch5 Week 6 [HK]Ch3, Data Warehouse (DW) and OLAP (10/1/07) [RG]Ch6 Week 7 (10/8/07) Mid-term exam Week 8 [HK]Ch5, Data Mining Algorithm: Association Rules (10/15/07) [RG]Ch2,3 Week 9 [HK]Ch6, Data Mining Algorithm: Classification (10/22/07) [RG]Ch2,3 Week 10 [HK]Ch7, Data Mining Algorithm: Clustering (10/29/07) [RG]Ch2,3 Week 11 (11/5/07) Data Mining Lab [RG]Ch4,A,B Week 12 (11/12/07) Text Mining for MEDLINE [HK]Ch10 Week 13 (11/19/07) Thanksgiving Week 14 (11/26/07) Course Project Report Submission Revising Project Report based on peer-reviews Week 15 (12/3/07) Final Course Project Report Submission Week 16 (12/10/07) Final-term exam The instructor will provide lecture notes for each topic shown in the course schedule above. The instructor will refer to the required textbooks and the recommended textbooks (shown in Section 9 below) for lecture notes and, to make them self-contained and for students convenience, include tables and figures from the references; you should note the citations (e.g., [HK], [HMS]) in lecture notes. Basically, the lecture notes will be based on [HK]. University of Missouri-Columbia 5
9. Useful Resources 9.1 Data/Text Mining tools Name How to get it Note idata Analyzer (MS Excelbased data mining tool) MineSet Oracle Microsoft SAS WEKA Data Mining: A Tutorial-based Primer The CD includes two real medical data sets as well as other real data sets; Cardiology patient dataset from VA Medical Center in CA and Spine clinic dataset. http://www.purpleinsight.com http://www.oracle.com/solutions/business_intelligence/index.html http://www.microsoft.com/sql/technologies/dm/default.mspx http://www.sas.com/technologies/bi/ Research-oriented data mining http://www.cs.waikato.ac.nz/ml/weka/ tool (open source SW) 9.2 Data/Text Mining Tutorials SQL Server 2005 Data Mining Tutorial o http://msdn2.microsoft.com/en-us/library/ms167167.aspx SQL Server 2005 Data Mining Concepts o http://msdn2.microsoft.com/en-us/library/ms174949.aspx Solving Business Problems with Oracle Data Mining o http://www.oracle.com/technology/obe/obe10gdb/bidw/odm/odm.htm 9.3 Success Stories Study how BI has been used in their companies. SAS Customer Success in Healthcare and Health insurance o http://www.sas.com/success/industry.html#healthcare Oracle Business Intelligence (BI) Customers o http://www.oracle.com/customers/solutions/bi.html Success Stories for Microsoft products o http://www.microsoft.com/casestudies/ 9.4 Training National Center for Biotechnology Information (NCBI) s PowerScripting: FREE 4 day course. You will learn how to take advantages of NCBI databases using programming languages. http://www.ncbi.nlm.nih.gov/class/powertools/eutils/course.html University of Missouri-Columbia 6
10. References Required Textbooks: [HK] Data Mining - Concepts and Techniques by Jiawei Han and Micheline Kamber, Second Edition, Morgan Kaufmann, 2006, ISBN 1-55860-901-6 [RG] Data Mining - A tutorial-based primer by Richard J. Roiger and Michael W. Geatz, Addison Wesley, 2003, ISBN 0-201-74128-8 Recommended Textbooks: [HMS] Principles of Data Mining by D. Hand, H. Mannila, and P. Smyth, MIT Press, 2001, ISBN 0-262-08290-X [WF] Data mining: practical machine learning tools and techniques, Ian H. Witten and Eibe Frank, Second Edition, Morgan Kaufmann, 2005, ISBN 0-12-088407-0 [TSK] Introduction to Data Mining, P. Tan, M. Steinbach, and V. Kumar, Pearson Education, 2006, ISBN 0-321-32136-7 [Dunham] Data Mining - Introductory and Advanced Topics by Margaret H. Dunham, Prentice Hall, 2003, ISBN 0-13-088892-3. [KR] The Data Warehouse Toolkit, R. Kimball and M. Ross, Wiley, 2002, ISBN 0-471- 20024-7 [AM] Text Mining for Biology and Biomedicine, S. Ananiadou and J. McNaught (editors), Artech House, ISBN 1-58053-984-x University of Missouri-Columbia 7
11. Academic Dishonesty Academic integrity is fundamental to the activities and principles of a university. All members of the academic community must be confident that each person's work has been responsibly and honorably acquired, developed, and presented. Any effort to gain an advantage not given to all students is dishonest whether or not the effort is successful. The academic community regards breaches of the academic integrity rules as extremely serious matters. Sanctions for such a breach may include academic sanctions from the instructor, including failing the course for any violation, to disciplinary sanctions ranging from probation to expulsion. When in doubt about plagiarism, paraphrasing, quoting, collaboration, or any other form of cheating, consult the course instructor. 12. Statement for ADA If you need accommodations because of a disability, if you have emergency medical information to share with me, or if you need special arrangements in case the building must be evacuated, please inform me immediately. Please see me privately after class, or at my office. Office location: 404 Clark Hall Office hours: by appointment To request academic accommodations (for example, a notetaker), students must also register with the Office of Disability Services, (http://disabilityservices.missouri.edu), S5 Memorial Union, 882-4696. It is the campus office responsible for reviewing documentation provided by students requesting academic accommodations, and for accommodations planning in cooperation with students and instructors, as needed and consistent with course requirements. For other MU resources for students with disabilities, click on "Disability Resources" on the MU homepage. 13. Statement for Intellectual Pluralism The University community welcomes intellectual diversity and respects student rights. Students who have questions concerning the quality of instruction in this class may address concerns to either the Departmental Chair or Divisional leader or Director of the Office of Students Rights and Responsibilities (http://osrr.missouri.edu/). All students will have the opportunity to submit an anonymous evaluation of the instructor(s) at the end of the course. University of Missouri-Columbia 8
Appendix: Qualified Journals and Proceedings MU Libraries: http://mulibraries.1cate.com/ ACM TRANSACTIONS ON INFORMATION SYSTEMS (4.529 1 ) ISSN: 1046-8188 ARTIFICIAL INTELLIGENCE IN MEDICINE (1.882) ISSN: 0933-3657 COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE (0.788) ISSN: 0169-2607 COMPUTERS IN BIOLOGY AND MEDICINE (1.358) ISSN: 0010-4825 DATA MINING AND KNOWLEDGE DISCOVERY (2.105) ISSN: 1384-5810 IEEE INTELLIGENT SYSTEMS (2.56) ISSN: 1541-1672 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE (1.376) ISSN: 1089-7771 MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING (1.028) ISSN: 0140-0118 MEDICAL DECISION MAKING (1.822) ISSN: 0272-989X Journal of Biomedical Informatics (2.388) ISSN: 1532-0464 BMC Bioinformatics (4.96) ISSN: 1471-2105 AMIA Proceedings Conference on Knowledge Discovery in Data (KDD) Proceedings (Visit ACM Digital Library) Etc... IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (1.758) ISSN: 1041-4347 INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS (1.374) ISSN: 1386-5056 JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION (4.339) ISSN: 1067-5027 1 This number indicates the corresponding journal s impact factor (IF) which has been used as the importance of a journal. University of Missouri-Columbia 9