Development of a Data Mining Course for Undergraduate Students



Similar documents
Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Subject Description Form

Introduction to Data Mining

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Data Mining Solutions for the Business Environment

Information Management course

Computer Science. B.S. in Computer & Information Science. B.S. in Computer Information Systems

Database Marketing, Business Intelligence and Knowledge Discovery

The University of Jordan

Learning outcomes. Knowledge and understanding. Competence and skills

DATA MINING TECHNIQUES AND APPLICATIONS

Principles of Data Mining by Hand&Mannila&Smyth

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Part 5. Prediction

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Concept Paper: MS in Business Analytics

Data Science and Business Analytics Certificate Data Science and Business Intelligence Certificate

Introduction. A. Bellaachia Page: 1

Master of Science in Health Information Technology Degree Curriculum

Introduction to Data Mining

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

A Business Intelligence Training Document Using the Walton College Enterprise Systems Platform and Teradata University Network Tools Abstract

An Introduction to Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

How To Use Neural Networks In Data Mining

Bachelor of Information Technology

Information and Decision Sciences (IDS)

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Proposal for New Program: BS in Data Science: Computational Analytics

Sunnie Chung. Cleveland State University

Accelerated Undergraduate/Graduate (BS/MS) Dual Degree Program in Computer Science

Computer Science. General Education Students must complete the requirements shown in the General Education Requirements section of this catalog.

College of Health and Human Services. Fall Syllabus

Prediction of Heart Disease Using Naïve Bayes Algorithm

The University of Connecticut. School of Engineering COMPUTER SCIENCE GUIDE TO COURSE SELECTION AY Revised July 27, 2015.

Bachelor of Bachelor of Computer Science

Proposal for New Program: Minor in Data Science: Computational Analytics

Requirements Fulfilled This course is required for all students majoring in Information Technology in the College of Information Technology.

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

Customer Classification And Prediction Based On Data Mining Technique

AMIS 7640 Data Mining for Business Intelligence

These regulations apply to students admitted to the BBA(IS) degree in the academic year and thereafter.

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

Introduction to Data Mining Techniques

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

ANALYTICS CENTER LEARNING PROGRAM

CORE CLASSES: IS 6410 Information Systems Analysis and Design IS 6420 Database Theory and Design IS 6440 Networking & Servers (3)

Introduction to Data Mining

King Saud University

Chapter 12 Discovering New Knowledge Data Mining

Department of Computer Science

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

City University of Hong Kong. Information on a Course offered by Department of Information Systems with effect from Semester B in 2013 / 2014

Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects (DMCOMO)

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

A Review of Data Mining Techniques

Integration of Mathematical Concepts in the Computer Science, Information Technology and Management Information Science Curriculum

Data Mining Applications in Higher Education

Course Descriptions: Undergraduate/Graduate Certificate Program in Data Visualization and Analysis

SAS JOINT DATA MINING CERTIFICATION AT BRYANT UNIVERSITY

2015 Workshops for Professors

A Brief Tutorial on Database Queries, Data Mining, and OLAP

Data Warehousing and Data Mining in Business Applications

Predicting Students Final GPA Using Decision Trees: A Case Study

Proposal for Undergraduate Certificate in Large Data Analysis

Random forest algorithm in big data environment

Data Mining + Business Intelligence. Integration, Design and Implementation

Erik Jonsson School of Engineering and Computer Science Interdisciplinary Programs

Teaching Big Data and Analytics to Undergraduate and Graduate Students

Management Information Systems

Doctor of Philosophy in Computer Science

Proposal for a BA in Applied Computing

Graduate Student Orientation

Information Systems. Administered by the Department of Mathematical and Computing Sciences within the College of Arts and Sciences.

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Faculty of Science School of Mathematics and Statistics

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

College information system research based on data mining

Erik Jonsson School of Engineering and Computer Science

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Master of Science in Business Analytics

New Course Proposal OSC 4820, Business Analytics and Data Mining

Transcription:

Development of a Data Mining Course for Undergraduate Students Terri L. Lenox and Carolyn Cuff Department of Mathematics and Computer Science, Westminster College New Wilmington, PA 16172 USA ABSTRACT The goal of this project is to develop an introductory course for junior/senior computer science and mathematics undergraduate students on the topic of Data Mining. The course will be team developed and team taught at Westminster College by faculty in the Mathematics and Computer Science department. A detailed course description and day-by-day outline including textbooks, supplementary texts used as background, laboratory exercises, homework assignments and summary of lectures will be developed. This paper discusses some of the necessary steps to develop this course. Keywords: data mining, computer science curriculum 1. INTRODUCTION Current collegiate courses in data mining primarily appear at the graduate level. A carefully designed undergraduate course in data mining would introduce students to concepts in data mining by extending the existing database course and applying statistical concepts within the computer science curriculum. This data-mining course will be designed to integrate current research into the curriculum and promote the student s critical thinking and problem-solving skills. This course would provide an upper-level elective, offering a glimpse into innovative research. Research in computer science is often portrayed as very theoretical or hardware-related. Students become reluctant to pursue graduate-level research due to this stereotype. The authors hope that this course would demonstrate a more applied type of research and make graduate study seem more approachable to graduates of liberal arts colleges. This paper discusses some of the necessary steps to develop an undergraduate data mining course including : 1) development of a detailed course description and dayby-day outline; 2) refinement of course outcomes and expectations; 3) selection of textbooks and supplementary texts used for background; 4) selection of appropriate software tools; 5) development of laboratory exercises and homework assignments; and 6) creation of lectures. 1.1 Introduction to Data Mining Data mining consists of a set of quantitative techniques that help to extract patterns of information or rules from large amounts of data. It is used in business to profile customers with greater accuracy, and to detect fraud. In scientific disciplines, it can be used to analyze the large set sets produced by applications such as medical imaging or remote sensing (Elmasri & Navathe, 2000). Data mining requires access to large amounts of quality data, the ability to use commercial data mining tools, and the sufficient computational power. Large data sets are now being produced and are readily available. Students have access to powerful and fast computers that could be used to mine these data seeking to extract knowledge. However, few courses have been developed at the undergraduate level to lead prepared students to understand the purpose and general methodology of data mining 1.2 Rationale Westminster College is a liberal arts residential college with an enrollment of approximately 1,500 full-time undergraduates. The College offers majors in Mathematics, Computer Science and Computer Information Systems. These majors are housed in a combined Mathematics and Computer Science department which has five full-time mathematicians and three full-time computer scientists. All eight of these faculty hold Ph.D.s in respective fields. The total number of majors in the department has been steady for the last ten years averaging approximately 25 graduates per year. On average two are double majors in

mathematics and computer science, two mathematics majors have minors in computer science and four computer science majors have mathematics minors. Although the course offerings within each major are strong, few upper-level courses integrate mathematics and computer science. Westminster College s curriculum has a track record of innovation. In Fall, 1998, a four-credit hour discrete analysis course replaced Calculus I as the gateway course to all majors within the department. This course, depending on the major, is followed by at least two additional required credit hours in discrete analysis. Mathematics, Computer Science (CS) and Computer Information Systems (CIS) majors are required to take the introductory statistics course and single variable calculus. Mathematics majors are required to take the introductory computer science course and a two semester sequence mathematics intensive course. Often mathematics majors elect to fulfill this requirement by taking the second introductory computer science course and Data Structures. All majors at Westminster are required to take a Capstone course. Within the Mathematics and Computer Science department, the Capstone course has the form of a major individual or group project. The graduating class of 2001 was the first class required to complete the Capstone course. The faculty believes that the integration of mathematics within computer science and the option of integrating the computer science within mathematics is strong at the freshman and sophomore level. The authors seek to strengthen these ties at the upper level by the introduction of the Data Mining course. The authors believe that an undergraduate Data Mining course is accessible to strong computer science majors, will incorporate mathematics in the upper level computer science courses, provides students with exposure to state-of-the-art applications and is suitable in a liberal arts environment. A strength of the computer science discipline is in the application of foundational concepts to particular domains. A data mining course demonstrates how these foundational concepts can be built upon in new and innovative ways. An additional consideration in offering a data mining course is that data mining jobs are becoming more common. A quick search of several on-line job banks indicates entry level positions for programmers, analysts, and engineers (jobsearch.monster.com). These positions appear to require a mathematics or computer science degree with some data mining and statistics exposure. Finally as the number of students who obtain master s degrees in computer science and/or information sciences slows (U.S. Department of Education, 2001), the authors hope to encourage students to consider graduate studies and believe that innovative topics such as data mining may motivate students. 1.3 Data Mining Courses at Other Schools Although data mining is of increasing interest to industry, it does not appear to be a common undergraduate course at this time. However, it is likely that some institutions offer data mining and/or data warehousing subjects under a special or advanced topics listing which makes it difficult to survey. The list below from a brief Internet survey of over 125 institutes found the following undergraduate courses in data mining: 1. Data Warehousing and Mining (IFMG 455); Indiana University of Pennsylvania (Pierce, 1999). 2. Data Mining (ECMM 420); Christopher Newport University; e-commerce program. 3. Data Mining (INFO 329); Xavier University; crosslisted as a marketing course (MKTG 329). 4. Advanced Quantitative Methods in Business (BUS 491); Cal Poly San Luis Obispo; business course. 5. Emerging Database Technologies and Applications (CS 4440); Georgia Institute of Technology. 6. Data Mining (CS 373); LaSalle University. 7. Elementary Data Mining (Stat 321); Central Connecticut State University; on-line statistics course. The following lists some graduate courses in data mining and/or data warehousing: 1. Knowledge Discovery and Data Mining Masters program; Carnegie Mellon University 2. Data Warehousing and Data Mining (CS 753); Bentley College. 3. Data Warehousing and Data Mining (IS 549); DePaul University; data warehousing concentration. 4. Data Mining and Knowledge Discovery (INFSY 566); Penn State University. 5. Information Systems Data Warehousing and Decision Support ISC 571); University of South Alabama. 6. Data Warehousing and Data Mining (ISM 6930); University of South Florida; MBA program 7. Data Mining (COSC 7397); University of Houston. 8. Advanced Topics in Spatial Knowledge Discovery and Data Mining Systems (CS 783); North Dakota State University. 9. Advanced Topics in Data Mining Architectures (CS 785); North Dakota State University. 10. Introduction to Data Mining (Stat 521, CS 580); Central Connecticut State University on-line. 11. Data Mining (INFO 780-08); Drexel University. 12. Database Systems Planning (IS 304); Claremont Graduate University. In addition to traditional courses, several institutions offer certificate programs in data mining and/or data warehousing:

1. Data Mining Techniques and Applications; UCLA s Engineering, computer science and technical management short courses. 2. Data Mining; SAS Institute. 3. Data Mining; Central Connecticut State University; on-line. 4. Data Mining : Level I; The Modeling Agency 5. Penn State Data Mining Certificate Program. Because Westminster College is a traditional liberal arts college, our students are not currently interested in webbased courses. 2. DEVELOPING A DATA MINING COURSE 2.1 Prerequisite Knowledge The prerequisite database course at Westminster College provides the student with an introduction to the theory and practice of applying database technology to the solution of business- and information-related problems. Database terminology and concepts, data and file structures, and a comparison of the relational database with other models (hierarchical, network and objectoriented) are addressed. Experience is provided by database design and implementation based on a thorough requirements analysis and information modeling. An introduction to relational database technology is provided, highlighting the use of structured query language (SQL). The focus of traditional database courses is to provide students with a solid foundation on which to build and support operational databases. Data warehousing focuses on the creation of large consolidated databases to aid in the strategic decision making functions of an organization. Data mining typically focuses on analysis of these databases in an attempt to find previously unknown associations and patterns in the data. Since a purpose of this proposed data mining class is to strengthen the relationship between computer science and mathematics concepts, the statistical and analytical aspects of the subject are emphasized. 2.2 Development Plan The planned tasks for this project include development of the data mining course, field testing materials and post-hoc surveys. The authors will also team teach this course twice with an evaluation and dissemination period between sessions. Two institutions will be selected to field test this data mining course. Their evaluations will be analyzed and the course modified where necessary. The data mining course will have the following six prerequisites: Principles of Computer Science I and II, Theoretical Foundations of Computer Science, Data Structures, Database Theory and Design, and Statistics. This course will to complement the existing Neural Networks and Artificial Intelligence courses, and provide a basis for additional projects for the Capstone course. A brief outline of the material that will be covered in this data mining course follows. Pruning this list is expected to take the most effort during the development of this data mining course. 1. Additional background in statistics: likelihood functions, analysis of variance, clustering, and Bayesian networks. 2. Background in machine learning: concept learning, version spaces, decision trees, overfitting, Occam's razor, neural networks, single-pass learning algorithms and reinforcement learning. 3. Review of text-based database management systems. 4. Extension of traditional database topics: Multimedia databases including: time sequences, photographs and medical images, video clips, feature extraction, continuous media storage and delivery. Database methods including: massive datasets and association rules, frequent sets, sampling, visualization of large data sets. Data warehousing and on-line analytical processing (OLAP). 5. Data mining topics may include (Han & Kamber, 2000): Data preparation. Association rule mining (market basket analysis, basic concepts, and association rule mining). Classification by decision tree induction (decision tree induction, tree pruning, extracting classification rules from decision trees, enhancements to basic decision tree induction, scalability and decision tree induction, integrating data warehousing techniques and decision tree induction). Bayesian classification (Bayes theorem, naïve Bayesian classification, and belief networks). Classification by backpropagation (multiplayer feed-forward neural network, defining a network topology, backpropagation interpretability, classification based on concepts from association rule mining, k- nearest neighbor classifiers, and case-based reasoning). Prediction (linear and multiple regression, nonlinear regression, and other regression models). Cluster analysis (types of data in clustering analysis, interval-scaled variables, binary variables, nominal, ordinal, and ratio-scaled variables). Model-based clustering methods (statistical approach and neural network approach). Outlier analysis (statistical-based outlier detection, distance-based outlier detection, and deviation-based outlier detection).

2.3 Software Tools At this time, the first two software packages from the list below are being evaluated. Westminster College students have experience with either S-Plus of SPSS, which makes both Clementine and S-Plus logical choices for this data mining course. 1. Clementine from SPSS. 2. S-Plus (desktop) or Insightful Miner from Insightful Corporation. 3. PowerPlay and related tools from Cognos. 4. VisualMine and Daisy from AI Softw@re. 5. Data Mining Suite or Darwin from Oracle Corporation. 6. Intelligent Miner from IBM. * 7. Weka from the University of Waikato. * 8. TMiner Personal Edition from Department of Computer Sciences and Artificial Intelligence of Granada University (Spain). * A more in-depth list of data mining software tools can be found on the Knowledge Discovery and Data Mining web page (www.kdnuggets.com/software/suites.html). 2.4 Data Mining Text Books The number of data mining and data warehousing textbooks has increased substantially over the past few years. The seminal textbook on this subject is Inmon s Building the Data Warehouse from 1996. The Han and Kamber Data Mining: Concepts and Techniques textbook is the likely choice for this data mining course. The following lists a few of the many available books: 1. Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. 2. Inmon, W. H. (1996). Building the Data Warehouse, (2 nd Edition). New York, NY : John Wiley & Sons, Inc. 3. Berry, M. and Linoff, G. (1997). Data Mining Techniques for Marketing, Sales and Customer Support, New York, NY : John Wiley & Sons. 4. Berry, M. and Linoff, G. (1999). Mastering Data Mining, New York, NY : John Wiley & Sons. 5. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A. (1998). Discovering Data Mining. Prentice Hall PTR. 6. Weiss, S. and Indurkhya, N. (1997). Predictive Data Mining: A Practical Guide. Morgan Kaufmann Publishers. 7. Westphal, C. and Blaxton, T. (1998). Data Mining Solutions, New York, NY : John Wiley & Sons. 3. OUTCOMES EXPECTED FROM THIS PROJECT Outcomes from this project are based on measures of student achievement, transferability of this course to another similar institution, and whether this course can be adapted to meet the needs of other undergraduate institutions. Student outcomes are as follows (ACM Computing Curricula 2001 Ironman Report, 2001): Given several data sets, students will be able to understand and discuss ways to prepare (clean) and warehouse data. Given a prepared set of data, students will be able to classify data based on cluster analysis, decision trees, Bayesian networks or backpropagation algorithms. Given a prepared set of data, students will be able to analyze the data using one of the techniques discussed. Given an analysis of a particular data set, students will be able to make appropriate predictions. Students will be able to understand, adapt and apply algorithms. Students will be able to outline the applications of data mining techniques. Some students could be motivated to continue computer science research or graduate studies. 3.1 Evaluation Plan Student outcomes will be evaluated by some of the traditional methods (projects, written examinations, laboratory exercises and homework assignments) during the two semesters that this course is initially taught. Once the course has been taught, it will be fine tuned and the authors hope to disseminate our findings at both mathematical and information science forms such as American Statistical Association (ASA) Joint Statistical Meetings and Information Systems Education conference (ISECON). 4. REFERENCES ACM Computing Curricula 2001 Ironman Report February 6, 2001. http://www.computer.org /education/cc2001/ironman/cc2001/index.html). Elmasri, R. and Navathe, S. B. (2000). Fundamentals of Database Systems. Addison-Wesley Longman, Inc. Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. Knowledge and Data Discovery Organization, http://www.kdnuggets.com /software/suites.html Pierce, E. M. (1999). Developing and delivering a data warehousing and mining course, Communications of the Association for Information Systems, Vol. 2 (16). * Shareware or freeware.

U.S. Department of Education, National Center for Education Statistics, Higher Education (HEGIS) (2001). Table 2.86 Earned degrees in computer and information sciences by degree-granting institutions.