Using Students as Experiment Subjects An Analysis on Graduate and Freshmen Student Data



Similar documents
C. Wohlin, "Is Prior Knowledge of a Programming Language Important for Software Quality?", Proceedings 1st International Symposium on Empirical

Improving Software Developer s Competence: Is the Personal Software Process Working?

Empirical Software Engineering Introduction & Basic Concepts

Toward Quantitative Process Management With Exploratory Data Analysis

C. Wohlin and B. Regnell, "Achieving Industrial Relevance in Software Engineering Education", Proceedings Conference on Software Engineering

Teaching Disciplined Software Development

The Role of Controlled Experiments in Software Engineering Research

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Implementing a Personal Software Process (PSP SM ) Course: A Case Study

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Studying Code Development for High Performance Computing: The HPCS Program

A Comparison of Software Cost, Duration, and Quality for Waterfall vs. Iterative and Incremental Development: A Systematic Review

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

C. Wohlin, "Meeting the Challenge of Large Scale Software Development in an Educational Environment", Proceedings Conference on Software Engineering

Different Conceptions in Software Project Risk Assessment

Personal Software Process in the Database Course

Assignment Kits. Summary Kit Contents Lecture 1: Kit cover sheet (page 40)

ADVANCES IN TECHNOLOGY-BASED EDUCATION: TOWARDS A KNOWLEDGE BASED SOCIETY

Personal Software Process (PSP)

Empirical Model Building and Methods Exercise

Incorporating PSP into a Traditional Software Engineering Course: An Experience Report

Econometrics and Data Analysis I

Evaluation of Students' Modeling and Programming Skills

NORTHERN VIRGINIA COMMUNITY COLLEGE PSYCHOLOGY RESEARCH METHODOLOGY FOR THE BEHAVIORAL SCIENCES Dr. Rosalyn M.

Excess Units in Pursuit of the Bachelor s Degree

13 Empirical Research Methods in Web and Software Engineering 1

Evaluating Programming Ability in an Introductory Computer Science Course

Software Engineering: Analysis and Design - CSE3308

The software developers view on product metrics A survey-based experiment

The Personal Software Process (PSP) Tutorial

Statistics Review PSY379

The Personal Software Process SM (PSP SM )

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Analysis of Inspection Technique Performance

The role of replications in Empirical Software Engineering

Organizing Your Approach to a Data Analysis

Defining the Beginning: The Importance of Research Design

Strategies for Industrial Relevance in Software Engineering Education

Moving from ISO9000 to the Higher Levels of the Capability Maturity Model (CMM)

Tel: Tuesdays 12:00-2:45

MATH 140 HYBRID INTRODUCTORY STATISTICS COURSE SYLLABUS

Knowledge Infrastructure for Project Management 1

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Bachelor's Degree in Business Administration and Master's Degree course description

Description. Textbook. Grading. Objective

Descriptive Statistics

Comments on Software Quality by Watts S. Humphrey Fellow, Software Engineering Institute Carnegie Mellon University Pittsburgh, PA

Models for evaluating review effectiveness

Do Programming Languages Affect Productivity? A Case Study Using Data from Open Source Projects

What do you think? Definitions of Quality

Recall this chart that showed how most of our course would be organized:

Economic Statistics (ECON2006), Statistics and Research Design in Psychology (PSYC2010), Survey Design and Analysis (SOCI2007)

BODY OF KNOWLEDGE CERTIFIED SIX SIGMA YELLOW BELT

Fairfield Public Schools

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Chapter 6 Experiment Process

Section Three. Nursing: MSN Research Procedures. Page 25

ECON 523 Applied Econometrics I /Masters Level American University, Spring Description of the course

The Friedman Test with MS Excel. In 3 Simple Steps. Kilem L. Gwet, Ph.D.

A PRELIMINARY REPORT ON ADAPTING SOFTWARE DEVELOPMENT INDUSTRY BEST PRACTICES FOR UNDERGRADUATE CLASSROOM USE

D. Milicic and C. Wohlin, "Distribution Patterns of Effort Estimations", IEEE Conference Proceedings of Euromicro 2004, Track on Software Process and

C. Wohlin and A. Andrews, "Evaluation of Three Methods to Predict Project Success: A Case Study", Proceedings of International Conference on Product

Integrating a Factory and Supply Chain Simulator into a Textile Supply Chain Management Curriculum

A STUDY OF WHETHER HAVING A PROFESSIONAL STAFF WITH ADVANCED DEGREES INCREASES STUDENT ACHIEVEMENT MEGAN M. MOSSER. Submitted to

DPLS 722 Quantitative Data Analysis

The Impact of Design and Code Reviews on Software Quality

Writing the Empirical Social Science Research Paper: A Guide for the Perplexed. Josh Pasek. University of Michigan.

Mary Baldwin College Social Work SOWK 317L WA Social Work Research Fall 2015

2. SUMMER ADVISEMENT AND ORIENTATION PERIODS FOR NEWLY ADMITTED FRESHMEN AND TRANSFER STUDENTS

Chapter 7: Simple linear regression Learning Objectives

Guidelines for conducting and reporting case study research in software engineering

Application and Evaluation of The Personal Software Process

Basic Concepts in Research and Data Analysis

USING THE ETS MAJOR FIELD TEST IN BUSINESS TO COMPARE ONLINE AND CLASSROOM STUDENT LEARNING

An Integrated Quality Assurance Framework for Specifying Business Information Systems

Measurable Software Quality Improvement through Innovative Software Inspection Technologies at Allianz Life Assurance

GRADUATE STUDENT SATISFACTION WITH AN ONLINE DISCRETE MATHEMATICS COURSE *

Working with data: Data analyses April 8, 2014

1. What is PRINCE2? Projects In a Controlled Environment. Structured project management method. Generic based on proven principles

Diagnosis of Students Online Learning Portfolios

Lund, November 16, Tihana Galinac Grbac University of Rijeka

Course Descriptions. Seminar in Organizational Behavior II

The Personal Software Process 1 by Watts S. Humphrey watts@sei.cmu.edu Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213

Non-Inferiority Tests for One Mean

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

Risk Knowledge Capture in the Riskit Method

SOFTWARE ESTIMATING RULES OF THUMB. Version 1 - April 6, 1997 Version 2 June 13, 2003 Version 3 March 20, 2007

Software Engineering Compiled By: Roshani Ghimire Page 1

focus When I started to develop SEI s Personal Software Process, I was The Personal Software Process: guest editor's introduction Status and Trends

An Investigation on Learning of College Students and the Current Application Situation of the Web-based Courses

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

Improving Software Project Management Skills Using a Software Project Simulator

C. Wohlin, "Managing Software Quality through Incremental Development and Certification", In Building Quality into Software, pp , edited by

Course Syllabus STA301 Statistics for Economics and Business (6 ECTS credits)

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

The Advantages of Using a Software Engineering Project Development System

Goal Question Metric (GQM) and Software Quality

Evaluating the Lead Time Demand Distribution for (r, Q) Policies Under Intermittent Demand

Fault Slip Through Measurement in Software Development Process

Likert Scales. are the meaning of life: Dane Bertram

Transcription:

Using Students as Experiment Subjects An Analysis on and Student Data Per Runeson Lund University, Dept. of Communication Systems, Box 118, SE-221 00 Lund, Sweden per.runeson@telecom.lth.se ABSTRACT The question whether students can be used as subjects in software engineering experiments is debated. In order to investigate the feasibility of using students as subjects, a study is conducted in the context of the Personal Software Process (PSP) in which the performance of freshmen students and graduate students are compared and also related to another study in an industrial setting. The hypothesis is that graduate students perform similarly to industry personnel, while freshmen student s performance differ. A quantitative analysis compares the freshmen and graduate students. The improvement trends are also compared to industry data, although limited data access does not allow a full comparison. It can be concluded that very much the same improvement trends can be identified for the three groups. However, the dispersion is larger in the freshmen group. The absolute levels of the measured characteristics are significantly different between the student groups primarily with respect to time, i.e. graduate students do the tasks in shorter time. The data does not give a sufficient answer to the hypothesis, but is a basis for further studies on the issue. 1 INTRODUCTION People, process and technology are aspects that affect the capabilities of software development organizations. The three aspects interact, but it is not clear to what extent the different aspects contribute to the success or failure in software engineering. It is important to know which aspects contribute to, for example, increased productivity when introducing a new process. The issue can be analyzed by conducting empirical studies [22]. Many experiments are conducted using students as subjects, and it is as often questioned whether these studies give valid results applicable to a population of software engineering professionals. The Personal Software Process (PSP) [7, 8, 9] is presented as a contributor to the process part, and to some extent a contributor to the technologies in the area of project management. The PSP defines an approach to personalized software development processes with continuous improvement, packaged in process descriptions and course material. The PSP consists of a set of processes, ranging from the PSP0 Baseline Process, via the PSP1 Planning Process and the PSP 2 Quality Management Process, to the PSP3 Cyclic Process. Each step adds more features to the previous step in terms of planning, measurement and quality control. New technologies are presented continuously in research and industry, and are to some extent also evaluated. Technologies that are presented and evaluated are, for example, different techniques for inspections [1, 18]. In empirical studies, people with different background and experience have contributed as subjects. However, it is not clear how people interact with the process and technology issues. In most studies, the experiment design blocks the people in order to evaluate the process or technology part, i.e. the study is intended to be independent of the people. It is however important to try to clarify the impact and interaction with the people issue in empirical software engineering, in order to validate studies, in particular those with students as subjects, and the generalizability of such studies. Empirical studies on the effect of using PSP have addressed the question on interaction between people and process by comparing the improvements made by graduate students using PSP to the improvements made by industry people [19]. The improvements achieved are almost the same in both cases, i.e. the graduate students behave similarly to the industry people when taking the PSP course. In order to investigate this further, this paper presents a study which investigates the performance of freshmen students taking the PSP course compared to graduate students and, and in a second step, the industry people. Our hypothesis is that there are small differences between graduate students and industry people on one hand, while there are significant differences between graduate students and freshmen students on the other hand. The differences investigated are of two types. First, it is investigated whether the same improvements are achieved in the improvement steps between the PSP levels 0, 1 and 2, i.e. if estimation accuracy, defect density and productivity im-

prove. Second, it is analyzed whether there are differences between the performance, i.e. time consumption, productivity and number of defects. The outline of the study is shown in Figure 1, where the improvement comparisons are marked with solid lines and performance comparisons are marked with dashed lines. Limited access to industry data does not allow for the performance analysis on the industry data. The paper is structured as follows. In Section 2 the context of the study is presented. In Section 3 the hypotheses are formally defined and the analysis is reported. Section 4 contains a discussion on the interpretation of the results, and finally in Section 5 a summary is given. 2 STUDY CONTEXT Since Humphrey presented the Personal Software Process in his book [7], different studies related to PSP have been conducted. There are reports of descriptive nature which present positive results in general, for example, experience reports [4, 8]. Other studies are related to the quality of the data collected in the use of PSP [3, 11, 12]. Further, studies that investigate within-course effects of the PSP methods are presented [5, 6, 19] as well as attempts to assess postcourse impact [14]. Reports regarding the use of PSP in industry exist [13], and regarding use of PSP for teaching are numerous, e.g. [2, 16]. It is also proposed to use the PSP as a context for software engineering experiments [21]. This study is conducted on data primarily from students at Lund University, Sweden, taking the PSP course as defined by Humphrey [7]. The course settings are almost identical to the settings in the Wesslén study [19]. In this study we have one group of freshmen students at undergraduate level [16] in addition to the graduate students. The graduate students studied in Masters programs which are scheduled for 4.5 years in a sequence, including both undergraduate and graduate studies. Hence, most students study their topics without having industrial experiences between their undergraduate and graduate studies. The PSP course was given at Lund University the first time during the fall semester of 1996. It was then given to graduate students during their fourth year of studies. The course attendants are students at the Computer Science and Engineering program (CSE) and the Electrical Engineering program (EE). During the spring semester 1999, the PSP course was given to undergraduate students in their first year of study in a Bachelors program in Software Engineering (SE). In addition, the course was given to Ph.D. students at Linköping University, Sweden, 1997. In this chapter, the context of the course occasions contributing to the study is presented. The students were informed that the data collected might be used in future empirical research under guaranteed anonymity [17]. The grading in the courses was partly based on how well they adhered to the process, but not in the collected metrics as such. The industry data is collected at the Software Engineering Institute (SEI,) and reported by Hayes and Over [5]. The data are collected from courses given by the SEI at 23 different occasions, comprising 298 students. Half of the courses were given in an academical setting and half in an industrial setting. 2.1 General for all students All the university courses used the original PSP book by Humphrey as the key source of information [7]. In addition to the book, all students from 1996 onwards were given an additional booklet that guided the students in each task by giving pointers to relevant parts of the PSP book, and clarifying the use of, for example, the estimation method proposed in the book. The programming tasks performed are presented in Table 1 In order to ease the data collection and thereby improve the quality of the data, electronic support was given to the students. In the 1996 course setting, an ASCII-based solution was used, while from 1997 and onwards, a spreadsheet based tool for data collection was used. The students filled out a spreadsheet for each task and submitted it electronically for examination. The spreadsheet of the individual students TABLE 1. Programming tasks in the PSP course FIGURE 1. Outline of study Industry improvement performance # Description 1A Calculate standard deviation of a data set 2A Count lines of code in a source file 3A Extend 2A to count length of methods or functions 4A Calculate linear regression of a data set 5A Integrate a function numerically 6A Calculate a prediction interval based on 4A and 5A 7A Calculate the correlation between two data sets 8A Sort elements of a linked list 9A Calculate normality using Chi-2 test

were then linked together for analysis. The code counting data was collected using the code counting program developed as exercises 2A and 3A, which were based on a common code counting standard. Based on experience from the initial courses, the order between tasks 6A and 7A was shifted in the courses from 1998 and onwards. The reason is that the complexity of the tasks grows smoother when taken in this order. In neither of the university courses is the design method presented by Humphrey prescribed. It was left to the students to use any method they wanted. 2.2 PSP for graduate students The graduate students attending the PSP courses had taken programming courses in various languages. The CSE students had taken more courses than the EE students, but they had all taken at least one programming course. At the first course occasion, C was the mandatory programming language. At the other occasions, the students were free to choose programming language as long as they were familiar with the language they decided to use. Wesslén reports analyses of the outcome of the courses [19]. 2.3 PSP for freshmen students At Lund University a new Bachelors program in Software Engineering (SE) was launched 1998 [15]. The program is designed to make the students software engineers, not only as a last add-on, but from the very beginning provide them with means for quantifying, analyzing and managing their software development tasks. Therefore it is assumed that the attitudes are set towards software engineering from the very beginning. At the first run of the SE program, an introductory course in Java was given during the first semester. In addition, a brief introduction to the PSP concepts was given based on the PSP introductory book [9]. During the PSP introduction the basic forms were used, i.e. project plan summary, time reporting log and defect reporting log. During the second semester, the full PSP course was given according to Humphrey s book [7]. In parallel with the PSP course, a statistics course was given to teach the statistics needed to implement the PSP programs and to analyze the data. Experiences from teaching this course are reported by Runeson [16]. The undergraduate students used Java as a mandatory language. In contrast to the graduate students, they were allowed to use a list package as a support for the programs, which affects tasks 1A, 4A and 6A. The different groups of students are summarized in Table 2. The data reported by Hayes and Over is characterized in Table 3. TABLE 2. Overview of student subjects in the study Year University Level Language # stud 96/97 Lund C 42 96/97 Linköping Ph.D. mixed 30 97/98 Lund Grad mixed 59 99 Lund 3 ANALYSIS Undergraduate Java 31 Sum 162 TABLE 3. Subjects in the Hayes and Over study Type Number of Classes Class Size Category Instructor Training 4 4 to 10 6 Industry Setting 8 11 to 15 11 Academic Setting 11 16 to 21 6 Sum 23 298 23 Number of Classes 3.1 Hypotheses The informal hypothesis presented in the introduction is formally defined below. The hypotheses are of two types: improvement hypothesis and performance hypotheses. The improvement hypotheses are summarized in Table 4. The primary hypotheses are tested using the freshmen, graduate and industry data. The additional hypotheses are tested only for the student data due to limited access to raw industry data. The improvement hypotheses are the same as in the studies by Hayes [5] and Wesslén [19]. They are formulated undirectionally, to allow comparison to the original studies. Directional hypotheses would allow one-sided statistical tests that are more powerful than two-sided tests. The performance hypotheses investigate differences in the measurements between the groups. Due to limited access to industry data, these hypotheses are only tested on the freshmen and graduate student data. The hypotheses are summarized in Table 5. 3.2 Data validation In the graduate student group, the individuals are removed which had not finished the course, received more help than the other individuals or had not reported trustworthy data. [19]. The data validation reduces the data set from the original 131 data points to, as most, 113 data points for the different analyses, i.e. at maximum 18 out of 131 are removed.

TABLE 4. Improvement hypotheses Area Primary hypothesis Additional hypothesis Size estimation accuracy Estimation gets better for each PSP level Dispersion in estimation reduced for each PSP level Effort estimation accuracy Estimation gets better for each PSP level Dispersion in estimation reduced for each PSP level Defect density Defect density gets lower for each PSP level, overall, and for compile and test respectively Dispersion in defect density reduced for each PSP level Pre-compile defect yield Yield gets higher for each PSP level Productivity Productivity gets higher for each PSP level Dispersion reduced for each PSP level TABLE 5. Performance hypotheses Area Size Effort Productivity Defect Defect density Defect intensity Hypothesis students write programs of different size compared to graduate students students spend different amount of time compared to graduate students students have different productivity compared to graduate students students have different amount of defects in their programs compared to graduate students students have different amount of defects per size in their programs compared to graduate students students make different amount defects per time unit compared to graduate students Applying the same validation procedure to the freshmen student data set involves several risks. The data set is smaller, and thereby each subject contributes more to the totality. If the individuals were removed which did not perform very well, the results tend to be better than the sample actually should indicate. Hence, two alternative validation procedures are applied and the analysis results are reported for both. The first approach is to follow the same procedure as in the graduate student group, below referred to as the reduction approach. Then the data set is reduced from the original 31 data points to range from 17 to 25 data points for the analyses, i.e. between 6 and 14 out of 31 are removed. The second approach is to fill in lacking data values (below referred to as the fill-in approach), according to the following procedure: 1. If the data value is available, but not in the correct data sheet, it is filled in. For example, actual size is reported in the Project Plan Summary for the previous task, but not moved into the sheet for the current task. 2. If the data is available for other tasks at the same PSP level, this data is used. For example, if the yield is missing for task 8A but filled in for 7A, this data is used. 3. Otherwise, average population data is used. When applying the fill-in approach to the data, 11 data values are found in other data sheets, 12 data values are taken from other tasks and 54 data values are taken from population average. This can be compared to the total number of data values of about 1 700 per student [12], i.e. 53 000 for 31 students. The analyses in this study are conducted on data validated by both of the approaches, and it is reported where the results differ. In the Hayes and Over data set, between 222 and 277 data points out of 298 students were possible to use. They did not apply any method to complete the data. 3.3 Improvement study The hypotheses in the improvement study are tested and compared to previous studies, referred to as graduate [19] and industry respectively [5]. The analysis procedure follows the previous studies. Within each of the three groups, an ANOVA test is used to test if there are any differences between the adjacent PSP levels. If the ANOVA test rejects the null hypothesis that here is no difference, a pair-wise t-test is conducted to see in which step the improvement are done. For the freshmen and graduate groups, an F-test is conducted to test if the dispersion is reduced with more sophisticated PSP levels. The analysis results are summarized in Table 6. An X means that the hypothesis is rejected at a significance level higher than 0.95. It can be noted that the improvements are very much the same for the three groups. In the step from PSP0 to PSP1, the freshmen group improves significantly on four out of six areas, and a sixth is improved as well for the reduction approach to data validation. There is no reduction on test defect density, but otherwise, the result is compliant to both graduate students and industry people. The productivity is improved for freshmen in the step from PSP0 to PSP1, but is on the other hand not improved in the subsequent step, as for the other two groups.

The dispersion analysis shows less consistent results. The dispersion in the freshmen group is not reduced in size estimation accuracy and productivity, while the graduate student group has reduced dispersion on the yield. The freshmen group tends to reduce the dispersion in the step from PSP1 to PSP2 while the graduate student group reduces the dispersion already in the step from PSP0 to PSP1. The median improvements from PSP0 to PSP2 are in the same magnitude of order for the three groups, as presented in Table 7. The exception is the effort estimation accuracy, for which the freshmen improve a factor of 14.9, while graduate students and industry people improve a factor of 3.0 and 1.75 respectively. TABLE 7. Median improvement from PSP 0 to PSP2 Area Industry Size Estimation Accuracy 1.79 2.1 2.5 Effort Estimation Accuracy 14.9 3.0 1.75 Overall Defect Density 1.8 1.4 1.5 Compile Defect Density 3.4 2.9 3.7 Test Defect Density 1.8 2.0 2.5 Pre-Compile Defect Yield 45% 39% 50% Productivity 1.58 0.9 (0.86) No gain or loss It can be concluded that there are no other significant differences between the groups with respect to their improvement within the PSP context. Next question to study is whether the performance metrics show any statistical differences. 3.4 Performance study In order to further investigate the differences between the groups, the metrics for the different development performance characteristics, collected in the PSP, are compared for the freshmen students and the graduate students. The reduced access to industry data makes it impossible to make the same comparison to the industry group. The following metrics are compared: Size of program, measured in LOC Total development time in minutes Productivity, measured in LOC per hour Total number of defects Defect density, measured as number of defects per LOC Error intensity, measured as number of defects per development hour For each of the metrics, a t-test is conducted to test the null hypothesis that the performance is the same for freshmen students and graduates students. Further, the mean percentage difference between the groups are calculated according to the following formula: Ffreshmen ( ) Diff = -------------------------------- F( graduate) 1 100 where F = [Size, Time, Prod, Defects, Density, Intensity] The relative improvement is analyzed and no absolute values. Hence, the variety of languages used does not impact on the size difference. The analyses are summarized in Table 8, where * refers to significance level of 0.9 and ** refers to significance level of 0.95. TABLE 6. Summary of results in the improvement analysis. Mean Dispersion Area Size Estimation Accuracy PSP0 vs. PSP1 PSP1 vs. PSP2 PSP0 vs. PSP1 PSP1 vs. PSP2 Industry Industry X a X X X Effort Estimation Accuracy X X X X Overall Defect Density X X X X b X X Compile Defect Density X X X X X X X X X Test Defect Density X X X X X X X Pre-Compile Defect Yield X X X X Productivity X X X X X a. Only for the reduction validation approach b. Only for the fill-in validation approach

In the performance analysis, the differences are clearer between freshmen students and graduates students than in the improvement analysis. students write significantly smaller programs for tasks with PSP0 and PSP1. The average difference is 19% related to graduate students. In tasks 1A, 4A and 6A the groups have different prerequisites, i.e. the freshmen students are allowed to use a list package. The comparison to the subset of graduate students using Java shows the same trend, although it differs for the individual tasks. However, there are only 9 students in the graduate group which were using Java, so the basis for any conclusions is rather limited. The freshmen students spend significantly more time on 8 out of 9 tasks. On average the freshmen students spent 47% more time than the graduate students. A direct consequence of this big difference is that the productivity is significantly lower for freshmen students. They write shorter programs in longer time. The number of defects does not differ between the groups. This is an issue where the data quality can be debated. It can be questioned whether the freshmen students really report all the problems they encounter. The time data indicates that they have more problems, but the defect data does not. Although there is no significant difference in number of defects, the defect density is significantly higher for freshmen in 4 out of 9 cases. On the other hand, the defect intensity, i.e. the number of defects per development hour is lower. This indicates that the real difference is in the time consumption. 3.5 Qualitative differences Having experience from teaching the two student groups, there are also some qualitative differences worth mentioning [16]. Some of the issues are measurable, but they were not measured during the courses. The freshmen students tend to raise questions more on programming issues, while the graduate students are more focused on the process parts. This is not surprising as the freshmen students attended the course directly after their first programming course, while the graduate students attended the course in their fourth year of studies. On the other hand, there may be some learning effects for the graduate student group as well, in particular for the electrical engineering students. They take their programming courses primarily in the first and second years, and are focusing on other topics during their third year. Hence, they have to recover their programming skills. The variation within the groups is larger for the freshmen students. Few students in a graduate student group have serious problems while the share of students in the freshmen group having problems is larger. This is indicated by the number of data points removed in the reduction approach to data validation. In the graduate group of 131 subjects, it is reduced to 113, i.e. by 14%. In the freshmen group of 31, it is reduced to between 25 and 17, i.e. by 20 to 45% for the different analyses. 3.6 Threats to validity The most important threats to the validity of the study are discussed below. Conclusion validity is threatened by the fact that the data are collected in different settings. This is particularly true regarding industry data. However, as the PSP environment is well defined, this reduces the threat. The reliability of the measures can be questioned in this study as well as in other PSP studies [3, 11, 12], and hence the conclusions as well. Further, the data validation is performed using alternative approaches (fill-in, reduction), which give slightly different results. Internal validity is threatened by instrumentation issues. In its standard format, the PSP material provides an exten- TABLE 8. Summary of analysis PSP0 PSP1 PSP2 Diff 1A 2A 3A 4A 5A 6A 7A 8A 9A Size(fresh) < Size(grad) ** ** ** * ** a 18.7% Size(fresh) < Size(grad), Java only * ** ** ** ** 10.4% Time(fresh) >Time(grad) ** ** ** ** ** ** ** ** 46.8% Prod(fresh) < Prod(grad) ** ** ** ** ** ** * ** 37.4% Defects(fresh) < Defects(grad) 9.1% Density(fresh) > Density(grad) ** ** * ** 12.7% Intensity(fresh) < Intensity(grad) ** ** ** ** ** 32.1% a. ** for Size(fresh) > Size(grad)

sive paper-based set of forms to fill out. In the student settings, most of the data are collected using electronic support. It is unknown to what degree this impacts on the result. The selection of subjects within each group is based on convenience sampling, and is hence no true sample of any larger population. In all data sets, there are subjects whom drop out and we do not know how this impacts on the results. Regarding construct validity, the use of the PSP context is the largest threat. It increases the internal validity as it adds rigor to the process and the data collection, but it decreases construct validity since few software engineering settings are so well defined, nor are the tasks to be solved so small. However, as the key question is to investigate the validity of using students as subjects in experiments, the PSP is quite similar to how experiment packages look like, and hence it is reduces validity threat regarding the purpose of this study. For the external validity of the study, the question is whether the study is representative to other software engineering experiments, as the purpose is to analyze whether students can be successfully used as subjects. We believe that the student groups and the industry group are quite similar to groups conducting different types of experiments, and thus the external validity is reasonably high. Whether and experiment conducted in a student environment is another issues; this is the issue of the investigation as such. 4 DISCUSSION The analysis presented shows two clear trends: 1) the improvements between the PSP levels are very much the same for all three groups, 2) the freshmen students spend significantly more time on their tasks. The question is now how this can be interpreted. In the three groups, the process is the same. All groups follow the PSP course with minor variations. The technology is also rather similar, even though they use different languages. The tasks in the PSP are small, and thus there is no need for extensive tool support to do a good job. The people have different experience and knowledge. It can be debated which of the issues that has the largest impact on the total result, but it is hard to measure. However, as the PSP course is designed with continuous improvement in the three PSP levels and the adding of new methods, the improvement is probably to a large extent due to the methods as such and not due to the people learning. In PSP0, there is no estimation method, and very limited experience data available, while in PSP1 experience data is available and the PROBE estimation method is gradually introduced. Hence, it is not surprising the estimation accuracy improves. The same holds for the pre-compile defect yield. Code and design reviews are introduced in PSP2, while in PSP0 and PSP1, reviews are not a formal part of the process. Again, it is no surprise that the case of applying reviews reveals more defects before compile than the case of not applying reviews. These issues are related to the process, and it seems that, independently of people, almost the same effects can be observed. The direct measurements show a significant difference on time consumption. In the study, freshmen spend 47% more time than graduate students do. This indicates that the people part actually is different between the freshmen and graduate student groups. Unfortunately, industry data is not available to make the same comparison between graduate students and industry people. A last question related to the data is why there is no difference in defect levels between freshmen and graduate students. Here, it is tempting to assume that the freshmen students do not report all defects. The reported repair time in the defect reporting log seems to be somewhat low related to the time spent on compile and test, but no systematic investigation is performed on this issue. The quality of the PSP data is investigated and debated [11, 12, 3], but not concerning the defect reporting. How shall these results be interpreted in terms of the feasibility of using students as subjects in software engineering experiments? The improvement study may give the impression that any subject is feasible for a software engineering experiment. The performance study and qualitative judgments turn more towards that there are substantial differences between the two student groups. Unfortunately, industry data is not available to perform the same comparison to the industry group. Hence the general question remains unanswered, while it can be stated that freshmen students should not be used as subjects for software engineering experiments. 5 SUMMARY It is generally accepted that people, process and technology are three different aspects that affect software engineering. In order to learn more about the different parts, experiments are conducted. An important question is whether students can be used as subjects and still give generalizable results. In this paper, three sets of PSP data are compared in order to evaluate differences regarding people issues between freshmen students, graduate students and industry people. It is observed that almost the same improvements are made between the different PSP levels for the three groups. The estimation accuracy is improved and the defects are reduced. This is however primarily an effect of the PSP process as such, rather than the people. New steps for estimation and defect reduction are introduced which give the observed effects.

The measurements on the absolute performance on the freshmen student group and the graduate student group show more varying results. The freshmen students spend significantly more time to fulfill the tasks than graduate students do. From this, we conclude that there is a difference on the people issue between the two student groups, which is also supported by the quantitative observations. The conclusions drawn from the study can neither reject nor accept the hypothesis on differences between freshmen, graduate students and industry people. The difference between freshmen and graduate students is observed, while the data is not sufficient to evaluate similarities or differences between industry people and graduate students. Hence, this relation is subject for further studies. ACKNOWLEDGEMENT Thanks to Dr. Anders Wesslén for letting me use his analysis tools in the study and guiding in the data access. Thanks to Dr. Thomas Thelin and Dr. Magnus C. Ohlsson for good cooperation during the PSP course for freshmen students. Thanks to Dr. Martin Höst for reviewing a draft of this paper. REFERENCES [1] V. R. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull, S. Sørumgård, and M. Zelkowitz, The Empirical Investigation of Perspective-Based Reading, Empirical Software Engineering, 1(2):133-164, 1996. [2] J. Börstler, D. Carrington, G. W. Hislop, S. Lisack, K. Olson and L. Williams, Teaching PSP: Challenges and Lessons Learned, IEEE Software, Sep./Oct. 2002, pp. 42-48. [3] A. M. Disney and P. M. Johnson, Investigating Data Quality Problems in the PSP, FSE-6, 1998. [4] P. Ferguson, W. S. Humphrey, S. Khajenoori, S. Macke and A. Matvya, Results of Applying the Personal Software Process, IEEE Computer, No 5, 1997, pp. 24-31. [5] W. Hayes and J. W. Over, The Personal Software Process (PSP): An Empirical Study of the Impact of PSP on Individual Engineers, Technical Report, CMU/SEI-97-TR-001. ESC- TR-97-001, Software Engineering Institute, December 1997. [6] W. Hayes, Using a Personal Software Process to Improve Performance, Proc. 5th International Metrics Conference, pp. 61-71, 1998. [7] W. S. Humphrey, A Discipline for Software Engineering, Addison Wesley, 1995. [8] W. S. Humphrey, Using a Defined and Measured Personal Software Process, IEEE Software, May 1996, pp. 77-88. [9] W. S. Humphrey, Introduction to the Personal Software Process, Addison Wesley, 1997. [10] M. Höst, B. Regnell and C. Wohlin, Using Students as Subjects A Comparative Study of Students and Professional in Lead-Time Impact Assessment, Journal of Empirical Software Engineering, 5(3):201-214, 2000. [11] P. M. Johnson and A. M. Disney, The Personal Software Process: A Cautionary Case Study, IEEE Software, Nov./Dec. 1998, pp. 85-88. [12] P. M. Johnson and A. M. Disney, A Critical Analysis of PSP Data Quality: Results form a Case Study, Empirical Software Engineering 4(4):317-349, 1999. [13] M. Morisio, Applying the PSP in Industry, IEEE Software, Nov./Dec. 2000, pp. 90-95 [14] L. Prechelt and B. Unger, An Experiment Measuring the Effects of Personal Software Process (PSP) Training, IEEE Trans. on Software Engineering, 27(5): 465-472, 2000. [15] P. Runeson, A New Software Engineering Programme Structure and Initial Experiences, Proc. 13th Conference on Software Engineering Education & Training, pp. 223-232, 2000. [16] P. Runeson, Experience from Teaching PSP for Proc. 14th Conference on Software Engineering Education & Training, pp. 98-107, 2001. [17] J. Singer and N. G. Vinson, Ethical Issues in Empirical Studies of Software Engineering, IEEE Trans. on Software Engineering, 28(12): 1171-1180, 2002. [18] T. Thelin, P. Runeson, and B. Regnell, Usage-Based Reading - An Experiment to Guide Reviewers with Use Cases, Information and Software Technology, 43(15):925-938, 2001. [19] A. Wesslén, A Replicated Empirical Study of the Impact of the Methods in the PSP in Individual Engineers, Empirical Software Engineering, 5(2), pp. 93-123, 2000. [20] C. Wohlin, Meeting the Challenge of Large Scale Software Development in an Educational Environment, Proc. 10th Conference on Software Engineering Education & Training, pp. 40-52, 1997. [21] C. Wohlin, The Personal Software Process as a Context for Empirical Studies, IEEE TCSE Software Process Newsletter, pp. 7-12, No. 12, Spring 1998. [22] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell and Anders Wesslén, Experimentation in Software Engineering An Introduction, Kluwer Academic Publishers, Boston, MA, USA, 2000.