Mining a Change-Based Software Repository



Similar documents
Towards Software Configuration Management for Test-Driven Development

JRefleX: Towards Supporting Small Student Software Teams

Recovering Business Rules from Legacy Source Code for System Modernization

An Agile Methodology Based Model for Change- Oriented Software Engineering

Peter Mileff PhD SOFTWARE ENGINEERING. The Basics of Software Engineering. University of Miskolc Department of Information Technology

Using Continuous Code Change Analysis to Understand the Practice of Refactoring

Introduction to Software Configuration Management. CprE 556 Electrical and Computer Engineering Department Iowa State University

Component visualization methods for large legacy software in C/C++

19 Configuration Management

Software Construction

Chapter 5. Regression Testing of Web-Components

CHAPTER 20 TESING WEB APPLICATIONS. Overview

Model-based Configuration Management for a Web Engineering Lifecycle

Is It Dangerous to Use Version Control Histories to Study Source Code Evolution?

Configuration Management: An Object-Based Method Barbara Dumas

Analyze Database Optimization Techniques

Version Control. Luka Milovanov

CPSC 491. Today: Source code control. Source Code (Version) Control. Exercise: g., no git, subversion, cvs, etc.)

Simplifying development through activity-based change management

Engineering Process Software Qualities Software Architectural Design

Does the Act of Refactoring Really Make Code Simpler? A Preliminary Study

White Paper. Software Development Best Practices: Enterprise Code Portal

Chapter 5. Choose the answer that mostly suits each of the sentences given:

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Monitoring Replication

What is a life cycle model?

Supporting Software Development Process Using Evolution Analysis : a Brief Survey

CS 389 Software Engineering. Lecture 2 Chapter 2 Software Processes. Adapted from: Chap 1. Sommerville 9 th ed. Chap 1. Pressman 6 th ed.

Baseline Code Analysis Using McCabe IQ

Data Migration. How CXAIR can be used to improve the efficiency and accuracy of data migration. A CXAIR White Paper.

Xcode Source Management Guide. (Legacy)

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

Configuration Management

Instructional Design Framework CSE: Unit 1 Lesson 1

Structural Design Patterns Used in Data Structures Implementation

DAVE Usage with SVN. Presentation and Tutorial v 2.0. May, 2014

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Software Configuration Management (SCM)

SOFTWARE TESTING TRAINING COURSES CONTENTS

CHAPTER 2 LITERATURE SURVEY

TopBraid Insight for Life Sciences

Configuration Management Models in Commercial Environments

Theme 1 Software Processes. Software Configuration Management

Nick Ashley TOOLS. The following table lists some additional and possibly more unusual tools used in this paper.

Arti Tyagi Sunita Choudhary

Software Engineering. Software Engineering. Component-Based. Based on Software Engineering, 7 th Edition by Ian Sommerville

Kevin Lee Technical Consultant As part of a normal software build and release process

Chapter 24 - Quality Management. Lecture 1. Chapter 24 Quality management

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Semantic Search in Portals using Ontologies

Chapter ML:XI. XI. Cluster Analysis

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Content. Development Tools 2(63)

AGILE vs. WATERFALL METHODOLOGIES

1.1 Difficulty in Fault Localization in Large-Scale Computing Systems

Backing Up the CTERA Portal Using Veeam Backup & Replication. CTERA Portal Datacenter Edition. May 2014 Version 4.0

Windchill Service Information Manager Curriculum Guide

IBM WebSphere ILOG Rules for.net

Windchill PDMLink Curriculum Guide

Complexities of Simulating a Hybrid Agent-Landscape Model Using Multi-Formalism

Software Engineering Best Practices. Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer

A Framework of Model-Driven Web Application Testing

PATROL From a Database Administrator s Perspective

Improving database development. Recommendations for solving development problems using Red Gate tools

Towards a taxonomy of software change

A Time Efficient Algorithm for Web Log Analysis

Context Capture in Software Development

An Eclipse Plug-In for Visualizing Java Code Dependencies on Relational Databases

Driving Your Business Forward with Application Life-cycle Management (ALM)

Software Configuration Management.

Software Configuration Management and Continuous Integration

Foundations of Business Intelligence: Databases and Information Management

The Scientific Data Mining Process

Computer programs (both source and executable) Documentation (both technical and user) Data (contained within the program or external to it)

SAS in clinical trials A relook at project management,

Continuous Integration. CSC 440: Software Engineering Slide #1

Simple Network Management Protocol

Basic Trends of Modern Software Development

WHITEPAPER. Improving database development

Pipeline Orchestration for Test Automation using Extended Buildbot Architecture

Work. MATLAB Source Control Using Git

Database Design Patterns. Winter Lecture 24

A Mind Map Based Framework for Automated Software Log File Analysis

Recommending change clusters to support software investigation: an empirical study

Transcription:

Mining a Change-Based Software Repository Romain Robbes Faculty of Informatics University of Lugano, Switzerland 1

Introduction The nature of information found in software repositories determines what we can infer from it. Researchers base themselves on popular repositories such as CVS and SubVersion despite their limitations. The number of available case studies is a determinant factor for choosing source code repositories. Versioning systems are designed to be used in a variety of contexts and hence must lower their assumptions about the objects they version. 2

Introduction This paper explores the benefits and drawbacks obtained by breaking the assumption that a popular repository must be used. Proposes a change-based software repository which fetches domainspecific information from an Integrated Development Environment (IDE) In this approach, changes are stored as first-class citizens This model better matches the actual evolution of a system since we reproduce how developers actually change the system. 3

Limitations of Versioning Systems Versioning systems are designed to be used in a variety of contexts and hence must lower their assumptions about the objects they version. Shortcomings: They are file-based and snapshot-based 4

Limitations of Versioning Systems File-based: Ubiqutous Not Domain-specific Heavy pre-processing must be performed to raise the abstraction level beyond file and lines 5

Limitations of Versioning Systems Snapshot-based Stores updates as deltas between two snapshots of the system Dependant on developer commits The larger the delta, the harder it is to distinguish individual changes Small deltas (stable) are not guaranteed Most commits are related to tasks: many changes are aggregated in a single transaction (changes muggled together) 6

Limitations of Versioning Systems Snapshot-based Impossible to determine if a change was performed at once or in a series of incremental steps in the same session. The exact sequence of changes between two versions is hard to recover; A further consequence for evolution research is these two effects reinforce each other with the practice of version sampling. 7

Change-based Software Repositories History of a software is not viewed as a sequence of versions, but rather a sum of change operations; Change operations can not be deduced with enough precision by differencing two arbitrary program versions; Change operations are recovered by monitoring the IDE usage; 8

Change-based Software Repositories Building a Change-based repository involves: Program representation Change operations Data retrieval 9

Program Representation How we represent a software system in a change-based repository to match the problem domain as closely as possible. Focus: object-oriented programs Entities stored: object-oriented constructs (classes, methods..) System representation: evolving abstract syntax trees (AST) of object oriented programs. 10

11

Program Representation A node a is a child of a node b if b contains a (a superclass is not the parent of a subclass, only packages are parents of classes). Nodes have properties. These properties vary with the type of node being considered, such as: Classes: name and superclass; Methods: name, return type and access modifier (public, protected or private, if the language supports them); Variables: name, type and access modifier. This abstract syntax tree represents one state of the program, i.e., one state the program went through during its evolution. 12

Change Operations Change operations represent the actual evolution of the system under study Change operations enable us to transition from one state of the evolving system to the next. Eg: adding a class to the system, removing a method, moving a class from one package to another, changing the implementation of a method, or performing a refactoring Types of Change operations: Atomic Composite 13

Atomic Change Operations Operations on the program s AST Creation: creates a node n for an entity of a given type t. Addition: adds a node n as a child of a given parent p. Removal: removes a node n from the children of its parent p. Property change: changes the value v of a property p of node n. Atomic change operations are executable. By iterating on the list of changes we can generate all the states the program went through during its evolution. Sufficient to model the evolution of programs. 14

Composite Change Operations Atomic changes operations groupped into higher-level change operations associated with a more abstract meaning; Specific order of execution; Eg: moving a class from one package to another consists in: first removing it from the old package, and then adding it to the new package. These two change operations can be grouped in a single, higher-level move class change operation 15

Composite Change Operations Developer-level actions: This set of changes constitute a unit from the developer s point of view. Eg: changing the definition of a class, adding or changing a Method; Refactorings: Refactorings are behavior-preserving code transformation. Bug fixes: Changes necessary to fix a given bug. Development sessions: Changes done during a single development session by a developer. Features: Changes necessary to build a given feature of the system. 16

Data Retrieval The data repository we devised rely on the interaction with an Integrated Development Environment (IDE). Advantages: User activity can be processed as it happens; Time stamps can have an up to the second precision; It is possible to be notified of a variety of events to make the analysis more precise. 17

SpyWare Implemented in an IDE plug-in for the Squeak Smalltalk environment SpyWare monitors the activity of the IDE user and store code modifications in a changebased repositor, alongside other useful information, such as refactorings or navigation paths through the code SpyWare also allows to exploit this information through interactive data reports and visualizations. It also can regenerate the code of the project at any given point in time, and offers dedicated code browsers for this task. 18

SpyWare Limitations Developer-level actions, refactorings performed by refactoring tools, and development sessions are detected automatically; We have not yet attempted to recover other changes (manual refactorings, bug fixes, features) in an automated way, but plan to. 19

Case Study Application of Change-Based Mining for Refactoring Detection SpyWare Project X 20

Case Study Limitations identified in diferente approaches: The focus is on refactorings which change the interface of classes; work best when the entities are not modified by other changes in the same session Identify refactoring based on manual search, comment logs or release notes for establishing a precision measurement 21

Questions What kind of refactorings were used? Are they interface-changing refactorings detected by other approaches? Were several refactorings applied to the same entities in the same session, making them harder to detect? Were refactorings performed alongside other changes, complicating their detection in the process? 22

Types of Refactorings Large impact on the interface of the class Little impact on class interfaces No changes to the interface of a class, only to its implementation 23

Multiple Refactorings During Sessions 24

Refactorings and Method Modification We computed how often a method was involved in several refactorings: For Project X, no method was involved in several refactorings. For SpyWare, we found 28 occurrences Most were affected only two or three times, but a few methods were affected by four or five refactorings. Sessions tended to have several methods associated with several refactorings. Refactorings such as extract method and extract expression to temporary often were applied several times to one method, as well as inlining refactorings: 15 times in total. 25

Refactorings and Method Modification 26

Results Refactorings are at the interface-level such as in Project X are easier to detect, corroborating the results found in practice by other approaches. Refactorings at the method body level such as in SpyWare are harder to detect, as they tend to come in groups and cause more body-level modifications,. Incidentally, they are not the focus of most refactoring-detection approaches. As for the usability of our repository, its accuracy and change-based model allowed us to query the data in a straightforward way, We think a change-based representation supporting composition is well suited for this type of analysis. 27

Change-based versus snapshot-based repositories It is Domain-specific Less information loss Changes are first-class citizens Rare case studies Memory and speed requirements Non-code repositories 28

Relations to Software Configuration Management It is NOT a new approach for SCM! 29

Conclusions By monitoring IDE usage instead of relying on the checkin/checkout model of versioning systems, a change-based repository can capture more accurate information about an evolving system, and represent its evolution as a sequence of changes rather than several successive versions. 30

Mining a Change-Based Software Repository Dúvidas? Obrigado! 31