Remote Backup Features

Transcription

1 Extending Areca with Remote Backup Features Author: Delong Yang BSc (Hons) Computer Science Third Year Project Supervisor: Dr. Goran Nenadic May 2010 The University of Manchester School of Computer Science

2 ABSTRACT The advance of computer systems and increased data usage as a whole has called upon the need to restore data if a disaster strikes. Backing up files is a method to aid in recovery when data loss occurs. The evolution of technology has enabled new paradigms of backups to be created: small and portable such as USB pens, large and spacious such as tape storage systems, and finally, remote storage devices. The fact that files are dispersed across different media requires a solution that consolidates these files, and backs them up using a single piece of software. My final year project entitled Extending Areca with Remote Backup Features aimed to explore ways to access and backup remote storage. Areca is an existing opensource solution that I chose to implement additional features for. The code is freely available on SourceForge. As a student at the University of Manchester, it made sense to develop a project that helped me, and hopefully others to initiate and run backups away from their usual desks. This project was designed to incorporate useful features, with the ultimate goal being able to seamlessly access and backup files that are located remotely. In addition, a scheduler function will be included to automatically start backups at predefined times. This report will detail the background research on backup software, justification of the selection of Areca, the industry-wide protocols available to access remote servers, the evaluation of the finished product, and finally, a conclusion. Delong Yang The University of Manchester School of Computer Science Supervised by Dr. Goran Nenadic May

3 ACKNOWLEDGEMENTS First and foremost, I would like to give thanks to my supervisor, Dr. Goran Nenadic, for his support and sound advice. My thanks extend to Goran s research group. I would also like to thank my friends for their positive input and my family for instilling confidence and continuing support. 2

4 TABLE OF CONTENTS Chapter 1 Introduction Motivation of Project Aim and Objectives Brief Outline of Achievements Report Outline...11 Chapter 2 Background Brief Overview on Backup Evolution General Backup Solutions Limitations of Existing Solutions Potential Issues during Backup Locating the Files Read/Write Permissions File Integrity Summary...25 Chapter 3 Requirements Problem and Proposed Solution Constraints of Solution Risks Associated with Implementing Features User Requirements Summary...30 Chapter 4 Design Areca Architecture Object-oriented Design of the Extension Code UML Diagrams Abstract Data Types Graphical User Interface Existing User Interface The Target Window The Sources Tab The Scheduler Tab The Remote Account Manager Window Summary

5 Chapter 5 Implementation Programming Environment Existing Code Base Overview of the Existing Code Resource Manager Inheritance Generics Peculiarities Integrating the New Features Check File Tree Apache Commons Virtual Filesystem Scheduler A Hybrid Target Simple Encryption Module Remote Account Security Implications Summary...61 Chapter 6 Testing and Evaluation Overall Testing Structure Standard Testing Techniques Automated Testing Techniques Unit Tests Black and White Box Testing Automated Testing Functional Tests Load Tests Regression Tests Connection Stability Tests Testing Results Evaluation of Project Summary...77 Chapter 7 Conclusion Objectives Review Project Reflection Future Development Ideas Final Conclusion

6 Chapter 8 References...83 Appendix A: Gantt Chart...88 Appendix B: Initial Use Cases...92 Appendix C: Online Questionnaire...94 Appendix D: Test Results and Graphs

7 TABLE OF FIGURES Figure 1.1: Conventional method of accessing remote storage Figure 1.2: Proposed method of accessing remote storage Figure 2.1: A comparison of features of backup software researched Figure 3.1: A use case diagram depicting a user initiating a remote backup Figure 3.2: A table of functional requirements Figure 3.3: A table of non-functional requirements Figure 4.1: An overview of the main classes in Areca Figure 4.2: The original class AbstractFileSystemMedium, which inspired the creation of the new class RemoteFileSystemMedium Figure 4.3: A Java abstract data type for the class RemoteFileSystemDriver Figure 4.4: Areca graphical user interface (main window) Figure 4.5: The original Sources tab Figure 4.6: Option to add only a file or directory Figure 4.7: The new Sources tab Figure 4.8: The Check File Tree method to add sources Figure 4.9: An initial design of the Scheduler tab Figure 4.10: The Scheduler tab with a scheduled job Figure 4.11: The Scheduler tab without a scheduled job (controls disabled) Figure 4.12: A mock-up of the Remote Account Manager window Figure 4.13: The finalised Remote Account Manager window Figure 4.14: The proposed output of a serialised RemoteAccount object in an XML format Figure 5.1: An example of not using generics/parameterised types Figure 5.2: An example of exploiting the use of generics/parameterised types Figure 5.3: An example of a long line in the existing code base Figure 5.4: A derived example of a long line being separated onto different lines Figure 5.5: An excerpt showing the creation of new instances of objects Figure 5.6: Checking the previously selected file paths in the tree Figure 5.7: Associating a target with a scheduler Figure 5.8: HybridFileSystemTarget inheritance hierarchy Figure 6.1: An error message from Areca as it was executed on a Mac OS X system Figure 6.2: Assert fail when testing the hourly schedule Figure 6.3: Assert pass after bugs were fixed Figure 6.4: Excerpt of how the file sizes are checked against the threshold Figure 6.5: The memory comparison between the original and new version over a 12hr period Figure 6.6: The CPU usage of backing up three targets over a 12hr period Figure 6.7: An error accessing a remote source Figure 6.8: An excerpt of handling error windows in the automated test scripts Figure 6.9: An excerpt showing how to check for error messages in Areca Figure 6.10: The hard disk space usage during the backup of three targets Figure 6.11: The automated test cases Figure 6.12: The evaluation response of the implemented system Figure 6.13: The functional requirements met in this project Figure 6.14: The non-functional requirements met in this project Figure 7.1: The time spent per week on the project 6

8 Chapter 1: Introduction This chapter explains the motivation and direction of the project. It aims to give an overview of what is expected out of the project and what has been achieved. A brief description of the protocols that will be used frequently in this report has been given below. SFTP: Secure File Transfer Protocol allows a secure transaction of files between remote machines [1]. FTP: File Transfer Protocol allows an unsecured transaction of files between remote machines. The insecurity is due to the authentication details being sent across the network in plain text. SMB: Server Message Block can be used to share files between remote machines. The Windows operating systems are commonly associated with using this protocol to share files between networked computers [2]. 1.1 Motivation for Project The increasing use of computers has led to an ever-growing amount of data used and created in computer systems worldwide. As a society that is becoming increasingly reliant on digital information, the need for backing up is undisputable. There are many factors that could put data at risk; for example, viruses could erase the contents of the hard drive, hardware malfunctions, preventing data from being retrieved, human error (e.g. misplacing the USB pen that contained important information), natural disasters causing damage to hardware, etc. Without some form of backup, data will be irreversibly lost. During early periods of computing, the tedious case of manually copying files from a source to a destination would produce a backup. For instance, a storage medium such as a floppy disk would have sufficed as a physical backup copy. Ignoring the capacity of a floppy disk, storage of these physical floppy disk copies would have been cumbersome and expensive. In the current computing era, hard drive space is relatively cheap; it is much more viable to hold digital copies of files rather than, for example, storing a collection of punch cards in large filing cabinets. 7

9 Storing files digitally has many advantages. First, it allows files to be copied without much effort, and most importantly, it enables files to be accessed remotely. Nodes connected in a network allow one node (a client) to connect to another (a server) without a physical cable connection. Utilising this connection, files can be transferred. The different methods in which users access remote storage is certainly interesting and broad; protocols such as File Transfer Protocol (FTP) and Secure File Transfer Protocol (SFTP) are still widely used to create connections between local machines (client) and remote machines (server) and it will be useful to exploit these protocols to provide users with a simple option to backup files from remote servers. As an example, an occasion may arise when a user wishes to backup their remote files from different servers within an organisation. These servers are all connected through the same network and files are accessed through correct authentication to the server. The usual backup software would run and perform general 1 backing up activities on the local workstation, but the files that are required to be backed up would be located elsewhere. As a result, proprietary software such as an FTP client would be used to transfer these remote files to the local workstation, and then a local file backup would be performed effectively. It is arguably inconvenient for users to backup files from multiple remote servers. Consider Figure 1.1. The FTP client is required to transfer files from a remote server to the local drive, and then the process of backing up starts. Figure 1.1: Conventional method of accessing remote storage 1 General backup software should include full-backup, incremental backup and file recovery features. 8

10 This project aimed to relieve users from this manual and ad-hoc manner of transferring remote files to local storage. The idea of remote backups is more feasible given rising Internet connection speeds (bandwidth) being offered by Internet Service Providers. Areca was determined to be a suitable candidate for extending features for. These features will be a contribution to open-source code. The final result should act as an intermediary between local and remote files through one central point of control, which will be the extended version of Areca. Figure 1.2: Proposed method of accessing remote storage Using the method illustrated in Figure 1.2, it is possible to backup files from a range of media. USB pens connected to the local computer and CDs inserted into the CD- ROM drive would be accessed through the operating system, and enables many FTP/SFTP/SMB servers to be traversed through a single and unified solution. This significantly reduces the overhead of backing up each medium individually (e.g. accessing one server and copying the necessary files). 1.2 Aim and Objectives The aim of this project was to develop a piece of software that carries out general backing up activities with the additional features of accessing and backing up files located on remote servers, scheduling of these backups and an improved file restore. The remote files will be accessed through widely used communication protocols. 9

11 A decision was made to not write a backup solution from scratch (i.e. start development for a simple backup solution and slowly add the necessary features), but instead, build on top of an existing solution that has only the general features of a backup solution implemented. This will allow better dedication of resources in developing the outlined features. This also means that the design and construction of a graphical user interface (GUI) will not have to be one of the primary concerns. The primary objectives to be integrated into Areca were: A single interface to remote storage devices (i.e. a server, a cluster of servers, etc) Scheduling capabilities An improved file restore method The single interface to remote storage devices would need to support a variety of protocols such that the accessibility of a wide range of servers is possible. Therefore SFTP, FTP and SMB protocols will be implemented. The scheduler would allow users to set a time at which to run a backup and will have options to be enabled and disabled. The scheduler would run on a separate thread to the main application, effectively introducing multi-threading into Areca. Some applications can restore a file only to a directory that is different to where the file was originally backed up. There are advantages and disadvantages for this method. The advantage is that it allows the user to specifically recover the file into a convenient directory so they do not have to delve into multiple directory levels. Or simply, the user cannot remember where the original file was placed, and so, giving the user a choice of directory allows them to know exactly the location of the restored file(s). Another advantage is for the developer of the software. The developer does not need to worry about locked files. A locked file is essentially a file that is either in use or under special permissions (i.e. hidden in Windows). When a user tries to restore a file into its original destination, the OS prevents external (i.e. non-kernel) applications from accessing and overwriting the file, thus causing the entire restore operation to fail. This is most prominent in cases where a system file is trying to be restored. Since the OS will be running, it is likely the system file will be in 10

12 use. Therefore, when the application tries to copy the system file to its original location, the operation will fail as the file is locked and cannot be overwritten. 1.3 Brief Outline of Achievements During the course of the project, an extended version of Areca was produced. This new application is capable of accessing remote servers using FTP, SFTP and SMB servers. The method is which users select files from these remote servers to backup is the same as selecting files from the local drive, thus providing a familiar and seamless way of accessing remote files. Unfortunately, due to an underestimation of the elaborate and intricate nature of Areca, it was difficult to dedicate time solely for development during the course of the entire project. This led to the proposed feature of an improved file restore method to not be implemented as a majority of time spent was focused on understanding the existing code base. A large portion of time was also spent on the careful planning of the integration of new code into the existing code. 1.4 Report Outline This chapter has explained the necessity and importance of backups. The ways in which we tackle the problem will always be diverse in order to provide convenience for users to perform backups. The report is organised in six other chapters as follows: Chapter 2 explores the general use of backup software and explains the background research carried out. It also details how Areca was chosen as a suitable candidate to build features on. Chapter 3 provides an overview of the system requirements and how any foreseeable constraints may exist in the system. Chapter 4 illustrates the design stages and choices made. The design of code and graphical interfaces are included. Chapter 5 provides a detailed description of the code that has been written. 11

13 Chapter 6 describes the in-depth testing that has been carried out on the system. It also provides scope on new features that could be implemented in the future. An evaluation of the solution is also presented. Finally, chapter 7 discusses the objectives that were met, future ideas that could be undertaken, and this concludes the report. 12

14 Chapter 2: Background This chapter provides a background to the history of backups and alternative backup solutions that are available. It also explains common backup issues and the different user requirements between home and enterprise users. A backup can be described as copying one set of data onto a different medium such that a historical record can be made, e.g. day1backup, day2backup, etc [3]. This has the advantage that if, for example, data loss occurs due to a malicious virus deleting all the records created on day 2 the information can be recovered by restoring the data saved in day2backup. Usually, data/files are stored for later processing or for archiving purposes. In a large organisation it is often a System Administrator s job to maintain backups of systems but in smaller organisations this role may not be assigned. If the role is given to an employee who has less experience in making backups, they might fail to see the different approaches to backup data; a naïve way would be to make full hard drive backups at the end of every day. This would be infeasible for many large companies as data could be drastically different during every second, minute or hour. The possibility of a system crash at any given moment would cause a whole days work to be lost if the systems failed before the backup was about to start. It is generally considered an excellent practice to backup files at regular, convenient, and purposeful intervals. For large companies, backup strategies [4] are vital to ensure that resources (such as network bandwidth) are not tied up during busy or critical periods of business. A backup strategy dictates when to carry out a backup, which backup level to use, which files to backup, and where to store the backed up files. As the solution was primarily aimed for personal use, a backup strategy does not need much thought but it is, still indeed, a relevant notion to consider. Some methods to backing up (also known as backup levels) are explained first [5]: Differential backup: this backs up files that have been changed since the last full backup. This method is particularly useful for backing up the latest files. Incremental backup: this backs up any files that have changed (i.e. modified date) since the last backup of any type. This method is also a good way of backing up the latest files. 13

15 Full backup: this backs up all the selected files. This has the immediate advantage that every file is backed up, regardless of modification times, access times, or file size differences so that the administrator is sure that the selected files are always backed up. However, this also poses as a disadvantage. In a scenario where only half of the files have actually changed, it could be argued that 50% of the resources (i.e. hard disk space, network bandwidth, time and effort) have been used inefficiently, and effectively wasted. It is important to understand these basic backup levels in order to create a systematic and streamlined approach to backing up. Most of the backup software that was researched offered these three choices of backup levels. The following text explains the process of carrying out research into the features of backup applications, and subsequently, finding a suitable candidate on which to extend features for. 2.1 Brief Overview on Backup Evolution In the 1950s, the Universal Automatic Computer [6] provided the first glimpse into computerised backups. They were large systems and required magnetic tapes in which data was stored. The tapes had a capacity of 1 megabyte (MB). This would have been a respectable capacity for data storage for the time. The popularity of magnetic tapes grew, and businesses and home users started creating tape backups to prevent complete data loss. The introduction of hard disks in 1956 [7] offered the public with an efficient and quick way of storing and retrieving data at a higher storage density. A decade later, this would generate competition between companies on whether to proceed with tape or hard disk backups. This debate between which medium is best is still on going. Between 1960 and 1980, the arrival of floppy disks seemed revolutionary [8]. The would-be large capacity of 250MB disks, portability and cost-effectiveness attributes evoked widespread satisfaction and acceptance. As the floppy disks were relatively 14

16 cheap, small businesses and home users found the ideal medium in which to store their backups. Other portable backup media such as CD/DVD-ROMs, Flash storage (USB pen drives), and quite recently Blu-ray discs, are still in use as backup storage media. Albeit, it is fair to say that CD-ROMs are relatively low in capacity compared to Bluray discs, and thus, becoming increasingly obsolete [8]. The evolving technologies of computer technology brought the concept of network storage as a medium for storing backups. The introduction of Local Area Networks (LAN) created a link between computers that was connected using cables to exist in a network. This allowed computers to store their files, using the LAN connection, on Network Attached Storage (NAS) and Storage Area Network (SAN) devices [8]. This permitted remote access to files for computers that were connected in the same network. The most interesting technology relating to my project was the appearance of data transfer protocols such as the File Transfer Protocol in 1985 [8]. It bridged a gap between different computers of differing networks in order for them to transfer files. Many data transfer protocols were introduced and these protocols will be put into practice in the implementation stages (discussed later in chapter 5.3.2). Recently, the availability of both bandwidth and online services has led to a new paradigm of backing up. The general term is cloud backups [9]. Online storage services such as Amazon S3 [10] allow users to store their personal files online so that they are accessible. The interaction is carried out through the World Wide Web; using a browser, users can upload, retrieve and maintain their files. Given the use of a suitable desktop client that automatically scans files for changes and seamlessly connects and uploads to the online storage device, the reality of backing up to remote storage is achieved. There are already several products that allow online backup. For example, SpiderOak [11] is a freeware and cross-platform application that offers users with a limited capacity of 2GB of online storage. For additional storage, users can purchase a monthly storage subscription. Once a set of files has been chosen to be backed up, it will continuously scan the local filesystem for changes and if a change has 15

17 been detected, the application will automatically copy the selected files onto online storage. Another application that offers similar services is called SugarSync [12]. The main disadvantages include: If the online storage goes offline, backups and retrievals are not possible If the online storage company decides to cease operations, the loss of your data may occur Extremely dependent on the Internet connection (reliability, speed, bandwidth) May be very radical in terms of expenses if gross amounts of backups are completed (e.g. paying for monthly subscription fees, exceeded quota charges on the online storage server, violated Fair Usage Policy bandwidth limitations) Although the disadvantages of cloud backups are quite discouraging, the advantages far outweigh them. Listed below are advantages of using cloud backups: Consolidated storage of files (e.g. removes the need for backing up on separate hard disk drives if one runs out of space) No physical space required to store the backups (e.g. do not need a dedicated server in the house/office) Relatively cheap as the cost of backing up services is spread across a year Reliable (online storage space is normally off-shored and mirrored across different data centres, e.g. Amazon S3 [ref]) Backed up files can be accessed from any computer that has a browser and an Internet connection The concept of cloud backups was ideal for backing up files to remote storage. However, the requirement of additional software to access remote files defeated the idea of using a single interface to access remote files. 16

18 2.2 General Backup Solutions In order to better understand the choice of backup software available to users, research was carried out into several solutions to gather a broad range of generic features that existed in a wide variety of applications. The range of backup solutions included well-known and branded software packages such as Apple Time Machine [13], as well as open-source such as Snap Backup [14]. By comparing backup solutions that can be described as being at different ends of the spectrum allowed a careful study into their feature sets and find the median in terms of functionality. Using a chart that is included later (Figure 2.1), a comparison chart has been drawn to exhibit features that exist in my research set of backup solutions. From the chart, readers can quickly gain an indication of which the most popular features of a backup solution are. It is sometimes stated that the loss of data for a personal user is much more acceptable than data loss for a large business; this subjective statement implies that the value of retaining data after a system crash is much higher for a business than it is for a personal user. Without being philosophical, a clear distinction between the purposes of a backup solution for personal use and for enterprise use is exposed. One attribute that some businesses absolutely require is speed of recovery. This is an extremely expensive approach. The software package that is usually bought [15] to aid swift recovery of files will most likely be a commercial product being sold at a premium price. The hardware will normally consist of faster and larger-sized hard disk drives (possibly configured as RAID 2 ), the most efficient and quick processing units, and plentiful of RAM. It is usual for businesses to contain a rack of servers that are dedicated for backing up their important data. In an enterprise environment, the 2 RAID: Redundant Array of Independent Disks allows many hard disks to be connected together such as it acts as one hard drive, thus, increasing the disk space. Consequently, this allowed the capability of mirroring disks for instantaneous backup [16]. 17

19 feature to support disaster recovery is imperative. They must also be flexible in that the administrators (of the software or IT systems) can devise and tailor backup strategies. The software must be resource-efficient, be feature-packed and administrating options like the notion of a backup manager control interface. For personal users, a straightforward requirement of their data being backed up somewhere safe perhaps the most desired requirement, i.e. the speed of recovery may not be necessary. There are many backup solutions available for both personal (home) and enterprise use. It is important to state that this project s functions was primarily designed for personal use, however, it could be used in enterprises providing that the backup software is purely for backing up the files and not for restoring files. The reason is that enterprises will be dealing with vast amounts of sensitive files, and restoration of these files will be monitored to make sure that only the right departments can read them. In the existing solution, the file restore method is not the most efficient nor secure (this will be discussed in detail later in the report), and it will simply not suffice as it was not initially designed for speed nor enterprise use. Instead, Areca does fulfil the requirements of a personal backup solution. It provides a no-frills feature set, as well as being open-source. Therefore, it is feasible for home users to run the software on as many computers as they want. The virtue of opensourced software also means that Areca invites a community of software developers to report and fix bugs, and consequently, improve the reliability of Areca and implement additional features without charging users for software updates. From the research that was undertaken, it was apparent that there was an enormous quantity of backup software available that carried out general backup activities expected by personal users, as well as being viable candidates for implementation of the proposed features. The criteria for selecting a viable candidate to extend further was: Whether it was open-source (such that code could be edited) Whether the code was a familiar language, or a language could be quickly learnt Whether it had a graspable amount of code to work with Whether it had a simple set of backup features already implemented (i.e. no scheduler and interface to remote storage) 18

20 2.2.1 Limitations of Existing Solutions The feature that was missing from the set of existing solutions that were researched was the access to remote storage. This was one of the most intriguing features that this project attempted to provide. Knowing that it was an inaugural attempt to provide such a feature in Areca became a strong motivation. On the contrary, it is perhaps, not one of the most pertinent features to implement for any kind of backup software. However, having access to remote storage in a backup application certainly seems useful and convenient for end-users. From the comparison chart in Figure 2.1, it was apparent that the smaller projects did not have built-in schedulers. Some solutions relied on the OS native job scheduler, Windows Task Scheduler and cron for Unix systems [17]. The exclusion of such features eased the jobs of software developers. It was also apparent, perhaps obvious, that there was a correlation between the amounts of features implemented and the number of software developers involved in producing the product. 19

21 Figure 2.1: A comparison of features of backup software researched Explanations about the comparison chart shown in Figure 2.1 is given below: Easy to install Software was considered easy to install if it contained a familiar installer for its intended operating system (e.g. InstallShield installers for Windows [18], RPM for Linux, and DMG files for Mac OS X, etc). Lightweight Software was considered as being lightweight if it was uncomplicated to setup and use, i.e. does not have many components to install before use. Amanda Amanda [ref] was an open-source piece of software that boasts being the most popular open source backup and recovery software in the world [19]. The major downside was that the freeware version did not include a graphical user interface (GUI). Apple Time Machine 20

22 Time Machine [13] was a visually appealing backup system developed by Apple Inc. The software included an automatic scheduling of backups and it was easy to install as it was pre-installed into the operating system (Mac OS X). Its disadvantages include not being cross-platform and being closed-sourced. This means that only owners of Apple-branded machines can make use of Time Machine. Bacula This versatile network backup solution was ideal for administrators who control and maintain a network of computers [20]. It was cross-platform and open-source so it was a cheap (only in terms of money as it could be time-consuming to set up due to the various modules of Bacula that need to be installed) solution for large businesses. However, this application would not be ideal for the general user as a multitude of executables were needed to be installed in order for it to function correctly. DirSync Pro DirSync Pro [21] was very easy to install and deploy, as it was a simple and lightweight executable on Windows. However, DirSync Pro was arguably not a backup solution as its motto states that it synchronize the content of a directory to another, thus it would have been difficult to implement a backup framework in addition to the proposed features due to time constraints. Snap Backup This application was small in terms of its feature set [14]; however, it was crossplatform, free and easy to install. It lacks expected features such as incremental backups and file restoration, which made it unsuitable to use as a candidate Areca Areca [17] is an open-source project that allows backups of personal files to be made easily and reliably. One lead programmer, Olivier Petrucci, developed the majority of the code, with some other classes being added to by various SourceForge members. 21

23 I have chosen Areca as a candidate to implement my features because it had the right amount of features already written, but most importantly, it did not provide access to remote storage, scheduling of backups and a file restore method. 2.3 Potential Issues during Backup As one of the proposed features promised to access remote files, it was necessary to acknowledge issues that may occur for a system initially designed for operating with a local filesystem. First and foremost, understanding the issues that could arise were essential for producing solutions, and secondly, it was important to design and write code such that it disallowed incorrect behaviour. In the following sections, collections of issues that may emerge when backing up remote files are explained Locating Files The difficulties in locating the files during the process of a backup are worth explaining especially when remote files are involved. The vulnerability of the connection to servers that host the remote files is one that could cause file not found errors. For instance, some files might stay in the same directory for the rest of their existence, and some might change directories quite frequently. Regardless, if a user selects filea to be backed up from location D 1 and for some reason, filea is later moved to location D 2, the backing up process will fail as filea is not in its expected D 1 location. Trivial as it may seem but a backup can fail due to files being moved around by users, or even applications. Therefore, it is imperative that a backup can still occur, and it would be logical to help users understand any errors that may have occurred by adding a warning to the log, or presenting an error dialog to the user, stating the reasons for failure. Any other files that do exist in their expected locations should be backed up as normal. 22

24 This situation of files being relocated is worsened when considering remote servers. For instance, virtual directories [22] attempt to provide user-friendly URLs that map onto long or complex physical locations. If for any reason, the virtual directory link changes, then all references to the URL would be incorrect, and thus errors relating to files not found will be present. Although virtual directories are managed by the servers themselves, it is important for the software to deal with missing files accordingly, i.e. show an error dialogue and not crash Read/Write Permissions Read/write permissions are very important topics to discuss when dealing with files. However, for brevity purposes, the full extent of problems that can arise from file permissions will not be presented in this report. Strictly speaking, it is the responsibility of a user to set or alter file/directory permissions. However, modern operating systems tend to provide user-friendly services like automatically applying file permissions. Essentially, this means that users do not have to explicitly set permissions to every file that they have created. As a matter of fact, it is tedious to do so. Operating systems try to make file permissions as easy as possible to understand and apply, and therefore, implicitly affix certain file permissions for files that users create. This means that the read/write permissions are relatively straightforward for local files on a home computer. The user has their own directory (also known as user space, or home directory) within the OS filesystem and the user will have read/write access to any files/directories that they create. An undemanding user may only want to backup their files to another destination, possibly an external hard drive, or a different partition on the hard drive, and thus, read/write permissions are not immediate issues as the user has logical local access; meaning that if they can access the drive, they must be able to store files on that drive. Although this can be untrue for certain operating systems for example, backups whose destinations intrude OS-only directories, such as the root directory /var/root in Unix systems, are likely to fail as it requires root access [23]. 23

25 Read/write permission problems are multiplied further when accessing or storing remote files. For example, the user account that logs onto the remote server must have relevant permissions to read certain files/directories, and these files and directories could have their permissions changed frequently (e.g. user privileges downgraded due to a recent attack on the server [24], or revoked write permissions as the file server is running out of hard disk space), and therefore, temperamental backup behaviour may be introduced File Integrity A file s integrity relates to its file size, name and other attributes (e.g. file permissions, checksum) being intact when copied between several destinations. It could also infer the authenticity of the file, however, for the purposes of my project, this will not be dealt with. The integrity of a file is maintained when, for example, filea is backed up and is stored as fileb, and the file size, name, and attributes of filea are equal to fileb. The most obvious characteristic to indicate file integrity is the size of the file; if the file sizes match then it is quite likely that the two files are the same. However, there could be instances where erroneous bytes are being written to the designated file, and coincidentally, producing a file of identical size (i.e. the wrong bytes replacing the correct bytes). The checksum [25] of a file ensures file integrity, as a one-bit difference in the transferred file will produce a hash value that differs from the original file s hash value. The integrity of files is harder to maintain through remote transactions. Remote connections may be unstable and could introduce invalid bits into the connection stream, thus causing incorrect bits to be written to the file. The loss of integrity may also be due to the file attributes not being able to be reproduced correctly, with heterogeneous operating systems (i.e. translating Unix file attributes to Windows file attributes [26]) being a contributing factor. A file s integrity can also be checked using Cyclic Redundancy Checks (CRC) file checksums (e.g. MD5) and Simple File Verification (SFV) [27] tools and techniques. 24

26 2.4 Background Summary The choice for backup solutions is always going to grow. The copious amount of backup software allow each application to cater for a specific market, a market of users that require either a general set, or a feature set that may specifically be useful for a subset of users, and not so useful for the general population. The by-product is choice for users. In the research conducted, none of the applications had an easy function to access remote storage, and that most of the open-source did not have schedulers built in. In this project, the single interface to remote storage is the feature that aims to be the selling point. Having realised potential problems that may arise during a backup, it is important to transpose the problems into functional requirements. The next chapter will focus on gathering user requirements, and assessing the functional and nonfunctional requirements. 25

27 Chapter 3: Requirements This chapter describes the user requirements and gives an overview of what is expected out of the final solution. Broadly speaking, backup software is simple in terms of its desired inputs and outputs. However, in order to develop backup software that consumers will actually use, it must be flexible in terms of features and provide an optimum amount of backup options to the user. The scope of this project envision users to have files in multiple and remote locations, and wish to backup these files along with their existing local data. Areca only backed up local files, and therefore, required new features to access remote storage. This chapter reflects the objectives to be met when extending the existing solution. Figure 3.1: A use case diagram depicting a user initiating a remote backup 26

28 3.1 Problem and Proposed Solution The user may have files located locally, remotely on a cluster of servers, or some other proprietary storage (e.g. external USB drive) and the general problem involves the user wanting to backup remote files, but has the monotonous task of manually copying remote files onto a local drive, and then using backup software to backup these remote files along with their other local files. The solution was to create a single interface to remote storage devices so that, for example, backing up an FTP directory would require only the identical actions from the user, as they would backup a local directory from a local drive. This will require server credentials from the user to access the remote storage media. The desired solution would use an existing solution called Areca, and extend its feature set so that the SFTP, FTP and SMB protocols can be used to access remote files. The retrieval of local and remote files will therefore be backed up in one central solution. In addition, a scheduler would be needed to run the backups during regular intervals. The scheduler would be running alongside the GUI of Areca instead of being run as a background process, such as a daemon [28]. 3.2 Constraints of Solution The backing up of locked or in-use files might prove difficult. The act of trial-and-error techniques to seek issues with the diversity of operating systems has to take place, and detailed research will be required for each operating system. This was an extremely complex subject and therefore, given the limited time allocated for the project, it was not possible to devise feasible solutions to eradicate this constraint. Remote account objects are serialised to local files; passwords to remote storage devices will be encrypted. The constraint is that passwords to these remote accounts are not out-of-reach, i.e. stored on a dedicated secure server that houses a database of remote account credentials. This implies that should a hacker get hold of a local file that contains the remote account object, the password is vulnerable to be decrypted into plain text. A larger issue becomes apparent when the encryption key 27

29 (or master key) is known. All passwords are encrypted using the same master key, and if this key is exposed, then all the remote accounts are susceptible to being illegally accessed by hackers. To counter this constraint, a long key size will have to be used such that the time it takes to initiate a brute-force decipher very inefficient [29] Risks Associated with Implementing Features The risks involved in developing any project are worth considering. An important point to consider was whether to compromise a feature for better stability or for enhanced functionality. For example, if a feature was newly introduced that reduced the memory usage by 10%, but the aftermath increased the possibility of crashes, then it would be fair to say that the small feature was not worth implementing. In order to compile a fully comprehensive risk assessment report, one needs to know the system well to accurately gauge problems that might be forthcoming. It is true in my case that I did not know the system that will be extended well to begin with; the system being the underlying code and other files that had yet to be discovered. Hence, the risk level presented in the table (Figure 3.1) was my initial prediction of the risks involved in implementing the feature. It was imperative that undesirable effects on existing features were not introduced in the new solution, i.e. break any features that Areca originally had. 28

30 No. Functional Requirement Priority Risk Level 1 Create a full backup of selected remote files/directories. High High 2 Create an incremental backup of selected remote files/directories. High Low 3 Access and backup files from FTP servers. Medium 4 Access and backup files from SFTP servers. High Medium 5 Access and backup files from SMB servers. High Medium 6 A built-in scheduler to schedule backups to run at regular intervals. High High 7 Implement a file restore method. High Medium 8 A mechanism of saving user preferences/remote accounts. Medium Low 9 Be cross-platform (support Microsft Windows, Mac OS X and Linux) Low Medium Figure 3.2: A table of functional requirements No. Non-functional Requirement Priority Risk Level 1 The GUI must be simple to use and easy for the user to understand. High High 2 The software must be easy to install. High Low 3 The software must be lightweight. Medium 4 If a file fails to restore, the system should be provide the user with options. High Medium 5 Installed software should be easy to access (e.g. menu shortcuts, a visible and recognisable icon). High Medium 6 The source code should be open. High High 7 Provide an uninstaller. High Medium Figure 3.3: A table of non-functional requirements 29

31 3.2.2 User Requirements An early decision was taken to implement features so that the system just works, i.e. the system containing a core set (or a general set) of features, and carries out these activities without much intervention by the user. An important consideration was whether to design a system for speed or functionality. Systems that are responsive and do the right thing quickly are likely to gain positive feedback from users. Again, as time was a luxury, I had to sacrifice some speed elements of my features, i.e. they were not designed to be the quickest methods. As Areca was written in Java, one of the options was to rewrite it in a different language. If a quick backup system was the main priority, the entire project might have been written in C++, as many programmers argue that Java is slower than C++ [30]. However, for this project, the functionality was prioritised above performance, and the advantages of continuing this project in Java was far greater. 3.3 Requirements Summary Since the problem of backing up is broad, it was hard to pinpoint exactly what user requirements were. However, the specific features of accessing remote storage permitted a definition of a typical user and allowed a conjuration of the requirements. As I am planning to build my features on top of an existing solution, it is assumed that at least some of these requirements have already been met. Although implementing my features will affect the functional and non-functional features of the original solution, the notions of retaining these requirements remain. The next chapter will deal with making crucial design choices to access remote storage, integrate a scheduler object into the existing solution, and offering a file restore method. 30

32 Chapter 4: Design This chapter walks through the stages in which the features of the proposed extension were designed. It gives a simple introduction to the technologies and techniques adopted, and illustrates the graphical interfaces that have been designed as an addition for Areca. Some specific keywords will be used later in this report to describe certain aspects of Areca, and the most notable keywords are explained below. Source: a source is a file or directory that is selected by the user to be backed up. A source is a single entity that represents a single or directory. Target: a target is an entity that couples a number of sources together. A target can be thought as a container for multiple sources. A target has a name. When a target is backed up all the sources in the target are backed up. Group: a group is a collection of targets. A group of targets can be backed up so that the user does not have to individually backup each target. A group also has a name attributed so it can be distinguished amongst other groups. Repository: the destination directory for the backed up files. These keywords were devised in the original solution and the connotations have not been changed in the extension. 31

33 4.1 Areca Architecture One of the most difficult tasks during the design stage was deciding where to integrate the remote access feature. A good overview of the current system was required and the advice of any design patterns heeded. It was important to recognise how Areca was architected and so a diagram has been drawn to show a high-level view of significant objects used. Figure 4.1: An overview of the main classes in Areca The additional classes to be implemented on top of the original version have been drawn with dashed lines. These classes will be discussed further in Chapter 5. 32

34 The diagram in Figure 4.1 has been drawn by studying the code carefully and has been abstracted high enough to understand easily. It was obvious that there was a large amount of source code given the complex structure of Areca. This meant that careful planning and design had to be done before starting development straightaway. 4.2 Object-oriented Design There were many reasons why an object-oriented (OO) approach was used for the programming aspects of the project. OO design allows programmers to easily visualise and convert real-life objects into their equivalent software objects. The fundamental difference is that real-life objects are tangible, and many factors (e.g. humans, other physical objects, laws of physics, etc) can affect them. In terms of software, the objects have attributes that can describe and technically affect the object, such as engine.setrpm(18000) that sets the object, engine, to an RPM of 18,000 revs. Besides the natural [31] way of converting real-life objects into software objects, another reason the project was geared towards an OO implementation was that the existing solution was already written in Java. It would have been strenuously timeconsuming to rewrite the entire existing solution into an imperative language such as C UML Diagrams In order to fully understand the existing solution, a UML package diagram was drawn. This gave a concise overview of the existing solution, and indications on where (i.e. which Java classes) to integrate the new features. As an example, Figure 4.2 shows a UML class diagram that models the class AbstractFileSystemMedium that later helped in designing the extended RemoteFileSystemMedium. 33

35 Figure 4.2: The original class AbstractFileSystemMedium (left), which inspired the creation of the new class RemoteFileSystemMedium (right) Abstract Data Types It was useful to produce Abstract Data Types (ADTs) that would model operations and attributes allowed on certain objects especially objects were new additions to the existing code. These objects would have needed to be fully compatible with the existing solution. The use of ADTs helped build a concept to which an implementation could be built on top of the existing solution. However, it could be argued that it is difficult to define fully the preconditions, post conditions and constraints without understanding the code behind the solution first, so preliminary ADTs do not contain sufficient detail. In other words, it is frivolous to write the pre and post conditions and constraints without reading at least some of the source code that governs that particular method or class. Understanding the code gives an insight of what the actual pre and post conditions and constraints are, and why they exist. Therefore, terse ADTs were constructed to begin with, but they became increasingly comprehensive as more code was explored. 34

36 public abstract class RemoteFileSystemDriver implements FileSystemDriver<FileObject> { private static long MAX_FILE_SIZE; private static boolean USE_BUFFER; public void applymetadata(filemetadata p, FileObject f) {} public boolean canread(fileobject file) {} public boolean canwrite(fileobject file) {} public boolean createnewfile(fileobject file) {} public boolean delete(fileobject file) {} public boolean exists(fileobject file) {} public InputStream getcachedfileinputstream(fileobject file) {} public OutputStream getcachedfileoutputstream(fileobject file) {}... } // class Figure 4.3: A Java abstract data type for the class RemoteFileSystemDriver In Figure 4.3, the ADT describes how a RemoteFileSystemDriver should be modelled and used. 4.3 Graphical User Interface In order to fulfil the aforementioned requirements in chapter 3.2, a functionally and presentable graphical user interface (GUI) was vital to the success of the project. The existing solution already housed a well-built and functional GUI. Although, any changes to the original source code that composed the GUI, could potentially have rendered unforeseen bugs later in development. Tactically, the design and development of an extensive amount of GUIs did not take place. The main reason being that this project did not warrant much change in the original GUI. The intention of providing a seamless method of accessing remote storage, and a transparent motion between backing up files from a local directory and a remote directory, should not mean a complete overhaul in the graphical design. In fact, the complete opposite notion applies. The GUI should not have many 35

37 noticeable changes due to the anticipation of previous end-users being accustomed to the original GUI Existing Graphical User Interface As the GUI was deemed fully working, there was no need to alter its original form. However, some designs were revised such as grouping buttons together in the Target Window to allow space for the additional controls to access remote filesystems. Some of the window captions and titles have also been changed. For instance, the Target Window used to be called Target Edition (Figure 4.5), and as this title did not make much sense, it was renamed to Target (Figure 4.7). The following figure displays the main GUI. As much of the code has been modified behind the GUI, it was important to maintain its overall looks and functionality. A group called Demo Targets that have been backed up Targets Figure 4.4: Areca graphical user interface (main window) 36

38 4.3.2 The Target Window The Target Window is, essentially a window that allows a user to select and edit a target. A target can be described as a set of files that is compiled into one archive, and this archive can be then restored. Restoring an archive is a feature that copies all the previously selected files and pastes these files to a backup destination directory. The window contains a list of options: General, Sources, Compression, Description, etc. The following sections will describe what needs amending to this window. A tab is an entry in the list; please see Figure 4.5. The Sources tab Figure 4.5: The original Sources tab The separation of these tabs allow low coupling in the source code. For example, if another arbitrary feature such as Target Priority was added, any bugs caused in the Target Priority tab should not affect the other tabs. 37

39 The Sources Tab The Sources tab lies within the Target Window, and it allows a user to select a set of files and directories through two separate buttons. The Sources tab has already been presented earlier (Figure 4.5). When the user clicks on the Add button, the Figure 4.6 is shown: Figure 4.6: Option to add only a file or directory In the existing Areca GUI a user can only add one directory at a time, or one file at a time. If the user wanted to add a file and a directory, it would require the following mouse-click trace: Add File <Select file> Save Add Directory <Select directory> Save If, for instance, the user wanted to select five different directories and five different files, it would require (4 * 5) + (4 * 5) mouse clicks. It is a very enervating method of selecting a diverse set of files/directories. A more efficient way of selecting sources was proposed: a file/directory listing whereby users can select an entire directory, a single file, or a mixture of files and directories. This allowed the user to have more flexibility, as the user does not have to individually add a file and then a directory, and so on. 38

40 Graphically comparing Figure 4.5 with Figure 4.7, the new Sources tab contained extra buttons that provided extra functionality: adding a remote source file/directory, and an account selector that allowed the user to browse the files on a selected remote account. Figure 4.7: The new Sources tab The buttons are now more readable as more text has been used to describe the action that they correspond to. For example, Add Local Source will load the check file tree (explained in the next section) and start the listing of files on the local filesystem. The button Add Remote Source is used in conjunction with the combo box that appears on the right of it. This will load a new check file tree and start the file listing of whichever remote account was selected in the combo box. The Check File Tree GUI As Figure 4.6 showed a clear limitation of selecting multiple files, a tree containing the file listings of a filesystem was a clear and intuitive way to support the selection of dispersed files was extended onto the existing solution. The checkbox next to each file or directory permits the user to select the desired file/directory. It is flexible as users can select either an entire directory, a subset of files within that directory, or just individual files that are dotted across the filesystem. 39

41 Figure 4.8: The Check File Tree method to add sources In Figure 4.8, the file banner.png is selected. To backup more files (or directories), a simple click on the checkbox will place a check in the clicked checkbox. When the Save button is pressed, all checked items would be added as backup sources. The usage of the tree is simple; there are only two buttons for the user to interact with. In addition, the hierarchical listing of files is a tidy way of portraying the filesystem The Scheduler Tab A scheduler will schedule a target to run at specified times. A scheduler was associated with a target, instead of a group of targets, because it provided higher cohesion between a scheduler and a target, but equally as important, the source code will have high cohesion according to the High Cohesion design pattern [32]. An early design of the Scheduler tab has been illustrated in Figure

42 Figure 4.9: An initial design of the Scheduler tab Figure 4.10 shows the scheduler being enabled; the next backup time is displayed to acknowledge the user of the next scheduled backup. Figure 4.10: The Scheduler tab with a scheduled job 41

43 Figure 4.11: The Scheduler tab without a scheduled job (controls disabled). The label correctly states that the next backup is not applicable As expected, when a schedule is not activated, the next backup time is not applicable (N/A). Disabling the controls for time and date is a human-computer interface concept. As the schedule is not activated, the user should not be able to select a frequency, start date, nor start time. A scheduled backup can run every hour, day, month or year. If the start time was set to a minute after the current time on the same date, then the backup would start within the hour. As the existing solution was very graphically oriented, it would have been infeasible to have only one thread running. Therefore, the design of the scheduler has incorporated multi-threading technologies in order to manage user input (e.g. mouse clicks on buttons) and Areca processing (e.g. executing a backup and the monitoring of schedules). The implementation of the scheduler is further explained in section

44 4.3.3 The Remote Account Manager Window A remote account is an account that can access a remote server. A remote account (i.e. RemoteAccount Java object) will record the protocol to communicate with the server and user credentials to connect to the server. An initial design is provided in Figure Figure 4.12: A mock-up of the Remote Account Manager window The additional window shown in Figure 4.13 is called the Remote Account Manager. This window provides functions to manipulate remote accounts such as add, edit, and delete. A remote account manager is also able to test a remote account, i.e. to see if the user authentication passes, whether the user has the read/write permissions and whether the host is actually connectable. This was one of the most substantial GUI designs that were taken on board. All the implementation and design was completed from scratch. The intricate part of merging a chunk of code with the existing code was initially complicated and demanding, but by following several design patterns such as low coupling, high cohesion, factory, information expert and façade, it helped with the diverse stages of software engineering [32]. 43

45 Figure 4.13: The finalised Remote Account Manager window The Remote Account Manager window is represented by an object called RemoteAccountManager that contain methods that carry out the actions specified in the buttons. The RemoteAccountManager relates very closely to the Remote Account Manager window. The latter is a graphical representation of the object. Each remote account listed as show in Figure 4.13 is serialised (i.e. the contents of a RemoteAccount object are written to a file and can be de-serialised to generate a replica of the object) into XML files. The XML is designed to be output as shown in Figure <remoteaccount> <id>unique_id</id> <protocol>sftp</protocol> <hostname>sftpserver.serve.co.uk</hostname> <username>myname</username> <password>encrypted_password</password> <port>22</port> <defaultpath> </defaultpath> <isencrypted>true</isencrypted> </remoteaccount> Figure 4.14: The proposed output of a serialised RemoteAccount object in an XML format 44

46 4.4 Design Summary The designing stage undergone was relatively straightforward; not many new graphical interfaces needed to be constructed, and the original GUI was deemed useable. Integrating the tree to select sources involved adjusting several classes in order for it to work, and designing the new interfaces such as the Scheduler tab and Remote Account Manager, required careful study of the existing code to emulate the design aspects of the original version. In addition to the emulation, the design has created an opportunity to backup files from remote servers. The uncomplicated GUIs allow users to easily add remote accounts and initiate remote backups. The next chapter will discuss the technical implementation aspects of the entire project and provide a detailed insight on how the integration of new code with the existing code went on. 45

47 Chapter 5: Implementation This chapter explains the implementation for the project and consequently, the extension of features for Areca. The implementation included complex integration of my code with the existing code, and this chapter aims to describe the process of the intricacies of integration, and the problems that were encountered. In addition, as Areca was an existing solution, agile development was adhered to. This meant that early prototypes with the newly developed features were added and the entire system was tested. 5.1 Programming Environment Continuing with the development in Java meant that the popular integrated development environment (IDE) Eclipse could be used. Eclipse is an open-source, fully featured IDE that allows developers to write, not necessarily just Java but many other projects in different languages such as the Atlas Transformation Language [33/]. Also, a dedicated guide to help build the source code for Areca in Eclipse was posted on Areca s Wiki page [34]. 5.2 Existing Code Base The existing code base was freely available on sourceforge.net and it contained a hefty amount of code. A glance at the diagram explained in chapter 4.1 (Figure 4.1) immediately gave an indication of the size of the upcoming task. The vast existing code base meant that a substantial amount of decoding of the relationships between objects, and the objects themselves had to be undertaken. The entire Java project seemed to be packaged well; all the source files had a header stating that it is a GNU licensed project, and a signature from the original 46

48 author. These headers have been respected. Also extra information such as a new edits author and the date at which the file was edited has been included. One of the first hurdles that were overcome was the sheer amount of code that had to be understood. Understanding what the code meant was easy enough (with some exceptions), the challenge was understanding the effects it had on other objects, and how these effects manifest themselves into local/global changes, and interact with the multitude of objects that have not yet been discovered Overview of the Existing Code This section delivers an overview of the existing code base. It is important to critically analyse the existing code base, as starting with a poor solution will greatly hinder the process of implementing the new features Resource Manager In the existing code base, there was the notion of a resource manager. The resource manager s job was to read labels from a resource bundle. A resource bundle in Java can be described as objects that contact locale-specific data, e.g. language translation files [35 In Areca, the resource bundle gave titles to the windows, button text and other graphical elements presented on screen, inherently allowing Areca to be translated into many different languages 3. Overall, the ResourceManager object in the existing code base fulfils the Singleton design pattern as it is a static object and is only instantiated when necessary (i.e. instantiated at the application start-up and then other classes can invoke RM.getInstance() to get an instance of the resource manager object). 3 There is a respectable amount of translations available for Areca [36]. 47

49 Inheritance The existing code makes use of Java s polymorphism capabilities. The use of abstract classes, interfaces and general inheritance techniques meant that it was easier to write new classes that were subclasses of generic base classes. Inheritance has been exploited and used in the additional classes written in order to follow the standard that has been set Generics Generics in Java allow a type or method to operate on objects of various types while providing compile-time type safety [37]. With a recent update [38] to the existing code base, it was very surprising to see hardly any generics 4 used. This meant upon compilation, more than a thousand warnings from the Eclipse IDE were presented, stating: unsafe type operation involved raw types! Consider the following example: 1: ArrayList list = new ArrayList(); 2: String name = "Delong Yang"; 3: int uid = 1; 4: list.add(name); 5: list.add(uid); 6: 7: String strname = (String)list.get(0); // Correct cast to string. 8: 9: String struid = (String)list.get(1); // Incorrect cast; an integer // was expected. Figure 5.1: An example of not using generics/parameterised types 4 Generics was first introduced in Java 1.5 in 2004 [39]. 48

50 Running the above code would case a run-time exception: java.lang.classcastexception: java.lang.integer cannot be cast to java.lang.string However, using generics, we can catch the exception during compile time. 1: ArrayList<String> list = new ArrayList<String>(); 2: String name = "Delong Yang"; 3: int uid = 1; 4: list.add(name); 5: list.add(uid); // Causes a compile-time error. Figure 5.2: An example of exploiting the use of generics/parameterised types As shown in the above code, the exception is caught much earlier. Line 5 gives an error stating that the object list, parameterised by String objects, cannot take an argument of int, and therefore, cannot be added. Using this early warning system prevents run-time exceptions that can instigate a wide-array of problems such as null point exceptions, and even system crashes. The use of generics has been introduced in the new code as well as adding type parameters for the existing code that contained unparameterised objects (e.g. HashSet and ArrayList) Peculiarities Having walked through the code, some aspects could have been classed as poor engineering practice. Talking to Strangers According to Larman [32], the notion of talking to strangers is where one object calls for a method, and that method returning another object and then on the same line, invoke another method from the recently returned object. This is best illustrated by an example (Figure 5.3). 49

51 context.gettaskmonitor().getcurrentactivesubtask().addnewsubtask(remaining, "backup-main"); Figure 5.3: An example of a long line in the existing code base There are several problems with this approach. Ultimately, understanding this single line is difficult. This is caused by poor readability; it would have been much easier to read if each method call was on separate lines as shown in Figure 5.4. TaskMonitor monitor = context.gettaskmonitor(); TaskMonitor currenttask = monitor.getcurrentactivesubtask(); currenttask.addnewsubtask(remaining, "backup-main"); Figure 5.4: A derived example of a long line being separated onto different lines For software engineers that are new to the project, they are often confused by what each object returns because the form of structure-hiding is achieved. Therefore, the example in Figure 5.4 shows clearly the objects being returned by each method and is subjectively more readable. As readability was poor, the time it took to understand a small part of the existing code took far longer than expected. Abstract Classes There were a few classes in the code base that were labelled as being abstract. One interpretation of an abstract class is a class that has some abstract methods, and some completed methods. However, the existing abstract classes did not comply with the aforementioned interpretation; the abstract classes were usually fully implemented. Although not detrimental to the functionality of the software, it added another stage to learning curve of the current existing code. Long Lines Long lines are lines of source code that require the reader/developer to scroll across the screen in order to read it. It is seldom a case where reading a line of code is 50

52 understood correctly and completely in one glance. During development, this task was made harder as scrolling across the screen to read the entire line happened frequently, and thus, decreased productivity. Eclipse Standard Widget Toolkit (SWT) Eclipse SWT widgets are graphical elements for Java projects. They boast efficiency by using native GUI elements, their open-source status and cross-platform capabilities [40]. The problem arises when trying to programmatically use SWT. Prior experience of developing GUIs has taught that once a graphical control such as a text box has been instantiated, it could then be re-used. Therefore, clicking on remoteaccounta would populate textboxhostname with remoteaccounta s host name. Similarly, clicking on remoteaccountb would populate the same textboxhostname. However, this was not the case. In the Remote Account Manager GUI mentioned in chapter 4.2.3, the initial thought that one set of text boxes, labels, buttons, etc, can be used for all the remote accounts was wrong. Each remote account listing needed a new set of controls associated with it. This meant that, with each remote account, a new statement was created for each graphical control. An example of this has been given in Figure

53 For example, private void initnewaccounttab(composite composite, RemoteAccount acc) { // Settings for the protocol type. element.lblprotocol = new Label(composite, SWT.NONE); element.cboprotocoltypes = new Combo(composite, SWT.READ_ONLY); for (int i = 0; i < supportedprotocols.length; i++) { element.cboprotocoltypes.add(supportedprotocols[i]); } // Settings for the port number. element.lblportno = new Label(composite, SWT.NONE); element.txtportno = new Text(composite, SWT.BORDER); // Settings for the hostname. element.lblhostname = new Label(composite, SWT.NONE); element.txthostname = new Text(composite, SWT.BORDER); }... Figure 5.5: An excerpt showing the creation of new instances of objects The reason for not using a standard Java GUI platform, such as AWT or Swing [41], was that the entirety of the Areca GUI was already programmed using the SWT platform. After much testing of mixing a Swing implementation of the check file tree with the SWT implementation of Areca, it was near impossible to get the two GUI platforms to work well together. Thread problems, window modal error dialogues, and constant instability of Areca were encountered. The fundamental problem was that the mix and match nature between the two different GUI platforms was a poor approach from the beginning. Therefore, the right decision was to advance with the same GUI framework such that the aforementioned issues were not encountered again. French Comments Commenting source code is always encouraged. It promotes good reuse of code and readability, which helps the developer understand or quickly recap on what a method does. 52

54 One of the lead developers for Areca was French, and his sections of code were commented in French or poorly translated English. This made the task of understanding the existing code base significantly harder. Mac OS X Support This explains the difficulties in the transition of a project that was initially developed for just the Microsoft Windows platform, with only the theory of cross-platform. Earlier communication with the original developer confirmed the fact that Areca has never been tested on the Mac OS X platform before, and as such; this project was the first to encounter running problems on the said platform. This incurred another problem: instead of being tasked with the integration of features, code that ran well with multiple OSs (i.e. prevent cross-platform bugs) had to be written. 5.3 Integrating the New Features In this section, explanations on how the new features have been integrated are provided, and justifications into some of the sacrifices that have had to be made have been given Check File Tree Visually, the check file tree is a tree listing of the current filesystem (either local filesystem or remote filesystem) that places a checkbox next to each file and directory. This allows the user to easily select files and directories from the same interface. The check file tree is basically an implementation of a SWT widget [42] and has been adapted for the purposes of this project. The implementation was completed with help from a blog post [43]. The code has been placed into the com.areca.gui package and the GUI is in the Sources tab in the Target Window. In the previous existing solution, after a user has selected a file, the file is actually referenced as a File object. In the tree listing, every selected file (or directory) is 53

55 stored as a String object. Storing the string path initially saved computation time and memory, but had many side effects for the existing code. The side effects included class cast exceptions, invalid object assignments and no such method exceptions. Thus, vast amounts of the existing code required change to accept file paths as String objects. Classes that used the File objects directly, i.e. calling the File object methods, had to be adjusted such that each time it received a String object, it would create a File object from the given String, thus mirroring the original implementation. The aftermath is the increased workload of creating new File objects from String objects. Using the same tree GUI to select both local and remote files provides a seamless user experience, as the motion of selecting local files is the same for remote files. private void checkpathsintree(hashset<string> paths, CheckboxTreeViewer tree) { try { if (paths!= null && tree!= null) { if (paths.size() > 0) { Iterator<String> i = paths.iterator(); String path; FileObject f; while (i.hasnext()) { path = i.next(); f = VFSUtils.createFileObject(path); tree.setsubtreechecked(f, true); } } } } // try catch (Exception e) { System.err.println("Exception in checkpathsintree(): " + e.getmessage()); } } // checkpathsintree Figure 5.6: Checking the previously selected file paths in the tree 54

56 The code listing in Figure 5.6 is an example of how to check previously selected files programmatically. Notice the utility package VFSUtils to resolve a FileObject because only the filename of the file is stored and not the FileObject. There were downsides to using the check file tree approach. One of the prominent problems was the delay in listing large directories. There were many factors that could have caused the delays such as, the local Internet connection speed, the remote server being busy, the virtual filesystem being unresponsive (virtual filesystem is explained in 5.3.2), and most importantly, the new code written. The listing delay was also affecting the local filesystem. This was an intriguing point as factors such as connection speeds or remote system lag could not have affected the local listing. After thorough code dissemination of the relevant classes took place, it was not possible to alleviate or improve the delays. The only gratification from this was that once the file listings have been loaded for the first time, if the check file tree was ran the second time on the same instance of Areca, the listing would be much quicker. This quick response was most definitely due to the file listings being stored in RAM, i.e. cached for subsequent file listings. There could be instances where a file listing is cached and the structure in the filesystem changes. As the file tree calls the method getchildren() on load, if this method detects a change, the createcontents() would re-generate new objects to reflect the directory structure change Apache Commons Virtual Filesystem In order to make this a challenging project, the complex aspects in developing such an application must be recognised and dealt with. For example, the requirement of the Java Native Interface (JNI) [44] would involve low-level programming to exploit the different communication protocols. There are several libraries available that allow access to SFTP/FTP servers [45]; however, difficulties are exposed when errors from utilising these libraries occur. When library code is used, it is often difficult to pinpoint where they originated from, who or what caused them, and how to take necessary action. JNI will allow native access to these protocols (or rather, be able 55

57 to make use of libraries written in a lower-level language), hence, providing a robust remote access solution. During the design stages, two ways of accessing remote storage devices were considered. The first idea was to allow users to authenticate themselves to the servers and mount the drive to the operating system. Mounting can be described as attaching an additional filesystem to the currently accessible filesystem of a computer [46]. The advantage of mounting these remote servers is that a remote server will act like a local drive. This means that the drive will appear as a local disk attached to the computer. There are already applications that support this feature for a fee, for instance, ExpanDrive [47]. The main problem with this approach was the requirement of JNI to access OS kernel calls to mount the new filesystem, and the different filesystem technologies between diverse operating systems necessitated considerable research. The time constraints imposed on the project would have certainly not been sufficient to fulfil this approach. The requirement (or expectation) that, since the project is written in Java, it should be cross-platform, however, companies such as Microsoft has argued that JNI breaks the cross-platform attribute [48]. The second idea was to allow users to authenticate themselves to the servers and have the check file tree GUI show the filesystem. An existing package written in Java already provides the support of creating a virtual filesystem, the Apache Commons Virtual Filesystem (VFS) [49]. The advantages for the Commons VFS package include being open-sourced and written in Java. This seemed ideal as it allowed access to remote servers through pure Java code, and thus, keeping in line with the current Areca development. The VFS is a toolkit that permits Java code to be used in order to create a filesystem to enable remote files to be accessible just like local files. Unfortunately, the Commons VFS did not contain any utility code to help migrate file permissions from a remote server to a local file. Therefore, the decision to only affix the read, write and execute permissions for the files transferred from remote servers was made, as the complexity of configuring the file attributes was beyond the scope of the project. 56

58 5.3.3 Scheduler The purpose of a scheduler was to allow users to schedule backups, and more specifically, to schedule individual targets. Using the package java.util.timer allowed the implementation of the relevant scheduling objects (e.g. ScheduledTask, TargetScheduler, HourlyIterator, etc). One of the complications of carrying out a scheduled backup was the entangled nature of threads. There was a need to have at least one thread running to maintain the usual operations of the system (e.g. monitoring button clicks), and one to run a backup once its scheduled time has arrived. Explicitly creating and running threads increased the complexity of this project by a huge proportion. The concept of concurrency of threads was an exciting addition to Areca. Prior to designing and implementing a scheduler controller from scratch, various articles that were available online were consulted to gauge possible solutions. A Java library called org.tiling.scheduling written by IBM was available, and it contained examples on how to utilise the java.util.timer and java.util.timertask classes [50]. Modifying the open-source package allowed the creation of customised objects and paved ways in which to schedule backups. An excerpt of the code for the running of a scheduler for a target is given below in Figure 5.6: public void setscheduler(targetscheduler givensched, boolean runifvalid) { this.scheduler = givensched; if (runifvalid) { if (this.scheduler.isenabled() && this.scheduler.isinitialised()) { this.scheduler.start(); } } } // setscheduler Figure 5.7: Associating a target with a scheduler 57

59 In the above code excerpt, when a target is set a scheduler, it is checked to see whether it is enabled and initialised (i.e. frequency, time of day and date all set) before it starts. runifvalid is a flag; if it is set to false then the scheduler will not start until the method start() has been explicitly invoked. If runifvalid is true, and the scheduler object satisfies the two proceeding Boolean expressions, then the scheduler is started. This straightforward execution of a scheduler should provide ease of use for future development. When two targets (i.e. backups) are scheduled to start at the same time, the first target that yields control of the backupthread will initiate first. The second target will have to wait until the first target releases (i.e. using notify() to acknowledge other threads) the backupthread. This means that there is a queue if there are multiple targets that want to commence backups at the same time. Background scheduling was not one of the primary features to implement. Albeit useful for users, time could not be afforded to design a daemon that ran as a background process to initiate scheduled jobs A Hybrid Target A hybrid target is a target that has the ability backup both local and remote files. The amount of classes that handled the backup of local files was extremely sophisticated as a breadth of objects seemed interlocked, i.e. a plethora of object references being passed from one object to another and so on. In order to implement a hybrid function of backing up both local and remote files, a drastic change to the existing code base was required. The existing solution was working with entirely local files; suffice to say that the simple Java File object represented them. The fundamental objects used in Areca needed adapting to accept and utilise this new object. Fortunately, as generics were used in the existing solution, the features of generics were exploited to help with the problem. 58

60 Figure 5.8: HybridFileSystemTarget inheritance hierarchy Figure 5.8 shows a diagram that gives a quick overview of what has been implemented. It explains the notion of the object HybridFileSystemTarget being an extension of FileSystemTarget, which is an extension of AbstractTarget. A FileSystemTarget object basically dealt with a collection of local source files that were chosen to be backed up, the medium (i.e. hard disk drive) to backup to, and other helper objects that were related to this backup target. The HybridFileSystemTarget essentially imitated the actions of its super class, FileSystemTarget, but handled both local files and remote files. The most significant advantage was that it allowed local backups to be deferred to its super class methods; hence, reusing code, and any remote backup activities were handled by overriding methods (i.e. new code written). The change of contexts between local and remote was managed by using a global Boolean variable in HybridFileSystemTarget that dictated whether we were currently dealing in a remote context or a local context. If the Boolean value was set to true, the new code would be ran, else if it was set to false, the original code would be ran. 59

61 Using generics allowed any method that required the parameter of an AbstractTarget type to be passed a HybridFileSystemTarget instead of FileSystemTarget. This meant that only a cast was needed if specific methods from either the FileSystemTarget or HybridFileSystemTarget were needed, otherwise, an AbstractTarget s methods will suffice Simple Encryption Module A straightforward class that allowed text (String objects) to be encrypted and decrypted using a private (hardcoded) master key was orchestrated. This class utilised the supplied java.security and javax.crypto packages. The main intention of this class was to provide uncomplicated methods that any novice programmer can make use of. For example, to encrypt a string, one would invoke the encrypt(str) method, where str is the String object to encrypt. This returns a String object with the encrypted values. For decryption, one would simply invoke the decrypt(str) method, and again, a String object is returned with decrypted values. The algorithm is DES (Data Encryption Standard [51]) and uses a key of size 320- bits. This encryption and decryption module has been used throughout the project to hide sensitive information from unauthorised users. As the information of backup targets is stored locally in plain text, it is vital that passwords are encrypted, and with them being encrypted, prevents them from being easily guessed. 5.4 Remote Account Security Implications The remote account manager mentioned in section stores each remote account into an XML file. These XML files are stored locally in the user preferences directory. The OS defines the preferences directory, e.g. Windows 7: C:\Users\<username>, and in Mac OS X /Users/<username>, where <username> is the name of the user. 60

62 Storing files that contain passwords is vulnerable to attack, either through guessing the password or cracking it (brute force). Although the password for the remote accounts is encrypted using a 320-bit key, it does not mean that the passwords are completely safe [52]. Should the private master key be known, all the remote account passwords, and indeed the servers that they connect to, are vulnerable. In addition, other computers that run this project that have remote accounts can be decrypted. 5.5 Implementation Summary The scale of integration for Areca was substantial and the intricacy of blending new code with the existing code required delicate planning and management. To show the amount of code that was written 5, below is a brief comparison between the original code base and the implemented code base. Original source: 439 Java files, 45.5K source lines of code Implemented source: 479 Java files, 63K source lines of code It was evident that a considerable amount of code had been written. However, it is fair to say that almost half of the new code could be attributed to the third-party libraries, i.e. some classes were imported from the libraries. 5 Source lines were counted using CLOC [53]. 61

63 Chapter 6: Testing and Evaluation This chapter provides a detailed insight into the testing carried out for the project. It describes both standard testing procedures (i.e. manual black/white box testing), as well as automated testing. As for any type of software, it is necessary to initiate tests to see if methods or classes work as expected. Releasing applications without testing is like walking into a field of landmines; a good path will not cause any problems. However, as soon as a bug occurs, the method (or the entire application) can blow up. Carrying out testing on Areca was one of the most vital tasks. The main testing involved features that were newly written and integration of the new code with the existing code. 6.1 Overall Testing Strategy As one of the requirements was to provide a cross-platform solution, it not only created concerns for implementation, but also for testing. Thus, for testing, different operating systems were used, and they can be categorised as follows: standard testing on the development machine running Mac OS X, and automated testing running on a Microsoft Windows XP machine. Using two different paradigms for testing ensured better testing coverage. Testing on two or more operating systems also helped detect cross-platform bugs, e.g. / file path separators on Unix machines, and \ on Windows Standard Testing Techniques Standard testing of software is the method of manually testing software. Manual testing techniques were utilised, and this included unit testing where the correctness of methods or entire classes was checked, and developer testing that involved verifying code and making sure new pieces of code did not cause problems for other parts of the system. 62

64 These tests were carried out during the integration stages of Areca. The development machine was running the Mac OS X operating system, which Areca was previously not developed/tested for. Testing Areca under an operating system which it was not specifically designed for produced multiple errors during backup: files failed to backup, null pointer exceptions from the Java Virtual Machine, and many other erroneous errors that contributed to an entire backup failing. For example, error messages such as Figure 6.1 was encountered on the Mac OS X system when closing a Target Window. This did not occur when Areca was run on a Windows XP machine. Figure 6.1: An error message from Areca as it was executed on a Mac OS X system Automated Testing Techniques To initiate automated testing, a commercial automated testing package called TestComplete [54] was used. TestComplete can be described as an Integrated Development Environment, and provides developers with the opportunity to write automated test scripts using various scripting languages such as VBScript, Jscript, etc, to test the intended software. As previously mentioned, TestComplete is a commercial product, and therefore costs around $1000 for one standard license. Fortunately, a trial version was available and the trial license lasted one month and so, it was sufficient for the testing stages of this project. 63

65 For Areca, the intention was to test the serialisation and de-serialisation of remote account objects. Measurements of the system load (CPU usage, memory levels and hard disk space) during long periods of time (i.e. 6 hours or 12 hours) were recorded. Measuring the load of the system had the benefit of verifying the correct functionality of the scheduler feature as a log is written to when these tests run, hence it was possible to manually check that the backups were being run at the expected times. 6.2 Unit Tests A unit test is to isolate methods in a class to make sure that methods are functionally correct [55]. Unit tests are normally repeated in order to provide a form of regression testing; regression tests are useful as they make sure that new releases of code do not break other methods, i.e. methoda s code has been updated, so it is expected that the code changes have not affected the functionality of methodb. Unit tests were carried out sparsely though the development of this project. A JUnit test can be described as being a simple framework to write repeatable test. [56]. For instance, JUnit tests were written to evaluate the correct functioning of the scheduling iterators, and more specifically, the testing of the method that informs the user of the next backup time. Each scheduler object has its own getnextbackuptime() method, and this method decides when the next backup should occur. 64

66 Below shows two screenshots of running JUnit tests: one to ensure that an assert method failed (Figure 6.2), and another to ensure that it succeeded (Figure 6.3). In other words, expected success and failure use cases for the method. Figure 6.2: Assert fail when testing the hourly schedule Figure 6.3: Assert pass after bugs were fixed Utilising the tests helped analyse unforeseen errors in logic and coding. Unfortunately, not many JUnit tests were employed, as it was difficult to test methods that required the GUI to be initialised. This project depended heavily on user input, and therefore, without user input, the GUI elements would not have been initialised. Fortunately, automated testing was at hand to bridge the gap in functionality testing. 65

67 6.3 Black and White Box Testing The strategy to testing Areca was to use automated testing for black box testing, and manual testing for white box testing. The reason for this was that by using automated testing, the internal implementation details could be ignored (abstraction); these details are not to be tested during black box testing [57]. This meant that by using automated testing, an input and an expected output can be obtained, and if both inputs and outputs match then the assumption that the method under test is operating correctly can be made. Manual white box testing involved scrutinising paths within the implementation (i.e. following the data flow), and allowed good test cases to be developed as vulnerabilities in the code could be found. White box testing was carried out manually during development using the debugging tools provided as part of the Eclipse IDE. 6.4 Automated Testing Areca was subjected to automated testing of its functional behaviour. The details of the automated testing stages are described in the following sub-sections. The purpose of automated testing was to simulate user interaction with Areca without human intervention. The test scripts convey an infallible user that always carries out the same actions. This is extremely useful for regression testing as well as not having to repeat mundane testing tasks manually Functional Tests One of the aims of automated testing was to ensure and verify the functional requirements of Areca, and specifically, the features that were newly implemented. Another aim was to record whether anomalies, such as high CPU peaks whilst backing up, occurred. In the following figure, an example test has been provided to illustrate the verification of file sizes of remote account XML outputs as explained in Chapter 4. 66

68 // Get files from the XML directory & make sure directory is valid. if (aqfilesystem.exists(xmldir)) { var xmlfiles = aqfilesystem.findfiles(xmldir, "*.xml"); // I.e. XML files found. if (xmlfiles!= null) { var currentxmlfile; var fname; while (xmlfiles.hasnext()) { currentxmlfile = xmlfiles.next(); fname = currentxmlfile.name; if (currentxmlfile.size < FILE_SIZE_MIN) { Log.Warning(fName + " is below the minimum size of " + FILE_SIZE_MIN + " bytes."); } else if (currentxmlfile.size > FILE_SIZE_MAX) { Log.Warning(fName + " is above the maximum size of " + FILE_SIZE_MAX + " bytes."); } else { Log.Message(fName + " is averagely sized at " + currentxmlfile.size + " bytes."); } } } // if xml files exist } // if directory exists Figure 6.4: Excerpt of how the file sizes are checked against the threshold The global variables of FILE_SIZE_MIN and FILE_SIZE_MAX denote the minimum file size and maximum file size of a serialized XML file, respectively. The average file sizes of existing remote accounts were calculated by averaging the existing remote account files, and thus, set the minimum file size to be 300 bytes, and maximum 450 bytes. It was clear that should the serialised remote account objects (i.e. the XML files) fall below 300 bytes, then either data was not being serialised correctly or some fields were missing from the file. Likewise, if the file was larger than the maximum threshold, it could be case that duplicated data could be written. 67

69 More automated functional tests are available in Auxiliary Material B Load Tests There were two primary reasons for load testing. Executing load tests made sure that Areca did not starve other applications of system resources. For example, if Areca had a few scheduled jobs running but did not start till some time in the future, then it does not make sense for Areca to hog CPU, RAM and hard drive resources. Another reason is that even if scheduled backups are running, the tests will make report whether Areca has acquired a high usage of CPU (i.e. over 80%) and consequently, has blocked other applications from running. The following graphs were run on a target that contained three different sources. The sources varied in size and type. The test script recorded load statistics every 15 minutes, as well after a backup has been made. Figure 6.5: The original memory usage of backing up three targets over a 12hr period 68

70 In Figure 6.5, during the start-up of the original Areca, it ran with 59MB of memory. However, during the start-up of the extended version, it ran with only 35MB of memory. This was most definitely due to the fact that String objects are being used as references, rather than File objects that are more memory-intensive. However, the same success story could not be told for the backing up of three targets simultaneously. Figure 6.5 initially shows the new Areca using less memory on start-up, but as the backup progressed, the memory ballooned to almost double the usual memory usage. This increase was most likely due to the Commons VFS creating new objects for every single connection. After the garbage collector in Java commences, it is only then that the memory drops to an acceptable level of around 75MB. The original version hovered relatively consistently around 60-70MB throughout. Looking at the CPU load, the processing was never stressed over 5% when backing up all three targets. Figure 6.7 shows the points at which the CPU measurements were taken as the same time as the backing up of targets. As the average of the CPU usage was only around 0.25%, it implies that the new code has not caused detrimental effects to the running of Areca. Percentage Figure 6.6: The CPU usage of backing up three targets over a 12hr period 69

71 The graph values in Figure 6.6 indicate that the threads were running correctly and were not causing undue stress on the CPU Regression Tests Regression tests ensure updates to code do not have adverse effects on previously working methods. A reliable source of regression testing could be gained by harnessing the strengths of the aforementioned automated test scripts that stress both the GUI and data aspects of Areca. Code changes can be made and the resulting new version can be tested against the test scripts. As the test scripts are both broad (i.e. recording CPU and memory usage) and specific (i.e. Remote Account Manager) to Areca, they provide a valuable measure of where and when things go wrong. Using automated testing, the scripts can be used on new releases to expose bugs, and also induce a code-and-test attitude for developers, as it is not too arduous for developers to run automated tests every time they write new sections of code Connection Stability Tests Connection stability issues can strictly be categorised under regression tests but it has been explained in this section in more detail. The plan for testing the remote connection stability involved running remote backups frequently, and simultaneously looking for errors in Areca. If no errors were found after repeated attempts, then it is fair to assume that the connection to the remote files during backup was stable and reliable. Through development and general usage of Areca, the connection to remote servers would sometimes seemingly disconnect. The word seemingly is used because when a backup that contains remote files is run, it is often that the backup succeeds. However, there have been a few occasions where Areca would complain that the 70

72 selected remote files did not exist (see Figure 6.7). If the backup were run again, say, the next minute, then the backup would succeed. Figure 6.7: An error accessing a remote source This unpredictability is unacceptable. However, the conclusion is that, ultimately, the problem lies with the internal workings of Commons VFS. Delving through the Commons VFS source code was not a feasible option, as the problem would ve taken an exceedingly long time to solve. The following excerpt of code is from an automated test script that attempted to look for error messages present in the Areca window when a backup was in progress. if (haserrormessage()) { // Click on the error message. Aliases.Areca.ErrorDialog.OK.ClickButton(); } // Retry if (retries > 0) { aqutils.delay(1000); runbackupwithretries(targetname, retries - 1); } else { Log.Message("Target " + targetname + " failed to back up."); } Figure 6.8: An excerpt of handling error windows in the automated test scripts 71

73 If at any time during the backup progress an error dialog presents itself, an error entry is automatically written to the log. The above code in Figure 6.8 is within a while() loop that checks to see if an error dialog appears. Then it can either retry the operation by dismissing the message, or if retries have been disabled or have reached zero, then an error entry is logged. The function shown in Figure 6.9 shows the elegant capabilities of the TestComplete engine to detect windows. The alias Aliases.Areca.ErrorDialog is directly mapped to the error dialog in Areca so that TestComplete knows when it has appeared on screen or otherwise. // // // This function checks to see whether any error messages has come // up whilst processing a backup. // // Return: // true if an error message exists, false otherwise. // function haserrormessage() { var window = Aliases.Areca.ErrorDialog; if (window.exists) { return true; } return false; } // haserrormessage Figure 6.9: An excerpt showing how to check for error messages in Areca 72

74 Monitoring and recording the hard disk usage was another method used to deduce that the connection between a local client and a remote server was maintained. By measuring the decrease in free hard disk space shows that backups were being carried out as is shown in Figure Figure 6.10: The hard disk space usage during the backup of three targets 73

75 6.5 Testing Results A table has been produced to show the test results from the automated testing. Extended results can be found in Appendix D. No. Test Result/Remarks 1 Remote accounts are serialised (written to file). Successful 2 Remote accounts are de-serialised, i.e. Areca can parse the contents of the file correctly. Successful 3 The file sizes of the remote accounts that are created do not exceed the optimum threshold. Threshold was set between 300 to 450 bytes. Successful 4 Test the remote backup facility by backing up a file on a remote server whereby the file size of that remote file is known. Unsuccessful initially. I had to fix some bugs in order to pass the test. 5 Test the backup of one remote file in one target (called T1). Successful 6 Test the backup of two remote files in one target (called T12). Successful 7 Test the backup of three remote files in one target (called T123). Successful 8 Test the backup of a scheduled target (scheduling T1, T12 and T123). Successful 9 Test the backup of multiple scheduled targets. Successful Figure 6.11: The automated test cases Test number 4 was a prime example of how automated testing helped to rid the system of cross-platform bugs. The test involved the backing up of a text file but the Commons VFS code seemed to behave differently under Windows XP OS; the resolvefileobject() method worked under Mac OS X, but somehow failed to return a FileObject under the same code running on Windows XP. A generic method given in a package called VFSUtils that is part of the Commons VFS package had to be used to retrieve a FileObject. 74

76 6.6 Evaluation of Project To gauge the success of the project, five students from the University of Manchester evaluated the new version of Areca and provided feedback using an online form (please see Appendix C). A summary of the responses has been given below: Figure 6.12: The evaluation response of the implemented system On the whole, the feedback has been largely positive. However, some evaluators found the listing of files in the check file tree to be slow and unresponsive. It has also been stated in the comments section by two other evaluators that it needs improvement. The integration of the features has garnered a good response and from the evaluation, Areca has not crashed for any of the participants. Additional graphs are available in Appendix C. 75

77 The following table in Figure 6.13 and Figure 6.14 show the requirements that have been met. Objective No. Functional Requirement Met 1 Create a full backup of selected remote files/directories. Yes 2 Create an incremental backup of selected remote files/directories. Yes 3 Access and backup files from FTP servers. Yes 4 Access and backup files from SFTP servers. Yes 5 Access and backup files from SMB servers. Yes 6 A built-in scheduler to schedule backups to run at regular intervals. Yes 7 Implement a file restore method. No 8 A mechanism of saving user preferences/remote accounts. Yes 9 Be cross-platform (support Microsoft Windows, Mac OS X and Linux) Yes Figure 6.13: The functional requirements met in this project No. Non-functional Requirement Objective Met 1 The GUI must be simple to use and easy for the user to understand. Yes 2 The software must be easy to install. Yes (single JAR file to run) 3 The software must be lightweight. Yes 4 If a file fails to restore, the system should be provide the user with options. No (only an error message appears) 5 Installed software should be easy to access (e.g. menu shortcuts, a visible and recognisable icon). Yes 6 The source code should be open. Yes 7 Provide an uninstaller. No (preferences still exist after deleting the JAR file) Figure 6.14: The non-functional requirements met in this project 76

78 6.7 Testing and Evaluation Summary Testing Areca has been valuable. Bugs that were not previously encountered through the usual developing stages has been brought up in these tests, and in a way, it has made the bug fixing easier. The test results show that Areca is operating normally, even when such a complex transformation of the underlying code has taken place. Operating normally in this respect refers to Areca not consuming huge amounts of resources, and that erroneous errors messages are not showing. The former statement about resource consumption may not trouble normal users, but the latter will certainly create confusion amongst them. Error dialogs were deliberately kept to a minimum, and therefore, any errors that did pop up during testing would have been a serious error. The extension to Areca seems to have not had adverse effects on the stability on the original version. Evaluators were generally pleased with the newly implemented features. The next chapter will focus on the objectives succeeded and a conclusion of the project will be presented. 77

79 Chapter 7: Conclusion This chapter concludes the work undertaken and consolidates all the objectives that were put forward. A paragraph of my reflection is given to explain the effort injected in the overall project. 7.1 Objectives Review Looking at the original objectives, two out of the three main objectives were completed. The single interface to remote storage has been successfully integrated into the existing Areca code. It has given Areca the ability to access remote servers and backup the files that it previously had no route to. Users are now able to schedule backups so that data is backed up at regular intervals. The interface is simple yet effective. By utilising threads in Java, the scheduling mechanism implemented is efficient and has not caused any errors internally or externally for the user. The automated testing has proved the scheduler works as intended. The feature that was not implemented was the file restore method. This could be attributed to the underestimation of the complexity of the features that I had intended to implement. My initial thoughts of Areca were vastly incorrect, and my inexperience of working with large projects made it difficult to keep on track. Implementation started late; two weeks behind schedule caused undesired effects on the project. These effects included sacrificing lower-priority features for better implementation for the core features. This trend propagated to the nigh periods of the project, and consequently the file restore feature was not improved. The completed features are as follows: Provide a single interface to remote storage devices (i.e. servers, cluster of servers) Provide scheduling capabilities Provide an improved file restore method 78

80 The project could be regarded as a partial success as the main features were implemented. The difficulty of understanding an entire application was exemplified in this project, as the time taken to carry out even the simplest of tasks took longer than expected. 7.2 Project Reflection The newly implemented features in Areca are a solid start to providing a durable interface to remote storage. The scheduling feature is a bit primitive in terms of management. While it works as designed, it is not possible for users to retrieve an overview of all the scheduled jobs for all the targets. Although, should development continue new requirements may be imposed on the project - requirements such as protection from volatile connections, good management of scheduled jobs, and an optimised listing of remote files to produce a robust and user-friendly solution. In my opinion, I do not think Areca is ready for deployment yet. Its unpredictable behaviour due to the hybrid feature of remote connections and multiple threading need extensive and thoughtful testing, and additional development will be necessary to cure these setbacks. Some programming exceptions are caused by the lack of development for cross-platform usage; end-users should not have to deal with Java exception messages, as this will no doubt confuse them. However, I think Areca is a usable piece of backup software, and is reliable enough on a small scale; a small amount of users will help developers channel their efforts to fixing bugs. Releasing to a small group of users lessens the chance of bug reports that external applications may have had a part in (i.e. affected the operations of Areca). Thus, the impact of bugs is not as great should Areca be released to a largescale population. 79

81 Figure 7.1: The time spent per week on the project Figure 7.1 show the amount of time spent in the project. During the initial inception phase, the time spent per week was around 10 hours. However, as the project progressed, the amount of time spent increased significantly. Personally, it has been an excellent learning exercise. My project and time management skills have been examined and I believe I have accomplished a considerable amount of complex tasks that were presented. 7.3 Future Development Ideas The most interesting part of this project is the feature of accessing and backing up remote files. It would be very useful if we could access remote files, as well as saving these remotes files to remote storage a remote-to-remote backup (R2R hereafter). R2R backup provides endless possibilities for backing up. It is more flexible than the current solution, and it does not require the backed up files to be stored on the local disk drive. 80

82 Some foundations of R2R have already been implemented; the Remote Account Manager could be used to manage and control the remote servers as a backup destination. A central element for managing remote accounts should provide ease of use for end users. As previously mentioned above, the scheduler needs a form of manager. A manager that would handle the scheduling of jobs, and provide an overview of all the active schedulers where they can be initiated, edited, and stopped. Managing these jobs would provide users with flexibility as well as control. An improved GUI for the selection of file sources using the check file tree would be ideal. Currently, the check file tree GUI looks unfinished and untidy, and does not fit in with the existing interfaces. It is fair to say that more effort has been put into the engine of the check file tree than the exterior aspects such as the GUI. The possibility of uploading the implemented code to sourceforge.net gives other community members the opportunity to participate and develop additional features and/or fix bugs. It will also allow the community to give constructive criticism on the project. Below is a list of features that if implemented would convert the project into a fullyfledged backup application: Bare-metal recovery: A bare-metal recovery essentially, backs up the entire OS disk, and restores these files onto a new disk that can be considered as a clone of the previous disk. As this project runs on top of the OS, copying the system files would be near impossible. There are many issues that could arise from attempting to read system files. For example, there will be numerous locked files that cannot be backed up, out of date system files such as graphics card drivers (i.e. if an old driver file is restored and replaces a newer driver file, then the system could crash due to the OS being registered with the latter driver) [5]. Remote-to-Remote: backup remote files to a backup destination. The main problem to contemplate is whether to store the remote sources to a temporary local destination first, and then upload these to a selected remote destination. A direct connection from FTP to FTP already exists; this allows one FTP 81

83 server to transfer files to another. This is called FXP [58]. Currently, there are no protocols available to connect two SFTP or SMB servers together. Graphically show differences between remote files and local files, and allow the user to pick and choose which files to fetch from the remote server to backup. 7.4 Final Conclusion The project could be described as being somewhat innovative as it was the only open-source and freeware backup software that was aimed for home users that allowed the backup of their remote files. As the main task was to extend the features of an existing solution, this project could be classed as an integration project as new code was intermixed with an existing code base. The two features that have been implemented have created a solid foundation on which future ideas could be built upon. The last feature that was not implemented, the improved file restore method, could certainly be improved should this project continue development. The performance figures from the testing show that the memory has decreased on the initial start-up of Areca. This improvement has been one of the most profound achievements in this project, including the fact that the implemented features also work seamlessly with the existing code. The positive response from the evaluators has added to the success of the project. Overall, the outcome of the entire project can be considered a success. The useful and convenient feature of being able to access remote files whilst backing up is simple yet effective. Finally, this has been a very challenging and worthwhile project. 82

84 Chapter 8: References 1. University of Colorado, Dept. of Applied Mathematics. SSH and SFTP (Online, cited 2009 Nov 25) URL: 2. The Australian National University. What is SMB? (Online, cited 2010 Mar 11) URL: 3. Cougias D. The Backup Book: Disaster Recovery from Desktop to Data Center [cited 2010 May 4] 4. Developing a Backup Strategy: ICT Hub Knowledgebase (Online, cited 2010 May 5) URL: 5. Curtis Preston W. Backup & recovery [cited 2010 May 5] 6. Universal automatic computer definition of Universal automatic computer in the Free Online Encyclopedia (Online, cited 2010 Apr 23) URL: 7. IBM Research History Highlights (Online, cited 2010 Apr 11) URL: 8. The History of Backup (Online, cited 2010 May 5) URL: 9. What Is Cloud Backup? (Online, cited 2010 Apr 22) URL: Amazon Simple Storage Service (Amazon S3) (Online, cited 2010 Feb 5) URL: Free Online Backup - SpiderOak.com (Online, cited 2010 Apr 14) URL: File Backup Software for Windows and Mac: Remote Backup and File Sync with SugarSync (Online, cited 2010 Apr 14) URL: 83

85 13. Apple - Time Machine (Online, cited 2010 May 5) URL: Snap Backup (Online, cited 2010 May 5) URL: BounceBack Business Server Solutions (Online, cited 2010 May 5) URL: erver.html 16. What is RAID? (Online, cited 2010 Apr 12) URL: Petrucci O. Areca Backup (Online, cited 2009 Dec 15) URL: InstallShield (Online, cited 2010 May 5) URL: Amanda Network Backup: Open Source Backup for Linux, Windows, UNIX and OS X (Online, cited 2010 May 5) URL: Bacula, the Open Source, Enterprise ready, Network Backup Tool for Linux, Unix, and Windows (Online, cited 2010 May 5) URL: DirSync Pro (Directory Synchronize Pro) (Online, cited 2010 May 5) URL: Using Virtual Directories with FTP Sites (IIS 6.0) (Online, cited 2010 Mar 29) URL: e247-be2b-4b0d-8dee-04f71ad6c14a.mspx?mfr=true 23. Mac OS X Hidden Files & Directories (Online, cited 2010 May 5) URL: Restrict FTP access to a single directory for only one user (Online, cited 2010 May 5) URL: 84

86 25. What is the MD5 hash? (Online, cited 2010 Apr 29) URL: Permissions (Online, cited 2010 April 2) URL: QuickSFV Overview (Online, cited 2010 May 1) URL: Unix Daemon Server Programming (Online, cited 2010 May 5) URL: LASEC (Online, cited 2010 May 5) URL: The Java (not really) Faster than C++ Benchmark (Online, cited 2010 May 5) URL: Martin J, Odell J. Object-oriented Analysis and Design. Prentice Hall 32. Larman C. Applying UML and Patterns - An Introduction to Object-Oriented Analysis and Design and Iterative Development. Third Edition. 33. ATL Project (Online, cited 2010 May 5) URL: SourceForge.net: Eclipse - areca (Online, cited 2010 Mar 24) URL: ResourceBundle (Java 2 Platform SE v1.4.2) (Online, cited 2010 Apr 1) URL: SourceForge.net: Translations - areca (Online, cited 2010 Mar 24) URL: JDK 5.0 Java Programming Language (Online, cited 2010 Apr 20) URL: Areca Backup - Versions History (Online, cited 2010 May 5) URL: 85

87 39. Bracha G. Generics in the Java Programming Language (Online) 2004 Jul 5 [cited 2009 Apr 21) URL: SWT: The Standard Widget Toolkit (Online, cited 2010 May 5) URL: Creating a GUI With JFC/Swing (The Java Tutorials) (Online, cited 2010 May 5) URL: Class Tree Help - Eclipse SDK (Online, cited 2009 Nov 20) URL: api/org/eclipse/swt/widgets/tree.html 43. Building custom Swing components (JTreeView with checkboxes) - Dumitru Postoronca (Online, cited 2009 Nov 16) URL: JTreeView-with-checkboxes.html 44. Java Native Interface: Programmer's Guide and Specification (Online, cited 2010 Apr 7) URL: Free FTP, SFTP and TFTP Source Code and Programming Libraries (Online, cited 2010 Apr 29) URL: Mounting definition by The Linux Information Project (Online, cited 2010 Mar 18) URL: ExpanDrive: Ridiculously simple SFTP/FTP/S3 drive access (Online, cited 2010 Mar 25) URL: Defendant Microsoft Corporation's Corrected Opposition to Plaintiff SUN Microsystems, Inc. s Motion for Preliminary Injunction (Online, cited 2010 Apr 10) URL: Commons VFS - Commons Virtual File System (Online, cited 2009 Nov 12) URL: Scheduling recurring tasks in Java applications (Online, cited 2010 Mar 2) URL: 86

88 51. Data Encryption Standard (Online, cited 2010 May 5) URL: The Electronic Frontier Foundation (Online, cited 2010 Apr 3) URL: faq.html#howsitwork 53. CLOC -- Count Lines of Code (Online, cited 2010 Jan 26) URL: Automated Testing Tools - TestComplete (Online, cited 2010 Mar 6) URL: Unit Testing (Online, cited 2010 Apr 13) URL: JUnit (Online, cited 2010 Apr 1) URL: What is black box/white box testing? (Online, cited 2010 Apr 12) URL: What Is FXP? (Online, cited 2010 Apr 20) URL: 87

89 Appendix A Gantt Chart 88

90 89

91 90

92 Appendix B Initial Use Cases Scenario 1 Use case: Creating a new backup Actors: End-user The backup software can load successfully on the user s machine. Pre-conditions: The backup destination (also called local repository) has enough free space on disk. Success Scenario: 1. The user creates a new back up Group called UniWork by going into Edit New group The user enters a Title and Description for the group. The title must not be blank however, the description can be empty. 3. The user presses Save on the new group window. 4. The user creates a Target to be backed up by right-clicking on the UniWork group and choosing New target [Main Tab] On the subsequent new target window, the user enters a target name, a local repository and chooses the Standard storage mode. 6. [Sources Tab] The user presses the Add button to add a file/directory to the sources list. 7. [Compression Tab] The user chooses None as the compression option. 8. [Advanced Tab] The user selects Register empty directories, Store permissions and Follow directories in the File management box. 9. [Filters Tab] No filters are added by the user. 10. [Pre-processing and Post-processing Tabs] No actions are added for pre/post-processing. 11. The user enters a suitable description for the backup. 12. The user presses Save on the new target window. 13. The user right-clicks on the newly-created target and selects Backup The backup should now start and a progress window/tab should be displayed to the user. Alternative Routes: 1a. The user presses Ctrl+G to invoke the Create New Group window. 2a. The user does not enter a title. The Save button is not enabled. The button is only enabled when there is text in the Title text box. 4a. The user presses Ctrl+T to invoke the Create New Target window. 5a. The user does not enter a target name or a valid path for the local repository. The Save button is not enabled. It is only enabled when the target name and local repository text boxes are filled in. 6a. The user is presented with a graphical view of a selected storage device with checkboxes next to the storage device (to back the whole device up), nested directories within that device (to backup a certain directory or directories), and files (to backup a select amount of files). 7a. The user decides to add compression. Current options are Zip and Zip

93 Scenario 2 Use case: Restoring a backed up file Actors: End-user The backup software can load successfully on the user s machine. The restore destination has enough free space on disk. Pre-conditions: The file that needs to be restored can be restored to a valid 1 location. The backup/archive package contains the file that is to be restored. Success Scenario: 1. The user selects a saved target in the GUI. 2. When a saved target has been selected in the Archives tab, there should be an archive which contains the file that the user wants to restore. The user selects the desired archive. 3. The user clicks on the Logical view tab and is presented with a tree directory of the files that have been backed up. 4. The user browses through the tree and finds the required file. 5. The user right-clicks on the file and clicks on Restore. 6. The system attempts to restore the file to its original location A dialog is shown - explaining that the file has been restored. Alternative Routes: 2a. The user right-clicks on an archive and searches for the file to be restored. 3a. The user selects a directory to be restored. 3b. The user holds down the Ctrl key and selects more than one file to be restored. 6a. The system fails to restore the file. The user is presented with some options to Retry, Abort, or Select an alternate location to restore the file. 1 Valid in this context means that the storage device has enough disk space to restore the file to, the file is not already in use (or locked), the location exists, and that the user has the correct permissions to write to the disk. 2 The original location refers to the file s location where it was last located in the previous backup. 3 A file restore is different to a file recover operation. A restore operation attempts to copy the backed up file to the location of where the file was previously backed up. A file recover operation can recover the file to any location that is valid. 92

94 Appendix C Online Questionnaire 93

95 94

96 95