Data Provenance for e-social



Similar documents
White Paper on CLOUD COMPUTING

Data Protection Act Guidance on the use of cloud computing

Cloud Computing Submitted By : Fahim Ilyas ( ) Submitted To : Martin Johnson Submitted On: 31 st May, 2009

Cloud Computing and Records Management

CLOUD COMPUTING IN HIGHER EDUCATION

Cloud Computing for SCADA

Introduction to Cloud Computing

Cloud Computing Architecture: A Survey

Cloud Computing For Distributed University Campus: A Prototype Suggestion

Grid Computing Vs. Cloud Computing


How To Understand Cloud Computing

Cloud Computing. Course: Designing and Implementing Service Oriented Business Processes

Where in the Cloud are You? Session Thursday, March 5, 2015: 1:45 PM-2:45 PM Virginia (Sheraton Seattle)

Security Considerations for Public Mobile Cloud Computing

Cloud definitions you've been pretending to understand. Jack Daniel, Reluctant CISSP, MVP Community Development Manager, Astaro

A Conceptual Architectural Framework of Cloud Computing for Higher Educational Institutions in the Sultanate of Oman 1

Cloud Services. More agility. More freedom. More choice.

Cloud Computing INTRODUCTION

Workprogramme

The NIST Definition of Cloud Computing (Draft)

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

CLOUD COMPUTING. Keywords: Cloud Computing, Data Centers, Utility Computing, Virtualization, IAAS, PAAS, SAAS.

Tamanna Roy Rayat & Bahra Institute of Engineering & Technology, Punjab, India talk2tamanna@gmail.com

Lecture 02a Cloud Computing I

Cloud Computing Services and its Application

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

CLOUD COMPUTING. A Primer

Lecture 02b Cloud Computing II

A Study on Service Oriented Network Virtualization convergence of Cloud Computing

Cluster, Grid, Cloud Concepts

Grid Computing vs Cloud

AN OVERVIEW ABOUT CLOUD COMPUTING

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing. Chapter 1 Introducing Cloud Computing

Cloud Computing in the Federal Sector: What is it, what to worry about, and what to negotiate.

How To Compare Cloud Computing To Cloud Platforms And Cloud Computing

Cloud Computing Service Models, Types of Clouds and their Architectures, Challenges.

Data Storage Security in Cloud Computing

Cloud Computing and Government Services August 2013 Serdar Yümlü SAMPAŞ Information & Communication Systems

Essential Characteristics of Cloud Computing: On-Demand Self-Service Rapid Elasticity Location Independence Resource Pooling Measured Service

The benefits of Cloud Computing

Implementing Digital Forensic Readiness for Cloud Computing Using Performance Monitoring Tools

Cloud Computing. Chapter 1 Introducing Cloud Computing

Indian Journal of Science International Weekly Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Architectural Implications of Cloud Computing

LJMU Research Data Policy: information and guidance

Hadoop in the Hybrid Cloud

The Hybrid Cloud: Bringing Cloud-Based IT Services to State Government

CLOUD COMPUTING SECURITY ISSUES

What is Cloud Computing? First, a little history. Demystifying Cloud Computing. Mainframe Era ( ) Workstation Era ( ) Xerox Star 1981!

How To Understand Cloud Computing

DNA IT - Business IT On Demand

Queensland recordkeeping metadata standard and guideline

A Secure Strategy using Weighted Active Monitoring Load Balancing Algorithm for Maintaining Privacy in Multi-Cloud Environments

Masters in Human Computer Interaction

How cloud computing can transform your business landscape

European University Association Contribution to the Public Consultation: Science 2.0 : Science in Transition 1. September 2014

IS PRIVATE CLOUD A UNICORN?

Introduction to Cloud Services

A Strawman Model. NIST Cloud Computing Reference Architecture and Taxonomy Working Group. January 3, 2011

The NIST Definition of Cloud Computing

Masters in Computing and Information Technology

Fujitsu Dynamic Cloud Bridging today and tomorrow

THE BRITISH LIBRARY. Unlocking The Value. The British Library s Collection Metadata Strategy Page 1 of 8

Masters in Networks and Distributed Systems

The cloud - ULTIMATE GAME CHANGER ===========================================

SURVEY OF ADAPTING CLOUD COMPUTING IN HEALTHCARE

Masters in Information Technology

How To Write A Blog Post On Globus

Near Sheltered and Loyal storage Space Navigating in Cloud

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Introduction to Cloud Computing

1.1.1 Introduction to Cloud Computing

THE CCLRC DATA PORTAL

CLOUD COMPUTING SECURITY CONCERNS

Cloud Computing demystified! ISACA-IIA Joint Meeting Dec 9, 2014 By: Juman Doleh-Alomary Office of Internal Audit

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

INCREASING SERVER UTILIZATION AND ACHIEVING GREEN COMPUTING IN CLOUD

Simon Miles King s College London Architecture Tutorial

Cloud Models and Platforms

International Journal of Engineering Research & Management Technology

Cloud Computing: The Next Computing Paradigm

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Masters in Advanced Computer Science

A grant number provides unique identification for the grant.

Service Component Architecture for Building Cloud Services

Achieve Economic Synergies by Managing Your Human Capital In The Cloud

Masters in Artificial Intelligence

PUBLIC HEALTH LIMITED OPEN CALL SCHEME (LOCS) PHA202 RESEARCH BRIEFING

Transcription:

Data Provenance for e-social Data Provenance for e-social Science Science Cloud Cloud Applications Applications Che Wan Amiruddin Chek Wan Samsudin MSc Computing and Management Session (2010/2011) The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others. I understand that failure to attribute material which is obtained from another source may be considered as plagiarism. (Signature of student)

ABSTRACT e-social Science applications use Grid Computing technology to collect, process, integrate, share, and present social and behavioural data. The option to use cloud computing technology for both computation and data storage has now arisen due to its on-demand strategies where customers only pay for their what-they-use service. However, data stored on cloud can be heterogeneous and can pose some problems in identifying the source of the data and information on who stored the data. The integrity of the data used is crucial for the use by e-social Scientists when conducting experiments. These issues can be solved by using data provenance when deriving data from the cloud environment. The aim of this study was to derive common data provenance requirements needed by the e-social Scientists, to facilitate the uptake of cloud computing technology for the e-social Sciences which would help social scientists to use an authenticate data from the cloud environment. The method used to derive the e-social Science provenance requirements was by interviewing e-social Scientists themselves, followed by surveys on current provenance schemes. Experimental evaluations of cloud provenance were conducted using a real test bed to fulfil those requirements. Comparison of the current provenance schemes and the simple provenance schemes drawn were then conducted. Three main requirements were derived from the results of interviewing the e-social Scientists which were: 1) The ability to replicate the results based on the input data; 2) The ability to evaluate the processes based on the metadata; and 3) The ability to secure the sensitive data obtained from being accessed by unauthorised users. The survey on the current provenance schemes showed that schemes such as Chimera and PASS could support both the provenance requirements needed by e-social Scientists. The proposed scheme created also supported the result replication. The experimental evaluation on performance with regard to the time taken to record the process details, the time taken to record the result generated, and the time taken to retrieve results showed that the current provenance schemes can be applied on cloud environment. A design of a simple provenance scheme for e-social Science cloud application was drawn as a proof of concept to show that recording provenance on cloud environment was possible. Findings showed that two out of four provenance schemes discussed as well as the cloud provenance schemes created met the e-social Science requirements. The concept drawn showed that recording and querying data provenance on cloud applications is possible. Further studies to work on provenance data security should be performed to prevent unauthorised access of data. ii

ACKNOWLEDGEMENT Data Provenance for e-social Science Cloud Applications 2011 First and foremost, I would like to thank my project supervisor, Dr Paul Townend for suggesting me this topic as my Master s thesis project, and for his invaluable advice, guidance, support, and time throughout the preparation of my MSc project. Without his support and guidance, I would not be able to complete my project on time. I would like to thank Andy Turner from the School of Geography, and Dr. Colin Venters for spending his busy time to hold an interview session with me so that I was able to derive the provenance requirements needed by e-social Scientists. Great appreciation to thank Peter Garraghan for allowing me to use the ivic system for me to conduct the experiment. Without his generous help, it will cost me a fortune having to use an enterprise cloud provider such as Amazon. To my MSc Computing and Management colleagues as well as other Malaysian friends and families in Leeds, thank you for the support, motivation and friendship throughout my student life in Leeds. Thank you to my twin brother and my little sister for their encouragement and support throughout the year. They insist that I work hard at keeping well so that I could submit this work on time. And most importantly, my utmost gratitude to my parents for their prayers, moral and financial support, invaluable encouragement, advice and guidance given at all times. Many thanks Mom and Dad, I could not have done it without you. Che Wan Amiruddin Chek Wan Samsudin MSc Computing and Management (2010/2011) University of Leeds September 2011 iii

Table of Contents Data Provenance for e-social Science Cloud Applications 2011 ABSTRACT... ii ACKNOWLEDGEMENT... iii LIST OF ABBREVIATIONS... vii LIST OF FIGURES... viii LIST OF TABLES... ix CHAPTER I... 1 1. Introduction... 1 1.1 Project Outline... 1 1.1.1 Overall aim and objective of the project... 1 1.1.2 Minimum requirement and further enhancement... 1 1.1.3 List of deliverables... 1 1.1.4 Resource required.... 2 1.1.5 Project schedule and progress report... 2 1.1.6 Research methods.... 2 1.2 Chapter overviews... 3 CHAPTER II... 4 2. Background Reading... 4 2.1 e-science... 4 2.2 e-social Science... 5 2.2.1 e-social Science Research... 6 2.3 Cloud Computing... 8 2.3.1 Deploying Models... 10 2.4 Provenance... 11 CHAPTER III... 15 3. Analysis & Survey... 15 3.1 Requirement Analysis... 15 3.1.1 Provenance requirement... 15 3.1.2 Provenance requirement for cloud... 16 3.1.3 Provenance requirement by the e-social Scientist... 16 3.2 Current Provenance Schemes Survey... 18 3.2.1 Chimera... 19 3.2.2 Earth System Science Workbench (ESSW)... 20 iv

3.2.3 Provenance Aware Service-oriented Architecture (PASOA)... 22 3.2.4 Provenance-Aware Storage System (PASS)... 24 3.3 Discussion of the current provenance schemes... 25 3.3.1 Storage Repository... 27 3.3.2 Representation Scheme... 27 3.3.3 Result Replication... 27 3.3.4 Provenance Distribution... 27 3.3.5 Evaluate metadata... 27 CHAPTER IV... 29 4. Design and Implementation... 29 4.1 Proposed Design... 29 4.1.1 System Design Scenario... 29 4.1.2 System Architecture... 30 4.1.3 Database Design... 32 4.2. Implementation... 33 4.2.1 Creating the database... 33 4.2.2 Accessing the virtual machine... 34 4.2.3 Computational software... 35 CHAPTER V... 37 5. Evaluation... 37 5.1 Performance evaluation... 37 5.1.1 Recording the result... 37 5.1.2 Querying the result... 39 5.1.3 Discussion... 41 5.2 Do the current provenance schemes meet the e-social Science requirements?... 41 5.3 Current provenance schemes and cloud... 42 5.4 Summary... 43 CHAPTER VI... 45 6. Conclusion... 45 6.1 Overall project evaluation... 45 6.2 Problems encountered... 46 6.3 Future Work... 46 REFERENCES... 48 Appendix A Personal Project Reflection... 53 v

Appendix B Contribution on project... 54 Appendix C - Interim Report... 55 Appendix D Gantt chart... 56 Appendix E Interview session with Andy Turner... 57 Appendix F Database Tables Created... 61 Appendix H Email interaction to get hold of PASS system... 64 vi

LIST OF ABBREVIATIONS Data Provenance for e-social Science Cloud Applications 2011 AVHRR CMS DAMES DReSS DAG DTD ESSW ESRC ERM GENeSIS GeoVUE HaaS PC PHP IaaS ivic KBDB IP MeRC MoSeS NCeSS NeSC NHS NIST NOAA NGN ND-WORM OS PaaS PASOA PReP PASS PReServ Advance Very-High Resolution Radiometer Compact Muon Solenoid Data Management through e-social Science Digital Record for e-social Science Directed Acyclic Graph Document Type Definitions Earth System Science Workbench Economic and Social Research Council Entity Relationship Diagram Generative e-social Science Geographic Virtual Urban Environments Hardware as a Service Personal Computer Hypertext Preprocessor Infrastructure as a Service infrastructure Virtual Computing in-kernel Berkeley DB Database Internet Protocol Manchester eresearch Centre Modelling and Simulation for e-social Science National Centre for e-social Science National e-science Centre National Health Service National Institute of Standards and Technology National Oceanic and Atmospheric Administration Next Generation Network No-Duplicate Write Once Read Many Operating System Platform as a Service Provenance Aware Service-Oriented Architecture Provenance Recording Protocol Provenance-Aware Storage System Provenance Recording for Service vii

SDSS SaaS OPM VDC VDL VFS VM WBBIs Sloan Digital Sky Survey Software as a Service The Open Provenance Model Virtual Data Catalogue Virtual Data Language Virtual File System Virtual Machine Web-based behavioural interventions LIST OF FIGURES Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Illustration of a public cloud. Users accessing data and application without knowing how the underlying architecture works Illustration of a private cloud. Useful for large enterprise to handle a maximum workload Illustration of a hybrid cloud. Combination of public and private cloud together Illustration of a community cloud. Shared infrastructure among enterprise with a common purpose Example of provenance graph Schematic of Chimera Architecture ESSW conceptual diagram Lab Notebook and Laware (ND-WORM) architecture An illustration of how a service acts as another client PReServ layers PASS system architecture Edges in the OPM (sources are effects, and destinations are causes). Overview of the proposed design on how the system will work An architecture diagram of the proposed system design ERM diagram for the provenance database The structure for client table. All the databases created in the provenance database Using the VNC viewer to connect to the virtual machine. :0 assign to the virtual machine to access viii

Figure 19 Figure 20 Figure 21 Figure 22 Figure 23 Figure 24 Figure 25 Figure 26 Figure 27 Figure 28 Figure 29 Figure 30 The virtual machine that is ready to be used The class diagram of the system created. All the class will connect to the database when performing an operation Time taken to record result with and without the provenance data, based on the number of iterations generated from a single call operation. Time taken to record result with and without the provenance data based on the number of iterations generated using two different calls. Time taken to get the result with and without the provenance data Initial schedule for completing the project Revised schedule for completing the project The structure for client table The structure for virtual machine table The structure for process table The structure for result table The structure for recording the time taken to store table LIST OF TABLES Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Comparison between Chimera, ESSW, PASOA and PASS The entity table schemes for the provenance database Workload classification in a web database system Comparison of the system created together with the provenance schemes discussed in Chapter Four Description of the system used by each schemes Result of the time taken to record the results together with the provenance data in a single operation call Result of the time taken to record the results only in a different operation call Time taken to query results with provenance data Time taken to query results without provenance data ix

CHAPTER I 1. Introduction 1.1 Project Outline 1.1.1 Overall aim and objective of the project The aim of this project is to derive common data provenance requirements for e-social Science applications and research, and extend existing provenance frameworks to facilitate the uptake of cloud computing technology for the e-social Sciences. This project should be able to help social scientists to use an authenticate data from the cloud environment by using provenance to get the information. The project will address several objectives, which are: Understand current e-social Science projects. Investigate cloud computing functionality and data provenance. Survey existing provenance schemes and extend them to the cloud paradigm, with an emphasis on meeting the e-social Science requirements. Identify the emergent functionality of provenance and cloud. Develop initial schema for e-social Science data provenance in cloud computing. Conduct experiments according to a chosen framework. Evaluate the experiment conducted. Evaluate the whole project process. 1.1.2 Minimum requirement and further enhancement Minimum requirements of the project: Perform requirements analysis for provenance in e-social Science projects. Develop initial scheme for e-social Science data provenance in cloud computing. 1.1.3 List of deliverables A cloud-based data provenance schema for e-social Science. An analysis of emergent functionality in provenance and cloud. 1

Experimental analysis of feasibility of using provenance in cloud for e-social Science applications. 1.1.4 Resource required. The resource needed is a cloud service to host sample data to be used by the experiment. 1.1.5 Project schedule and progress report The schedules for both the initial and the revised plan can be found in Appendix D. The standard waterfall methodology will be used as the method that will cover: Background reading and literature review on each research area. Collection of qualitative analysis from e-social Scientists to derive provenance requirements for future e-social Science projects. Development process. Evaluation of the development. Progress report: From the beginning of conducting the project until the submission of the interim report, everything went as planned as shown in the initial schedule plan. Deviations from plan started from the beginning of July 2 phase. Tasks undertaken: Further survey of existing provenance schemes extended to July 3 and 4 phase while the process of implementing the system continued. A simple cloud-base data provenance schemes as a proof of concept conducted. 1.1.6 Research methods. Thorough of literature research and analysis with regard to cloud computing, provenance and e-social Science. Interview with e-social scientists to derive the provenance requirements for e-social Science projects. Survey on existing provenance schemes and extend these to the cloud paradigm, with emphasis on meeting the e-social Science requirements. Experimental evaluation of cloud provenance using a real test bed. 2

1.2 Chapter overviews Chapter On - introduction to the project describing the aims and objectives. Chapter Two - background reading and research into e-social Science, cloud computing, and data. Chapter Three - analysis of the provenance requirement from both cloud and e-social Science perspective and survey of the current provenance schemes. Chapter Four - design and implementation of a simple cloud provenance schemes. Chapter Five - evaluation based on the system created and difference with the previous provenance schemes. Chapter Six - conclusion and further work on the project. 3

CHAPTER II 2. Background Reading Section 2.1 explains a general description about e-science. Section 2.2 discusses how e- Science technology is used in e-social Science together with some examples of current projects. Section 2.3 and 2.4 describe what Cloud computing and Provenance are. 2.1 e-science The National e-science Centre (NeSC) defines e-science as the large scale science that will increasingly be carried out through distributed global collaborations enabled by the internet (Taylor). The fundamental idea behind e-science is to enable scientists to conduct new discoveries and obtain advancement in areas ranging from dentistry to medicine (escience-grid). It is also a tool that enables scientists to network their data to other researchers, and deals with data storage, interpretation, and. The United Kingdom (UK) government has also defined e-science as: science increasingly done through distributed global collaborations enabled by the internet, using very large data collections, tera-scale computing resources and high performance visualisation (Illsley, 2011) e-science programme was developed to invent and apply computer-enabled methods to facilitate distributed global collaborations over the internet, and sharing a very large data, collections, terascale computing resources and high performance visualisations (EPSRC, 2009). The government has funded more than 250 million to run the UK e-science programme. Funds were divided between the e-science core programme, where the focus on the development of the generic technologies to integrate different resources across computer network was done, and individual research council of e-science programme specific to its discipline support. Technically, the initial emphasis of the programme was to focus on exploiting the Grid (the hardware, software, and necessary standard) to co-ordinate geographically distributed and possibly heterogeneous computing and data resources and deliver over the internet for researchers to use (Halfpenny and Procter, 2010). The purpose was to demonstrate the potential of Grid technologies in advancing the social science and encourage other researchers to adopt the emerging technologies. Grid has been described as a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities (Foster, 2002); 4

as well as an enabler for Virtual Organisation (Foster et al., 2001). In e-science, the aim of grid computing is to construct a cyber infrastructure or e-infrastructure for the use of research collaboration (Scot and Venters, 2006). It has emerged as a major paradigm shift for sharing resources such as processing power, data storage, scientific instrument etc. (Turner et al., 2009). The grid has a number of distinct components (Procter): Access grids: Provide advanced video conferencing and collaboration tools, an important element in Virtual Research Environment. Computational grids: Provide scalable high performance computing. Data grids: Provide access to datasets that makes their discovery, linkage and integration more transparent. Sensor grids: Provide an opportunity to gather data in new ways. Turner (2009) stated that the use of grid computing has gradually moved towards other sciences and research areas. The application now currently covers the area of business, economic medicine as well as performing arts. 2.2 e-social Science e-social Science is a term which encompasses technological developments and approaches within social science that work with Social Scientists and Computer Scientists on tools and research which Social Scientists can take and use to help their research (NCeSS, 2009). e-social Science is a valuable tool to help the Social Science researchers in conducting new researches, or to conduct a new research more quickly. Within NCeSS, the term e in e-social Science also refers to enabling. Another definition of e-social Science is a programme to use networked information to collect, process, and present social science research data. Like science, the essence of social science it to process (human) information. Unlike sciences that have immediately benefited from the latest information technologies, most of social science research has remained labour intensive in data collection and processing (Xiaoming Li, 2006). The e-social Science programme was run by the UK National Centre for e-social Science (NCeSS) activities and was funded by the Economic and Social Research Council (ESRC). The programme was created to facilitate bigger, faster and more collaborative science, driven by a vision of researchers worldwide addressing key challenges in new ways (Halfpenny and Procter, 2010). The programme encompasses two strands that concern the application of grids technologies within social science and the design, uptake, and use of e-science (i.e. the design of the grid infrastructure) (Woolgar, 5

2004). It was developed in conjunction with wider developments in e-science programme as mentioned earlier. NCeSS is exploring the use of grids in e-science by exploring its potential to be used in social science research. The e-social Science research uses the adoption of grid technologies and tools that have been applied in natural science to advance social science (Scot and Venters, 2006). Distribution data in social science is now common. Some of the issues are data curation, data management, distributed access, platform and location independence, confidentiality, and access control (Turner et al., 2009). 2.2.1 e-social Science Research NCeSS projects were carried out in a 3-year projects or nodes located across different universities in the UK. The projects ran on two phases. The first phase which ran between 2004 to 2007 had seven nodes and the second phase that ran between 2008 to 2012 had eight nodes, three of which were the extensions of the first nodes, one was the combination of the two first phase nodes and the remaining four were new (Halfpenny and Procter, 2010). A hub team based at the University of Manchester coordinated all the e-social Science projects under the directorship of Professor Rob Procter. He has re-vitalised the e-social Science research programme with the launch of MeRC (Manchester eresearch Centre). The team was responsible for designing and managing the research programme, dissemination strategies, played a key role on the commissioning panels, creating and exploiting synergies across the components of the programmes, and strategically planning any future developments (MeRC). There are two strands in the NCeSS research programme which are the applications strand and the social shaping strand (Halfpenny and Procter, 2010). The applications strand focused on unfolding developments in technologies, tools and services from the e-science programme and applies them to the social science research community needs. Improving the existing method or developing a new method that enables advances in the fields that would not otherwise be possible is the objective of this strand. The second NCeSS research programme strand (the social shaping strand) falls within the social studies of science and technology tradition. The aim of this strand is to understand the social, economic and other influences on how e-science technologies are being developed and used, and the technologies implications for scientific practice and research outcomes. The use of e-social Science strand is to understand the origin of the technological innovations and the difficulties, and the facilitators to their uptake within the scientific research communities and use the knowledge to extend the reach of e-science (Halfpenny and Procter, 2010). 6

Below are some of the Nodes programmes: DAMES (Data Management through e-social Science) is the second-phase node of the NCeSS research from the University of Stirling / University of Glasgow (2008-2011). The data management refers to operations on data performed by social science researchers and the tasks associated with preparing and enhancing data for the benefits of analysis (i.e. matching data files together, cleaning data, and operationalising variables) (Lambert et al., 2008). There are four social science theme objectives from DAMES: 1) Grid Enabled Specialist Data Environment (Occupations, education, ethnicity) 2) Micro-simulation on social care data 3) Linking e-health and Social Science database, and 4) Training and interfaces for data management support. DReSS (Digital Record for e-social Science) is a node from the Nottingham University (2004-2011). It is sought to develop Grid-based technologies for social science research through three driver projects that have common methodological themes (recording data, replaying data, representing and re-representing data) (Rodden et al., 2008). Driver Project One explored the themes through the use of digital records in ethnographic research to investigate the social character of technology used. Driver Project Two used digital records in corpus linguistics to investigate the multi-modal character of spoken language. Driver Project Three employed digital record alongside psychological approaches to investigate the efficacy of e-learning. GENeSIS (Generative e-social Science) is a second phase node from a combination of two first phase nodes, called MoSeS (Modelling and Simulation for e-social Science)(Leeds University, 2004-2007) and GeoVUE (Geographic Virtual Urban Environments)(University College London, 2004-2007). MoSeS aims to use e-science technique to develop a national demographic model and simulation of the UK population (Birkin et al., 2006, Townend et al., 2008). GeoVUE focuses on generating environments using network-based technologies (i.e. the grid and web-based services) to enable users to map and visually explore spatially coded socio-economic data (Batty, 2006, Steed, 2006). GENeSIS project seeks to develop models of social systems where the main applications are to build environments and cities using new techniques of simulation involving complexity theory, agent-based models and micro-simulation (Birkin and Townend, 2009). LifeGuide is a project from Southampton University (2008-2011). It is a social science research environment designed by both computer scientists and behavioural psychologists for accelerating Web-based behavioural interventions (WBBIs) research (Yang et al., 2009). 7

2.3 Cloud Computing Grid computing requires the use of a software that can divide and farm out pieces of a program as one large system image to several thousand computers (Myerson, 2009). It is better suited for organisation with large amounts of data being requested by a small number of users or few but larger allocation requests (Schiff, 2010). In grid computing, if one of the software in node fails, other pieces of the software on other nodes may also fail. Users will not have access to the hardware and servers to upgrade, install, and virtualise servers and applications. Providing a standard set of services and software that will enable sharing of storage resources and geographically distributed computing is one of the aims of having grid computing. This includes security framework for managing data access and movement, utilisation of remote computing resources and many more (Rings et al., 2009). With cloud computing, customers will no longer need to be at a personal computer (PC), use an application from a PC or purchase specific version of software to be configured on their smartphone or PDA devices. Customers will not have to worry about how servers and networks are maintained in the cloud as they do not own the infrastructure, software or platform in the cloud. And finally customers can access multiple servers anywhere in the world without having to know which one and where the location of the server is (Myerson, 2009). One of the main reasons why people switched to cloud was that, it provides companies with scalable, high-speed data storage and services at an attractive price (Schiff, 2010). It offers a solution to the problem of organisation that need resources such as storage or computing (CPU) in high dynamic level of demand (Rings et al., 2009). With both grid and cloud aim to provide access to a large computing or storage resources, cloud utilises virtualisation to provide a standardised interface to a dynamically scalable underlying resources that will hide the physical heterogeneity, geographical distribution, and faults by using the virtualisation layer (Rings et al., 2009). Cloud computing allows users to plug into and use web services and application of networked software application by just using a web browser to run it. Using an open standard interface, cloud computing also allows external users to write applications using web based service (Adair, 2009). Adair (2009) explained that hardware is not an important factor since the main emphasis is placed on software and networking. There are two parts of the hardware equivalent to software-driven cloud computing concept mentioned: 1. Relates to network based storage system that abstracts some of the connectivity details of externally attached storage, and making it location independent to a host machine needing storage space. 8

2. Parallel or grid computing based on a form of virtualised server resources where the underlying platform may consist of blade servers or in the form of on-demand hardware allocation. The technologies used to build cloud applications are much the same as developing web site on the three-tier web architecture. The front end will be the end user, client or applications, and the back end is the network of servers with data storage system and computer program (Dave, 2009). IBM, Sun Microsystems, Microsoft, Google, and Amazon are some of the big manufactures that provide cloud computing services and platforms (Adair, 2009). Based on its essential characteristics, service models, and deployment models, the National Institute of Standards and Technology (NIST) has proposed the following definition: Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. (Mell and Grance, 2009). Cloud computing can also be referred as applications delivered over the internet and system software and hardware in the data centre that provide those services (Armbrust et al., 2009). These services come to customers based on on-demand strategies where customers only pay for what they really use. The three most often mentioned services are: Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS) (Mell and Grance, 2009). NIST defines those three services as: Software as a Service (SaaS): Customer uses applications locally without having any control over the running of operating system, hardware or network infrastructure. Example of this service will be the webmail provided by Yahoo! Mail, Hotmail and Gmail as well as Google Docs provided by Google (Robison, 2010). Platform as a Service (PaaS): Provides customers more than just the software. It allows customers to host their applications in cloud infrastructure. Customer has controls over the deployed applications and possibly some control over the hosting environments but has no control over the cloud infrastructure (network, servers, operating system and storage). Infrastructure as a Service (IaaS): Provides fundamental computing resources to customer such as processing, storage and networks. Customer has controls over operating system, 9

storage, deploying applications and possibly networking components (i.e. firewall) but has no control on the cloud infrastructure beneath it. 2.3.1 Deploying Models Along with the services, there are many ways to deploy the cloud. Mell and Grance (2009) comes up with four deployment models of cloud computing which are as follow: Public Cloud. The cloud infrastructure is made available to the general public or a large industry group and is owned by an organisation selling cloud services. Figure 1: Illustration of a public cloud. Users accessing data and application without knowing how the underlying architecture works. Source from (Amrhein et al., 2009) Private Cloud. The cloud infrastructure is operated solely for an organisation. It may be managed by the organisation or a third party and may exist on premise or off premise. Figure 2: Illustration of a private cloud. Useful for large enterprise to handle a maximum workload. Source from (Amrhein et al., 2009) Hybrid Cloud. The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardised or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds). 10

Figure 3: Illustration of a hybrid cloud. Combination of public and private cloud together. Source from (Amrhein et al., 2009) Community Cloud. The cloud infrastructure is shared by several organisations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organisations or a third party and may exist on premise or off premise. Figure 4: Illustration of a community cloud. Has shared infrastructure among enterprise with a common purpose. Source from (Amrhein et al., 2009) 2.4 Provenance Provenance can be referred to the sources of information, such as entities and processes, involved in producing or delivering an object. The provenance of information is crucial to making determinations about whether information is trusted, how to integrate diverse information sources, and how to give credit to originators when reusing information. In the free environment such as the Web, users may find information that is often contradictory or questionable (W3C, 2005). As a working group, W3C defined provenance as: 11

Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance. (W3C, 2005) Other definitions of provenance available are: The provenance of a piece of data is the process that led to that piece of data. (Groth et al., 2006) Digital provenance is metadata that describes the ancestry or history of a digital object (Muniswamy-Reddy et al., 2010) Nowadays, data can easily be obtained through the web. Data available on cloud can be heterogeneous. It is hard to know how data were updated and how trustworthy the data are. The use of provenance in the cloud will be crucial in this case. Without provenance, data consumers have no means to verify its authenticity or identity of the data (Muniswamy-Reddy et al., 2010). Provenance can help to validate the process to generate data sets for researchers to decide if they want to use the data. Another reason could be bugs from hardware and software release on cloud could be faulty. Using provenance can help to identify if a data set was tainted by faulty hardware or software in cloud (Muniswamy-Reddy and Seltzer, 2009). The use of provenance can establish an end users trust since it can be served as an indicator of data quality (Souilah et al., 2009). Providing information about the creation of the dataset also is one of the most important ways that provenance can contribute to trust (Rau and Fear, 2011). Rau and Fear (2011) also stated when users trust the data used, it is most likely that the data will be reused in the future. It allows the users to accept the data results created by the authors of the reports. Provenance can also be abstractly defined as a directed acyclic graph (DAG) (or provenance graph) where the edges of the graph signify the dependency between the nodes that help to explain how a product of a data or an event come to be produced in an execution (Moreau, 2010, Groth, 2008). Moreau (2010) assumed that the nodes and edges in DAG represent data items and data derivation. Figure 5 is a simple example of the computation 3 + 4 = 7 adopted from (Acar et al., 2010). The graph shows that 7 was a result from a process of addition from inputs 3 and 4. The provenance graph is a way to illustrate how a trace of the origin of particular processes can be done. In practice, scientific researchers may conduct the same experiment repeatedly and hence the experimental result needs to be reproducible. A trace of the origin of the raw data needed to ensure trusted and authenticated data were used. This is because scientists perceived provenance as a crucial 12

component of workflow systems that helps reproducibility of their scientific analysis and processes (Moreau et al., 2011a). Figure 5: Example of provenance graph Provenance also concerns with a very broad range of sources and uses. Business applications may exploit provenance in trusting a product as they consider the manufacturing processes involved. In a scientific context, data are integrated depending on the collection and pre-processing methods used, and the validity of an experimental result is determined based on how each analysis step was carried out. In e-social Science, there were questions regarding how the data were created, analysed, or interpreted, and conclusions drawn. The use of provenance will be the answer to these questions. Provenance will be important to provide information about how data were created as well as information about the context of the data (Philip et al., 2007). Context of the data is important to evaluate the quality and the reliability of data, the robustness of analysis, the generalisability and the validity of finding. This is indeed useful when using that particular data when conducting research. An example of such context could include, an account of the characteristic of secondary data and why it was used; an account of data collection methods including who and from whom the data were collected etc. Philip et al. (2007) also stated that conducting research in ethical manner is important in e-social Science. They must be able to produce, through transparent process, a robust system for the analysis of data which forms evidence of policy success or failure. This also applies to legal issue such as data protection, copyright and intellectual property right. Thus, it is one of the important reasons of having provenance in e-social Science. The role of provenance will allow it to act as decision makers by producing evidence of a context that shows all the data, methods and instruments used. Based on The Green Book by UK Government (Treasury, 2003), evidence is important to support conclusions and recommendation made to ensure that decision makers understand the assumption underlying the conclusions of the analysis and the recommendations put forward. From the study conducted, it shows that provenance is an important component in e-social Science. Contexts of the data used will be important when evaluating the research conducted especially if the 13

data were used to process human information such as health record. Having adopted the e-science grid technology to collect, process, integrate, share, and present social and behavioural data, the use of cloud computing technology that provides on-demand strategies where customers only pay whatthey-use service could arise in the future for e-social Science. However, data stored on cloud can be heterogeneous and can pose some problems when trying to get the information about the context of the data. The integrity of the data used will also be crucial for the use of e-social Scientists when conducting experiments on cloud environment. This issue can be solved by using data provenance when deriving data from the cloud environment. The study also shows that currently there are lack of analysis and surveys of the current provenance schemes for cloud. In order to deal with this problem, provenance requirements for e-social Scientist is needed to ensure the right information will be captured. Study about the current provenance schemes will also need to be discussed. Chapter Three begins by deriving the provenance requirements for both cloud and e-social Science. Survey and analysis of the current provenance schemes will also be conducted where various provenance schemes available for the use on different domains will be analysed. An example of the schemes is The Provenance Aware Service-Oriented Architecture (PASOA) (Groth et al., 2005a) which was used on Biology domain. The Earth System Science Workbench (ESSW) (Frew and Bose, 2001) was used by the Earth Science researcher, and Chimera (Foster et al., 2002) was used in the Physics and Astronomy domain. Each of these schemes will be discussed in further details in Chapter Three. 14

CHAPTER III 3. Analysis & Survey 3.1 Requirement Analysis Requirement analysis is part of the methodologies conducted to get the necessary requirements. Two methods used to derive the requirement are by conducting study on research papers published that are related to the topic and by interviewing e-social Scientists to derive the provenance requirement. The analysis will comprise of three parts which are the requirement of the provenance itself, the provenance requirement needed for cloud, and the provenance requirement needed by the e-social Scientists. Each of these will be derived in a separate sub section as follows. 3.1.1 Provenance requirement Before defining the provenance requirement for cloud and provenance requirement by the e- Social Scientist, requirements regarding the provenance itself need to be gathered. Dave (2009) stated that the technologies used to build cloud application are the same as developing web site on three-tier web architecture, hence the requirement identified by (Groth et al., 2004) can be used. Some of the requirements identified are: Verifiability: The ability to verify the process in terms of actors involved, actions, and relationship with one another. Reproducibility: The ability to repeat and possibly reproduce a process from the provenance that has been stored. Preservation: The ability to maintain provenance information for a long period of time. Scalability: As provenance information could be bigger than the output data, it is necessary for the provenance system to be scalable. Generality: Where different systems might be used, the provenance system should be general enough to record provenance from a various different applications. Customisability: To allow for more application specific use of provenance information, a provenance system should allow for customisation. Aspects of customisability could include constraints on the type of provenance recorded, time constraints on when recording can take place, and the granularity of provenance to be recorded. 15

3.1.2 Provenance requirement for cloud Muniswamy-Reddy et al. (2009) has drawn some requirements needed by cloud providers in order to make provenance as a first class citizen. These requirements are to ensure that clients will get the benefits by reducing the extra development efforts needed to store provenance, and providers can take advantage of the rich information inherent in provenance for various applications ranging from searching to improving the performance of their services. Those requirements are: Co-ordination between Storage and Compute Facilities: Cloud providers usually provide both storage and compute facilities. During the compute process, data can be generated and transmitted. A virtual machine that can handle the automated provenance transmission to the storage device together with the data as well as tracking record of the provenance can be installed. Allow Clients to Record Provenance of their Objects: Clients may generate data locally and the data may be stored on the cloud. Client should be responsible to track the provenance and provide the provenance to the cloud for storage. Cloud providers should provide an interface that will allow clients to record provenance for generating data locally instead on cloud. Provenance Data Consistency: Provenance stored on cloud should be consistent with the data it describes. Cloud itself is inherited from the distributed system, the provenance can be inconsistent if the provenance and the data were recorded using separate methods. Cloud providers should provide an interface that can store provenance and data automatically to ensure consistency between the provenance and data. Long Term Persistence: Provenance should be kept for longer period than the object it described. An unrelated object could still be connected to the provenance. This is to ensure that the provenance chain will not be split when the objects were deleted. Provenance Accessibility: Clients may want to access the provenance database to verify the properties or simply checking the lineage of data used. Cloud providers should support an access for clients to efficiently query the provenance by providing the right interface. 3.1.3 Provenance requirement by the e-social Scientist Getting the requirements by the social scientists is crucial, as they will be the clients who will use the system to retrieve, insert, and analyse the data saved on the cloud storage for research purposes. In this study, an interview method was used in order to obtain the requirement from the social scientists. 16

Andy Turner, one of the social scientists from the School of Geography, University of Leeds who was involved in one of the e-social Sciences nodes projects in the UK (GENeSIS project) was chosen as a contact. During the interview, Turner pointed out various general points about the data used by social scientists. In social science, various data about people are kept. Some of the data can be described as an open data, where everyone has the authority to access. Another type of data is the closed data, which is classified and only available to an authorised research project, where data were usually stored in an anonymous form. An example of the data mentioned by Turner was the Census survey data. It is considered as one of the most important data collected for researcher (within Turner s research subject). The recently collected Census survey is now in the process of being updated and soon it will be available for researchers to use it in various output data form. Turner stated that it is important to link the Census data with other databases such as the council registry on health condition that can be shared across the health service providers such as public National Health Service (NHS) provider, and private providers by using a shared common registry. In an example of people having cancer, this will allow the city council to track the record of the patients cancer stages so that more help can be provided systematically. All of this information can be linked between the databases where the name and address can be used as the identifier (for Census in UK). In other parts of the world, a unique code stitched together with the data is used to link with other government data sets. If the e-social Scientists were to use the system on cloud, data security will be one of the issues. If the cloud infrastructure is not secured, it will expose the sensitive data to an unauthorised user. This issue can be a subject of another study of securing the cloud application for the use of e-social Scientists. As for the data, Turner mentioned that provenance could be useful to provide confidence and trust when using the data from cloud. Result replicability is also crucial. Example given was when running a simulation type model to simulate cities or population where changes of the characteristic of population overtime need to be repeated. When an average trend result achieved during the experiment, for debugging and for other types of reasons, the researcher will want to be able to produce the exact same result from a given set of metadata input. Another example will be the ability to reproduce the results of a city evacuation simulation using different scenarios of transport changes. Replicability is also useful when running the same program in a distributed fashion that has ten different instances, where those instances were evaluated against the input. Other forms of running the model will have certain stages that start with a given set of data, process it, get the result and archive it at a different repository. A separate process will be required to draw on these results. 17

An interview with Dr. Colin Venters, a computer scientist who worked with social scientist when he was a Grid Engineer at the NCeSS was also conducted. Dr. Venters has pointed out several data provenance issues which are: Currently there are no exact data provenance specification requirements to be kept. Data provenance used by the social scientist will vary depending on the project conducted. Quality of the data provenance is an important issue. The integrity of data provenance collected is crucial as the data used will have to be validated before using it for research purpose. e-social Science researcher usually conducts several experiments before using the final output data as an input data for another experiment. From the interview sessions above, the provenance requirement for e-social Science can be derived. Full interview with Andy Turner can be found in the Appendix E. The key provenance requirements are: 1. Ability to replicate the results based on the input data. 2. Ability to evaluate the processes based on the metadata. 3. Ability to secure the sensitive data obtained from being access by unauthorised users. In this project, security issues will not be discussed. An assumption regarding security will be made when designing the cloud provenance schemes as part of alternative solution later. 3.2 Current Provenance Schemes Survey The use of provenance can be applied to mostly anything either on scientific domain or on a business domain. In business domain, organisations usually work with a large organised data that are shared across the corporation. Even for the data usually shared among the trusted partners, validation of the originality of the data is still required. In scientific domain, researchers often shared the experiments conducted. Other researchers who would like to use the results may have several questions before using them. Such questions could be: who did the experiment? What methods were used? What are the original data? Etc. In this section, a survey on current scientific provenance scheme will be discussed in order to understand how the recording and querying the provenance were done. Even if none of the schemes discussed below are created for the use of e-social Science project, as mentioned in 2.2, social science research is exploring the possibility of using the e-science grid project for conducting 18

social science researches. Thus, an understanding of the schemes below is important to understand how recording and querying the provenance can be done so that it can be applied on Cloud application. 3.2.1 Chimera (Foster et al., 2002) Chimera, a virtual data system for representing, querying, and automatic data derivation is a provenance system that manages derivation and analyse data (virtual data) objects in a collaboration environments. It has been applied to the physics and astronomy domain that has provided a generic solutions for scientific communities such as the generation and construction of simulated high-energy physics collision event data from the Compact Muon Solenoid (CMS) experiment at CERN (V. Innocente a et al., 2001) and used on the detection of galactic clusters in Sloan Digital Sky Survey (SDSS) data (Annis et al., 2002). It uses a process oriented model to record provenance by constructing workflows (in form of directed acyclic graph (DAG)) using the high level Virtual Data Language (VDL). Figure 6: Schematic of Chimera Architecture The architecture of Chimera virtual system (Figure 6) consists of two main components, the Virtual Data Catalogue (VDC) that implements Chimera Virtual Data Schema that defines the object and the relations used to capture descriptions of program invocations, and the Virtual Data Language (VDL) interpreter that is used for defining and manipulating data derivations procedures stored in the VDC operations. The Chimera virtual data schema from the VDC is divided into three parts, which are transformation, derivation, and data object. Transformation is a schema defining formal types of input and output required to execute a particular application and is mapped onto an executable, a derivation 19

represents the execution of a particular transformation, and a data object is the input or output of a derivation (Arbree et al., 2003). VDL comprises data derivation for populating the system database and query statement to retrieve information from the database. The VDL conforms to a schema that represents data products as abstract typed dataset (compromise from files, tables and objects) and their materialised replicas that are available at a physical location. The provenance in Chimera was represented in VDL that was managed by the VDC service. The VDC will then map the VDL to a relational schema and stores it in a relational database that can be accessed using SQL queries. The metadata of the process can be stored in a single or multiple VDC storage that enables scaling through organisation. The provenance information can then be retrieved from the VDC using queries written in the VDL that can search derivations that generate a particular dataset. The search can be made by search criteria such as input/output filename, transformation name, or the application name itself (Annis et al., 2002). The result from the query will return an abstract of the workflow plan in DAG format where the node will represent an application and the edge will represent the input/output data. When a created dataset needs to be generated, the provenance will be able to guide the workflow planner in selecting an optimal plan for resource allocation. Foster et al. (2002) stated that the design of Chimera is viable and not only feasible to represent complex data derivation relationships. It can also integrate the virtual data concepts into the operational procedures of large scientific collaborations. 3.2.2 Earth System Science Workbench (ESSW) (Frew and Bose, 2001) The Earth System Science Workbench (ESSW) is a metadata management and data storage system used by the earth science researchers that manage custom satellite-derived data product such as receiving image from the Advance Very-High Resolution Radiometer (AVHRR) sensors on board National Oceanic and Atmospheric Administration (NOAA) satellites (Kidwell, 1998) and researchers that managed metadata for ecological research project. The key aspect of the metadata created in the workbench will be the lineage. It will be used as an error detector in deriving the data products, and in determining the quality of dataset collected. ESSW processed the data using a scripting model (Perl scripts). The script was used to wrap the legacy code to improve the burden of refashioning older program to generate useful metadata so that only minimal alterations needed to be done. 20

DAG model was used as the model that serves as a framework for defining the workflow process and metadata collection for each experiment and steps. The metadata will be defined in XML document type definitions (DTD) format that includes specification of the metadata elements. The ESSW workflow scripts links the data flow between successive scripts to form the lineage trace of the data products by using their metadata IDs. By chaining the scripts and the data using parent child link, ESSW system between data and the process can produce a balance-oriented lineage. By having this chain, it can help data provider to discover the source of errors in a derived data product whenever there are faults by tracing descending the lines of connected objects. The workflow of the metadata and the lineage can then be navigated through a web browser that uses Hypertext Preprocessor (PHP) scripts to access the provenance database in a DAG format (Bose and Frew, 2004). Figure 7 below shows the conceptual diagram of ESSW. Figure 7: ESSW conceptual diagram. Source (Frew and Bose, 2001) ESSW architecture consists of two basic components which are Lab Notebook and Labware. Lab Notebook is an application that logs metadata and lineage for experiments into XML format and stored it in a relational database. Labware is a No-Duplicate Write Once Read Many (ND-WORM) service that managed storage archive for the Lab Notebook by keeping unique processed file and namespace metadata including data in a relational database. Lab Notebook is a Java client/server application. The server collects the specific metadata value from the client and constructs it as XML documents. These XML document as well as the metadata parse 21

are then transferred to a database record. The Lab Notebook server consists of three main components which are: Lab Notebook Daemon (responds to Client API), Lab Notebook Console (provide an interface to submit XML DTD), and Lab Notebook Database (storage for the XML and metadata documents). Labware (ND-WORM) service is also a Java client/server application. The server operates through the interaction of three components which are: the Labware server (copies files to archives, group them and sends pertinent information to clients in response to request), Labware Database, and dedicated disk storage. Figure 8: Lab Notebook and Laware (ND-WORM) architecture. Source (Frew and Bose, 2001) 3.2.3 Provenance Aware Service-oriented Architecture (PASOA) (Groth et al., 2004, Groth et al., 2005a) The Provenance Aware Service-Oriented Architecture (PASOA) is a provenance infrastructure project that is built for recording, storing, and analysing over provenance using an open provenance protocol used by e-science community to foster interoperability. PASOA system has been applied to the biology domain to identify scripts and steps of execution being invoked as well as making copy of the changes made so that it will be able to detect changes between one execution to another. It was used to decide if two results were obtained by the same scientific process. Example could be to check if valid operations were performed, or to decide the specific data item used as computation input. 22

In PASOA, both client and service act as an actor. Service could act as another client that will invoke other services. Figure 9 illustrates how a service acts as another client to another service. Figure 9: An illustration of how a service acts as another client. Source (Groth et al., 2004) PASOA architecture consists of two main systems, which are Provenance Recording Protocol (PReP), and Provenance Recording for Services (PReServ). PReP is a system that defines interaction provenance message that are generated by actors, synchronously or asynchronously, with each service invocation (Simmhan et al., 2005). PReP is divided into four phases protocol that consist of negotiation phase, invocation phase, provenance recording phase, and termination phase. To put it in words, actors should have an agreement before invoking a service; it will then record the interaction provenance and terminate the protocol respectively. The interaction and provenance message generated by actors in the workflow were linked using an ID that will be presented in the provenance message itself. The trace can then link all assertions that have the same ID as the assertion that contain the data as output. PReP only record the documentation of the activities invoked by the actors and it is not designed for duplication of data (Groth and Moreau, 2009). PReP currently does not have security in its specification and this has been proposed as the future work intended. PReServ is a java-based web implementation of the PReP protocol that stores the provenance either in a relation database, file system or in a memory (Groth et al., 2005b). It contains a provenance store for web service as well as a set of interfaces for recording and querying the provenance message, a client side library for communicating with the provenance store and an Apache Axis library that automatically records the exchange messages in PReP and construct the workflow. There are three main components available in PReServ. Those components are Message Translator, Plug-Ins, and Backend Storage. The Message Translator isolates the Provenance Store s storage and the query logic from the message layer. This will allow the Provenance Store to be easily modified to support different underlying message layer. The Plug-Ins implements functions provided 23

by the Provenance Stores (i.e. store plug-in and query plug-in). The last component is the Backend Storage where the provenance assertions are stored in the Provenance Store in the form of either database, file system or memory. Figure 10 illustrates the PReServ layers. Figure 10: PReServ layers. Source (Groth et al., 2005b) 3.2.4 Provenance-Aware Storage System (PASS) (Muniswamy-Reddy et al., 2006) The Provenance-Aware Storage System (PASS) is a storage system that collects provenance automatically and transparently for objects stored in it. PASS system has been successfully used on cloud as a system to collect the provenance (Muniswamy-Reddy et al., 2010). PASS system observes system calls that application makes and captures relationship between objects to construct the provenance graph. An example given by Muniswamy-Reddy et al. (2006) was, PASS would create a provenance edge recording the fact that the process depends upon the file being read when a process issues a read system call. When a write system call issued, PASS will then create an edge stating that the file written depends upon the process that wrote it, thus transitively recording the dependency between the file read and the file written. As for the process that works within the PASS system, it records several attributes such as; command line arguments, environment variables, process name, process ID, and a reference to the parent of the process. PASS system stores data and the provenance record together to ensure that the provenance collected is consistent with the data. PASS also records the provenance of the temporary objects since persistent object such as files are related to each other via data flows through that temporary object. 24

Figure 11: PASS system architecture. Source from (Seltzer et al., 2005) PASS system architecture (Figure 11) consists of two main components, which are the Collector and the Storage Layer that reside within the Virtual File System (VFS) Layer. The VFS Layer provides a uniform interface for the kernel to deal with various inputs/outputs request and specifies a standard interface that each file system must support (Kroeger, 2001). The collector generates a provenance record for each provenance-related system call and binds the record to the appropriate structure. It intercepts system calls and translates it into a memory provenance record that will be attached to the key kernel data structures. The storage layer composed of a stackable file system called PASTA that uses an in-kernel database engine to store the metadata, and in-kernel port of the Berkeley DB embedded database library called in-kernel Berkeley DB Database (KBDB) to store and index the provenance data. The provenance is accessible via a variety of programming language as a way to query the provenance in the database. PASS also has a built in file system browser tool for querying the provenance stored in the Berkeley DB database. PASS system currently does not implement security in the system. Planning to implement the security model will be included in the second version of the PASS system that will look at provenance ancestry and attributes separately. 3.3 Discussion of the current provenance schemes The schemes above show that there are various systems available to record and query provenance in a different scientific domain. Data are now increasingly being shared across organisation and it is essential for provenance to be shared across the data. 25

The use of cloud computing will offer users a variety of services covering the entire computing stack from the hardware up to the application level by means of virtualisation technology on a pay-peruse basis. This will give researchers the ability to scale up and down the computing infrastructure used according to the application requirements and the budget of users (Vecchiola et al., 2009). The use of cloud will also offer an access to a large distributed infrastructure and allow researchers to customise their execution environment so that they will have the desired setup for their experiments. Each scheme discussed also has their own protocol for managing provenance and there are no open standards for collecting, storing, representing and querying for provenance between the four schemes described. This issues can be solved by applying The Open Provenance Model (OPM) specification (Moreau et al., 2011b) into each schemes discussed above. The OPM is a provenance model that is designed to meet a set of requirements so that it will: Allow exchanging of provenance information between systems, by means of the compatibility layer based on a shared provenance model. Allow developers to build shared tools that operate on such provenance model. Define provenance in precise, technology-agnostic manner. Support a digital representation of provenance for any thing, whether produced by computer systems or not. Allow multiple levels of description to coexist. Define a core set of rules that identify the valid inferences that can be made on provenance representation. Figure 12: Edges in the OPM (sources are effects, and destinations are causes). Source from (Moreau et al., 2011b) OPM representation will be in a format of a directed graph. Figure 12 above shows an example of how the representation of OPM will look like. The first and the second edges show that process (P) used an artefact (A) and that artefact (A) was generated by a process (P). The process is identified by a role (R). The role is important when mapping the graph because a process may process more than one artefacts and each may have a specific role. 26

Table 1 below shows the comparison of the schemes discussed based on the principle below together with the finding if it met the e-social science requirements derived in the previous section. 3.3.1 Storage Repository The size of the provenance data will depend on how rich the information of the provenance collected. It may vary from different process created. The manner of how it is stored is important to scalability. This is because provenance of certain process can have many versions available. 3.3.2 Representation Scheme There are two ways to represent provenance. Provenance can be represented in either annotations or inversion approaches (Simmhan et al., 2005). Annotations approach collect metadata comprising of the derivation history of the data product as well as the process. Inversion method uses the property by which some derivation can be inverted to find the input data supplied to derive the output data. 3.3.3 Result Replication A researcher might perform various experiments for their project. The experiment might use result derived from the previous stage before proceeding to the next stages. If an error was found on the current stage, the researcher might want to use the previous result data set again. Result replicable will allow the researcher to use the dataset again. 3.3.4 Provenance Distribution A system should allow researcher to access the provenance in various ways. DAG is one of the methods where researcher can browse and inspect the tree. Another method is searching using the dataset based on the provenance metadata to locate dataset generated by flawed execution or to find the owner of the source data used to derive that certain data. 3.3.5 Evaluate metadata A researcher usually conducts several experiments before using that final output data as an input data for another experiment. Researcher will want to be able to evaluate the metadata so that the final output used will be a reliable source for further uses. 27

Domain applied Storage Repository Chimera ESSW PASOA PASS Physics, Astronomy VDC / Relation Database Earth Science Biology None Relation Database PReServ, Relation Database, File System Relation DB, File System Representation Scheme VDL Annotation XML Annotation Annotation Annotation Provenance Distribution Queries Browse Queries Queries Will it be able to replicate Yes No No Yes the output result? Does it evaluate base on the metadata? Does it currently work on cloud? Does it offer securing the data obtained? Yes Yes No Yes No No No Yes No No Proposed Proposed Table 1: Comparison between Chimera, ESSW, PASOA and PASS Based on the requirement analysis conducted and the provenance schemes discussed in this chapter, a design of a simple provenance scheme for e-social Science cloud application can be drawn as a proof of concept to show that recording provenance on cloud environment is possible. The next chapter will be the second solution in answering the problem stated at the end of Chapter Two where a simple cloud provenance schemes will be created using available resources. 28

CHAPTER IV 4. Design and Implementation 4.1 Proposed Design The system created will work as a proof of concept to see if the provenance schemes discussed in Chapter Three will be able to run on cloud application and meet the e-social Science requirements gathered. Properties not intended to build on the test system will be made as an assumption. Assumption: Security of accessing the cloud will be ignored at this stage (assuming the system is already secure). Automated data transfer to provenance database when running a process. 4.1.1 System Design Scenario Clients use the cloud virtual machine (VM) to process the data retrieved from the cloud storage. Having retrieved the files, clients process the retrieved data and update the file on the cloud storage upon completing processing. Provenance will be generated while the data is being processed. In order for the system to be able to track the provenance generated during the process, clients must send the provenance data along with the data uploaded to the cloud storage. The system should be able to ensure that data processed can be monitored and the associated provenance is properly recorded in the provenance database. A different client will be able to query the provenance database before using the processed data stored in the cloud storage. In this case, the client here will be the e-social Scientist who will use the system. The experiment is to ensure that recording and querying provenance can be done so that the concept of the previous schemes can be applied on the cloud environment. 29

VM Parent VM Controller OS VM Applications Browser OS Prov DBas e Data base Cloud storage Hypervisor Applications Hardware Browser OS Client A Applications Browser OS Hardware Client B Applications Browser OS Hardware Signpost Interaction between VM Transfer of data provenance Client interaction in the cloud Figure 13: Overview of the proposed design on how the system will work. Figure 13 above shows an overview illustration of how the scenarios described will work. Clients access the VM in the cloud environment using the standard network protocol for connection to the internet. The cloud provider (assuming Service Level Agreement between client and cloud provider has been agreed) will deploy the VM for the uses of client to access it at anytime. 4.1.2 System Architecture The system will comprise two main components which are the virtual computing environment for Software as a Service (SaaS) cloud environment that will provide the software as a service over the internet for the use of e-social Scientist, and the cloud storage for storing the input and output of the processes data. Using the SaaS delivery model will make an access to the software simpler. Users will be able to access the software on-demand and transparently through the internet (Zhong et al., 2009). The system used will be based on the Internet based Virtual Computing Infrastructure (ivic) system. 30

ivic is a virtual computing environment for both Hardware as a Service (HaaS) and SaaS (Jinpeng et al., 2007). The SaaS part in ivic is called vsaas(v stand for virtual). It will mainly focus on how to virtualise the software to make the network accessible without being redeveloped or modified. ivic Client User Agent Layer Schedule Layer Virtual Display Layer (vspace) Virtual Execution Layer (vprocess) Provenance Database Data Storage Virtual Resource Management Layer Virtual Resource Layer Figure 14: An architecture diagram of the proposed system design ivic vsaas architecture can be divided into six layers. Those layers will be described as the following: User Agent Layer: Layers that consist of many user agents that acts as an intermediate among the user clients, virtual display layer and the virtual execution layer. Schedule Layer: Layers that are responsible for scheduling two different kinds of tasks. The first is when the scheduler will find the suitable resource in the back-end resource pool to create virtual display instance when users connected to the ivic vsaas system. The second is to find a suitable resource to execute when a user request execution on certain virtual software. Virtual Display Layer: Layers that will track user interaction with the presentation windows of the distributed execution software that were executed on different physical or virtual machine. It consists of two functions which are the Virtual display instance management and the Desktop windows merging. Virtual Execution Layer: An important layers which provides virtual execution environment for the execution of virtual software. Virtual Resource Management Layer: Layers that is responsible for the management and organisation of the under-lying virtual resources. The function in this layer allows the 31

software execution on the upper layers to easily and efficiently find the most suitable resources to meet the resource requirement. Virtual Resource Layer: Layers that have different types of resources such as computer machine, storage, network, device, and software application. All the resources are combined together into the virtual computing environment so that it will support different kind of traditional and newly network applications. 4.1.3 Database Design In the above design scenario, provenance database needs to be created for storing the data provenance regarding the computation and the output done. A simple database design will be created to show that the system will be able to record the provenance. From the scenario above, several entities have been identified which are: Virtual_Machine (VM): To store the information about the VM created such as the name, operating system (OS) used and the date and time created. Client: To store records of clients that access the VM for computation. Process: To store the detail of the process done such as the method used and the time taken to complete the job. Result: To store the output result generated by the process. The entities listed above allow the creation of the relationship between the entities to show the connection between it. The relationship can be described as the following: One client can create one or many VM to run a process from the VM and that particular VM is created only for that one client. One system can run one or many processes and that particular process is created by that one VM One process can produce one or many results and that particular result is created by that one process. The entities and the relationship described above allow the creation of the Entity Relationship Diagram (ERM) so that it will be able to analyse the data requirements in a systematic way to help producing a well designed database. Figure 15 shows the ERM of the provenance database. 32

VM name Date & Time created Process ID Process Method OS used VM M 1 Run M Process 1 Completion time Client ID Create Produce Result ID n M Client Name Client Result Result output Figure 15: ERM diagram for the provenance database The production of the ERM above allows the creation of the Entity-tables schemes to be created as shown in the table 2 below. Entities Client VM Process Result Attributes (Client_ID, Client_Name) (VM_name, OS_used, Date&time_created, Client_ID) (Process_ID, Process_method, completion time, VM_name) (Result_ID, Result_output, Process_method) Table 2: The entity table schemes for the provenance database 4.2. Implementation Based on the design drawn in the previous section, ivic system will be used as the backend for the cloud infrastructure. The system will comprise two Virtual Machine instances. The first VM will act as the database for storing the processed data and the provenance data. The second VM will act as the machine for the uses of client to access it from the client s computer device and use it as a machine that will run the process. Instances created on ivic will run on Debian operating system, a non-commercial free Linux distribution that uses GNU/Linux distribution and standard UNIX style command. Due to limitation, a cloud storage will not be used as a meant to store the result generated, instead a database will be used. 4.2.1 Creating the database The database will be created using the Relational Database Management System (RDMS). RDMS commonly use SQL to manipulate data. The implementation of the database will be created using MySQL as the backend, which is another open source that is freely available for developer to use. The use of SQL query supports the automatic of distributing data from input to output using the 33

standard select-insert-join queries. If all attributes value in the database are linked, than the other data that were linked can be transferred along (Tan, 2007). Based on the entity-tables schemes drawn, all the tables for the provenance database can be created. In this small-scale experiment, the ID for each entity will be set to hole up to three fields only. Varchar will be used as the data type for holding characters in order to optimise the trailing spacing when storing the data. datetime_created field will use timestamp data type to generate the current date and time of the virtual machine created, and int as the data type for storing the time taken to store the process insertion and result insertion in recordtimep table. Figure 16 shows the example of the creation of client table. Other details of all the tables created in the database can be found in Appendix F. Figure 16: The structure for client table. Figure 17: All the databases created in the provenance database. 4.2.2 Accessing the virtual machine To access the ivic virtual machine, a remote control software application called VNC that will allow viewing and interaction with the ivic server will be installed. The installation of the software can be downloaded from the VNC website itself (http://www.realvnc.com/products/download.html). As mentioned in Chapter Two section 2.3, when using cloud application, clients will not have to worry about how servers and networks are maintained in the cloud, as they do not own the infrastructure, software or platform in the cloud. Hence the ivic virtual machine can be accessed using the Internet Protocol (IP) address as the identifier to the virtual machine. 34

Figure 18: Using the VNC viewer to connect to the virtual machine. :0 assign to the virtual machine to access Once the connection has been successfully made, the desktop of the virtual machine will be deployed as shown in figure 19. Access to the desktop will allow client to use it from client s device without having to be at the physical machine itself. Figure 19: The virtual machine that is ready to be used 4.2.3 Computational software The software created to run the experiment on the virtual machine will generate a randomisation seed number based on the iteration selected. It will act as the software to run on the simple cloud-base data provenance schemes. Besides generating a randomisation seed numbers, the software will also record all the processes details including the result generated and the process completion time into the database. The database will be located in another virtual machine where the connection between clients and the database virtual machine will be connected using IP address. The purpose of this is to test if the concept of the design can be applied on cloud environment. Java will be the choice of the programming language to write the program. A java-mysql connector will be 35

installed as the driver to make a connection to the database server. The program will make a call to the java-mysql driver and the connection to the database will be identified using the IP address of the virtual machine as stated earlier. Each classes will have the Class.forName("com.mysql.jdbc.Driver" to get the connection driver, and "jdbc:mysql://10.0.13.2:3306/provenance_database" as the link to make a connection to the database. Figure 20 shows the class diagram of the system created Figure 20: The class diagram of the system created. All the class will connect to the database when performing an operation The class recordprocess will record the result generated together with the provenance data into the database. The recordprocessonly will only record the result generated without the provenance data into the database. 36

CHAPTER V 5. Evaluation The goal of the study is to evaluate if the e-social Scientists can store and derive data provenance on cloud application. Various methods will be used when conducting the evaluation process. Performance of the cloud provenance scheme created based on the proposed design will be evaluated as part of the methods used. Evaluation on the provenance schemes surveyed in Chapter Three will also be discussed to see if the schemes meet the e-social Science requirement derived during the analysis. The evaluation process will also discuss if the provenance schemes surveyed in Chapter Three can be run on the system created in Chapter Four. 5.1 Performance evaluation Based on the computational software created, comparison of the performance when storing the generated result together with provenance data and storing result without the provenance data will be measured. The performance will be measured based on the time it takes for the software created to store and query the output results from the database. Measurement of the time it takes to respond for storing and query is one of the importance of conducting performance testing when accessing database (Li and L, 2000). 5.1.1 Recording the result The first experiment conducted evaluates the performance of recording the result generated, together with the other provenance data such as the process ID, process method used, and the virtual machine ID. The experiment will generate a random number from a randomisation seed based on the number of iterations set using the computational software created in previous chapter. The result generated will be stored in the result entity and the detail of the process conducted will be saved in the process entity database. The computation will also record the time it takes to record the process details and the time it takes to record the result generated. A single call operation was made to record the detail for storing the result and the provenance details. Figure 21 shows the time taken for storing the result with and without the provenance in a single operation call. 37

Time taken to store the result in the database (time in milliseconds) Data Provenance for e-social Science Cloud Applications 2011 2000000 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 20 40 60 80 100 120 140 160 180 200 Number of iterations generated storing result with provenance storing result without provenance Figure 21: Time taken to record result with and without the provenance data, based on the number of iterations generated from a single call operation. The x-axis on the graph shows the number of the iteration generated to get numbers from a randomisation seed and the y-axis shows the time it takes to store the results into the database in milliseconds. From the graph shown, as the number of iteration progress from 20 to 200, the time it takes to store the result will be longer as it progresses, showing linear increase as the number of iterations increase. The increase of time did not show a constant change between iteration changes (looking at the difference of time taken to store results between 140 and 160 iterations). The graph also shows that there is not much difference between the time it takes to store the result either with or without the provenance. Looking at Table 6 in Appendix G, the difference of the time taken to store the result with and without provenance data is only 3 or 4 milliseconds for each iterations. Second experiment conducted evaluates the difference of the time taken when storing the result with provenance and when storing the results that run only on a separate operation. Figure 22 below shows the comparison of time between recording with and without the provenance data. 38

Time taken to store the result in the database (time in milliseconds) Data Provenance for e-social Science Cloud Applications 2011 2000000 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 20 40 60 80 100 120 140 160 180 200 Number of iterations generated storing result with provenance storing result without provenance (seperate operation) Figure 22: Time taken to record result with and without the provenance data based on the number of iterations generated using two different calls. Recording the result with provenance data will use the same operation as the previous experiment. Surprisingly the time it takes to store the result without the provenance for iteration 20 until 100 seems to take longer than storing the result with provenance. There are various factors that could lead into this and one of it could be the network bandwidth that can act as the system bottleneck (Anderson et al., 2005) during the process of storing the result. The experiment also shows that the time it takes to store the result with provenance can take nearly double the time when storing only the result. 5.1.2 Querying the result Another experiment conducted will be evaluating the performance of the time it takes to get the results that have been stored in the database. The process conducted a similar operation as the previous experiment, where comparison of the time taken to get the result with and without the provenance data will be measured. The operation will get the output in various stages. The first operation will get the 1 st until 20 th results stored, then it will get the 1 st until 40 th result stored, and it will go into the pattern similar to the iteration conducted in the previous experiment until it gets the 1 st until 200 th results. The SQL command that used to call all the results together with the provenance data will be SELECT * from client c, vm v, process p, result r where r.resultid <= 120 and SELECT * from result where resultid <= 120 for retrieving the results without the provenance data. The x-axis on the plot shows the number of results retrieved, and on y-axis shows the time it takes to get the results in milliseconds. Figure 23 39

Time taken to query the result from the database (in milliseconds) Data Provenance for e-social Science Cloud Applications 2011 below shows the difference of the time taken to get the result with and without the provenance data. 14 12 10 8 6 4 2 0 20 40 60 80 100 120 140 160 180 200 Number of results retreive from the database Query result with provenance query result without provenance Figure 23: Time taken to get the result with and without the provenance data Instead of getting a linear graph, the time taken to get the results may vary. Some of the time it takes to get the result with the provenance data can take more than double the time it takes to get the result without the provenance data (20, 40, 60, 80, and 100 results retrieved). The effect on the time taken could be the workload of the CPU processed from either client s virtual machine or the database server when querying the result (Zhu and L, 2000). Zhu and L (2000) have proposed a workload classification in a web database system that consists of a different type of queries as shown in Table 3 below. It shows that high load on network will occur when querying a large result set from the database server. Class Description Load on Database Servers Load on Web Servers Load on Network CPU DISK CPU DISK 1 Complex query with small result High Low Low N/A Low size 2 Complex query with large result High High High N/A High size 3 Simple query with small result Low Low Low N/A Low size 4 Simple query with large result size Low High High N/A High 5 No query, small file size N/A N/A Low Low Low 6 No query, large file size N/A N/A High High High Table 3: Workload classification in a web database system. Adopted from (Zhu and L, 2000). 40

5.1.3 Discussion The above experiments were run on the ivic machine located at the School of Computing, University of Leeds. A connection through the Local Area Network (LAN) network was made to make a connection between client and the virtual machines. The time that it will take when using a different virtual machine and network type may vary on a different service available. Using other services to store and get the provenance data together with the result could execute faster time. After conducting the experiment using the ivic system, the results showed that deriving data provenance in cloud application could be done. The experiments revealed that information about the process details could be saved separately from the result generated. The detail would give additional information for e-social Scientist to know more about the results generated such as the method used to produce the results. The time it takes to store the result with and without the provenance data using the ivic system took a very long time for a simple computation process. Process of storing 10 iterations already took 100464 milliseconds, or 1.674 minutes and about 30.7 minutes for 200 iterations. If the experiments were to generate a more complex data, the time that it would take could be longer and clients would have to pay more for the computation usage. Having said so, the experiments conducted were run using the ivic system which is a non commercial system. The time taken on different system could be different and it could produce a better response time compared to ivic. 5.2 Do the current provenance schemes meet the e-social Science requirements? This section will look again at the current provenance schemes discussed in Chapter Three and match with the provenance requirements needed by the e-social Scientist to see if the requirements derived are met. From the requirement analysis conducted, three main e-social Science provenance requirements derived are: 1) The ability to replicate the results based on the input data. 2) The ability to evaluate the processes based on the metadata. 3) The ability to secure the sensitive data obtained from being access by unauthorised users. As mentioned earlier, this project will not cover the security issues and hence the third point above will not be discussed. Table 1 in Chapter Three has described the principles for each schemes discussed, and comparisons between each scheme made. Table 1 shows that only Chimera and PASS schemes were able to 41

replicate the result based on the input data. As for the data evaluation based on the metadata, the schemes that met the requirements are Chimera, ESSW, and PASS. This shows that both Chimera and PASS schemes support both the provenance requirements needed by e-social Scientist. The proposed scheme created also supports the result replication. This can be done by generating the same number based on the seed set. As for evaluating the process based on the metadata, an evaluation can be made by calling the provenance data stored in the database. Each results generated will be labelled by the process ID that holds the record of the process methods and the virtual machine used for running the process. Table 4 below is an extended table that shows the comparison of the proposed schemes with the other provenance schemes that have been described in Chapter Three. Domain applied Storage Repository VDC / Relation Database Representation Scheme Provenance Distribution Will it be able to replicate the output result? Does it evaluate base on the metadata? Does it offer Securing the data obtained? Chimera ESSW PASOA PASS Proposed Scheme Physics, Earth Science Biology None None Astronomy Relation Relation DB, Relation Database File System Database PReServ, Relation Database, File System VDL XML Annotation Annotation Annotation Annotation Annotation Queries Browse Queries Queries Queries Yes No No Yes Yes Yes Yes No Yes Yes No No Proposed Proposed No Table 4: Comparison of the system created together with the provenance schemes discussed in Chapter Three. 5.3 Current provenance schemes and cloud This section will evaluate if the current provenance systems discussed in Chapter Three can be applied to the cloud application. A lookout at the system infrastructure of the provenance schemes will be the basis of the evaluation. Table 5 shows the system used by each schemes described in Chapter Three. 42

Schemes Chimera ESSW PASOA PASS System used Run a virtual data system in collaboration environments Application that builds with Java for both client and server applications A java-based web implementation that runs on Apache Axis library A Linux based system that consists of two portions which are the kernel modification and a set of user level Table 5: description of the system used by each schemes Cloud service provides different services for customer to use (SaaS, PaaS, and IaaS). Cloud service also provides different methods for deploying the service so that it can be shared between different services. This shows that any systems listed in table 5 can be deployed on cloud. From the schemes listed above, Chimera which runs a virtual data system is the only system that uses grid technologies. As mentioned in Chapter Two section 2.3, grid computing and cloud computing are two different technologies. Rings et al. (2009) has looked for the opportunities to integrate both grid and cloud computing for the Next Generation Network (NGN). Four scenarios were drawn to illustrate the possibility converging of NGN for grid and clouds. The first scenario shows that NGN application will be deployed as an application server that will be available via a standard interface and the interface will be accessed by the grid application server. Second scenario adds another subsystem to the NGN service layer to support the provisioning of grid or cloud service that will give access to virtualise grid-enabled cloud resources. The third scenario combines both grid and networking resources in a new architecture, that separates grid service shared resources (CPU, storage, and network) to assign flexible resources to the grid of the NGN system. The final scenario implements NGN functionality using grid-enable service and cloud resources technology virtualisation to enhance the entire NGN architecture. Study from Rings et al. (2009) shows that grid and cloud technology can integrate together. This shows that it is possible to use Chimera on cloud environment. 5.4 Summary The proposed schemes show that recording provenance on cloud is possible. The provenance data that acts as additional information provide an advantage for other clients when using the result. Figure 20 also shows that the difference of the time taken to record both result and the provenance data are relatively small. 43

As mentioned in Chapter Two section 2.4, having provenance can establish an end users trust since it can serve as an indicator of data quality. This will ensure the credibility of the data used for other purposes such as decision-making process by the e-social Scientists. The requirement analysis and the survey on the current provenance schemes conducted in Chapter Three shows that the current provenance schemes can be applied on cloud environment, and two of the schemes discussed as well as the cloud provenance schemes created in Chapter Four shows that the e-social Science requirements have been. 44

CHAPTER VI 6. Conclusion This chapter will look at the overall project evaluation, problems encountered, and identification of further work for the project will be discussed. 6.1 Overall project evaluation This section will look back at the aim of the project, and evaluate if the project has met the minimum requirements set at the beginning of conducting the project. The aim of this project is to derive common data provenance requirements for e-social Science applications and research on cloud applications to help social scientists use an authenticate data from the cloud environment by using provenance to get the information. The aim has been successfully achieved based on the objectives set where: An understanding of the current e-social Science project has been conducted in Chapter Two. Investigation of cloud computing and data provenance performed in Chapter Two. Survey of provenance schemes identified in Chapter Three. The emergent functionality of provenance and cloud identified in Chapter Three. Cloud provenance schema for e-social Science developed in Chapter Four. Experiment to derive data provenance on cloud conducted in Chapter Five. Experiment as well as the whole project process was evaluated in Chapter Five. The minimum requirements of the project were also met where: Analysis of provenance requirements for e-social Science project was conducted as shown in 3.3. Schemes for e-social Science data provenance in cloud computing were developed as shown in 4.6 and Chapter Four. The minimum requirements were also exceeded by the following activities: Requirement analysis for provenance and provenance on cloud conducted as an additional requirements needed by the e-social Scientist. 45

Identification of the existing provenance schemes that met the e-social Science requirements identified during the evaluation process would enable e-social Scientists to adopt the schemes for future use. 6.2 Problems encountered The biggest problem in this project was during the development process. The challenge started during the July 2 phase where problem to implement a cloud-base data provenance scheme for e-social Science arose. From the research conducted, there are not many cloud provenances which have been developed yet. Dr. Venters who was working on provenance during the period of project execution was not aware of cloud provenance. Consequently it was not easy to develop the provenance scheme on cloud. The first problem encountered was in the process of installing the Support for Provenance Auditing in Distributed Environments (SPADE) (more info on http://code.google.com/p/dataprovenance/wiki/architecture) which runs on Linux environment that supports Neo4j graph database (Neo4j, 2011) and H2 SQL database(h2, 2005). In order to use the SPADE system, an understanding on either Neo4j or H2 database needed to be done first. The second problem faced was in using the PASS system itself. Having obtained the installer from the creator of the systems, difficulties in installing and configuring the system arose due to the complexity and lack of technical skills to deploy in Linux environment. Appendix H shows the interaction via email to get hold of PASS system. These problems have led to an extension of the time taken to develop as shown in the revised chart in Appendix D. Time allocation for implementation was extended to allow for the development of the test system. The above problems were overcome by creating manual data provenance stored and queried on mysql database and running it on ivic system as described in Chapter Four instead of using the current available schemes for the cloud environment. 6.3 Future Work The immediate approach will be to adopt the schemes that meet the e-social Science requirements for the use of e-social Scientist on cloud environment which are either Chimera or PASS schemes derived in Chapter Three. The use of a commercial cloud service provider such as Amazon EC2 and Microsoft Azure to test if storing and query of provenance data can be conducted. Experiments on the performance of different databases approach to store provenance data on cloud is also encouraged to establish the identification of the right database for storing provenance. 46

Further studies to work on provenance data security should be performed to prevent unauthorised access of data. 47

REFERENCES ACAR, U., BUNEMAN, P., CHENEY, J., BUSSCHE, J. V. D., KWASNIKOWSKA, N. & VANSUMMEREN, S. 2010. A graph model of data and workflow provenance. Proceedings of the 2nd conference on Theory and practice of provenance. San Jose, California USENIX Association. ADAIR, B. 2009. Cloud Computing: Technology Overview. AMRHEIN, D., ANDERSON, P., ANDRADE, A. D., ARMSTRONG, J., B, E. A., BRUKLIS, R., CAMERON, K., COHEN, R., EASTON, A., FLORES, R., FOURCADE, G., THOMAS FREUND, B., HOSSEINZADEH, HUIE, W. J., ISOM, P., JOHNSTON, S., KULKARNI, R., KUNJUNNY, A., LUKASIK, T., MAZZAFERRO, G., MCCLANAHAN, C., MELO, W., MONROY-HERNANDEZ, A., NICOL, D., NOON, L., PADHY, S., PFISTER, G., PLUNKETT, T., QIAN, L., RAMACHANDRAN, B., REED, J., RETANA, G., RUSSELL, D., SANKAR, K., SANZ, A. O., SINCLAIR, W., SLIMAN, E., STINGLEY, P., SYPUTA, R., TIDWELL, D., WALKER, K., WILLIAMS, K., JOHN, WILLIS, M., SASAKI, Y., WINDISCH, E. & ZAPPERT, F. 2009. Cloud Computing Use Cases White Paper. version 2.0. ANDERSON, D. P., KORPELA, E. & WALTON, R. Year. High-performance task distribution for volunteer computing. In: e-science and Grid Computing, 2005. First International Conference on, 1-1 July 2005 2005. 8 pp.-203. ANNIS, J., ZHAO, Y., VOECKLER, J., WILDE, M., KENT, S. & FOSTER, I. 2002. Applying Chimera virtual data concepts to cluster finding. in the Sloan Sky Survey. Proceedings of Supercomputing 2002 (SC2002). ACM Press. ARBREE, A., AVERY, P., BOURILKOV, D., CAVANAUGH, R., KATAGERI, S., RODRIGUEZ, J., GRAHAM, G., VÖCKLER, J. & WILDE, M. 2003. Virtual Data in CMS Production. CHEP 2003. California. ARMBRUST, M., FOX, A., GRIFFITH, R., JOSEPH, A. D., KATZ, R. H., KONWINSKI, A., LEE, G., PATTERSON, D. A., RABKIN, A., STOICA, I. & ZAHARIA, M. 2009. Above the clouds: A Berkeley View of Cloud Computing. US Berkeley Reliable Adaptive Distributed System Laboratory. BATTY, M. 2006. GeoVIE: Geographic Virtual Urban Environments [Online]. Research Methods Festival St. Catherine's College, Oxford, Tuesday 18th July 2006. Available: http://www.ncess.ac.uk/research/nodes/geovue/presentations/20060718-batty- GeoVUE.pdf [Accessed 12th June 2011]. BIRKIN, M. & TOWNEND, P. 2009. GENESIS: Generative e-social Science. In: LEEDS, U. O. (ed.). The White Rose Grid e-science Centre. BIRKIN, M., TURNER, A. & WU, B. 2006. A Synthetic Demographic Model of the UK Population: Method, Progress and Problems. BOSE, R. & FREW, J. 2004. Composing Lineage Metadata with XML for Custom Satellite-Derived Data Products. SSDBM, 275-284. DAVE, P. 2009. Introduction to Cloud Computing [Online]. dotnetskackers.com. Available: http://dotnetslackers.com/articles/sql/introduction-to-cloud-computing.aspx [Accessed 14th June 2011]. 48

EPSRC. 2009. Introduction to e-science programme [Online]. Engineering and Physical Sciences Research Council. Available: http://www.epsrc.ac.uk/about/progs/rii/escience/pages/intro.aspx [Accessed 5th May 2011]. ESCIENCE-GRID. What is e-science [Online]. Available: http://www.escience-grid.org.uk/what-escience.html [Accessed 7th June 2011]. FOSTER, I. 2002. What is the Grid? A Three Point Checklist. Argonne National Laboratory & University of Chicago. FOSTER, I., KESSELMAN, C. & TUECKE, S. 2001. The Anatomy of The Grid: Enabling Scalable Virtual Organization. FOSTER, I., VÖCKLER, J., WILDE, M. & ZHAO, Y. 2002. Chimera: A Virtual Data System For Representing, Querying, and Automating Data Derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database Management. FREW, J. & BOSE, R. 2001. Earth System Science Workbench: A Data Management Infrastructure for Earth Science Products. In Proceedings of the 13th International Conference on Scientific and Statistical Database Management. IEEE Computer Society. GROTH, P., JIANG, S., MILES, S., MUNROE, S., TAN, V., TSASAKOU, S. & MOREAU, L. 2006. An Architecture for Provenance System. Enabling and Supporting Provenance in Grids for Complex Problems. GROTH, P., LUCK, M. & MOREAU, L. 2004. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS 04). GROTH, P., MILES, S., FANG, W., WONG, S. C., ZAUNER, K.-P. & MOREAU, L. 2005a. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC 05). GROTH, P., MILES, S. & MOREAU, L. 2005b. PReServ: Provenance Recording for Services. In Proceedings of the UK OST e-science second All Hands Meeting 2005 (AHM 05). Nottingham, UK. GROTH, P. & MOREAU, L. 2009. Recording Process Documentation for Provenance. IEEE Transactions on Parallel and Distributed Systems. GROTH, P. T. 2008. A Distributed Algorithm for Determining the Provenance of Data. Proceedings of the 2008 Fourth IEEE International Conference on escience. IEEE Computer Society H2. 2005. H2 Database Engine [Online]. Available: http://www.h2database.com/html/main.html [Accessed 21st August 2011]. HALFPENNY, P. & PROCTER, R. 2010. The e-social Science Research Agenda. The Royal Society, 368, 3761-3778. 49

ILLSLEY, M. 2011. What is e-science [Online]. Science & Technology Facilities Council (STFC). Available: http://www.stfc.ac.uk/our+research/4570.aspx [Accessed 7th June 2011]. JINPENG, H., QIN, L. & CHUN-MING, H. 2007. Civic a Hypervisor based Virtual Computing Environment. ICPP 2007, HotMP2P Workshop. KIDWELL, K. B. 1998. NOAA Polar Orniter Data User's Guide [Online]. U.S. Department of Commerce. Available: http://www.ncdc.noaa.gov/oa/pod-guide/ncdc/docs/podug/index.htm [Accessed 4th August 2011]. KROEGER, T. M. 2001. The Linux Kernel's VFS Layer [Online]. usenix.org. Available: http://www.usenix.org/event/usenix01/full_papers/kroeger/kroeger_html/node8.html [Accessed 5th August 2011]. LAMBERT, P., GAYLE, V., BOWES, A., MAXWELL, M., BELL, D., TURNER, K., JONES, S. & SINNOTT, R. 2008. DAMES: Data Management Through e-social Science [Online]. DAMES. Available: http://www.dames.org.uk/docs/node/dames_slides_16jan2008.ppt [Accessed 10th June 2011]. LI, Y. & L, K. 2000. Performance Issues of a Web Database. Proceedings of the 11th International Conference on Database and Expert Systems Applications. London, UK: Springer-Verlag. MELL, P. & GRANCE, T. 2009. The NIST Definition of Cloud Computing. National Institute of Standards and Technology, Information Technology Laboratory. MERC. NCeSS Nodes [Online]. Manchester eresearch Centre. Available: http://www.merc.ac.uk/?q=node/686 [Accessed 8th May 2011]. MOREAU, L. 2010. The Foundations for Provenance on the Web. Foundations and Trends in Web Science, 2, 99-241. MOREAU, L., CLIFFORD, B., FREIRE, J., FUTRELLE, J., GIL, Y., GROTH, P., KWASNIKOWSKA, N., MILES, S., MISSIER, P., MYERS, J., PLALE, B., SIMMHAN, Y., STEPHAN, E. & BUSSCHE, J. V. D. 2011a. The Open Provenance Model Core Specification (v1.1). Future Generation Computer Science, 27, 743-756. MOREAU, L., CLIFFORD, B., FREIRE, J., FUTRELLE, J., GIL, Y., GROTH, P., KWASNIKOWSKA, N., MILES, S., MISSIER, P., MYERS, J., PLALE, B., SIMMHAN, Y., STEPHAN, E. & BUSSCHE, J. V. D. 2011b. The Open Provenance Model core specification (v1.1). Future Gener. Comput. Syst., 27, 743-756. MUNISWAMY-REDDY, K.-K., HOLLAND, D. A., BRAUN, U. & SELTZER, M. 2006. Provenance-aware storage systems. Proceedings of the annual conference on USENIX '06 Annual Technical Conference. Berkeley, CA, USA: USENIX Association. MUNISWAMY-REDDY, K.-K., MACKO, P. & SELTZER, M. 2010. Provenance for the Cloud. 8th USENIX Conference on File and Storage Technologies. MUNISWAMY-REDDY, K.-K. & SELTZER, M. 2009. Provenance as First Class Cloud Data. 3rd ACM SIGOPS International Workshiop on Large Scale Distributed Systems and Middleware (LADIS'09). 50

MYERSON, J. M. 2009. Cloud computing versus grid computing: Service types, similarities and differences, and things to consider. IBM developerworks. NCESS. 2009. About e-social Science [Online]. Available: http://www.ncess.ac.uk/about_ess/ [Accessed 5th May 2011]. NEO4J. 2011. Neo4j: The Graph Database [Online]. Available: http://neo4j.org/ [Accessed 21st August 2011]. PHILIP, L., CHORLEY, A., FARRINGTON, J. & EDWARDS, P. 2007. Data Provenance, Evidence-Based Policy Assessment, and e-social Science. PROCTER, R. e-social Science and Mixed Methods [Online]. National Centre for e-social Science. Available: http://www.ccsr.ac.uk/methods/events/mixed/documents/esocialscienceandmixedmethods.doc. [Accessed 7th June 2011]. RAU, D. & FEAR, K. 2011. Provenance, End-User Trust and Reuse: An Empirical Investigation. Proceeding TAPP'11, 3rd USENIX workshop on on Theory and practice of provenance. USENIX. RINGS, T., CARYER, G., GALLOP, J., GRABOWSKI, J., KOVACIKOVA, T., SCHULZ, S. & STOKES-REES, I. 2009. Grid and Cloud Computing: Opportunities for Integration with the Next Generation Network. Journal of Grid Computing, 7, 375-393. ROBISON, W. J. 2010. Free at What Cost?: Cloud Computing Privacy Under the Stored Communications Act. The Georgetown Law Journal, 98, 1169-1232. RODDEN, T., CRABTREE, A., GREENHALGH, C., BENFORD, S., ADOLPHS, R. C. S., O'MALLEY, C., CLARKE, D. & AINSWORTH, S. 2008. Report of Research Conducted during the 1s t Phase of the NCeSS DReSS Research Node. Nottingham: University of Nottingham. SCHIFF, J. 2010. Grid COmputing and the Future of Cloud Computing [Online]. enterprosestoragforum.com. Available: http://www.enterprisestorageforum.com/outsourcing/features/article.php/3859956/grid- Computing-and-the-Future-of-Cloud-Computing.htm [Accessed 16th June 2011]. SCOT, S. V. & VENTERS, W. 2006. The Practice of e-science and e-social Science: Method, Theory and Matter. Department of Management, Information System Group London School of Economics and Political Science. SELTZER, M., MUNISWAMY-REDDY, K.-K., HOLLAND, D. A., BRAUN, U. & LEDLIE, J. 2005. PASS: Provenance Aware Storage Systems (Poster). Harvard Industrial Partnership (HIP). SIMMHAN, Y. L., PLALE, B. & GANNON, D. 2005. A Survey of Data Provenance Techniques. SOUILAH, I., FRANCALANZA, A. & SASSONE, V. 2009. A formal model of provenance in distributed systems. Proceeding TAPP'09 First workshop on on Theory and practice of provenance. USENIX. 51

STEED, A. 2006. Large Scale Visualisation in the GeoVue Project [Online]. Available: http://www.ncess.ac.uk/research/nodes/geovue/presentations/20060628-steed- GeoVUE.pdf [Accessed 12th June 2011]. TAN, W.-C. 2007. Provenance in Databases: Past, Current, and Future. TAYLOR, J. Defining e-science [Online]. National e-science Centre. Available: http://www.nesc.ac.uk/nesc/define.html [Accessed 7th June 2011]. TOWNEND, P., XU, J., BIRKIN, M., TURNER, A. & WU, B. 2008. Modelling and Simulation for e-social Science through the Use of Service-Orientation and Web 2.0 Technologies. TREASURY, H. 2003. The Green Book: Appraisal and Evaluation in Central Government [Online]. London: HM Treasury. Available: http://www.hmtreasury.gov.uk/d/green_book_complete.pdf [Accessed 29th June 2011]. TURNER, K. J., TAN, K. L. L., M.BLUM, J., WARNER, G. C., JONES, S. B. & LAMBERT, P. S. 2009. Managing Data in E-Social Science. International Conference on Networks, 8th, 214-219. V. INNOCENTE A, SILVESTRIS, L. & STICKLAND, D. 2001. CMS software architecture Software framework, services and persistency in high level trigger, reconstruction and analysis. Computer Physics Communications, 140, 31-44. VECCHIOLA, C., P, S. & BUYYA, R. 2009. High-Performance Cloud Computing: A View of Scientific Applications. Proceeding ISPAN '09 Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks. Kaohsiung. W3C. 2005. What is Provenance [Online]. W3C. Available: http://www.w3.org/2005/incubator/prov/wiki/what_is_provenance [Accessed 14th June 2011]. WOOLGAR, S. 2004. Social Shaping Perspectives on e-cience and e-social Science: the case for research support. A consultative study for the Economic and Social Research Council (ESRC). XIAOMING LI, J. J. H. Z. 2006. Let Social Sciences Ride on IT Bullet Train. Communications of Chinese Computing Federation 2, 43-46. YANG, Y., OSMOND, A., WEAL, M., WILLS, G., ROURE, D. D., JOSEPH, J. & YARDLEY, L. 2009. LifeGuide: An Infrastructure for empowering behavioural intervention research. UK e- Science All Hands Meeting 2009. ZHONG, L., WO, T., LI, J., LI, B. & HUAI, J. 2009. vsaas: A Virtual Software as a Service Architecture for Cloud Computing Environment. Poster of the 5th IEEE International Conforence on e-science. ZHU, Y. & L, K. 2000. Performance Analysis of Web Database Systems. Proceedings of the 11th International Conference on Database and Expert Systems Applications. Springer-Verlag. 52

Appendix A Personal Project Reflection This project has provided a very challenging and steep learning curve for me. The main experience gained was on the knowledge related to the importance of data provenance in cloud environment for the use of e-social Scientists. This project has given me an insight detail of each topic (data provenance, cloud computing, and e-social Science) and see how they are interrelated with respect to future especially with the use of cloud computing that can be applied to almost anything. As for the requirement analysis, I discovered that I should have interviewed at least three e-social Scientists, so that more provenance requirement can be derived. This would have helped me to be able to have a better understanding from the practitioners point of view. Time management is also one of the most important aspects in order for me to complete this MSc project on time. Even though working full time on this project officially started after the May exam, it is recommended to start working on the project immediately after the project title was approved by the supervisor to allow for sufficient time for literature research. Although despite plenty of time during the summer to complete the project, problems may also arise during any of the project phases. The ability to change the scheduled plan and adapt to the new one is one of my strengths to complete this project as scheduled. Having had to work part time as well as studying, I discovered that time management is key and my ability to follow my work timetable established at the beginning of the project was a big help. Another lesson learned was the critical thinking needed to derive the data from the information gathered. Each of the topics can have a vast amount of information and careful data selection is crucial to ensure that the right information is gathered. For example, the information regarding provenance itself is already huge. Some of the information gathered may not be applicable to the research conducted, as they may not be applicable to cloud computing and/ or e-social Science. Hence another reason as to why must start early preferably before the May exams period. The weekly meetings with supervisor are of utmost important of all. Besides solving the problems as they arise, the supervisor is in the position to provide the feedback if the project is moving in the right direction. The supervisor also is a great motivator who adds to the enjoyment of working on this project. Overall, I found this project to be a very enjoyable one and it was indeed a great experience to be able to contribute the e-social Scientists in the area of data provenance on cloud applications. 53

Appendix B Contribution on project The ivic system which runs via School of Computing Local Area Network was set up by Peter Garraghan who is a PhD student under the Distributed System and Service group. 54

Appendix C - Interim Report (Attached as hard copy) Data Provenance for e-social Science Cloud Applications 2011 55

Appendix D Gantt chart Figure 24: Initial schedule for completing the project Figure 25: Revised schedule for completing the project 56

Appendix E Interview session with Andy Turner Me: Basically the project that I am doing now is about Data Provenance for e-social Science on Cloud. And at the moment I m new in this topic. That s why I m trying to get what s the requirement needed especially for social science scientist especially when they need to get an access to a data from resources, what are the important things that they are looking at. I need to perform a requirement analysis. So I need to know what are needed from the e-social Scientist. Andy: So umm, in social science there re lots of data about people and some of that is open data so that everybody has an access to it and some of it is not so open data, down to classified that is only available to a single research project and/or store in an anonymously form so almost immediately after the data have been collected, you will lose touch, you lose its who is really attributable to and so if you are looking at provenance of information and where it comes from, sometimes that is intentionally lost in an organisation. There re lots of data that are collected for people that are in social surveys and one of the most important one is the Census. In UK it s collected once in 10 years. The recently collected one. That data is being process at the moment and will be available for research in various output data form. Which will follow on from what it relates from previous assemble research on it. After about 100 years, the census will generally became available in less anonymously form. You will have full name and addresses so that you will be able to link things better. 100 years is chosen because most people at that time will either be very young or they were not even born by that time. It does impact some people at some extent. For some research now, it is important to be able to link those census records with other databases. For an example, you might have council registry and you might want to...from there you ll have got to have quite a good record of who is in what stage from what the council know about. And that can be shared across the Health service provider. so you can have Public NHS provider, private provider and it can all be shared to a common registry, who got cancer and what stages are they in. You can then link with the population data which contain everybody, to get more detailed information of those people. And about on the rest of the population.. Me: It is not something just from one source then? It has to come from one end to another different source to get the full information hasn t it? Is that what you meant by collecting data from Census to the council? Andy: So the..if you were to go from the council registry data to the census record for the people, you will probably get..you need something to link it. Usually it is a name and an address for our census in UK. In other parts of the world, it gives everybody a unique code that stitch with them and they can link these census record through time using that code and they can link other government data set with that. But data that has been release into search? certain domain lose that certain identifier but it s still kept within the domain. So that survey research, they might have a subset of a data where they still have that identifier that links. Me: So as a social scientist, will that actually be one of the main issues in term of the data security? 57

Andy: Data security, yeah. If you have a Cloud instance running and you populate that with some data, and then you want to share that across the..i don t know..let say other cloud provider..you can move that data way out. what will happen if other people access that data? That is the big issue and this is one thing that is exposing the data, another thing that is exposing the data and then being used by someone for other purpose for an unauthorised used. And professionally you can expose anything on it. I m well aware that cloud infrastructure were used by commercial organisation that have a very sensitive data like your bank account, stuff like that. They used it, and secure it and that s not an issue but...knowing about the security issues is keys. That is why I ask you to talk to Junaid. Me: That s the thing, I realise the project that I m doing now is very big and it can be specifically into more detailed sections. For example provenance and cloud are two separate things. And in terms of cloud itself, you can even look at the security, reliability and so on. And I ve been told by Paul that as for my case I ll just have to look at it generally. Because I ve found out that there are not many papers regarding provenance for clouds. So look at, how I can use cloud application to help the e- Social Scientist in term of... Andy: Okay, I have various colleague who is doing work on some sort of simulation model. There are some social science simulation type models. It could simulate cities or population and sort of having a model of moving or changing of a characteristic of population overtime, and that might be some extend to forecasting. So you will have something random be generated to change overtime. Now result replicability is critical in times. So if you know that on average your results tend to a trend result, then that s one thing that you can say that be able to re-produce the trend result is what of interest. Well I would argue that for debugging and for other types of reasons, you will want to be able to produce the exact same result and from a given set of meta-data input. Because it might not be the trend that you are interested in but in an extreme result. And that extreme result interested to...you didn t output enough data about that simulation while you are running it for the 1 st time. You now want to analyse it and you now want to run it again with more diagnostic. So for example, you could be evacuating cities and on a different sort of random scenario of transport changes. And under one scenario hardly anybody gets out. And if you know everybody gets out very slowly, you will want to know how harmful. What s the blockage? What was wrong? And let s say if you don t know about, you can reproduce to get back to that point. Then that s an issue. So roughly replicability is a key for you input data. Now that will involve setting your pseudo random sequence seed sensibly and knowing what is doing what. So it s difficult to take the advantage of credit things? and when you re not controlling and to achieve this way is not controlling these things. But really you want to be controlling these thing. So and then at the end of it you want to archive what has happenned to some extent so you might be that, there s another thing sometimes go from reasonably small description of some to a large expended database so large in fact that you don t want to remove it and you want to recreate that result somewhere else is fine. Send the program in the input data and then blow it out again whenever. So some of the works that we re doing at the moment is looking at running a model a number of times and then choosing a result where the simulation outcome, the probability that we get from that will be similar to the probability that we input. And then that s the one that we re looking at the 58

trend result. So we might run the same program in a distributed fashion that has 10 different instances. And then we evaluate those 10 instances, we each evaluate those instances to the input. And then return the evaluation. And then we say okay, right..this one is the best. The rest is if you get up to the speed. And then the one who got the best result we use that on next step. So the one that we re looking at now running on yearly steps. For all of each of those involve daily time step. So we run for a year, who s got the best result, we re going to slightly change the mortality and the facility and probability next year, and then we re going to run again. It s sort of scatter gather, parallel processing thing which if you can t reproduce the same result, would involve moving data from one place to another. There s always a tradeoffs between whether you use storage, transfer or processing and storage of... The things can grow, and at some point, many iterations down the line, and you re starting from scratch somewhere, it actually might be cheaper to compress and send the data and at that point just run the simulation from scratch to that same stage. So the meta-data of how long it takes to do it, what resources was required to do, what the minimum was actually used. All this stuff can be wrap it up. So that the self evaluation could get on. This is what it takes to get something else to this point. Wrap up the envelop calculation, this is how long it will take me to get to this point. Hmm let s have a little play of data transfer and see what s the bottleneck, what s the bandwidth to get to transfer from A to B, should I transfer the data or should I recreate result. The cloud stuff provides us with something that a resource doesn t fit with the general resources availability from grids. Cause grids tend to be about 2 GB to round and quite a lot of course. If your problem doesn t fit into that, then you know what you really need is a 64GB machine. Then actually sometimes that s not a great resource that s so readily available and with cloud you can specify some extend what actually you need. And then you can maybe try to get that from that. First is to know what the difference between grid and cloud are. One of the key differences is that grids are all about crossing organisation boundary and still be able to interoperate and ensure the program runs in a distributed fashion. And I m not too sure whether your grid or cloud is actually something that work yet. So it s either each cloud is its own thing, it doesn t really allow much transfers between cloud. Me: I m about to get the access to the test cloud account. So I m not really sure yet what am I allowed to do on it yet at the moment. Andy: So this is test cloud in school of computing? Me: School of computing, yeah. Andy: There are lots of commercial clouds out there. There are also there. If you look in the data section of those, can look at whose responsible is it for upload. When we re looking at it a year ago basically it was saying upload the data at your own risk. Anybody can do whatever they want with the data that we upload. It is an issue. Data security is a big issue for social science. Me: Will there be any difference if we make the cloud private so that other people won t be able to access data? 59

Andy: Yeah if they provide some data security, some guarantee that 3 rd party can t access your data, that s fine. Me: Maybe that s an assumption that I need to make by saying the security is there. Andy: I don t know how it will work with these. Me: I thought initially the minimum that I ll try to get is basically the general thing is about the provenance and cloud itself and e-social Science is part of how people from there can use those technologies to improve. At the moment I thought if I can just get a set of data, not a very big data. Just something for me to experiment it, and then try to run it and test it and from there see how it goes to get the provenance from cloud. And by getting the provenance, can it detect whether something has been changed or not. So far that was my initial thought of the starting point. And then only at the end, if it can be done in small amount, perhaps in can be expanded more in the future. Andy: Sometimes the model that we re running will have all the data stage in a static thing. So you know, start with this data, process it, got the result, archive it in some way. There will be a separate process that will draw on these results. Another set of model that we have will pull data from live feed. So it can keep the record of the data that have been pulled. And provide that with meta-data or provenance for to recreate the result. But if you run it in the same program again, it might be pulling a slightly different data. So if you have for example you are pulling an open street map road, and you re doing routing from A to B using road network, and if you re running the model a year after the time you read it previously, the road network will change or the data have been updated so the routing will then be slightly different. You ll get a different result. So as it capture the full provenance of the road, cache will be you will have to store a huge amount of data. It all depends on what you are pulling in data that might be dynamically changing. And if you re doing that, then there will be another issue. 60

Appendix F Database Tables Created Figure 26: The structure for client table. Figure 27: The structure for virtual machine table. Figure 28: The structure for process table. Figure 29: The structure for result table. Figure 30: The structure for recording the time taken to store table. 61

Appendix G Experiment results Process ID No of Iterations Time to record the process (provenance data) milliseconds Time to record the result in milliseconds Total time to record the process and result 1 20 4 100460 100464 2 40 3 200672 200675 3 60 3 301235 301238 4 80 3 401620 401623 5 100 4 522098 522102 6 120 3 707338 707341 7 140 3 807729 807732 8 160 3 1266830 1266833 9 180 4 1442010 1442014 10 200 4 1841973 1841977 Table 6: Result of the time taken to record the results together with the provenance data in a single operation call No of Iterations Time to record the result in milliseconds 20 196710 40 365660 60 598214 80 799216 100 891966 120 602586 140 702493 160 802735 180 903038 200 1004167 Table 7: Result of the time taken to record the results only in a different operation call Number of results Time taken in ms 20 9 40 10 60 10 80 10 100 10 120 13 140 10 160 11 180 11 200 12 Table 8: Time taken to query results with provenance data 62

Number of results Time taken in ms 20 3 40 3 60 4 80 4 100 4 120 8 140 7 160 9 180 9 200 9 Table 9: Time taken to query result without provenance data 63

Appendix H Email interaction to get hold of PASS system -----Original Message----- From: Peter Macko [mailto:pmacko@eecs.harvard.edu] Sent: 28 July 2011 04:14 To: Che Wan Amiruddin Samsudin Cc: PASS; Kiran-Kumar Muniswamy-Reddy Subject: Re: [PASS] Re: PASS Hi, Thank you for your interest in PASS! A snapshot of our code is available at: http://www.eecs.harvard.edu/~syrah/pass/download/v2/ The user/password are: ******* We are currently working on an updated version of PASS, which should be more stable. Best, -Peter On Jul 26, 2011, at 4:09 PM, Kiran-Kumar Muniswamy-Reddy wrote: > Redirecting you to the PASS group :-) > > On Tue, Jul 26, 2011 at 4:22 AM, Che Wan Amiruddin Samsudin > <sc10cwac@leeds.ac.uk> wrote: >> Dear Kiran, >> >> >> >> I am Amir a student from the University of Leeds, UK, currently studying >> MSc Computing and Management. I m now working on my thesis titled Data >> Provenance for e-social Science Cloud Application. Having done the >> research, I found out that you have developed an automated provenance >> recording on cloud call PASS. How can i get hold of the PASS application? Is >> there a formal procedure that I need to follow in order for me to get hold >> of PASS and try and run it for the use of my project? >> >> >> Thanks >> >> ---- >> >> Che Wan Amiruddin Chek Wan Samsudin (Amir) >> >> MSc Computing & Management >> >> University of Leeds 64