1 Big Data for Security and Resilience Challenges and Opportunities for the Next Generation of Policy-Makers Proceedings of the Conference : Challenges and Opportunities for the Next Generation of Policy-Makers Edited by Jennifer Cole STFC/RUSI Conference Series No. 4
2 Conference Report, October 2014 Challenges and Opportunities for the Next Generation of Policy-Makers Proceedings of the Conference : Challenges and Opportunities for the Next Generation of Policy-Makers, March 2014 Edited by Jennifer Cole
3 A joint publication of RUSI and the STFC, Royal United Services Institute for Defence and Security Studies Whitehall London SW1A 2ET UK Science and Technology Facilities Council Polaris House North Star Avenue Swindon SN2 1SZ Editor: Jennifer Cole Sub-editor: Susannah Wright Individual authors retain copyright of their contributions to this publication. This report may be copied and electronically transmitted freely. It may not be reproduced in a different form without prior permission of RUSI and the SFTC.
4 Contents Foreword Bryan Edwards v Introduction: Machine Learning for Big Data 1 Alex Gammerman and Jennifer Cole I. The National Archives, Big Data and Security: Why Dusty Documents Really Matter 5 Tim Gollins II. Trends in Big Data: Key Challenges for Skills 14 Harvey Lewis III: Big Data and Financial Transactions: Providing New Means of Analysis 18 Gregory Mandoli IV. Characteristics of Terrorist Finance Networks: The Human Element 28 Neil Bennett V: Terrorism and Political Risk Modelling 32 Mark Lynch VI: Intelligent Use of Electronic Data to Enhance Public Health Surveillance 38 Edward Velasco VII: The Raxibacumab Experience: The First Novel Product Approved Under the US Food and Drug Administration Animal Rule 47 Chia-Wei Tsai Discussion Groups Rapporteurs: Philippa Morrell, Chris Sheehan, Ed Hawker Discussion Group 1: The Ethics and Legality of Big Data Sharing 57 Chair and Rapporteur: Edward Hawker Discussion Group 2: Policing, Terrorism, Crime and Fraud 62 Chair: David Smart; Rapporteur: Philippa Morrell
5 Discussion Group 3: Health Data, Public Health and Public Health Emergencies 68 Chair: Chris Watkins Discussion Group 4: Individual Privacy Versus Community Safety 76 Chair and Rapporteur: Jennifer Cole Research Themes Identified in the Presentations and Discussion Groups 83 An additional three presentations were given at the conference by Professor John Parkinson of the Medicines and Health Products Regulatory Agency (MHRA), Michael Connaughton of Oracle, and Dr Catriona McLeish of the University of Sussex. For a variety of reasons, no written papers were produced for these presentations, but we would still like to acknowledge their contribution to the event. The Powerpoint presentations given by Michael Connaugton and Professor Parkinson, as well as those delivered by the speakers who have contributed a written paper, can be accessed on the RUSI website events page here: gl/9cxc3g.
6 Foreword Bryan Edwards Of all the challenges facing the UK today, few are as demanding as those affecting its national security. Some threats to the UK and its citizens are modern variants of those that the country has faced for many years. Others are entirely new and different to anything that has preceded them; while some, no doubt, have yet to be recognised or understood. One feature of this large, complex and constantly evolving array of challenges is that few, if any, lend themselves to single-discipline solutions. With this in mind, the Science and Technology Facilities Council (STFC) operates a Defence, Security and Resilience Futures Programme. Challengeled and agnostic with respect to academic discipline, the STFC s aim is to identify and facilitate opportunities to engage relevant capabilities within the UK National Laboratories and university research groups in relation to some of the highest-priority and most demanding challenges in national security. As part of this programme, the STFC is delighted to fund and proud to collaborate closely with RUSI in delivering a series of conferences on topical issues within this domain. Each meeting is designed to explore the interface between academic research and government policy and operations, in order to stimulate debate on how a step change, rather than incremental change, in the protection of the UK could be achieved. The meetings are strategic in character, with contributions from an atypically broad community drawn from universities, industry, government and its agencies and partners. At the forefront of the organisers minds is a deceptively simple question: what academic research can offer now, and in the future, to allow government to further enhance its capabilities in key areas, enabling it either to do significantly different things or to do what it does now in significantly different and better ways. In this context, Big Data is often identified as being of particular importance. Certainly, there is little doubt that raw data are being generated at what appears to be an accelerating rate. This is a trend that seems set to continue for the foreseeable future. Not only that, but complementary improvements in data storage technologies and telecommunications infrastructure mean that more of these data can be archived (potentially indefinitely) and accessed on a global basis. And yet volume alone is insufficient to fully
7 vi Foreword appreciate either the nature of the challenge or the opportunities that exist. Indeed, if Big Data was defined simply according to volume alone, there would be few grounds for claiming a revolution. For example, during the 1990s, the strategy of the UK s Department for Social Security sought to migrate benefits, such as unemployment benefits and pensions, from traditional paper-based systems to IT systems. The data volumes associated with this enterprise were large, even by today s standards. It is therefore necessary to look instead at other characteristics of the data to identify what is qualitatively different, and to establish the source of the challenges and opportunities we are now presented with. These include features such as the diversity of the data, in terms of type and reliability. These in turn create new challenges for the development of the automated data analysis and interpretation systems required. This raises questions not only over how one could, in principle, approach the analysis of such data, but equally how systems based on these new principles could themselves be tested, verified and validated. While these technical challenges are significant, there are additional complexities associated with data residing in different organisations, and a population that is becoming increasingly aware of and sensitive to the possibility of exploitation of data whose ownership they question in ways they consider inappropriate. In this meeting we look at some of the technical challenges that Big Data presents, and consider a range of possible uses of and perspectives on data to tease out new issues. In the course of a one-day event, the scope for exploring them in detail is extremely limited. However, it is hoped that identifying relevant questions to be explored elsewhere is, in itself, a useful contribution to the debate. I would very much like to acknowledge the generous assistance and support offered by the US Department of Homeland Security, which contributed to making the day a success. Similarly, thanks must go to the staff at the STFC and RUSI, whose extremely hard work made this event possible. However, the final word of appreciation and gratitude is reserved for all those who participated so enthusiastically on the day itself, whether as speakers or as delegates. Anyone wishing to know more about the STFC s Defence, Security and Resilience Futures Programme in general, or about these conferences in particular, is invited to contact me using the address below. Professor Bryan Edwards Science and Technology Facilities Council
8 Introduction: Machine Learning for Big Data Jennifer Cole and Alex Gammerman This paper discusses the impact of the current high level of interest in Big Data from academia and industry, and comments on how this is influencing the approach taken to funding research and developing skills in particular areas of computer science. It also discusses the relationship between Big Data and machine learning systems that have the ability to learn from data, rather than only following explicitly programmed instructions and the influence Big Data has on machine learning. For Big Data (or, for that matter, Small Data) to have any value, machine learning needs to be applied in order to extract useful information from the data. The current approach to Big Data arguably places too much focus on the data as an end in themselves at the expense of properly considering the techniques and approaches that will enable the best use to be made of them. For example, in 2012 the International Data Corporation estimated that while the global data supply had reached about 2.8 zettabytes (1 zettabyte equalling bytes), only an estimated 0.5 per cent of all data collected is used for analysis. 1 There is little point in Big Data per se; a problem needs to be defined and then the amount of data needed to solve this problem can be decided. As a way of extracting useful information from data (irrespective of whether they are Big or Small Data) along with the academic disciplines and research that have contributed (and continue to contribute) to it, machine learning has much to offer in determining how the data are collected, analysed and used. Buzzwords in Computer History Big Data is a buzzword (or two), and it is not the first time in computer science that a new concept has been hailed as the answer to everything. In 1982, the Japanese Ministry of International Trade and Industry (MITI) began the Fifth Generation Computer Systems (FGCS) 2 project to develop a supercomputer that would further develop artificial intelligence. The British response to the Japanese challenge was the Alvey Programme 3 in information technology. At that time, the way forward for artificial intelligence was largely considered to be expert systems: computer systems that could help a human in the decision- 1. John Gants and David Reinsel, The Digital Universe in 2020: Big Data, Bigger Digital Shadows and Biggest Growth in the Far East, International Data Corporation and EMC, 2012, <http://www.emc.com/collateral/analyst-reports/idc-the-digital-universein-2020.pdf>, accessed 2 July Ehud Shapiro, The Fifth Generation Project a Trip Report, Communications of the ACM (Vol. 26, No. 9, 1983), pp , <http://dl.acm.org/citation.cfm?id=358179>, accessed 2 July The Alvey Programme, <http://www.chilton-computing.org.uk/inf/alvey/overview. htm>, accessed 30 July 2014.
9 2 making process by emulating the reasoning abilities of an expert. Such systems were supposed to solve everything. Gradually, however, as it become clear that expert systems have narrow and limited areas of application, unsubstantiated claims died down and the boom was over. The expert systems boom has much in common with the Big Data hullabaloo being experienced today. There seems to be an assumption that everything can be resolved by Big Data. It is somewhat naive to assume that theory is no longer needed to solve problems, just a lot of data and an ability to calculate a correlation between various items of data. This is nonetheless what some of proponents of Big Data say. 4 The myth persists that Big Data will provide the answers to all our questions. Big Data will not do this, but combined with machine learning it may help to provide some of them. Big Data and Machine Learning Modern machine learning exists at the intersection between statistics and computer science. 5 Two main topics inference (the process of reaching a conclusion from known facts) and data analysis have been taken from statistics. In particular, non-parametric statistics (which makes no assumptions about probability distributions) has developed many methods and algorithms that are in use in machine learning. On the other hand, how to develop efficient algorithms and knowledge representation the tractable, intractable, non-computable functions are coming from computer science. Basically, machine learning tries to find regularities within past (or training) data (or examples) that allow the user to make predictions in future examples. This is done irrespective of the amount of data big or small. Researchers at Royal Holloway, University of London, have been doing this for years: in 1998, the Computer Learning Research Centre 6 was established there and today two prominent Royal Holloway researchers are working in the field of statistical learning theory (SLT) with Vladimir Vapnik and Alexey Chervonenkis, the theory s founders. Classical statistics usually deals with small scales and low dimensions of data; conceptual and computational difficulties may begin to arise when there are complex, sizable and high-dimensional data (roughly speaking, where the number of attributes or features are greater than a number of examples). Several machine learning methods are being developed to deal with these 4. Chris Anderson, The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Wired, 16 July 2008, <http://archive.wired.com/science/discoveries/ magazine/16-07/pb_theory>, accessed 30 July Many disciplines like psychology, mathematics, philosophy, linguistics, biology contribute to machine learning, but the main ones at present are statistics and computing. 6. Computer Research Learning Centre, Royal Holloway University of London, <http:// accessed 30 July 2014.
10 3 problems, including online predictions, parallel algorithms and efficient methods. Some of the new techniques being developed at Royal Holloway include string kernel techniques, prediction with expert advice and online conformal predictors (or transductive confidence machines) new learning techniques that make valid predictions. These techniques have been applied in a number of areas, for example for automatic target recognition, statistical profiling of offenders for the Home Office, material identification and atmospheric correction for military applications, and anomaly detection to identify suspicious behaviour of ships and other vehicles. These techniques have also been applied to several medical fields, for example for detecting various abdominal diseases and ovarian cancer, and finding the best treatment for depression. One of the central questions in the theory of learning concerns the quantity of data needed in order to achieve a solution with a desirable degree of accuracy. A simple pattern recognition system to classify digits (0 9) can learn to recognise and correctly predict a shown digit after being trained on only a few hundred digits out of the hundreds of thousands of digits available for training. 7 That is only a fraction of data, but enough to solve a problem. Pattern recognition systems often need surprisingly small amounts of data to obtain an answer. While intuitively it seems that the more data are used, the more accurate the prediction will be, the founders of SLT, 8 Vapnik and Chervonenkis, have shown that it is not just the length of training data that is important, but a concept called capacity or VC-dimension (after Vapnik and Chervonenkis). Roughly speaking, VC-dimension is a number of parameters of a decision rule. The important factor for quality of learning is a ratio of a length of the training set to the VC-dimension. A large ratio is good from a learning perspective, as the results obtained on the test set are close to those on the training set to avoid overfitting the test set should show about the same accuracy (number of errors) as in the training set. If, however, there is a request to apply machine learning algorithms when Big Data is provided but the analysis cannot be handled on one machine, parallel algorithms can be developed and run on parallel machines. This requires more efficient methods to be developed, which is currently a challenge, though some progress is being made to resolve this. For example, in addition to wellknown methods such as induction, there are some advances in developing 7. Alex Gammerman and Volodya Volk (2007), Hedging Prediction in Machine Learning, The Computer Journal (Vol. 50, No. 2, 2007), pp Oliver Bousquet at al., Introduction to Statistical Learning Theory, Max Plank Institute for Biological Cybernetics, 2004, <http://www.kyb.mpg.de/fileadmin/user_upload/ files/publications/pdfs/pdf2819.pdf>, accessed 2 July 2014.
11 4 transductive methods. 9 In induction, particular examples are used to formulate a general rule and then make predictions using this rule. The transductive instead goes from one example to another, which should be more efficient as the model does not have to solve an infinite number of examples, just find one particular example, which will in turn predict the next one. This could be a way forward for developing new, efficient algorithms for prediction. Conclusions There is currently a lot of research into machine learning taking place and new algorithms are being developed. They are both simple and rigorous, and give a wide range of statistical learning methods. John Poppelaars 10 compared the current belief in Big Data with a fictional computer, Deep Thought, in The Hitchhiker s Guide to the Galaxy, which took 10 million years to compute the ultimate question of life, the universe and everything, but because the beings who had programmed it never really knew what the question was, nobody knew what to make of the answer. Nowadays, people hope that Big Data will help to find the ultimate question, but if we slightly paraphrased The Hitchhiker s Guide to the Galaxy, we would argue that it is not Big Data that will define the question: it is machine learning. Jennifer Cole is a Senior Research Fellow in Resilience and Emergency Management at the Royal United Services Institute, where her research programme has included a number of reports and projects on the use of Big Data and cyber-security for the UK government, including the Foreign Office and Ministry of Defence. She is also a PhD candidate in the Computer Science Department at Royal Holloway, University of London. Professor Alex Gammerman studied in Leningrad (now St Petersburg) and then worked in several research institutes of the Academy of Science of the USSR. In 1983 he moved to the UK. He was appointed to the established Chair in Computer Science at the University of London (Royal Holloway and Bedford New College) in Currently, he is Founding Director of the Computer Learning Research Centre at Royal Holloway, University of London, and a Fellow of the Royal Statistical Society. Professor Gammerman s research interest lies in the field of machine learning, particularly the development of inductive transductive confidence machines. Areas in which these techniques have been applied include medical diagnosis, forensic science, genomics, environment and finance. This is a version of the paper written by the authors and can be found at clrc.rhul.ac.uk/publications/techrep.htm 9. Vladimir Vapnik, The Nature of Statistical Learning Theory (New York, NY: Springet, 1995). 10. John Poppelaars, Will Big Data End Operations Research?, 2013, <http://johnpoppelaars.blogspot.nl, 2013>, accessed 30 July 2014.
12 I. The National Archives, Big Data and Security: Why Dusty Documents Really Matter Tim Gollins This paper discusses three linked propositions. First, the way in which the National Archives, as a national institution of the United Kingdom, can be regarded as a repository of Big Data. The paper will discuss the concept of big data and place it in the historical context of archival collections that have transformed the world, for example, the King of Assyria s Library and the Library at Alexandria. Second, it will consider the way in which the National Archives are central to UK security, providing a point of reference for society, and supporting citizens rights and the rule of law. It will also discuss the potential threat that emerges from a loss of trust in the processes that underlie the transfer of records to the Archives. Third, the paper will cover how the challenges of sensitivity reviews of digital records, which ensure that sensitive government records are archived appropriately, 1 could give rise to further threats to the Archives and thus the wider security of our society. The paper goes on to show that in addressing the challenges of the sensitivity review of digital records, by using the Big Data nature of archives, opportunities arise to counter the wider threats to the security of our society. The Archives and Big Data The classic definition of Big Data rests on volume, variety and velocity, 2 and is inherently assumed to be digital. Taking a longer view, there are a number of points in history where such transformative conditions have existed with collections of other media, such as: The 30,000 clay tablets from the oldest surviving royal library in the world: that of Ashurbanipal, King of Assyria (around BC), including the story of Gilgamesh 3 The iconic Library of Alexandria, alleged to have collected the knowledge of the ancient world under one roof (including 400, ,000 rolls within the collection) National Archives, Step 3: Sensitivity Reviews of Selected Records, <http://www. nationalarchives.gov.uk/information-management/manage-information/selectionand-transfer/sensitivity-reviews-on-selected-records/>, accessed 25 July Anton Chuvakin, Broadening Big Data Definition Leads to Security Idiotics!, Gartner blog, 18 September 2013, <http://blogs.gartner.com/anton-chuvakin/2013/09/18/ broadening-big-data-definition-leads-to-security-idiotics/?fnl=search&src Id= >, accessed 18 July British Museum, The Library of Ashurbanipal, Research Project at the British Museum, <http://www.britishmuseum.org/research/research_projects/all_current_ projects/ashurbanipal_library_phase_1.aspx>, accessed 19 August Heather Phillips, The Great Library of Alexandria, Library, Philosophy and Practice 2010, <http://unllib.unl.edu/lpp/phillips.htm>, accessed 18 July 2014.
13 6 In comparatively more recent times, as the practice and conventions of common law developed in Britain, the need to collect the records of cases and to access legal judgments for precedent gave rise to another example of Big Data of its day. Drawing on information from the National Archives Catalogue, 5 we learn that The Dialogus de Scaccario, describing Exchequer administration in the 1170s, mentions a clerk who was deputy to the chancellor and had responsibility for the preparation and custody of formal Chancery enrolments. Thereafter, the chancellor s principal clerk was invariably associated with these duties, although progressively more and more remote from their direct execution; by 1388, and probably long before, a staff of subordinate clerks carried out the actual enrolments. From the mid-thirteenth century, this officer was generally known as the keeper of the rolls, and, as the first rank of Chancery clerks gradually came to be known as masters, the title Master of the Rolls had become the standard designation by the fifteenth century. The holder of that post now chairs the Lord Chancellor s Advisory Council, which assures the transfer of records to the Archives. 6 Bringing the picture up to date, the paper holdings of the National Archives at Kew are over 1 billion paper pages, representing 1,000 years of history. 7 At the same time, there are now over 2.5 billion archived pages accessible from the UK Government Web Archive (representing less than 20 years of contemporary history) 8 that are now being aggregated and mined to answer novel research questions that would have previously been intractable. The Archive is, and always has been, Big Data. The Archive and Security Discussion of security should not be limited to considerations of criminality and terrorism. The security of UK society relies at its deepest level on the trust of the citizen in the state. It is all about the rule of law and the fact that no one, not even the executive, is above that rule. 9 The British state is different from many others in that the citizen expects the state to be subservient to it rather than the more common case. This is the very fabric of UK society; the rule of law supports and empowers the citizen. 5. National Archives Catalogue, <http://discovery.nationalarchives.gov.uk/searchui/>, accessed 19 August National Archives Advisory Council Information, <http://www.nationalarchives.gov.uk/ advisorycouncil/>, accessed 19 August The authors own estimate based on approximately 12 million entries in the National Archives catalogue that refer to boxes or folders of records that can reasonably expected to hold upwards of 100 sheets of paper. 8. National Archives UK Government Web Archive Information, <http://www. nationalarchives.gov.uk/news/929.htm>, accessed 18 July The Rule of Law definition, LexisNexis, <http://www.lexisnexis.co.uk/en-uk/about-us/ rule-of-law.page>, accessed 18 July 2014.
14 7 The National Archives are fundamental to this aspect of security. The Archives provide the impartial witness that enables holding to account under the rule of law and in the court of history. They contain evidence of the transactions of the state and the executive and evidence of the decisions and policies enacted. This is central to Lord Bingham s Fourth Principle: Ministers and public officers at all levels must exercise the powers conferred on them in good faith, fairly, for the purpose for which the powers were conferred, without exceeding the limits of such powers and not unreasonably. 10 How can we know what the executive has done if the records are not kept? However, it is clearly not sufficient to consider the keeping of the record without considering how the record is selected and transferred to the Archives. The content of the Archives is clearly dependent on these processes. It follows therefore that the citizen must trust the process by which the Archives receive their material to sustain their rights. Transfer to the Archive The process by which public records are transferred to the National Archive is not widely understood, even among scholars who regularly use its content for their research. The principles of the appraisal that underlies transfer were laid down by the great archivist Hilary Jenkinson, who described many of the fundamentals of the UK system. 11 In setting out his approach, Jenkinson was trying to ensure that the UK archive (at that time The Public Records Office) was able to guard its independence under the rule of law, and could not fall foul of the criticism of complicity in wrongdoing that was evident in the case of the Nazi Archive in Germany with respect to the Holocaust. 12 In summary, the transfer process consists of the following steps: Appraisal and selection: determining which records meet the collection policy of the National Archives and then choosing which records should be transferred to the Archives or to a place of deposit Sensitivity review: deciding which records should be open on transfer, which must be closed, and which must be retained in departments (under the Lord Chancellor s blanket see below) Preparation and delivery: the cataloguing, preparation and 10. IAP Annual Conference, The Rule of Law in Prosecuting Big Businesses in Application to Regulatory Frameworks, 2013, p. 2, <http://www.iap-association.org/conferences/ Annual-Conferences/18th-Annual-Conference-and-General-Meeting-Provisi/18AC_ WS1D speech_alun_milford.aspx>, accessed 18 July Hilary Jenkinson, A Manual of Archive Administration (London: P. Lund, Humphries & Co Ltd, 1963 ). 12. Eric Westervelt, Probe Details Culpability of Nazi-Era Diplomats, NPR, 28 October 2010, <http://www.npr.org/templates/story/story.php?storyid= >, accessed 18 July 2014.
15 8 organisation of records for transfer and the actual transportation of records to the National Archives or to a place of deposit Accessioning: the process by which the National Archives makes the records appropriately available. A Threat The principle of independence derived from and identified in the Grigg Report 13 that initiated the Public Records Act 1958 has led, over the years, to a series of checks and balances intended to ensure that the necessary records of the activities of the executive are deposited. These checks and balances include: 14 The right of access to information in departments under freedom of information legislation before information is transferred Departments responsibility for the selection of the records, and for the identification of any sensitivity in the records that would cause an exemption under freedom of information legislation The fact that the exemptions that can be applied to delay transfer are proscribed in law and their application can be challenged through the information commissioner and thence by appeal to the Information Tribunal The public visibility of the selection criteria that the departments must apply as agreed with the National Archives The National Archives process of oversight during the creation of the criteria and the Archives process of monitoring their application The publication of information regarding transfers The formal oversight of the timeliness of the transfer process and the application of freedom of information exemptions by the Lord Chancellor s Advisory Council on Public Records. Unfortunately, in 2012, negative publicity 15 concerning the migrated archives of the colonial administrations (papers of the British administrations which should have been passed to the Public Records Office in a timely fashion but were wrongly kept at the government s Hanslope Park facility) and subsequent questions concerning other collections of documents at the Foreign Office raised the issue of the degree of trust in this system. 13. James Grigg, Report of the Committee on Departmental Records, Cmnd 9163 (London: HMSO, 1954). 14. National Archives, History of the Public Records Act, <http://www.nationalarchives. gov.uk/information-management/legislation/public-records-act/history-of-pra/>, accessed 18 July Ian Cobain and Richard Norton-Taylor, Sins of Colonialists Lay Concealed for Decades in Secret Archive, Guardian, 18 April 2012, <http://www.theguardian.com/uk/2012/ apr/18/sins-colonialists-concealed-secret-archive>, accessed 22 July 2014.
16 9 While the process of selection, sensitivity review and transfer is in principle an open one, the process is complex and there are opaque aspects (not least, the use of the Lord Chancellor s Security and Intelligence Instrument, known colloquially as the Lord Chancellor s blanket, which is used to protect specific aspects of national security). 16 The very nature of such a situation, in which the shape of the process is open, and yet the detail of the data passing through the process must be hidden (since to reveal that detail would render the process moot), creates a situation in which conspiracy theorists can ply their trade. 17 In essence, it can look like the establishment has something to hide and such appearances are important. While in no sense a conspiracy theorist, when someone of the eminence of Professor Margaret MacMillan feels compelled to challenge her own definitive works on the First World War, we should take note. 18 For trust to be maintained in the Archives, it is clear that any further barriers to the timely, open and transparent transfer of records must be avoided. Sensitivity Review of Digital Records The argument set out in this paper so far applies to all public records regardless of format or media. There are, however, particular consequences of the transition to the use of digital records that need to be considered. During the three decades from 1984 to 2014, administrative practices have been transformed by the introduction of a sequence of waves of technology. This started with the photocopier and moved on to the personal computer (PC), the local area network to the internet, a wide range of mobile devices and, most recently, the cloud. All of these technologies created the ability and tendency to duplicate and proliferate information in ever-increasing volumes. This process was piecemeal and began in the early 1990s, but by the middle of the first decade of this century all UK government records were digital. The impact of these technologies and the transformation of 16. Notes on the Lord Chancellor s Security and Intelligence Instrument, <https://www. gov.uk/government/uploads/system/uploads/attachment_data/file/219905/notessecurity-intelligence-instrument.pdf>, accessed 18 July National Archives, 20 Year Rule: Record Transfer Report, <http://www.nationalarchives. gov.uk/about/record-transfer-report.htm>, accessed 30 September Quoted in the Guardian: I am one of many historians who has benefited from using the British archives and who had confidence that the documents had not been weeded to suit particular interests. Now I am wondering whether I will have to go back and rethink my work on such matters as the outbreak of the First World War or the peace conference at the end. But when are we going to get the complete records? So far the pace of transferring them is stately, to put it politely. Ian Cobain, Academics Consider Legal Action to Force Foreign Office to Release Public Records, Guardian, 13 January 2014, <http://www.theguardian.com/politics/2014/jan/13/foreign-officesecret-files-national-archive-historians-legal-action>, accessed 19 August 2014.
17 10 administrative practice on the records of the public sector has not been examined in detail, however a detailed examination of the format and nature of the evidence presented to the Hutton Inquiry 19 is not positive. 20 In the evidence, the paper trail for a decision was no longer in a single Manila file; instead, the record was found in a blizzard of s sent from person to person and stored on multiple computing systems. It would appear that the previously clear and unambiguous rules for the creation and management of information in the public services have been challenged. In July 2012, the government announced the transition towards releasing records when they are twenty years old, instead of thirty 21 (as has been the case since the amendment to the Public Records Act in 1967). 22 From 2013, two years worth of government records will be transferred to the National Archives through a ten-year transition period until a new 20-year rule is in place in The records covered by this transition are those from 1983 to 2003, 23 coinciding with the time during which the most extreme aspects of the technical changes mentioned above took place. When examining the process of transfer described above, and considering the impact of the change to digital records, it is clear that all of the steps in the process need to be examined. Appraisal and selection, preparation and delivery, and accessioning will all present challenges to departments and the Archives but there are a number of mitigations, including the doctrine of macro appraisal and the recent developments in digital preservation at the National Archives. 24 It is the process of sensitivity review that generates the most significant challenges and where considerable work is needed to identify mitigations. Additional Threats The challenges of digital records to the process of sensitivity review are as follows: 19. Lord Hutton, Report of the Inquiry into the Circumstances Surrounding the Death of Dr David Kelly C.M.G. [the Hutton Inquiry], HC 247 (London: The Stationery Office, 2004), <http://fas.org/irp/world/uk/huttonreport.pdf>, accessed 18 July Michael Moss, The Hutton Inquiry, the President of Nigeria and What the Butler Hoped to See, English Historical Review (Vol. 120, No. 487, June 2005), pp , <http://ehr.oxfordjournals.org/content/120/487/577>, accessed 19 August National Archives, Government Confirms Transition to a 20-Year Rule Will Begin from 2013, 13 July 2012, <http://www.nationalarchives.gov.uk/news/739.htm>, accessed 18 July Public Records Act 1967, <http://www.legislation.gov.uk/ukpga/1967/44>, accessed 22 July Ibid. 24. Tim Gollins, Puting Parsimonuous Preservation into practice, The National Archives, 2012, <http://www.nationalarchives.gov.uk/documents/information-management/ parsimonious-preservation-in-practice.pdf>, accessed 25 July 2014.
18 11 Volume and resources: Following advances in office technology during the late twentieth century, the consequent proliferation of information, and the broadening of the interest of the scholarly community, a much greater volume of material is being deemed worthy of preservation in the digital age. Against a background of budgetary constraint, the manual review of digitally born records is not practical Complex context: Technology has challenged earlier clear and unambiguous rules for the creation and management of information. This situation will significantly complicate the process of digital sensitivity review, as understanding a record s context (including its distribution) is crucial in assessing its sensitivity Risk: These challenges for review also occur in a context of significantly increased risk. Although the consequences of mistaken disclosure have not changed with the advent of digital records, the probability of discovering a mistake has. It is hard to discover particular information in the paper world, in marked contrast to the digital environment where ubiquitous search engines index content rapidly. Risk-averse depositors may feel obliged to close large swathes of records if they cannot efficiently and effectively determine the sensitivity of each individual record with some clear degree of certainty. If sensitivity review of digitally born records is not practical, and against a background of budgetary constraint and increasing litigation, unless something is done large swaths of records will be closed in their entirety for long periods (up to 120 years in the case of some exemptions). Such precautionary closure (due to the costs or difficulty of review) is permissible under freedom of information legislation, but it will contradict citizens expectations of openness in a democratic society and will only serve to exacerbate the threat to trust in the Archives, as described above, and the subsequent threat to our security. Opportunities While digital records may challenge sensitivity review, and this may give rise to threats to our wider security, their very nature also offers opportunities to address those challenges and counter the threats. Some of the opportunities are as follows: Some sensitivities are not subtle. They can relate to specific terms and thus an appropriately configured search system should be able to highlight them. For example, the records that related to the Al- Yamamah Contract, 25 although still available on the Campaign Against Arms Trade (CAAT) website, have been closed officially to prevent further damage to international relations. 25. David Leigh and Rob Evans, Secrets of al-yamamah, Guardian, [no date], <http:// accessed 18 July 2014.
19 12 Consistency: by using electronic means, it is possible to drive some consistency across the review process. Accurate estimation of residual risk: unlike in the review of paper records, it is possible to estimate the risk posed by reviewed records using the concept of technologically assisted digital review. Exploitation of the Big Data aspects of digital records, coupled with the application of machine learning applied in the context of information retrieval technology, can result in patterns emerging that can inform reviewers of where to look. All of the above requires significant research, first to determine what the digital record looks like, and then to demonstrate the opportunities that can be derived. Conclusion Freedom of information does not relate solely to openness. There is a fundamental difference between openness (driven by what the state wants it citizens to see) and freedom of information, which proscribes the right of accessing information to the individual. 26 Freedom of information creates a balance between the public interest, the state interest and the personal interest based on human rights, all mediated and governed by the rule of law. Balance is crucial to achieving freedom of information alongside openness. Limits on openness are necessary for reasons of national security (for example, the location of Britain s nuclear weapons should not be revealed, nor should their targeting information). Individuals also need to be protected from harm, and this has to be done through some limits on public access to information. However, the ability to hold the executive to account under the rule of law and in the court of history is also central to the security of a modern democratic society. This can only be achieved through open and transparent access to the records of government. How these challenges play out in the digital age of Big Data requires significant research, in order to gain a better understanding of how public records have changed and thus how they can be sensitivity reviewed and appropriately archived. Tim Gollins is currently an Honorary Research Fellow in the School of Computing Science at Glasgow University, working on the technically assisted sensitivity review of digital public records while on secondment from the National 26. S Curtis, Information Commissioner, Open data is no substitute for freedom of information, Daily Telegraph, 29 October 2013, <http://www.telegraph.co.uk/ technology/news/ /information-commissioner-open-data-is-no-substitutefor-freedom-of-information.html>, accessed 29 July 2014.
20 13 Archives. Tim started his career in the UK civil service in 1987 and joined the National Archives in April 2008 to lead the delivery and procurement workstream of the Digital Continuity Project. Tim was part of the team that developed the National Archives business information architecture and helped to initiate work on the new Discovery system to enable users to find and access the records held at the National Archives. He has recently worked on the design and implementation of a new digital-records infrastructure at the National Archives, which embodies the new parsimonious preservation approach he developed. Tim is a Director of the Digital Preservation Coalition and a member of the University of Sheffield I-School s Advisory Panel.