1 Bringing Compute to the Data Alternatives to Moving Data Part of EUDAT s Training in the Fundamentals of Data Infrastructures

2 Introduction Why consider alternatives? The traditional approach Alternative approaches: Distributed Computing Workflows Bringing the Compute to the Data

3 Why should alternative approaches be considered? Moving data is still hard, even when you re using the right tools. Data volumes are expected to continue to increase, and this is expected to happen more rapidly then increases in transfer speeds Alternatives require thinking about things differently, so it may be wise to start thinking about alternatives before current techniques break down

4 Traditional Approach Input data is stored at location A Compute resource is at location B Output data is required at location C 1. Move data from A to B 2. Perform computation at B 3. Move data from B to C (A & C are often the same place) A B C

5 Traditional approach Data Compute

6 Alternative Approaches: A Disclaimer None of the following approaches provide a silver bullet! Not all approaches will be useful for all problems and in some case, using these approaches can make things worse These should complement existing approaches and be used where appropriate

7 Distributed Computing Here, the idea is that you might not need to do all of the compute at B. In general, this approach could make things worse, depending on your data transfer pattern It will not be suitable for all kinds of problem Many of the considerations here are traditional parallel computing concepts

8 Distributed Computing as Parallel Computing Is the problem trivially parallel? Is it possible to solve parts of the problem using only part of the input data, and simply recombine the output at the end of a run? If all processors have access to all the data at the start, is it then possible for them to proceed with little or no communication during the runs? If there is the need to communicate during a run, how intensive are these communications? Do you have all-to-alls?

9 When might Distributed Computing be a good alternative? When input data starts off distributed Fairly common with large scale experimental data: Sensors, detectors, etc. When input data is already mirrored When you ve had to move the data before anyway and you could have moved it to multiple places instead of just one When the computation is trivially parallelisable or requires only limited communication

10 A B1 B2 B3 B4 C

11 A1 A2 A3 A4 B1 B2 B3 B4 C

12 A1 A2 A3 A4 B1 B2 B3 B4 C

13 A1 B1 B2 C

14 A1 A2 B1 B2 B3 C

15 Is this Grid Computing? There are definite overlaps between these ideas of distributed computing and the grid computing that promised so much in the last decade Grid is not such a cool topic anymore, but many of the ideas could be reused in different contexts (possibly hidden from an end-user) This way of computing may still come into its own for certain kinds of big data problems

16 Scientific Computation in the cloud? Likely to be a while before this can get close to existing approaches in terms of efficiency, but it is being used in some places e.g. Amazon has Cluster Compute and Cluster GPU instances (see Some data sets are already in the cloud, e.g. Annotated Human Genome Data provided by ENSEMBL Various US Census Databases from The US Census Bureau UniGene provided by the National Center for Biotechnology Information Freebase Data Dump from Freebase.com

17 Big Input Data Likely to become more common as more and more data is stored and available for re-use Projects like EUDAT will make it easier to access to stored data This will be the case for much data-intensive science Where here I use this term in the context of the fourth paradigm : computers as datascopes

18 Workflows Related to distributed computing Sometimes referred to as programming in the large Again, this potentially requires more data movement The idea is to break the computation down so that some of it can be done at A, some of it can be done at B, and some of it can be done at C. Also, instead of doing everything at B, this could instead be done at B1, B2, B3, B4,

19 Simple Motivating Example Big Input Data A B Small Output Data C

20 A A B1 B B2 C C

21 or a more realistic case? Image Source:

22 Difficulties with this approach Change to computation algorithm likely A trade-off, but it might only need to be done once Orchestration Coordinating computation at multiple sites Workflows can help with this Can help to address the added complexities of Multiple jurisdictions / access policies Job scheduling Automation

23 Approaches to orchestration Local Each compute service works independently Data can be pushed or pulled between services (or some combination) The route that the data should take can be passed with the data predetermined at the service communicated manually to the service for each run Orchestrated The usual workflow approch A workflow engine communicates with services or processing elements to control data flow

24 An aside: Push & Pull Push Service 1 completes processing. Service 1 makes a call to service 2 and sends the data to service 2 The arrival of data triggers service 2 to run Pull Service 1 runs and stores its output locally Service 2 runs (triggered manually) Service 2 initiates data transfer from service 1 Service 1 Service 1 Service 2 Service 2

25 Workflow Engines Scientific Workflows Kepler, Taverna, Triana, Pegasus (Condor), VisTrails Unicore, OGSA-DAI (for database-oriented flows) General Purpose / Business Orientated Service Oriented Architecture Solutions BPEL engines, e.g., Oracle BPEL Process Manager SAP Exchange Infrastructure WebSphere Process Server Many of these based on web services Datacentre orientated Hadoop (MapReduce), Storm (stream processing)

26 Moving the Compute to the Data A more general idea which is related to both the previous approaches This approach relies to some extent on having an infrastructure that supports this approach Can work particularly well where A and C are the same place

27 Computing Close To The Data Relational Database Systems Send a query as SQL Virtual Machines Send a VM image to a virtualisation environment on a machine which can directly mount the data Allow a user to submit a script or executable on a machine close to the data SPARQL endpoints on RDF triple stores Data Services (e.g. as Web Services) with some API beyond file transfer Prefiltering / transformation / subsetting Application As A Service

28 Implications for Data Centres These approaches rely on data centres to provide computational resources and services Cons: Interface required to accept query or compute job Compute/processing resources required Pros: Less strain on the network

29 Conclusions Data movement will always be required Moving large amounts of data is never likely to be easy There is not one single solution, but by considering alternative approaches to big data problems may help you to solve problems and answer questions that would have otherwise been impossible

30 Acknowledgements These slides were produced by Adam Carter (EPCC, The University of Edinburgh) as part of the EUDAT project ( The University of Edinburgh You are welcome to re-use these slides under the terms of CC BY 4.0 (

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES AWS GLOBAL INFRASTRUCTURE 10 Regions 25 Availability Zones 51 Edge locations WHAT