Visual Analysis for Extremely Large Scale Scientific Computing

Size: px
Start display at page:

Download "Visual Analysis for Extremely Large Scale Scientific Computing"

Transcription

1 Visual Analysis for Extremely Large Scale Scientific Computing D2.5 Big data interfaces for data acquisition Deliverable Information Grant Agreement no Web Site Related WP & Task: WP2, T2.4 Due date Mars 31, 2015 Dissemination Level Nature Author/s Contributors Benoit Lange, Toàn Nguyên Version 1.0 Alvaro Janda, Andreas Dietrich, Miguel Tinte, Jochen Haenisch, Miguel Pasenau Page 1 of 40

2 Approvals Author Name Institution Date OK Toàn Nguyên, Andreas Dietrich, Benoit Lange Task Leader Benoit Lange INRIA WP Leader Toàn Nguyên INRIA Coordinator INRIA 16/01/2015 Change Log Version Version 0.1 Version 1.0 Version 2.0 Version 3.0 Description of Change First draft version of the document First version of the document Update of the document with different contributions Added more contributions, comments, corrected styl, numbering of sections, figures and foot notes Page 2 of 40

3 Table of content 1. Introduction 5 2. Data Flow on the platform Current state of the platform Modules Simulation Ingestion Storage Access of Storage Analytics Query Manager Visualization on the User Workstation Decomposition in Vqueries Session Queries Direct Result Queries Result Analysis Queries Data Ingestion Queries Moving to HPC cloud Simulation FEM simulations Discrete based simulations Ingestion Storage Access of Storage Analytics Query Manager Visualization on the User Workstation Moving to Cloud Simulation Ingestion Storage Access of Storage Analytics Query Manager Visualization on the User Workstation Conclusion 38 Page 3 of 40

4 7. References 39 Page 4 of 40

5 1. Introduction By 2020, the amount of produced information will be huge. Tools used by scientific community (simulation tool, or observation tools) will be able to produce more data than ever before. In fact as stated in [16], science produces information using three paradigms: observational Data, Experimental Data and Simulation Data. In the case of observational data, data comes from unexpected events. For the experimental use case, the production of information is performed in a fully controlled environment, and all parameters are well known by scientists. This strategy enables to reduce noise produced by external sources. This solution for the production of information is mainly focused on closed environment like laboratories. The final strategy used to produce data is based on simulations. A simulation is an execution of a mathematical model, which is solved by a specific process. This resolution is performed by iterative computation. This solution produces data without the need of doing a real experimentation. This discovery process has also evolved in the past few years. In fact, simulation software is not anymore reserved by some specific communities (chemical, astrophysical, engineering, etc.), but this kind of application is now used in all research fields. This evolution was guided by the transformation of computing hardware. HPC environments will not be anymore the architecture reference, the computation will be moved to cloud IT systems. This transformation is lead by the evolution of the cost of such a kind of system. In a cloud environment, CPU time is becoming cheaper than ever before. CPU time is becoming an affordable resource. This increase in computational data will lead scientists to move their mind to a novel paradigm. The production of information is not anymore a problem; simulation tools exist for a wide variety of domains. Today s challenge is the management of the produced data. In fact, IO operations have not been a main interest of the research community, and this kind of operation suffers from this lack of attention by R&D. IO operations have not been optimized compared to the evolution of the computation on CPU or GPU. To avoid issues of writing latency, simulation engines do not write all data into the file system. Users of these applications can select how many time steps shall be stored. Intermediary time steps between different computations are considered not to be necessary. In this project, we aim to provide a specific platform, which can deal with engineering data by avoiding deleting intermediary information. This work targets to deal with two kinds of datasets: already produced information (owned by scientific community) and computed data produced by simulation engines. An exhaustive list of available resources for this project is presented in Figure 1. This table shows all the requirements for external data. The providers have limited access to the storage, and they are already using some strategy to reduce the amount of information. This method avoids storing some time steps from the Page 5 of 40

6 computed data. In the case of our partner, the targeted simulation tools (used by CIMNE and UEDIN), can perform multiple time steps per execution. These multiple iterations may produce very large amounts of data; thus it is necessary to use reduction methods in order to deal with the produced dataset. Figure 1. List of already existing resources in this project. In this deliverable, we enlarge the notion of data acquisition from the data ingestion concept (described previously in Deliverables D2.1 to 2.4) to the production of data. We study in this document the workflow of all components: simulation, analytics tools, etc. It is therefore not limited to the ingestion of data produced by a specific large scale application by the VELaSSCo user panel. It is deemed important to take into account the large volumes of data that are likely to be stored, processed, produced and exchanged between the various modules of the platform. This has a direct impact on the way the modules interact, on the required communications protocols and on the latency, fault tolerance and, hence, the performance issues. Further, e Science applications targeted by the VELaSSCo project are simulations using FEM and DEM modeling approaches. However, many large scale collaborative projects have integrated data stores to support collaboration among sometimes many international stakeholders. This implies the management and effective access to and storage of large volumes of remote heterogeneous data. First, that data is ingested then analyzed and processed before dissemination and visualization (Figure 2). Page 6 of 40

7 Figure 2. NIST Big Data Interoperability Framework: Information Flow[8] Next, the data might originate from various sources using different methods and tools, including streaming to batch processing and transfers using messages and packet based communication protocols. Also, the data can range from very large numeric arrays to complex documents that include text, videos and complex hyperlinks to each other. These data sets have some requirements related to the storage method used. And depending on the storage method and complexity the storage system has a more or less efficient data access system. An overview of the most used methods is presented in Figure 3. Figure 3. NIST Big Data Interoperability Framework: Data Complexity [8] This means that effective and standardized communications and transfers approaches must be used, on top of high speed and low latency hardware devices. Hierarchical memory devices are common today, including hard disk drives, SSD, DRAM and caches to support fast Page 7 of 40

8 data accesses to data and compute demanding applications. Also, data replication is a must today in order to support fast and locally stored data, as well as fault tolerance. In the rest of this document, we will discuss the flow of information of a HPC deployment of the platform. Then we will study the impact on this communication flow in a HPC cloud architecture. And finally, we will discuss the move to a pure cloud infrastructure. This study will introduce our workflow decomposition based on Vqueries. If we compare our proposed architecture with the architecture provided by NIST (see Figure 4), we can see that our proposed architecture is compatible with the NIST proposal. Figure 4. NIST Big Data Interoperability Framework: Reference Architecture [39] 2. Data Flow on the platform In this section, our interest is focused on the current state of the developed platform, and then data flow of this platform. A simplified version of the architecture is presented in Figure 5. Page 8 of 40

9 Figure 5. Schema of the VELaSSCo Architecture Current state of the platform During the first year of the project, one solution has been elaborated upon by the consortium to enable an early start the development of the VELaSSCo infrastructure. For this solution, the hardware was provided by CIMNE; our compute platform is composed of two dedicated nodes. The administration of these nodes has been given to VELaSSCo users in order to be able to install all necessary software on the platform (without the HPC restrictions) and for experimentation. These nodes are composed by two processors (Intel(R) Xeon(R) CPU 2.33GHz), with 32Gb of memory. For the management of the software stack, we have created a specific user named velassco, which was shared by users of the platform. This user is used to store and start all necessary processes on the platform. Furthermore, as part of the deployment plan of the VELaSSCo platform, two dedicated nodes for VELaSSCo platform have been already installed in the HPC cluster Eddie of the University of Edinburgh. It is also expected to test the deployment of VELaSSCo platform on this two nodes and later on the project to deploy the platform in Eddie HPC system involving a large number of nodes. Currently, a refreshment of hardware and software stack is being conducted in HPC Eddie systems by the cluster management team. Interestingly, the refreshment actions of the Eddie are in line with the architecture designed for VELaSSCo platform. Our designed platform is decomposed into layers, and layers are split into components, see Figure 5. Some of these components have already been described in previous deliverables, Page 9 of 40

10 and their implementations already exist; thus the deployment of these specific parts can be performed quickly. With the Hadoop ecosystem we are already covering a subset of necessary tools, and some partners also provide some other tools for the platform. For the simulation engine, CIMNE and UEDIN provide FEM and DEM simulation tools. These tools already exist and are deployed on compute nodes outside of the VELaSSCo architecture. This part of the project will not evolve; we will use a traditional strategy to produce data. The data layer is based of different components: Flume, HBase, Hive, a communication component (Batch and RT), and HDFS (with HadoopAbstractFileSystem). Flume is an existing tool, which can be used to aggregate information from multiple sources. This tool is based on agents, they will gather information from one repository and store this dataset into a targeted repository. These agents are in charge of merging data from multiple sources. These agents are able to write data on different file systems or repositories. In the case of this project we target to write information into HDFS (and from there further into Jotne s EDM after conversion into ISO ) and HBase (a Big table system on top of HDFS). HBase is another piece of software used in this project. This tool offers a fast indexing strategy for tabular data. This tool can be mapped on any file system, but was more suitable for HDFS. This is used in the context of this project to offer a fast and efficient solution to store and access data. EDM (EXPRESS Data Manager) is an alternative storage of the VELaSSCo simulation data. It does not follow the tabular paradigm, but the object oriented one that is specified in ISO 10303, STEP. Hive is another tool used in the context of this project. This tool offers a simple query language for a distributed file system. This method provides a query language based on SQL, and offers, thus, the advantage of this simple language on top of HDFS. Hive enables to use distributed computing with a well know query language. This tool is useful because it allows interacting with a specific database stored on the file system, but it also enable to interact with HBase. This solution offers a new strategy to interact with data stored in HDFS. Due to its nature of supporting table based data, it is not applicable to the EDM use of the VELaSSCo platform. The storage layer of the platform is managed by the Hadoop infrastructure. Hadoop comes with a dedicated File Storage system called HDFS. Using configuration files, it is possible to use some other file system than HDFS Hadoop. An abstract layer on the Hadoop Java classes brings this evolution of the platform. This strategy makes it possible to deploy different file systems on a Hadoop ecosystem. We plan to use this feature to extend the Hadoop storage system with the EDM DB. The last component of the data layer is named data query. This component aims to provide the necessary communication layout to interact with the storage layer. With this component the engine layer of the platform will only use one communication protocol to interact with all components (HBase, Hive or HDFS IO). This component is also in charge of Page 10 of 40

11 managing queries to the real time engine or the batch engine. This layer needs to be developed. The next layer of this platform is the engine layer. YARN, the resource scheduler and manager of Hadoop is in charge of this layer. Four different modules compose this layer: a query Manager Module, a monitoring module, a graphics module and an analytics module. The monitoring module will use the existing zookeeper tool to monitor the platform. This tool already exists and needs to be deployed for our own set of tools. The analytics module is in charge of executing queries that process the stored data to extract the information the scientist requests. For this purpose a specific computation is applied on the existing data. An example of this feature is: extracting splines from a subset of the model, extract a specific level of detail of the model, calculate the 0 level iso surface of a fluid simulation or the maximum damage result on a structural simulation. This component can also write intermediary information to ensure a higher reactivity of the platform. This extracted information is then passed to the graphics module. The graphics module is in charge of prepare the Vquery results in a suitable way, so that the information can be displayed by the visualization engines at high speed with minimal latencies by using server side HPC compute and memory resources. To this end, the graphics module converts the data into an internal representation. This module receives data from the storage layer or from the analytics module, converts the extracted data set into a specific format suitable for fast GPU rendering, and the resulting data structures are handed over to the query manager module, which sends them back to the visualization client. Moreover, the graphics module will handle streaming and progressive data transfer. Rather than sending the complete data set of a query in one step, information is sent on demand in small parts based on user input (e.g., depending on the position of a moving camera). The final component of this platform is the query manager. This module is in charge of the communication with the visualization client. This module offers a communication gateway to the VELaSSCo platform. It is able to interact directly with the storage layer and to ask the analytics module to do computation on some specific set of the stored data. This module is also in charge of analyzing the complexity of a query to use the most suitable data flow (direct access on the data or using the analytics module). The last layer represents the visualization part of the platform. This layer is composed by a single component and concerns the visualization part. This component is a plugin developed by the consortium, which enables interaction between visualization software (ifx, GID) and the VELaSSCo platform. The next part of the document will provide a deeper description of each component and the communication pipelines among them Modules In this section we will discuss all the different workloads of the platform, with a special focus on the component level. Page 11 of 40

12 2.2.1 Simulation The data that is to be analyzed, processed and visualized using the VELaSSCo platform and the visualization clients origins in numerical simulation programs that runs on HPC centers. A simulation of a physical process is performed by solving the equations describing this process using a discretization of the domain of the problem, depending on the method used to solve these equations. Most numerical methods, like Finite Element Methods (FEM), Finite Volumes (FV) are based on a discretization of the domain into small elements or cells defining a mesh. These elements can be surface elements, such as triangles, or volume elements, such as tetrahedrons. Other numerical methods like Discrete Element Methods (DEM) use particles to represent the domain of the simulation. These particles can be circles, spheres or can have more complex shapes. The output of these methods can be scalars, vectors or tensors (that will be referred to as simulation results), which can be attached to both nodes and elements. These simulation results can be viewed as attributes like typical per face and per vertex attributes in computer graphics models: colours, normals, texture coordinates, and so on. For example in the simulation of the aerodynamics of a racing car, the domain to be represented is the air surrounding the car's body, and it can be approximated using several millions of tetrahedrons. The simulation program calculates the evolution of attributes like air pressure, velocity, density or viscosity using this fixed volume mesh along all the time steps of the analysis. Scientific simulations that run on High Performance Computing (HPC) clusters follow the distributed memory paradigm and partition the huge domain meshes in small portions trying to minimize the interface between these portions [17, 18] as shown in Figure 6. When the simulation finishes, the post processing, i.e. result analysis and visualization, is usually performed by merging these partitions, with their results, together in one single computer. Figure 6. Simulation of the air flow around a telescope. The mesh with 24 million tetrahedrons was subdivided into 128 partitions in order to run on 128 cluster nodes as the colour map shows. Also the stream lines, lines tangent to the vector field, of the velocity attribute were calculated and visualized. Page 12 of 40

13 FEM simulations Finite element simulation codes that run on HPC usually output their calculated results in bursts, at each time step of the analysis. These results are stored, on a single file or on files for each computation node that corresponds to one partition of the domain, on a centralized, high efficient file system like Lustre, or NFS. In the case of the telescope example, the central NFS file system contains all 128 files corresponding to the 128 subdomains. Table 1 shows for a single simulated model, the sizes of the data to be handled by the system: size of the mesh, number of attributes per node or particle, number of expected time steps, and number of sub domains for the single simulation. Total size of the data for the simulated model DEM From 50 Gigabytes to Petabyte FEM From Gigabytes to Terabytes Number of partitions 1 to 10,000 1 to 10,000 Number of particles / elements 10 million particles From millions to 1 billion tetrahedrons Number of written time steps 1 billion From 40 to 25,000 Number of variables per particle / node Particles: 12 (3 scalars + 2 vectors) + user defined variables (scalars and vectors) Contacts: 8 (2 scalars + 2 vectors + user defined variables (scalars and vectors) 6 ( 2 scalars + 1 vector) to 16 ( 8 scalars + 2 vectors) Table 1 Characteristics of a single simulation to be handled by the VELaSSCo platform A more detailed description of the simulation data was provided in the deliverable D1.3. In the initial scenario of the project, the simulation data is already present and is to be ingested in the platform from existing files. To avoid the redundant storage of data both in files, from the simulation programs, and inside the VELaSSCo platform, a more useful scenario contemplates that the results that are being calculated by the calculation programs should be feed directly to the VELaSSCo platform, by means of Flume agents. To develop the last scenario, the project will also provide a Data Ingestion library that will send the results to the platform at each time step of the simulation. The connection between the Simulation program and VELaSSCo platform is shown in Figure 7. Page 13 of 40

14 Figure 7. Simulation program with the DataIngestion library to send results to the VELaSSCO platform. Kratos Multi Physics is a free, open source framework for the development of multidisciplinary solver and is being developed at CIMNE. This simulation program and was used to generate the data provided by CIMNE by using the free GiDPost library. Kratos Multi Physics has also been successfully ported to HPC environments as shown in Figure 8 [17]. Figure 8. Speedup achieved on the Telescope problem on the Marenostrum Supercomputer [19] In this case, the VELaSSCo Data Ingestion library will be integrated in the GiD post library that is already used by the simulation program to output the calculated results. This way the interaction with the VELaSSCo platform will be transparent to the simulation code, requiring only to set the destination of the results data (VELaSSCo_platform instead of GiD_binary_files) and access credentials (user_name and password) in the unavoidably initialization of the library. This will constitute the test case for the previous mentioned scenario of ingesting simulation data into the VELaSSCo platform from a running simulation program Discrete based simulations In the case of discrete based simulations, the raw data is produced by a DEM simulation solver. There already exist many different DEM simulation solvers that are extensively used for both scientific and industrial applications. Some of them are commercial software such as EDEM, PFC, StarCCM+ or DEMpack but there also available a wide range of open source codes such as LAMMPS, LIGGGHTS, MercuryDPM or Yade. DEM computations are massively parallelizable using space domain decomposition and data passing protocols such as MPI. Therefore, most of the DEM simulation solvers are capable to work in distributed environments such as traditional HPC systems. In the case of DEM solvers, the calculation is computed based on discrete particles that interact with the neighbors through contact forces. The velocity and position of the particles Page 14 of 40

15 along the simulations is updated based on the forces acting on them by means of explicit time integration of Newton laws. Thus, for each time step of the simulations, the solver produces data related to the position, properties (mass, volume, ) and results (velocity, angular velocity, ) of the particles and the contact forces network. A more detailed description of the simulation data was provided in the deliverable D1.3. The data writing process to files is conducted using a user predefined saving interval. Typically, the simulation solver produces single files that contain the whole simulation data for all time steps of the simulation or a data file per time step. Moreover, some of the DEM solvers have the capability to save the data in a distributed way, i.e each node or processor writes the data from the particles and contacts that it processes. The data produced by the simulation solver needs to be ingested to VELaSSCo platform for the post processing and analysis of the results. To this end, the simulation data is contained in files that are read to ingest the data into the Big data table of VELaSSCo platform. For the first prototype, it is considered that the simulations have already finished before the data ingestion process is triggered. Nevertheless, for final version of the platform, it is expected to explore the possibility to ingest the simulation data in a progressive way as the simulation is running. In this latest case, a special triggering mechanism should be implemented in the VELaSSCo. Platform.DataIngestion library (see Figure 7) in order to ingest the data into the platform in an on line way Ingestion This component is focused on the process of data ingestion. This module is in charge of the communication between simulations nodes and the storage layer. This process performs a formatting task for the dataset of the HPC nodes. Figure 7 displays two main blocks that involved in the Ingestion process: 1. Simulation module in charge of generating simulation data files after each simulation process completion. These files will be used as input for Ingestion and processing module. 2. Ingestion & Processing part takes as input the simulation data files and processes their information in order to store them into the database. To do so, this module runs an ETL process 1 where each simulation type is identified and processed in a specific way. Figure 9 shows that the simulation results being generated are ingested to the VELaSSCo platform using the DataIngestion library and the Ingestion & processing module inserts this data in the Storage module, which depending on the scenario store this data in HDFS or in the EDM engine. 1 process.html Page 15 of 40

16 Figure 9. VELaSSCo platform Ingestion sub workflow, from the simulation program, on the left, to the Ingestion & processing component and Storage module, in the Data Layer of the VELaSSCo platform. The implementation of the Ingestion module makes use of different tools and services to achieve predefined functionalities. Basically, three functional blocks can be observed: 1. DataInjectorInstance component (described in VQueries chapter 3.4) will be deployed as a RESTFul 2 service. This first implementation allows asynchronous communication between Simulation and Ingestion modules. This specific communication pipeline comes from a web service, which uses callable through HTTP Methods. Usually, a POST method will be used to send simulation data to processing module. Example: URL: Parameters: simulationname=dem_box& analysisname=p3w& partid= The second tool used to store information in databases is Apache Flume 3, which is in charge of delivering large amounts of data through different agents. These applications implement a simple and a flexible architecture based on streaming of a data flow. Moreover, Flume provides an easy integration with some NoSQL databases, like HBase 4, which is the chosen one for our implementation. In this context, it is important to describe how to integrate Apache Flume agents regarding to the final data model. A flume agent has to be configured and deployed, for this it is necessary to indicate the table and data model, which they are pointing to. It can be configured through flume properties file: Page 16 of 40

17 # The configuration file needs to define the sources, the channels and the sinks. # Sources, channels and sinks are defined per agent, in this case called 'agent' agent.sources=avrosource agent.channels=channel1 agent.sinks=hbasesink agent.sources.avrosource.type=avro agent.sources.avrosource.channels=channel1 agent.sources.avrosource.bind= agent.sources.avrosource.port=61616 agent.sources.avrosource.interceptors=i1 agent.sources.avrosource.interceptors.i1.type=timestamp agent.channels.channel1.type=memory agent.channels.channel1.capacity= agent.channels.channel1.transactioncapactiy=10000 agent.channels.channel1.bytecapacitybufferpercentage=20 agent.channels.channel1.bytecapacity= agent.sinks.hbasesink.type=hbase agent.sinks.hbasesink.channel=channel1 agent.sinks.hbasesink.table=velassco_models # filling second column agent.sinks.hbasesink.columnfamily=tableinformation agent.sinks.hbasesink.batchsize = 5000 # splitting input parameters agent.sinks.hbasesink.serializer=org.apache.flume.sink.hbase.regexhbaseeventserializer agent.sinks.hbasesink.serializer.regex=(.+) (.+) (.+) (.+) (.+) (.+)$ agent.sinks.hbasesink.serializer.colnames=row_key,simulationid,boundingbox,validationst atus,numberpart,otherdata agent.sinks.hbasesink.serializer.rowkeyindex=0 agent.sinks.hbasesink.serializer.row_key=row_key The Flume agent configuration above specifies table name, column family, column names and ROW_KEY information that HBase requires, in order to store the data transported by the agent. 3. HBase is the NoSQL database chosen to represent simulation process information and to provide access methods to retrieve such information efficiently. HBase is a column oriented NoSQL database type which allows to insert information in different column names within each column family (CF). Therefore, the number of CFs defined in the Data Model is one of the important aspects in order to store and deliver this information efficiently. Currently, three tables have been defined in order to satisfy all data model requirements: o VELaSCCo_Models: it stores general information regarding simulations already processed and stored, like size of the simulation and validation/verification status. Page 17 of 40

18 o <simultation_id>_metadata: It stores metadata information related to simulations, like mesh type and result type information. o <simulation_id>_simulationdata: It stores simulation data like coordinates, element connectivities, result values, etc. related to simulations. Besides this, HBase will provide a data service access layer, which could eventually be offered on an accessible manner. In this context, it can be easily integrated with other tools to provide an accessibility layer. For instance, Apache Hive 5 facilitates integration 6 as well as querying and managing large datasets residing in distributed storage: Figure 10. Apache HBase and Hive integration Storage Different tools are included in this component, which already exist, see Figure 11. These applications are parts of the Hadoop ecosystem; in addition to the scenario where the simulation data is stored using HBase tables on HDFS, we will use also the JOTNE repository EXPRESS Data Manager (EDM) for storage of engineering objects. Using Hadoop provides a fully extensible storage framework for the VELaSSCo platform. As stated in D2.1, Hadoop already supports multiple file systems to store data. It is also possible to use some alternative storage solutions, such as the EDM DBMS. Developers and research communities have developed several extension to Hadoop based on traditional database systems. In the Page 18 of 40

19 case of this project, we will use this extensibility to provide the most suitable storage platform. We already have identified some plugins for indexing data and extending Hadoop storage to support the EDM Database as a storage system. Figure 11. Zoomed view on the data storage layer of the platform. Hadoop provides all necessary tools to distribute data among several nodes. This operation is available through HDFS, which is a part of the Hadoop ecosystem. HDFS is a virtual file system developed for Hadoop. To not force the utilization of this virtual file system an abstraction layer was developed and was named HadoopAbstratFileSystem. With this methodology, it is possible to extend Hadoop storage using any kind of File system. Several examples have already been developed: QFS 7, or KFS 8, etc. As stated in the reference document of the NIST, presented in Figure 12, storage can be specialized into two categories, based on a File System paradigm, or on an indexing methodology. In a file system environment, it is possible to benefit from the organization: centralized compared to distributed. And a file system also enables to have an organized structure for files. This organization is controlled by a file storage strategy: delimited, with a fixed length parameter and using binary storage. In an indexed paradigm, data can be retrieved efficiently using different strategies: in the case of relational database, key value data, column data, document oriented data and graph data. In VELaSSCo we will extend this indexed paradigm to include object storage that is compliant with ISO 10303, STEP, using EDM. With the extensibility of Hadoop, it has been possible to use multiple strategies to access data. Several plug ins have been developed to extend the Hadoop File System; an example is based on an indexing strategy linked to the HDFS. In order to increase the access speed, multiple solutions can be used at the same time. In this project, we plan to use at least two plugins for different data access strategies, which will offer different ways to increase data access speed. This first one will be HBase, and another one can be Hive (these tools are designed to access data in batch, a specific process will be necessary to extract content in real time). To provide a faster access than these two tools (and that still fit with the real time requirement), the accessing strategy can be extended by Phoenix Page 19 of 40

20 Figure 12. NIST Big Data Interoperability Framework: Data Organization For the commercial version of the VELaSSCo platform, we will provide a plug ins for the Jotne partner storage solution EDM database. Jotne has developed an object oriented database specially designed to store engineering data. Hadoop will be extended by two EDM plug ins. As depicted in Figure 11 and in Figure 14, one EDM plugin will be developed to allow the EDM DB to read files from the Hadoop File System and to write to it. The VELaSSCo test models will be read, whereas the EDM indexed database files will be written to HDFS; the latter will port the EDM DBMS to fit with the distributed storage infrastructure. Figure 14 shows that the second plug in resides in the YARN module. It translated the Query Manager queries into EDM compliant queries and returns results in a format that is readable by the VELaSSCo YARN implementation. Figure 13. EDM Plug in for the data injection and direct data access. Page 20 of 40

21 These two EDM plug ins will be the gateway between Hadoop and EDM Access of Storage Figure 14. EDM extension to Hadoop for query access. To avoid complex communication between the engine and data layers, an access component will be developed. This component is in charge of receiving queries from the Engine layer and mapping them to the correct access plugin (HBase, Hive, etc.). With this strategy, the engine layer performs only one kind of query, and this query is directly mapped to the correct access software. The management of Real time and batch queries is managed by mapping to the correct module HBase, Phoenix, etc. Figure 15. External communication component for the storage layer. All the communications with this component are represented in Figure 15. This component Page 21 of 40

22 will interact with sub modules using the thrift API provided by these applications. Only the HDFS IO is performed using the cli, command line interface, API Analytics The analytics module is in charge of analyzing and processing the stored data. This processes aims to produce new information in order to answer a requirement from the Query Manger (QM). To ensure a fast production of the desired data, the QM can ask to produce information using two different methods. Thus analytics can be performed using multiple solutions and these solutions can be triggered at the same time. The module will also provide a cost estimation of the data analytic query to help the Query Manager evaluate in which mode is the analytics to be evaluated. For time consuming queries the QM will trigger two analytic queries at the same time: one over the simplified version of the model to provide fast feedback to the user and another one over the full resolution model. In collaboration with the graphics module the results of the queries will be returned to the client using streaming, progressive and render efficient protocols and formats. Figure 16. Analytics module and its relation with other VELaSSCo modules. This component is in charge of some specific queries that have already been identified. An example of this query is GetBoundaryOfAMesh(). This query consists of extracting the boundary of a mesh from a simulation model data stored in VELaSSCo. Data.Layer. The workflow of the query can be observed in Figure 17. In this specific example, the analytics module is in charge of the operation CalculateBoundaryOfAMesh that is composed of several components involving the data storage module and the analytics module (see Figure 18). Following the MapReduce v2 (MR) in YARN, the computation pipeline of this query can be described as follows: 1. Select proper YARN application depending on the element type of the mesh. Page 22 of 40

23 2. In the map phase of the application: extract from the data storage the elements of the mesh and simulation that the user specified in the query (component GetElementsOfLocalMesh). 3. Still In the map phase of the application: From the elements data of the mesh extracted from the data storage, the analytics module computes the unique triangles/lines of the volume/surface mesh, i.e. the boundary of the mesh. 4. In the reduce phase of the application: All the partial unique triangles/lines computed in the previous step are joined together, and the repeated triangles/lines eliminated. 5. Now the while boundary mesh is formatted for drawing by Graphics module. Figure 17. Workflow of the VQuery GetBOundaryOfAMesh(). Figure 18. Components of the CalculateBoundaryOfAMesh operation from the Analytics module. The communication pipeline to compute this query includes communication between the different modules of the platform: o o o Query Manager Analytics: the query manager requests the computation of the query to the analytics module together with the input parameters of the query previously specified by the user. Analytics Data storage: the analytics module receives from the data storage module the data of the elements of the mesh Analytics Query Manager: The analytics module sends the result of the computation to the query manager and it will send to graphics module from formatting Query Manager The goal of this component is to manage the VELaSSCo platform by providing all the necessary stuff to communicate with users (through visualization) and sub modules of the platform. This component directly interacts with YARN (Hadoop scheduler). This module is one of the most complex modules, because it is in charge of the communication, thus it also Page 23 of 40

24 must understand the data flow of the platform and ensuring some feed back to the user while time consuming queries are being executed. This component has a smart feature, which enables to pre execute some queries in order to reduce the execution time of these queries. For this module, we have targeted two kinds of queries: simple and complex ones. This module is the manager module of VELaSSCo, all queries are redirected to this module. It is in charge of providing all the necessary mechanisms to communicate with all components. Its goal is to simplify the communication process between all modules, and also between layers. When a query sent by the visualization tool is received by the QM, this query is analyzed, and decomposed into sub queries, operations. This decomposition depends on the topic of the query and also on the desired response time. To ensure a faster solution to retrieve the information, asynchronous queries can be triggered. This module studies the desired query and executes desired computation on data. For example this module can extract information from a coarse resolution of dataset in order to provide a faster result. As stated earlier, this module decomposes queries. These produced queries can be twofold: simple and complex. A simple query directly retrieves information from the storage layer, while complex queries produce content from a computation. This decomposition into multiple queries brings some complexity of the system; in fact it is possible to express a query into different aggregation of queries. Thus, it will be necessary to provide an evaluation utility to the QM to ensure the best decomposition of a query. But it is necessary to know that even with this tool, performances can reach less performance than planned. The flow process of this module is depicted here: A query is received from the client. QM analyses this query and determines the most suitable decomposition of the query. In this case, multiple solutions are possible: o Gather directly data from the storage layer o Execute an analysis on stored data o Execute an Asynchronous query, which can be: Gather directly data from the storage layer Execute an analysis on stored data When the result is available a message from a previous process is executed, Figure 19. Query Manager of VELaSSCo o QM asks the graphics module to gather the necessary information. Page 24 of 40

25 o o Graphics module gathers data, and compresses datasets to the suitable GPU friendly format. Information is sent back to the QM QM receives this information and sends it to the visualization platform. The communication process is presented in Figure 20. In this figure, two execution workflows are presented, the workflow with black hexagones respresents the simulation data flow (from compute nodes to the storage layer), while the workload composed by purple circles represents the communicaiton pipeline between a user and the storage layer. Figure 20. Communication pipeline for both solutions (for server and for user) Visualization on the User Workstation A user accesses the VELaSSCo platform by operating a local visualization client (see Figure 22). The visualization client is separated from the database infrastructure, and communicates remotely with the platform by sending queries and receiving results. In order to exchange information with the platform, the visualization client makes use of the VELaSSCo access library as a communication layer. The access library provides a specific application programing interface (API) for managing queries and results. It can be linked to a visualization engine, which handles user input and displays query results on a local workstation. As part of the initial VELaSSCo implementation GiD (CIMNE) [14] and ifx (Fraunhofer) [15] are employed as visualization engines. Both GiD and ifx feature a pluginmechanism to enable extensions. To attach the access library to the visualization engines, a plugin for each framework will be developed, where each plugin will be linked to the library. Keeping platform access in a separated library allows for targeting other frameworks besides GiD and ifx. Page 25 of 40

26 Figure 21. Graphics module in the VELaSSCo platform Figure 22. Visualization client with the Access library to communicate with the VELaSSCo platform On the client side, a user interacts with the graphical user interface of one of the visualization engines. User actions are translated by the plugin component into a query message that is sent to the access library. To communicate with the engine layer, the library will make use of Apache Thrift [13] which interchanges information between the access library and the query manager module in the VELaSSCo engine layer (see above). Sending a query will trigger either retrieving simulation results (to be visualized on the client), or the computation of analysis algorithms over the simulation data (also to be rendered on the client). It is also possible to retrieve partial simulation data to be post processed in the Figure 23. Workflow of an VQuery initiated by the user on the visualization client, blue box at the top, and received by the QueryManager module of the VELaSSCo platform, below. visualization client. The resulting data is sent back the same way to the visualization engine, which then displays or processes the data. The scheme in Figure 23 shows the steps that follows a request initiated by the used with the visualization client. The request is mapped and formatted to a Vquery, VELaSSCo query, which then is packed and send to the platform. The platform then unpacks it and passes the Vquery to the QueryManager for its processing. Page 26 of 40

27 Figure 24 shows the operations present in the platform access library that performs the previous detailed steps. Figure 24. Components involved in the operations conforming the PlatformAccess library, tied to the visualization client. On the server side, the graphics module within the platform s engine layer is responsible for preparing query results. This is done in such a way that information can be displayed by the visualization engines at high speed with minimal latencies. To this end, the graphics module converts the data into an internal format. Data structures resulting from this conversion are handed over to the query manager module, which sends them back to the visualization client. This workflow is reflected in the scheme of Figure 25. Figure 25. Workflow of the returning results of a processed VQuery, top white box, which are formatted, packed and sent to the platform access library, which in turns hands the data to the visualization client. Figure 26 shows the components involved in the operations that handles the reception of the results of the processed Vqueries on the client side. Figure 26. Components involved in the operations conforming the PlatformAccess library, tied to the visualization client Page 27 of 40

28 3. Decomposition in Vqueries In this project, the workload execution is expressed by queries named: VELaSSCo queries (VQueries). A VQuery is a global functionality of the VELaSSCo platform. A Vquery can express functionality at the user level and also at the ingestion level. A VQuery is an aggregation of operations (which is an aggregate of components), which can evolve at different levels. The queries will be implemented for the two storage solutions in VELaSSCO: Hbase and EDM. All of the preliminary queries are part of one of four classes. These classes are presented and discussed in the rest of this section Session Queries The group of session queries provides the frame for access to the simulation contents data. They manage user login with corresponding session handling, control access to models and maintain model meta data, such as, thumbnails and validation information. The specification of the queries has shown the need for the following modules in the VELaSSCo architecture: user access management session handling model administration. All of those modules will be distinct building blocks of the VELaSSCo platform independent of the storage solutions. Functionalities will need to be mapped to the corresponding modules and data in EDM and Hbase Direct Result Queries This VQuery class defines queries, which directly interact with the storage layer. This class is currently decomposed into 12 queries, with two main objectives: extract information or delete information. The extracting queries of this class are dedicated to gather the information stored into the storage layer. The information was stored in a hierarchical decomposition: The access point of a dataset is the model, A model can may contain a static mesh and one or more analyses, An analysis contains one or several steps, A Time Step may contain Meshes in the case of dynamic meshes, A mesh contains some elements. The access of all sub parts of the data set can be performed using different queries, for example, the extraction of a vertex can be performed by: it is ID, or it is coordinates. For the deleting queries, different queries are implied to remove each part of the dataset. These queries delete information recursively regarding to which data have to be removed. Page 28 of 40

29 3.3. Result Analysis Queries The Result Analysis Queries (RAQ) include queries that conduct computation over the simulation data in order to produce new results that help to understand original raw results from the simulation solver. Currently, this VQ class is composed of 4 queries: GetBoundingBox GetResultForPoints. GetBoundaryOfAMesh. GetDiscrete2ContinuumOfAModel. In all cases, the RAQ involve operations and/or components related to a Data Storage module that extract the whole or part of simulation data models. The simulation data is stored in the storage layer and the extraction of data is conducted depending on the input arguments (model id, mesh id, time steps, ) of the RAQ specified by the user. In some cases the new results produced by the RAQ need to be saved temporally or permanently inside the platform. These new results will be stored in memory, in files or in the HBase tables of the data storage layer depending of their size Data Ingestion Queries Data Ingestion Queries (DIQ) is one of the VQuery families defined in this chapter. This family focuses on the process of data insertion into persistence layer and it is composed by one component and five operations described in workflow below: As exposed on Figure 27, Data Ingestion Query family is composed by only one component to satisfy all operations described. This component (DataInjectorInstance) is in charge of managing all the logic associated to Data Ingestion process through five main operations: GetInjectorInstance: this operation is based on the process of creating an instance of DataInjector component deployed in HPC platform. InjectSimulationData: once DataInjector component is instantiated, the process is aimed to read and insert simulation data files into final databases. RunETLProcess: this operation runs Extract Transformation and Load process, where each type of simulation data information is processed properly to be sent to databases SendDataToPipeline: the process of sending data to datastores (HBase) is implemented in this operation using Apache Flume as the software manager used to synchronize events with simulation data and HBase database accesses. WriteDataIntoNOSQLStorage: Finally, this operation writes data received from Flume agents into HBase tables. The workflow depicted represents the main functionalities identified during VQueries definition. During implementation phase, the component will be deployed into HPC infrastructure and HPC Cloud respectively. Page 29 of 40

HPC technology and future architecture

HPC technology and future architecture HPC technology and future architecture Visual Analysis for Extremely Large-Scale Scientific Computing KGT2 Internal Meeting INRIA France Benoit Lange benoit.lange@inria.fr Toàn Nguyên toan.nguyen@inria.fr

More information

Visual Analysis for Extremely Large Scale Scientific Computing

Visual Analysis for Extremely Large Scale Scientific Computing Visual Analysis for Extremely Large Scale Scientific Computing Deliverable Information Grant Agreement no 619439 Web Site D2.2 Specification of Big Data Architecture Version 3.0 http://www.velassco.eu/

More information

A Hadoop use case for engineering data

A Hadoop use case for engineering data A Hadoop use case for engineering data Benoit Lange, Toan Nguyen To cite this version: Benoit Lange, Toan Nguyen. A Hadoop use case for engineering data. 2015. HAL Id: hal-01167510 https://hal.inria.fr/hal-01167510

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Visual Analysis for Extremely Large- Scale Scientific Computing

Visual Analysis for Extremely Large- Scale Scientific Computing Visual Analysis for Extremely Large- Scale Scientific Computing D2.1 State- of- the- art of Big Data Version 1 Deliverable Information Grant Agreement no 619439 Web Site http://www.velassco.eu/ Related

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Big Data architecture for large-scale scientific computing

Big Data architecture for large-scale scientific computing Big Data architecture for large-scale scientific computing Benoit Lange, Toan Nguyen To cite this version: Benoit Lange, Toan Nguyen. Big Data architecture for large-scale scientific computing. world academy

More information

Deploying Hadoop with Manager

Deploying Hadoop with Manager Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer plinnell@suse.com Alejandro Bonilla / Sales Engineer abonilla@suse.com 2 Hadoop Core Components 3 Typical Hadoop Distribution

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A Scalable Data Transformation Framework using the Hadoop Ecosystem A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Data Services Advisory

Data Services Advisory Data Services Advisory Modern Datastores An Introduction Created by: Strategy and Transformation Services Modified Date: 8/27/2014 Classification: DRAFT SAFE HARBOR STATEMENT This presentation contains

More information

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here> s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets!! Large data collections appear in many scientific domains like climate studies.!! Users and

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014

Big Data. White Paper. Big Data Executive Overview WP-BD-10312014-01. Jafar Shunnar & Dan Raver. Page 1 Last Updated 11-10-2014 White Paper Big Data Executive Overview WP-BD-10312014-01 By Jafar Shunnar & Dan Raver Page 1 Last Updated 11-10-2014 Table of Contents Section 01 Big Data Facts Page 3-4 Section 02 What is Big Data? Page

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Unified Batch & Stream Processing Platform

Unified Batch & Stream Processing Platform Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments Important Notice 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, Impala, and

More information

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Diagram 1: Islands of storage across a digital broadcast workflow

Diagram 1: Islands of storage across a digital broadcast workflow XOR MEDIA CLOUD AQUA Big Data and Traditional Storage The era of big data imposes new challenges on the storage technology industry. As companies accumulate massive amounts of data from video, sound, database,

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Certified Big Data and Apache Hadoop Developer VS-1221

Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer VS-1221 Certified Big Data and Apache Hadoop Developer Certification Code VS-1221 Vskills certification for Big Data and Apache Hadoop Developer Certification

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

Mitra Innovation Leverages WSO2's Open Source Middleware to Build BIM Exchange Platform

Mitra Innovation Leverages WSO2's Open Source Middleware to Build BIM Exchange Platform Mitra Innovation Leverages WSO2's Open Source Middleware to Build BIM Exchange Platform May 2015 Contents 1. Introduction... 3 2. What is BIM... 3 2.1. History of BIM... 3 2.2. Why Implement BIM... 4 2.3.

More information

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013 Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science A Seminar report On Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science SUBMITTED TO: www.studymafia.org SUBMITTED BY: www.studymafia.org

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

So What s the Big Deal?

So What s the Big Deal? So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5

How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5 Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com

Hybrid Software Architectures for Big Data. Laurence.Hubert@hurence.com @hurence http://www.hurence.com Hybrid Software Architectures for Big Data Laurence.Hubert@hurence.com @hurence http://www.hurence.com Headquarters : Grenoble Pure player Expert level consulting Training R&D Big Data X-data hot-line

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014)

International Journal of Innovative Research in Information Security (IJIRIS) ISSN: 2349-7017(O) Volume 1 Issue 3 (September 2014) SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE N.Alamelu Menaka * Department of Computer Applications Dr.Jabasheela Department of Computer Applications Abstract-We are in the age of big data which

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Introduction to Big Data Training

Introduction to Big Data Training Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB

More information

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015 Enterprise Scale Disease Modeling Web Portal PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015 i Last Updated: 5/8/2015 4:13 PM3/5/2015 10:00 AM Enterprise

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014 Ralph Kimball Associates 2014 The Data Warehouse Mission Identify all possible enterprise data assets Select those assets

More information

Designing a Cloud Storage System

Designing a Cloud Storage System Designing a Cloud Storage System End to End Cloud Storage When designing a cloud storage system, there is value in decoupling the system s archival capacity (its ability to persistently store large volumes

More information

Upcoming Announcements

Upcoming Announcements Enterprise Hadoop Enterprise Hadoop Jeff Markham Technical Director, APAC jmarkham@hortonworks.com Page 1 Upcoming Announcements April 2 Hortonworks Platform 2.1 A continued focus on innovation within

More information

Information Architecture

Information Architecture The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Next-Gen Big Data Analytics using the Spark stack

Next-Gen Big Data Analytics using the Spark stack Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

CitusDB Architecture for Real-Time Big Data

CitusDB Architecture for Real-Time Big Data CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

HDP Hadoop From concept to deployment.

HDP Hadoop From concept to deployment. HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Parallel Analysis and Visualization on Cray Compute Node Linux

Parallel Analysis and Visualization on Cray Compute Node Linux Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory

More information

Big Data Challenges in Bioinformatics

Big Data Challenges in Bioinformatics Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?

More information

Enabling High performance Big Data platform with RDMA

Enabling High performance Big Data platform with RDMA Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran) Day with Development Master Class Big Data Management System DW & Big Data Global Leaders Program Jean-Pierre Dijcks Big Data Product Management Server Technologies Part 1 Part 2 Foundation and Architecture

More information

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum Siva Ravada Senior Director of Development Oracle Spatial and MapViewer 2 Evolving Technology Platforms

More information

Big Data Analytics Platform @ Nokia

Big Data Analytics Platform @ Nokia Big Data Analytics Platform @ Nokia 1 Selecting the Right Tool for the Right Workload Yekesa Kosuru Nokia Location & Commerce Strata + Hadoop World NY - Oct 25, 2012 Agenda Big Data Analytics Platform

More information

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing 3Vs (Volume, Variety and Velocity) of Big Data Testing 3Vs (Volume, Variety and Velocity) of Big Data 1 A lot happens in the Digital World in 60 seconds 2 What is Big Data Big Data refers to data sets whose size is beyond the ability of commonly used

More information