Description of Work. Project acronym: BigFoot Project full title: Big Data Analytics of Digital Footprints

Size: px
Start display at page:

Download "Description of Work. Project acronym: BigFoot Project full title: Big Data Analytics of Digital Footprints"

Transcription

1 Description of Work Project acronym: Project full title: Big Data Analytics of Digital Footprints Project Budget: 3, Euro Work programme topics addressed: Objective ICT : Cloud Computing, Internet of Services and Advanced Software Engineering. Name of the coordinating person: Pietro Michiardi Fax: List of Participants Role Number Name Short Name Country Date Enter Date Exit CO 1 EURECOM EUR FR 1 36 CR 2 SYMANTEC SYM IR 1 36 CR 3 Technische Universität TUB DE 1 36 Berlin CR 4 Ecole Polytechnique EPFL CH 1 36 Federale de Lausanne CR 5 GridPocket GRIDP FR 1 36 Role: CO=Coordinator; CR=Contractor. SEVENTH FRAMEWORK PROGRAMME THEME FP7-ICT Cloud Computing, Internet of Services and Advanced Software Engineering

2 Contents 1 Concept and objectives, progress beyond state-of-the-art, S/T methodology and work plan Concept and objectives Context Motivations Objectives: The Approach Expected results Indicators and success criteria Relevance to the topics addressed in the call Progress beyond the state-of-the-art Application layer Parallel data processing Interactive query engines Distributed data stores Virtualization layer Relevant EU-funded projects Baseline S/T methodology and associated work plan Introduction Methodology Workplan Structure and Breakdown Overall System Description Usage Scenarios Risk and mitigation plans Work packages list Deliverables list List of milestones Implementation Consortium as a whole Impact Expected impacts listed in the work programme Strategic impact Description of Work page 2 of 62

3 3.1.2 impacts listed in the work programme Scientific impact Social and economic impact The European dimension of Plan for the use and dissemination of foreground Dissemination and communication strategy Exploitation strategies Standardization activities Description of Work page 3 of 62

4 1 Concept and objectives, progress beyond state-of-the-art, S/T methodology and work plan 1.1 Concept and objectives The aim of is to design, implement and evaluate a scalable system for processing and interacting with large volumes of data. The software stack allows automatic and self-tuned deployment of data storage and parallel processing services for private cloud deployments which go beyond best-effort services currently available in the state-of-theart. The project addresses performance bottlenecks of current solutions and takes a crosslayer approach to system optimization, which is evaluated with a thorough experimental methodology using realistic workloads and datasets. The ultimate goal of the project is to contribute the software stack to the open-source community Context The amount of data in our world has been exploding. E-commerce, Internet security and financial applications, billing and customer services to name a few examples will continue to fuel exponential growth of large pools of data that can be captured, communicated, aggregated, stored, and analyzed. As companies and organizations go about their business and interact with individuals, they are generating a tremendous amount of digital footprints, i.e., raw, unstructured data for example log files that are created as a by-product of other activities. As discussed in the report in [56] there are many broadly applicable ways to leverage data and create value across sectors of the global economy: Making data access and interaction simple; Collect and process digital footprints to measure and understand the root causes of product performance and bring it to higher levels; Leverage large amounts of data to create highly specific user segmentations and to tailor products and services precisely to meet users needs; Produce sophisticated analytics to improve decision making with automated algorithms; Use data analysis to create new products and services, enhance existing ones, and invent entirely new business models. In summary, use of data is a key basis of competition and growth: companies failing to develop their analysis capabilities will fail to understand and leverage the big picture hidden in the data, and hence fall behind. Nowadays, the ability to store, aggregate, and combine large volumes of data and then use the results to perform deep analysis has become ever more accessible as trends such as Moore s Law in computing, its equivalent in digital storage, and cloud computing continue to lower costs and other technology barriers. However, the means to extract insights from data require remarkable Description of Work page 4 of 62

5 improvements as software and systems to apply increasingly sophisticated mining techniques are still in their infancy. Large-data problems require a distinct approach that sometimes runs counter to traditional models of computing. In, we depart from high-performance computing applications and we go beyond traditional techniques developed in the database community. In the sequel of this document we present a comprehensive system to elaborate and interact with large amounts of data, that can be deployed on top of private, virtualized clusters of commodity hardware. Before delving into the technical objectives we address in, in the next section we motivate our approach by focusing on few use-cases and by clearly indicating the deficiencies of current approaches Motivations We now illustrate the challenge organizations are faced to when dealing with their own digital footprints. Although the following examples apply to a wide range of use-cases, we shall focus on the context and requirements of two companies that are part of the consortium. GridPocket: provides energy-related value-added services solutions. The goal of this organization is to use and process consumption data generated by millions of customers to help electric, gaz, and water utilities to reduce their CO 2 emissions. Data analysis tasks apply, for example, to the following cases: i) Consumer billing: a monthly scan of all customer data to produce consumption reports; ii) Consumer Dashboard Web applications: analysis of the whole consumption data to create personalized customer reports; iii) Consumer segmentation: execution of sophisticated algorithms to classify consumers based on their consumption patterns and produce, for example, personalized contract offers; iv) Provisioning applications: analysis of geographical consumption patterns and design of predictive algorithms to help operators in provisioning of their electric network. Symantec: Symantec is one of the world industry leaders in security software, focused on helping customers protect their infrastructures, their information, and their businesses. Through its Global Intelligence Network, Symantec has established some of the most comprehensive sources of Internet threat data in the world, with 240,000 sensors monitoring network attack activity in more than 200 countries through a combination of security products and managed What is a private cloud? Private cloud deployments [44] resemble those that are public: one or more datacenters clusters of physical machines, interconnected via a high speed local area network host virtual instances of servers which can be customized from scratch, or that host services and applications that are exposed to end-users via simple and standard interfaces, e. g., HTTP. Such installations are private in that services and applications are not accessible from the outside world: they are confined to clients that interact with them from within the security perimeter of a company or organization. In addition, data stored and manipulated in a private cloud never leave the company s datacenter, hence the application of policies and rules to protect data from unauthorized access can be enforced with legacy approaches to security and access control. Description of Work page 5 of 62

6 services 1. However, security analysts are challenged in their daily job of analyzing global Internet threats because of the sheer volumes of data Symantec collects around the globe [31]. In the cyber security domain, this is sometimes referred to as attack attribution and situational understanding, which are considered today as critical aspects to effectively deal with Internet attacks [43, 58, 70]. Attribution in cyberspace involves different methods and techniques, which, when combined appropriately, can help to explain an attack phenomenon by (i) indicating the underlying root cause, and (ii) by showing the modus operandi of attackers. The goal is to help analysts answer some important questions regarding the organization of cyber criminal activities by taking advantage of effective tools able to generate security intelligence about known or unknown threats. What are the common requirements, with respect to analytics tasks, that characterize the two scenarios described above? Clearly, there are two main ways of interacting with data: i) batch processing of large amounts of data (for analysis, mining, classification, etc), and ii) selective queries on subsets of data that can be issued through manual inspection or by Web applications that read information related to a single user or a subset thereof. Hence, the need for a unified system capable of supporting both kinds of interactions with data. The requirements exemplified above can be somehow met with existing technology. What are the typical solutions that are available today and which novelty brings to the current state-of-the-art? First, we review the three most prominent approaches to large-data analytics: i) buy database management systems and appliances from big vendors, ii) use public cloud services and iii) use open-source projects/products. Data Analytics Appliances: examples of products that attack the Big Data market include EMC/GreenPlum, Splunk, Oracle Big Data Appliance and many more. Addressing large-data analytics problems with such an approach has the advantage of using a product that bundles together hardware and software, which comes with production-level support. However, these products are closed-source and little is known on their effective performance, and how they behave when compared to alternative solutions. Furthermore, the costs associated with a production-level deployment is exorbitant, with licensing fees proportional to the amount of data to be processed. It should also be noted that such systems are difficult (at best) to deploy and tune [62]. Last but not least, this approach suffers from the data lock-in problem: once data is loaded and analytic jobs are written for a specific platform, it is hard to move to another product. Public-cloud Analytic Services: a prominent example of this kind of approach is that of Amazon Elastic MapReduce 2. The idea is, for organizations such as the ones discussed above, to ship their data to a public cloud storage service, prepare analytic jobs in an offline manner (e. g., a MapReduce program, but also SQL-like queries), submit the data processing code to a web application and select the amount (and quality) of resources that will be dedicated to computation. Although this seems an appealing approach, it is not exempt from significant drawbacks. First, today s public cloud products offer only a best-effort service: in practice there is no performance guarantee, and current servicelevel-agreements account also for some down-time periods in which the service may be un-available. Furthermore, current EU directives are strict in what concerns privacy Description of Work page 6 of 62

7 issues 3 4, which drastically limit the applicability of public-cloud services. Finally, it is often unacceptable for industries to put sensitive data sets on public cloud infrastructure, for obvious confidentiality reasons (e.g., for a security software industry, uploading in a public cloud malware and attack data targeting their customers is not an option). Open-source projects: as a prominent example we consider Apache Hadoop we review other projects in Section 1.2 which consists in an ecosystem of sub-projects each dealing with a particular technology (e.g., a parallel processing framework similar to Google s MapReduce [28], a distributed data store similar to Google s BigTable [17], etc...). Additionally, several commercial products, based on Hadoop, have seen the light in the last two years: IBM BigInsights, MapR, Cloudera Hadoop Distribution. The main focus of Hadoop, is to reach a production-level quality: a lot of effort has been dedicated to address bugs, to improve interoperability among components, to improve custom deployments on dedicated cluster and, recently, to eliminate single-point-of-failures in the original architecture. In synthesis, open-source projects lack a structured effort toward system optimization and fail to cover the multiple layers involved in typical deployments. In summary, although the current state-of-the-art offers a rich set of approaches to tackle largescale data processing problems, we identify the following key points that are currently not addressed by existing technologies: Data interaction is hard. Current approaches lack an integrated interface to inspect and query (processed) data. Moreover, little work has been done in the literature to optimize the efficiency (and not only the performance) of interactive queries that operate on batch processed data. Parallel algorithm design is hard. While the design of parallel algorithms is already a difficult topic per se, current systems make the implementation of even simple jobs a tedious and exhausting experience. As such, parallel programs tend to have limited usability and have a short life-time, i. e., code re-use is limited. Lack of optimizations. Current systems entrust users with the task of optimizing their queries and algorithms. Moreover, data-flow and storage mechanisms are data-processing oblivious, which leave room for several optimizations that have not been addressed by current solutions. Deployment tools are poor. Management tools are still in their infancy and target solely bare metal clusters. In addition, the effects of virtualization has been largely overlooked in the literature. Illustrative example of a deployment. We conclude this Section with an illustrative example to highlight the benefits brought by the project. For this example, we assume is used by a company say Symantec willing to explore the secrets hiding in the vast amount of data they collect Description of Work page 7 of 62

8 We assume that Symantec already is in possession of a private cloud deployment, that is the set of machines on which will execute in addition to existing, collocated services is physically deployed in their premises. What are the steps a Symantec user 5 is required to follow? exposes storage, data processing and querying components as a Platform-as-a-Service: in practice this translates into the following steps: Using a standard interface (shell-based or web-based), the user specifies the location of the data she will operate on; the system injects such data in the relevant storage layer (distributed file system or data store); Using a standard interface (shell-based or web-based), the user specifies the data processing tasks that are required to run on her data. Data processing can be delay tolerant that is batch oriented analysis or latency sensitive that is interactive queries and the system can be instructed to direct such tasks to the corresponding engine. The system automatically deploys the necessary machinery (in terms of virtual machines) to execute data analysis tasks, and perform the necessary optimizations (data and virtual machine migration, data flow enhancements, work sharing optimizations) to obtain an aggregate result (that we also label metadata) Using a standard interface (shell-based or web-based) the user can further inspect aggregate statistics to extract useful information or to refine the data processing tasks. As the example above illustrates, offers a unified setting to store, process and interact with data, which is exposed to the user with simple and standard interfaces to a cloud service. All the complications related to deployment, tuning and optimizations are handled transparently by the system. In addition to the above scenario, it should be noted that offers tap-in points for experienced users that are willing to sacrifice the simplicity of this approach for a more controlled usage of, for example, the parallel processing layer. This provides the additional freedom for the user to decide how to interact and use Objectives: The Approach The key challenge of is to conduct cutting edge research on several issues related to Big Data Analytics applications and services, producing relevant output for the research community and with the potential of being immediately relevant and available to real-world, industrial problems. has a number of important scientific and industrial objectives. These include fundamental (scientific and research oriented) and experimental elements, completed by contribution to the open-source community. Fundamental Objectives. The fundamental objectives of are threefold and each is described in detail in the following sub-paragraphs. The technical work packages (WP) described in the remainder of this document, including WP 2,3,4 and 5 contribute to the research-oriented work carried out in. Description of Work page 8 of 62

9 Fundamental objective 1 Given the potential offered by, it is important to clearly define use cases, scenarios and detail the workloads that derive from real-world applications. Moreover, it is important to make an effort to generalize workloads to encompass other applications that share similar traits to those addressed in. The work for this first objective is executed in WP2, and it is scheduled to take place in the first phase of the project. Fundamental objective 2 In, the focus is on system optimization, starting from the top layer down to lower layers of the system stack. Each component of the architecture requires special care in achieving efficiency, scalability and reliability goals. Furthermore, little is known on the combination of optimization techniques at different layers of the stack, that we label a cross-layer approach. As such, part of this fundamental objective is the design, implementation and validation of: 1. a novel component aiming at transforming a high-level, declarative query language into parallel programs; This work corresponds to Task in work package 3; 2. optimizations to the inner data-flow of the parallel processing framework adopted in ; This work is carried out in Task in work package 3; 3. a service-oriented query engine for improving data interaction, and its integration with distributed datastores; This work is done in Task in work package 3; 4. novel data partitioning and placement mechanisms that aim at optimizing the storage layer. This work is carried out in work package 4. Fundamental objective 3 The system stack is designed to work in a virtualized cluster, consisting of a pool of virtual machines and virtual networks. Our ultimate goal is to go beyond besteffort services and offer performance guarantees. An underlying objective is to adopt a cross-layer approach to optimization. This fundamental objective is achieved by the work in WP5, which covers several aspects of infrastructure virtualization and algorithms to support the specification of requirements and constraints dictated by components. Experimental Objectives. strives for cutting-edge experimental work, that is driven by the applications envisioned in the project. To this end, each partner will work on experiments using several platforms deployed at their respective premises. Each experimental platform will evaluate selected components, during the early stages of the development process, and the whole platform once it is available. Experimental objectives are grouped by partner type, academic first, then industrial. Experimental objective 1 EUR, TUB, and EPFL will work towards the design and deployment of experimental test beds to validate and analyze the performance of each component, individually that is before a consolidated implementation is available, hence as part of the work in WP3, 4 and 5 and as a whole, when the software stack is fully available. As part of the performance evaluation and benchmarking tasks of WP2, academic partners Description of Work page 9 of 62

10 will use their platforms and perform fine-grained measurements on all system layers, including individual computing node resources, the network equipment and topology, and the stack. In the process, the partners will use auxiliary tools for generating synthetic data and workloads during the early stages of the project. Tests on the consolidated platform, in addition, will use real-world data (or anonymized versions thereof) and workloads. Experimental objective 2 As industrial partners, SYM and GRIDP, will work towards the design, implementation and evaluation of algorithms for scalable clustering, data mining and multi-criteria analysis of very large data sets by leveraging the underlying layers of the system. As such, they will deploy local test beds to perform an experimental validation and performance assessment of the software stack, which is part of Task 2.4 of WP2. Such experiments will follow the development progress of, that is they can take place even if a consolidated version of the system is not fully available: each individual component strives at optimizing the system performance, and can be tested in isolation. The final system will be deployed by industrial partners which, together with academic ones, will analyze in detail the effects of a cross-layer approach to optimization. As a complement, industrial partners will also consider alternative commercial approaches, so as to establish a comprehensive comparative analysis of the performance and efficiency of as compared to the state-of-the-art. Open-source Objectives. goes beyond the fundamental elements outlined above and contribute to the open source community with the outcomes of the project. takes two distinct directions toward this goal: contributions to existing open-source projects, such as Apache Hadoop, when these exists and are widely adopted by the industry; establishment of new open-source projects for components that cannot be integrated to an existing open-source project. We remark here that contributing to existing open-source projects should be regarded as a feature rather than a limitation of : several other projects (that we overview in section 1.2) related to failed largely due to the fact that they had little or no impact. adopts the Apache Software Foundation (ASF) v2.0 license, which is compatible with the exploitation plans, especially those of the two industrial partners of the consortium. As discussed above, strives to contribute to existing open source projects that emerged in the past few years as de-facto standards for Big Data applications: in particular, by focusing on the Hadoop project, embraces (and is compelled to) the ASF v2.0 license. For other open-source initiatives, the software license will be discussed on a case-by-case analysis, with a preference for the ASF v2.0 license. As part of the open-source objectives, will create a number of Git repositories on a dedicated server for all software deliverables. As such, the consortium will push frequent updates to the released software, including early releases, bug fixes, and a consolidation effort to bring software quality as close as possible to a production environment. Description of Work page 10 of 62

11 The following Table summarizes the objectives and provide an indication on a measurement metric to asses the progress and accomplishments achieved in the project. A detailed description of such metrics and related indicators is available in Sec Objective Category Fundamental Experimental Open-source Measurement metric The achievement of fundamental objectives is measured by the impact of articles published in prestigious international journals and conferences in the research community. Furthermore, achievements can be measured by the interaction with related research projects and by the establishment of training activities, including Doctoral Schools and workshops. These objectives can be measured by asserting the deployment of experimental platforms by project partners and by their relevance in carrying out the validation and performance evaluation tasks defined in WP2. Additionally, the adoption of (or a subset of its components) for the Symantec WINE platform is another measurable metric to judge the achievements of. Automatically generated activity reports and graphs (number of contributors, feature requests, bug fixes, frequency of contributions, etc...) are an integral part of the public repositories of Big- Foot and constitute a natural measurement metric to assess the project progress. In addition, the creation of new JIRA tickets and the eventual adoption by the open-source community of contrib modules output from is another metric to monitor the progress and accomplishments of the project. Note that JIRA tickets are uniquely numbered and associated to the identity who created them: as such, it is possible to correlate such tickets with activities Expected results We now summarize what are the expected results of the project. Once the products delivered by the project are clearly defined, we will move on to describe who are the potential users of the stack. The results we expect from falls into two main categories, namely research and software, that we describe below. Research results. blends theory and system research in the domain of large scale data management and distributed systems. Given the composition of the consortium, that includes three academic partners, the main expected results of materialize in research articles submitted to prestigious venues, including international conferences and journals, as described in details in Sec. 3. The main domains in which we expect to have an impact include (but are not limited to): scalable algorithm design, scheduling protocols, work-sharing optimization, automatic physical design and dynamic data layout, system reliability and embedding algorithms. In addition, the research activity underlying involves a massive implementation effort toward the design and deployment of benchmarking tools and workload suites, cross-layer and work-sharing optimization Description of Work page 11 of 62

12 techniques and a thorough experimental approach to evaluate system performance and scalability. Software results. In addition to research activities, aims at having an impact in the industry, which is ensured by the two industrial members of the consortium and by the presence of an industrial advisory board that will rely on to understand and define strategic decisions during the project execution. will contribute to existing open-source project, including Apache Hadoop and OpenStack, and establish new projects following a two-pronged approach, which is described in detail in Sec Essentially, prompt and incremental commits to experimental repositories represent the first phase of the open-source strategy. Subsequently, selected components will be pushed to existing open-source communities for review, contributions and final acceptance. We now describe the potential users of the software stack: Engineers, data scientists and product managers: the first category of users are employees of small and large companies that wish to exploit the potential of their data. Such users should not bother with the tedious exercise of trying to dimension, deploy and tune a complex distributed system to store and process data. Instead, they will use as a Platform-as-a-Service solution, requiring minimal manual intervention, thus allowing this user category to focus on their task: extract useful information from data. For such users, represent a drastic improvement with respect to the state-of-theart 6. Developers: this category encompasses the open-source community that will contribute to and its components. It should be noted that many large corporations (including Facebook, Twitter, IBM, Cloudera, and many more) currently embrace the open-source model of contributing to successful projects, and this is especially true for Hadoop. As such, we believe that offers fundamental challenges to these users and, at the same time, will benefit greatly from their contributions. Researchers: the research activity carried out in will produce a set of fundamental tools to understand production workloads based on real-world data and to assess the performance and scalability underlying the systems to store and analyze such data. Essentially, the experimental approach of will enable this set of users to carry out their own research and eventually reproduce and compare their work to Indicators and success criteria The success of can be measured based on the following indicators and success criteria, which are related to its objectives: Fundamental Objectives Peer reviewed articles in the highest quality international and national scientific journals/conferences. We expect not less than one 7 article per year per partner. 6 The only competitor for is Amazon Elastic MapReduce, which is a best-effort service. 7 A large fraction of the work in blends theoretic and system aspects of a variety of topics system virtualization, parallel processing engines, distributed data stores and query engines, high-level languages and scalable algorithms that may require an important implementation effort. As such, it is advisable to target a realistic publication rate, which incorporates the engineering effort behind a scientific article. Description of Work page 12 of 62

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Big Data and Market Surveillance. April 28, 2014

Big Data and Market Surveillance. April 28, 2014 Big Data and Market Surveillance April 28, 2014 Copyright 2014 Scila AB. All rights reserved. Scila AB reserves the right to make changes to the information contained herein without prior notice. No part

More information

Concept and Project Objectives

Concept and Project Objectives 3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

More information

Apache Hadoop: The Big Data Refinery

Apache Hadoop: The Big Data Refinery Architecting the Future of Big Data Whitepaper Apache Hadoop: The Big Data Refinery Introduction Big data has become an extremely popular term, due to the well-documented explosion in the amount of data

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Why Big Data in the Cloud?

Why Big Data in the Cloud? Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics BIG DATA & ANALYTICS Transforming the business and driving revenue through big data and analytics Collection, storage and extraction of business value from data generated from a variety of sources are

More information

Virtualizing Apache Hadoop. June, 2012

Virtualizing Apache Hadoop. June, 2012 June, 2012 Table of Contents EXECUTIVE SUMMARY... 3 INTRODUCTION... 3 VIRTUALIZING APACHE HADOOP... 4 INTRODUCTION TO VSPHERE TM... 4 USE CASES AND ADVANTAGES OF VIRTUALIZING HADOOP... 4 MYTHS ABOUT RUNNING

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management

PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management PALANTIR CYBER An End-to-End Cyber Intelligence Platform for Analysis & Knowledge Management INTRODUCTION Traditional perimeter defense solutions fail against sophisticated adversaries who target their

More information

Bringing the Power of SAS to Hadoop. White Paper

Bringing the Power of SAS to Hadoop. White Paper White Paper Bringing the Power of SAS to Hadoop Combine SAS World-Class Analytic Strength with Hadoop s Low-Cost, Distributed Data Storage to Uncover Hidden Opportunities Contents Introduction... 1 What

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Big Workflow: More than Just Intelligent Workload Management for Big Data

Big Workflow: More than Just Intelligent Workload Management for Big Data Big Workflow: More than Just Intelligent Workload Management for Big Data Michael Feldman White Paper February 2014 EXECUTIVE SUMMARY Big data applications represent a fast-growing category of high-value

More information

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved. Object Storage: A Growing Opportunity for Service Providers Prepared for: White Paper 2012 Neovise, LLC. All Rights Reserved. Introduction For service providers, the rise of cloud computing is both a threat

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

Networking in the Hadoop Cluster

Networking in the Hadoop Cluster Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

BIG DATA-AS-A-SERVICE

BIG DATA-AS-A-SERVICE White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers

More information

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling

More information

INDUSTRY BRIEF DATA CONSOLIDATION AND MULTI-TENANCY IN FINANCIAL SERVICES

INDUSTRY BRIEF DATA CONSOLIDATION AND MULTI-TENANCY IN FINANCIAL SERVICES INDUSTRY BRIEF DATA CONSOLIDATION AND MULTI-TENANCY IN FINANCIAL SERVICES Data Consolidation and Multi-Tenancy in Financial Services CLOUDERA INDUSTRY BRIEF 2 Table of Contents Introduction 3 Security

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data for the Rest of Us Technical White Paper

Big Data for the Rest of Us Technical White Paper Big Data for the Rest of Us Technical White Paper Treasure Data - Big Data for the Rest of Us 1 Introduction The importance of data warehousing and analytics has increased as companies seek to gain competitive

More information

Agile Business Intelligence Data Lake Architecture

Agile Business Intelligence Data Lake Architecture Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

Big Data - Infrastructure Considerations

Big Data - Infrastructure Considerations April 2014, HAPPIEST MINDS TECHNOLOGIES Big Data - Infrastructure Considerations Author Anand Veeramani / Deepak Shivamurthy SHARING. MINDFUL. INTEGRITY. LEARNING. EXCELLENCE. SOCIAL RESPONSIBILITY. Copyright

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

How to Enhance Traditional BI Architecture to Leverage Big Data

How to Enhance Traditional BI Architecture to Leverage Big Data B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel A Next-Generation Analytics Ecosystem for Big Data Colin White, BI Research September 2012 Sponsored by ParAccel BIG DATA IS BIG NEWS The value of big data lies in the business analytics that can be generated

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Big Data and Hadoop for the Executive A Reference Guide

Big Data and Hadoop for the Executive A Reference Guide Big Data and Hadoop for the Executive A Reference Guide Overview The amount of information being collected by companies today is incredible. Wal- Mart has 460 terabytes of data, which, according to the

More information

AppSymphony White Paper

AppSymphony White Paper AppSymphony White Paper Secure Self-Service Analytics for Curated Digital Collections Introduction Optensity, Inc. offers a self-service analytic app composition platform, AppSymphony, which enables data

More information

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required. What is this course about? This course is an overview of Big Data tools and technologies. It establishes a strong working knowledge of the concepts, techniques, and products associated with Big Data. Attendees

More information

White Paper: What You Need To Know About Hadoop

White Paper: What You Need To Know About Hadoop CTOlabs.com White Paper: What You Need To Know About Hadoop June 2011 A White Paper providing succinct information for the enterprise technologist. Inside: What is Hadoop, really? Issues the Hadoop stack

More information

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,

More information

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst

IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst ESG Brief IBM: An Early Leader across the Big Data Security Analytics Continuum Date: June 2013 Author: Jon Oltsik, Senior Principal Analyst Abstract: Many enterprise organizations claim that they already

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Hadoop Overview, Customer Evolution and Dell In-Memory Product Details Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

Cray: Enabling Real-Time Discovery in Big Data

Cray: Enabling Real-Time Discovery in Big Data Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects

More information

HadoopTM Analytics DDN

HadoopTM Analytics DDN DDN Solution Brief Accelerate> HadoopTM Analytics with the SFA Big Data Platform Organizations that need to extract value from all data can leverage the award winning SFA platform to really accelerate

More information

White Paper: Hadoop for Intelligence Analysis

White Paper: Hadoop for Intelligence Analysis CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and

More information

Delivering Real-World Total Cost of Ownership and Operational Benefits

Delivering Real-World Total Cost of Ownership and Operational Benefits Delivering Real-World Total Cost of Ownership and Operational Benefits Treasure Data - Delivering Real-World Total Cost of Ownership and Operational Benefits 1 Background Big Data is traditionally thought

More information

Hadoop in the Hybrid Cloud

Hadoop in the Hybrid Cloud Presented by Hortonworks and Microsoft Introduction An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure. Big

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches. Detecting Anomalous Behavior with the Business Data Lake Reference Architecture and Enterprise Approaches. 2 Detecting Anomalous Behavior with the Business Data Lake Pivotal the way we see it Reference

More information

Big Data Too Big To Ignore

Big Data Too Big To Ignore Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction

More information

Symantec Global Intelligence Network 2.0 Architecture: Staying Ahead of the Evolving Threat Landscape

Symantec Global Intelligence Network 2.0 Architecture: Staying Ahead of the Evolving Threat Landscape WHITE PAPER: SYMANTEC GLOBAL INTELLIGENCE NETWORK 2.0.... ARCHITECTURE.................................... Symantec Global Intelligence Network 2.0 Architecture: Staying Ahead of the Evolving Threat Who

More information

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies Big Data: Global Digital Data Growth Growing leaps and bounds by 40+% Year over Year! 2009 =.8 Zetabytes =.08

More information

ASCETiC Whitepaper. Motivation. ASCETiC Toolbox Business Goals. Approach

ASCETiC Whitepaper. Motivation. ASCETiC Toolbox Business Goals. Approach ASCETiC Whitepaper Motivation The increased usage of ICT, together with growing energy costs and the need to reduce greenhouse gases emissions call for energy-efficient technologies that decrease the overall

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Informatica and the Vibe Virtual Data Machine

Informatica and the Vibe Virtual Data Machine White Paper Informatica and the Vibe Virtual Data Machine Preparing for the Integrated Information Age This document contains Confidential, Proprietary and Trade Secret Information ( Confidential Information

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

How To Create An Insight Analysis For Cyber Security

How To Create An Insight Analysis For Cyber Security IBM i2 Enterprise Insight Analysis for Cyber Analysis Protect your organization with cyber intelligence Highlights Quickly identify threats, threat actors and hidden connections with multidimensional analytics

More information

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013 Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software SC13, November, 2013 Agenda Abstract Opportunity: HPC Adoption of Big Data Analytics on Apache

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April 9 2013 Integrating Hadoop Into Business Intelligence & Data Warehousing Philip Russom TDWI Research Director for Data Management, April 9 2013 TDWI would like to thank the following companies for sponsoring the

More information

Cybersecurity Analytics for a Smarter Planet

Cybersecurity Analytics for a Smarter Planet IBM Institute for Advanced Security December 2010 White Paper Cybersecurity Analytics for a Smarter Planet Enabling complex analytics with ultra-low latencies on cybersecurity data in motion 2 Cybersecurity

More information

Cyber Situational Awareness for Enterprise Security

Cyber Situational Awareness for Enterprise Security Cyber Situational Awareness for Enterprise Security Tzvi Kasten AVP, Business Development Biju Varghese Director, Engineering Sudhir Garg Technical Architect The security world is changing as the nature

More information

Big Data and Apache Hadoop Adoption:

Big Data and Apache Hadoop Adoption: Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards

More information

White Paper. Unified Data Integration Across Big Data Platforms

White Paper. Unified Data Integration Across Big Data Platforms White Paper Unified Data Integration Across Big Data Platforms Contents Business Problem... 2 Unified Big Data Integration... 3 Diyotta Solution Overview... 4 Data Warehouse Project Implementation using

More information

Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms Unified Data Integration Across Big Data Platforms Contents Business Problem... 2 Unified Big Data Integration... 3 Diyotta Solution Overview... 4 Data Warehouse Project Implementation using ELT... 6 Diyotta

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract W H I T E P A P E R Building your Big Data analytics strategy: Block-by-Block! Abstract In this white paper, Impetus discusses how you can handle Big Data problems. It talks about how analytics on Big

More information

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

WHITE PAPER. Four Key Pillars To A Big Data Management Solution WHITE PAPER Four Key Pillars To A Big Data Management Solution EXECUTIVE SUMMARY... 4 1. Big Data: a Big Term... 4 EVOLVING BIG DATA USE CASES... 7 Recommendation Engines... 7 Marketing Campaign Analysis...

More information

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY Analytics for Enterprise Data Warehouse Management and Optimization Executive Summary Successful enterprise data management is an important initiative for growing

More information

BIG DATA ANALYTICS For REAL TIME SYSTEM

BIG DATA ANALYTICS For REAL TIME SYSTEM BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage

More information

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload Drive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload

More information

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases

INDUS / AXIOMINE. Adopting Hadoop In the Enterprise Typical Enterprise Use Cases INDUS / AXIOMINE Adopting Hadoop In the Enterprise Typical Enterprise Use Cases. Contents Executive Overview... 2 Introduction... 2 Traditional Data Processing Pipeline... 3 ETL is prevalent Large Scale

More information

APPLICATION OF MULTI-AGENT SYSTEMS FOR NETWORK AND INFORMATION PROTECTION

APPLICATION OF MULTI-AGENT SYSTEMS FOR NETWORK AND INFORMATION PROTECTION 18-19 September 2014, BULGARIA 137 Proceedings of the International Conference on Information Technologies (InfoTech-2014) 18-19 September 2014, Bulgaria APPLICATION OF MULTI-AGENT SYSTEMS FOR NETWORK

More information

Analyzing HTTP/HTTPS Traffic Logs

Analyzing HTTP/HTTPS Traffic Logs Advanced Threat Protection Automatic Traffic Log Analysis APTs, advanced malware and zero-day attacks are designed to evade conventional perimeter security defenses. Today, there is wide agreement that

More information

Luncheon Webinar Series May 13, 2013

Luncheon Webinar Series May 13, 2013 Luncheon Webinar Series May 13, 2013 InfoSphere DataStage is Big Data Integration Sponsored By: Presented by : Tony Curcio, InfoSphere Product Management 0 InfoSphere DataStage is Big Data Integration

More information

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi Getting Started with Hadoop Raanan Dagan Paul Tibaldi What is Apache Hadoop? Hadoop is a platform for data storage and processing that is Scalable Fault tolerant Open source CORE HADOOP COMPONENTS Hadoop

More information

Unified Cyber Security Monitoring and Management Framework By Vijay Bharti Happiest Minds, Security Services Practice

Unified Cyber Security Monitoring and Management Framework By Vijay Bharti Happiest Minds, Security Services Practice Unified Cyber Security Monitoring and Management Framework By Vijay Bharti Happiest Minds, Security Services Practice Introduction There are numerous statistics published by security vendors, Government

More information

Navigating Big Data business analytics

Navigating Big Data business analytics mwd a d v i s o r s Navigating Big Data business analytics Helena Schwenk A special report prepared for Actuate May 2013 This report is the third in a series and focuses principally on explaining what

More information

VIEWPOINT. High Performance Analytics. Industry Context and Trends

VIEWPOINT. High Performance Analytics. Industry Context and Trends VIEWPOINT High Performance Analytics Industry Context and Trends In the digital age of social media and connected devices, enterprises have a plethora of data that they can mine, to discover hidden correlations

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Monitoring Best Practices for COMMERCE

Monitoring Best Practices for COMMERCE Monitoring Best Practices for COMMERCE OVERVIEW Providing the right level and depth of monitoring is key to ensuring the effective operation of IT systems. This is especially true for ecommerce systems

More information

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011 Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA

More information

How Cisco IT Built Big Data Platform to Transform Data Management

How Cisco IT Built Big Data Platform to Transform Data Management Cisco IT Case Study August 2013 Big Data Analytics How Cisco IT Built Big Data Platform to Transform Data Management EXECUTIVE SUMMARY CHALLENGE Unlock the business value of large data sets, including

More information

Navigating the Big Data infrastructure layer Helena Schwenk

Navigating the Big Data infrastructure layer Helena Schwenk mwd a d v i s o r s Navigating the Big Data infrastructure layer Helena Schwenk A special report prepared for Actuate May 2013 This report is the second in a series of four and focuses principally on explaining

More information

Big Data Integration: A Buyer's Guide

Big Data Integration: A Buyer's Guide SEPTEMBER 2013 Buyer s Guide to Big Data Integration Sponsored by Contents Introduction 1 Challenges of Big Data Integration: New and Old 1 What You Need for Big Data Integration 3 Preferred Technology

More information