Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia



Similar documents
Unified Big Data Processing with Apache Spark. Matei

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem B Y R A H I M A.

ITG Software Engineering

Moving From Hadoop to Spark

The Inside Scoop on Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data and Industrial Internet

A Performance Analysis of Distributed Indexing using Terrier

Architectures for Big Data Analytics A database perspective

Big Data Explained. An introduction to Big Data Science.

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Open Source Technologies on Microsoft Azure

Implement Hadoop jobs to extract business value from large and varied data sets

Challenges for Data Driven Systems

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data and Data Science: Behind the Buzz Words

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Comprehensive Analytics on the Hortonworks Data Platform

Apache HBase. Crazy dances on the elephant back

Workshop on Hadoop with Big Data

Peers Techno log ies Pv t. L td. HADOOP

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Big Data and Analytics: Challenges and Opportunities

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Sunnie Chung. Cleveland State University

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Microsoft Big Data. Solution Brief

HDP Hadoop From concept to deployment.

BIG DATA TOOLS. Top 10 open source technologies for Big Data

A programming model in Cloud: MapReduce

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

COMP9321 Web Application Engineering

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

NoSQL and Hadoop Technologies On Oracle Cloud

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data. Lyle Ungar, University of Pennsylvania

How To Create A Large Data Storage System

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

A Brief Introduction to Apache Tez

HiBench Introduction. Carson Wang Software & Services Group

Shark Installation Guide Week 3 Report. Ankush Arora

Case Study : 3 different hadoop cluster deployments

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Native Connectivity to Big Data Sources in MSTR 10

Introduction to Big Data Training

Assignment # 1 (Cloud Computing Security)

Oracle Big Data Fundamentals Ed 1 NEW

How To Create A Data Visualization With Apache Spark And Zeppelin

Dominik Wagenknecht Accenture

#TalendSandbox for Big Data

Hadoop & Spark Using Amazon EMR

How Companies are! Using Spark

Hadoop and Map-Reduce. Swati Gore

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Map Reduce & Hadoop Recommended Text:

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Why Big Data in the Cloud?

Big Data Analytics OverOnline Transactional Data Set

Unlocking the True Value of Hadoop with Open Data Science

BIRT in the World of Big Data

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Open source Google-style large scale data analysis with Hadoop

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Can the Elephants Handle the NoSQL Onslaught?

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Ubuntu and Hadoop: the perfect match

How To Learn To Use Big Data

Ali Ghodsi Head of PM and Engineering Databricks

Step by Step: Big Data Technology. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 25 August 2015

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Apache Hadoop. Alexandru Costan

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data on Microsoft Platform

ANALYTICS CENTER LEARNING PROGRAM

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Reference Architecture, Requirements, Gaps, Roles

Big Data Course Highlights

Apache Flink Next-gen data analysis. Kostas

Hadoop-BAM and SeqPig

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

BioInterchange 2.0 CODAMONO. Integrating and Scaling Genomics Data

Big Data and Analytics (Fall 2015)

BIG DATA What it is and how to use?

VOL. 5, NO. 2, August 2015 ISSN ARPN Journal of Systems and Software AJSS Journal. All rights reserved

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Transcription:

Monitis Project Proposals for AUA September 2014, Yerevan, Armenia

Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop ecosystem projects, Java, Python, MapReduce/Spark, machine learning Hardware Requirements: Cluster hosting the data platform, test servers Prerequisites: Java programming, Linux and networking fundamentals, a brief understanding of NoSQL systems Capacity: 8 applicants

Distributed Log Collecting and Analysing Platform Project Outline As corporate and enterprise systems grow in size, and with recent trends in distributed systems, centralized log collecting, analysing and processing, becomes more and more challenging. The goal of this project is creation of a platform, that will store log data generated by corporate services, and will provide an interface for interactive querying and analysis. Tasks Create a distributed Hadoop based platform for log storage Create a MapReduce framework Organize log data flow (Apache Flume, Fluentd, etc.) Organize querying with SQL syntax (Hive, Impala, Stinger/Stinger.next - Hortonworks, etc.) Organize real-time search (Apache Solr, Elasticsearch) Run machine learning algorithms to process log data to provide health reports for the whole system

Machine Learning and Parallel Computing Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Java, Eclipse Hardware Requirements: None Prerequisites: MapReduce, Java, Object oriented programming Capacity: 2 applicants

Machine Learning and Parallel Computing Project Outline MapReduce is a software framework for distributed computing introduced by Google. While MapReduce might be criticized for its distributed computing capabilities it is ideally suited for the type of experimental evaluations that are carried out in machine learning (ML) and data mining. Testing in machine learning typically involves running variants of algorithms on different datasets and this type of work can be distributed across nodes very effectively using MapReduce. Tasks Implement two machine learning algorithms in a MapReduce framework and demonstrate the evaluation of these algorithms across a number of datasets. These algorithms would not need to be implemented from scratch as a java code for most Machine Learning algorithms are publicly available - see the Weka code for instance.

Apache Spark vs. Classic MapReduce Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Apache Hadoop, Java/Scala/Python, MapReduce, Machine Learning Hardware Requirements: Cluster hosting the data platform Prerequisites: Java, Python, object oriented programming, basic understanding of NoSQL systems Capacity: 2 applicants

Apache Spark vs. Classic MapReduce Project Outline Apache Spark is a fast and general engine for large-scale data processing, and gets more and more popular this days. Developers claim that Spark, is more than 100 times faster than classic MapReduce processing. So the goal of this project is to create a data platform collect some test data(ex. Twitter twits data), and provide comparison analysis of both systems. Tasks Create a Hadoop cluster for distributed data storage Create Yarn based MapReduce framework Write testing data of some size (ex. test could be done on 1TB data set) Run some classic MapReduce tasks and document results Rewrite the same logic with Apache Spark Run Spark jobs and document results Create a detailed comparison report

Machine Learning and Predictive Analysis Project Specifications Category: Statistical Data Analysis Project Type: Applied Research Software Requirements: Java and/or Python, Eclipse, R Hardware Requirements: None Prerequisites: Basic statistics, R Capacity: 2 applicants

Machine Learning and Predictive Analysis Project Outline Every observational data like monitoring data can be represented as time series. Multivariate time series can be represented as matrices. If single time series/matrix rows are varies similarly then it is possible to use some machine learning, statistical tools to predict future behavior of data. Very simple approach is so called multivariate linear regression method. It allows predicting future data and also finding confidence level for every single time series. Tasks Develop a program that will read data from JSON format and convert it to multivariate time series, i.e. timestamp, value pairs. At first write a program for simple linear regression and prediction based on it. Examine other single value prediction algorithms/methods like Holt-Winters, linear regression, ARIMA, state-space. Check compatibility of data to use certain machine learning algorithm. Create test data collection to test proposed algorithm and interpret results.

Social Network Analysis Project Specifications Category: Software Development Project Type: Applied Research Software Requirements: Java and/or Python, Eclipse Hardware Requirements: None Prerequisites: Java, Python, Object oriented programming Capacity: 1 applicant

Social Network Analysis Project Outline The goal for this project is to take a news topic e.g. the Presidential Election or a group of Twitter users such as professional rugby players and analyse the related activity on Twitter. Rather than attempt to analyses the content of the tweets the idea would be to analyse the characteristics of the Twitter network. An example of this type of analysis see below. The reading list below contains links to material on the Twitter API and Twitter4J a Java library for Twitter. Similar resources are available for Python and Ruby. Tasks Develop a system that will take as input a collection of Twitter user IDs and will generate the data around these users which can be represented in a form that can be visualized. Visualization of generated data on a web page is desired.

NoSQL Systems Comparison Analysis Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Java/Python, NoSQL systems that would be chosen Hardware Requirements: Servers (virtual/dedicated) hosting the NoSQL systems Prerequisites: Java, Python, object oriented programming, basic understanding of NoSQL systems Capacity: 2-3 applicants

NoSQL Systems Comparison Analysis Project Outline The recent trends and developments in "NoSQL" systems, make the decision process for corporate system selection harder. Therefore, having a toolset and comparison reports and/or charts provide for more justified decision on system selection/ customization. The goal of this project is to develop some benchmarking tools to test NoSQL systems of some type (ex. in memory stores), and to provide articles, benchmark test results and comparison reports made with the developed tools. Tasks Create feature comparison report Develop benchmarking tools for the testing systems. The tool(s) should be able to test: Write load test Read load test MapReduce test (if available) Cluster tests Sharding test Create a detailed report of compared systems features and benchmark tests

Statistical Data Analysis Environment Project Specifications Category: Statistical Data Analysis Project Type: Applied Research Software Requirements: Java, JavaScript and/or Python, Eclipse, R Hardware Requirements: None Prerequisites: Basic statistics, R Capacity: 1 applicant

Statistical Data Analysis Environment Project Outline There is a lot of tools based on R, Python and Java for statistical data analysis. Also some web based environments exist to do simple statistical analysis without any embedded test data. But if one would like to test analysis result reliability some test data should exists or a tool should be able to extract data from given test database. So the aim of this project is combining different statistical tools with data for doing analysis and visualise them. For example see RStudio server, StatAce. Tasks Develop a web interface to do statistical analysis and visualisation. Create test database and connect it to web interface. Combine different statistical tools and programming languages like R, SciPy of Python and Java to work in one environment with the same data

Software Performance Project Specifications Category: Software Development Project Type: Design and Implementation Software Requirements: Java, Unit testing framework (JUnit), Logging (Log4J), Eclipse Hardware Requirements: None Prerequisites: Java, object oriented programming Capacity: 1 applicant

Software Performance Project Outline The goal of this project is to monitor the performance of any Java application using aspect oriented programming (AOP) which is a programming concept that allows intercepting calls at run-time and modifying the program and is often used to separate concerns in a software development, typically for crosscutting things such as logging or performance monitoring. Tasks Run testing systems on several Java libraries/applications and study of the testing process (such as when and where faults are injected, how faults are detected) Write a simple performance monitoring tool for testing systems using AOP Visualization of the performance of the testing systems on a web page. Integration with Monitis system is desired. The cooperation with the external monitoring platform should be done via Java Management Extensions (JMX) technology.

Cloud Computing Platforms Comparison Project Specifications Category: Cloud Computing Project Type: Applied Research Software Requirements: Java/Python, Eclipse Hardware Requirements: None Prerequisites: Java, Python, Object oriented programming Capacity: 2 applicants

Cloud Computing Platforms Comparison Project Outline Currently, the business applications are moving more-and-more to the cloud. It s not just a fad - the shift from traditional software models to the Internet has steadily gained momentum over the last 10 years. The goal of this project is to compare and test the main functionality and performance available on Amazon EC2, Google Compute Engine, and Microsoft Windows Azure. Tasks Overview of Cloud computing technologies, architectures provided by well-known vendors Implement cloud based application and benchmark test instrument Make a benchmark test on various well-known cloud platforms Provide comparison of AmazonEC2, Google compute Engine and Microsoft Windows Azure cloud platforms

Projects Structure

Thank You! Questions?