1 Monitis Project Proposals for AUA September 2014, Yerevan, Armenia
2 Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop ecosystem projects, Java, Python, MapReduce/Spark, machine learning Hardware Requirements: Cluster hosting the data platform, test servers Prerequisites: Java programming, Linux and networking fundamentals, a brief understanding of NoSQL systems Capacity: 8 applicants
3 Distributed Log Collecting and Analysing Platform Project Outline As corporate and enterprise systems grow in size, and with recent trends in distributed systems, centralized log collecting, analysing and processing, becomes more and more challenging. The goal of this project is creation of a platform, that will store log data generated by corporate services, and will provide an interface for interactive querying and analysis. Tasks Create a distributed Hadoop based platform for log storage Create a MapReduce framework Organize log data flow (Apache Flume, Fluentd, etc.) Organize querying with SQL syntax (Hive, Impala, Stinger/Stinger.next - Hortonworks, etc.) Organize real-time search (Apache Solr, Elasticsearch) Run machine learning algorithms to process log data to provide health reports for the whole system
4 Machine Learning and Parallel Computing Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Java, Eclipse Hardware Requirements: None Prerequisites: MapReduce, Java, Object oriented programming Capacity: 2 applicants
5 Machine Learning and Parallel Computing Project Outline MapReduce is a software framework for distributed computing introduced by Google. While MapReduce might be criticized for its distributed computing capabilities it is ideally suited for the type of experimental evaluations that are carried out in machine learning (ML) and data mining. Testing in machine learning typically involves running variants of algorithms on different datasets and this type of work can be distributed across nodes very effectively using MapReduce. Tasks Implement two machine learning algorithms in a MapReduce framework and demonstrate the evaluation of these algorithms across a number of datasets. These algorithms would not need to be implemented from scratch as a java code for most Machine Learning algorithms are publicly available - see the Weka code for instance.
6 Apache Spark vs. Classic MapReduce Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Apache Hadoop, Java/Scala/Python, MapReduce, Machine Learning Hardware Requirements: Cluster hosting the data platform Prerequisites: Java, Python, object oriented programming, basic understanding of NoSQL systems Capacity: 2 applicants
7 Apache Spark vs. Classic MapReduce Project Outline Apache Spark is a fast and general engine for large-scale data processing, and gets more and more popular this days. Developers claim that Spark, is more than 100 times faster than classic MapReduce processing. So the goal of this project is to create a data platform collect some test data(ex. Twitter twits data), and provide comparison analysis of both systems. Tasks Create a Hadoop cluster for distributed data storage Create Yarn based MapReduce framework Write testing data of some size (ex. test could be done on 1TB data set) Run some classic MapReduce tasks and document results Rewrite the same logic with Apache Spark Run Spark jobs and document results Create a detailed comparison report
8 Machine Learning and Predictive Analysis Project Specifications Category: Statistical Data Analysis Project Type: Applied Research Software Requirements: Java and/or Python, Eclipse, R Hardware Requirements: None Prerequisites: Basic statistics, R Capacity: 2 applicants
9 Machine Learning and Predictive Analysis Project Outline Every observational data like monitoring data can be represented as time series. Multivariate time series can be represented as matrices. If single time series/matrix rows are varies similarly then it is possible to use some machine learning, statistical tools to predict future behavior of data. Very simple approach is so called multivariate linear regression method. It allows predicting future data and also finding confidence level for every single time series. Tasks Develop a program that will read data from JSON format and convert it to multivariate time series, i.e. timestamp, value pairs. At first write a program for simple linear regression and prediction based on it. Examine other single value prediction algorithms/methods like Holt-Winters, linear regression, ARIMA, state-space. Check compatibility of data to use certain machine learning algorithm. Create test data collection to test proposed algorithm and interpret results.
11 Social Network Analysis Project Outline The goal for this project is to take a news topic e.g. the Presidential Election or a group of Twitter users such as professional rugby players and analyse the related activity on Twitter. Rather than attempt to analyses the content of the tweets the idea would be to analyse the characteristics of the Twitter network. An example of this type of analysis see below. The reading list below contains links to material on the Twitter API and Twitter4J a Java library for Twitter. Similar resources are available for Python and Ruby. Tasks Develop a system that will take as input a collection of Twitter user IDs and will generate the data around these users which can be represented in a form that can be visualized. Visualization of generated data on a web page is desired.
12 NoSQL Systems Comparison Analysis Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Java/Python, NoSQL systems that would be chosen Hardware Requirements: Servers (virtual/dedicated) hosting the NoSQL systems Prerequisites: Java, Python, object oriented programming, basic understanding of NoSQL systems Capacity: 2-3 applicants
13 NoSQL Systems Comparison Analysis Project Outline The recent trends and developments in "NoSQL" systems, make the decision process for corporate system selection harder. Therefore, having a toolset and comparison reports and/or charts provide for more justified decision on system selection/ customization. The goal of this project is to develop some benchmarking tools to test NoSQL systems of some type (ex. in memory stores), and to provide articles, benchmark test results and comparison reports made with the developed tools. Tasks Create feature comparison report Develop benchmarking tools for the testing systems. The tool(s) should be able to test: Write load test Read load test MapReduce test (if available) Cluster tests Sharding test Create a detailed report of compared systems features and benchmark tests
15 Statistical Data Analysis Environment Project Outline There is a lot of tools based on R, Python and Java for statistical data analysis. Also some web based environments exist to do simple statistical analysis without any embedded test data. But if one would like to test analysis result reliability some test data should exists or a tool should be able to extract data from given test database. So the aim of this project is combining different statistical tools with data for doing analysis and visualise them. For example see RStudio server, StatAce. Tasks Develop a web interface to do statistical analysis and visualisation. Create test database and connect it to web interface. Combine different statistical tools and programming languages like R, SciPy of Python and Java to work in one environment with the same data
17 Software Performance Project Outline The goal of this project is to monitor the performance of any Java application using aspect oriented programming (AOP) which is a programming concept that allows intercepting calls at run-time and modifying the program and is often used to separate concerns in a software development, typically for crosscutting things such as logging or performance monitoring. Tasks Run testing systems on several Java libraries/applications and study of the testing process (such as when and where faults are injected, how faults are detected) Write a simple performance monitoring tool for testing systems using AOP Visualization of the performance of the testing systems on a web page. Integration with Monitis system is desired. The cooperation with the external monitoring platform should be done via Java Management Extensions (JMX) technology.
19 Cloud Computing Platforms Comparison Project Outline Currently, the business applications are moving more-and-more to the cloud. It s not just a fad - the shift from traditional software models to the Internet has steadily gained momentum over the last 10 years. The goal of this project is to compare and test the main functionality and performance available on Amazon EC2, Google Compute Engine, and Microsoft Windows Azure. Tasks Overview of Cloud computing technologies, architectures provided by well-known vendors Implement cloud based application and benchmark test instrument Make a benchmark test on various well-known cloud platforms Provide comparison of AmazonEC2, Google compute Engine and Microsoft Windows Azure cloud platforms
Trends in Cloud Computing and Big Data Nikita Bhagat, Ginni Bansal, Dr.Bikrampal Kaur email@example.com, firstname.lastname@example.org, email@example.com Abstract - BIG data refers to the
White paper Sopen source solutions for big data management Your business technologists. Powering progress Open Source Solutions for Big Data Management Big Data Management is becoming a key issue in the
Cost aware real time big data processing in Cloud Environments By Cristian Montero Under the supervision of Professor Rajkumar Buyya and Dr. Amir Vahid A minor project thesis submitted in partial fulfilment
Chapter 17: DIPAR: A Framework for Implementing Big Data Science in Organizations Luis Eduardo Bautista Villalpando 1,2, Alain April 2, Alain Abran 2 1 Department of Electronic Systems, Autonomous University
David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Data on Azure: The Big Picture... 3 Relational Technology: A Quick
MASARYK UNIVERSITY FACULTY OF INFORMATICS Best Practices in Scalable Web Development MASTER THESIS Martin Novák May, 2014 Brno, Czech Republic Declaration Hereby I declare that this paper is my original
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
University of Twente Literature Study Business Process Management in the cloud: Business Process as a Service (BPaaS) Author: Evert Duipmans Supervisor: Dr. Luís Ferreira Pires April 1, 2012 Contents 1
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania firstname.lastname@example.org, email@example.com
Database Systems Journal vol. IV, no. 3/2013 31 Big Data Challenges Alexandru Adrian TOLE Romanian American University, Bucharest, Romania firstname.lastname@example.org The amount of data that is traveling across
Big Data: Beyond the Hype Why Big Data Matters to You White Paper BY DATASTAX CORPORATION October 2013 Table of Contents Abstract 3 Introduction 3 Big Data and You 5 Big Data Is More Prevalent Than You
Getting Started with Big Data Analytics in Retail Learn how Intel and Living Naturally* used big data to help a health store increase sales and reduce inventory carrying costs. SOLUTION BLUEPRINT Big Data
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Herodotos Herodotou Duke University email@example.com Fei Dong Duke University firstname.lastname@example.org Shivnath Babu Duke
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
Big Data Computing and Clouds: Trends and Future Directions Marcos D. Assunção a,, Rodrigo N. Calheiros b, Silvia Bianchi c, Marco A. S. Netto c, Rajkumar Buyya b, arxiv:1312.4722v2 [cs.dc] 22 Aug 2014
Expert Reference Series of White Papers Big Data and Apache Hadoop Adoption: Key Challenges and Rewards 1-800-COURSES www.globalknowledge.com Big Data and Apache Hadoop Adoption: Key Challenges and Rewards
Chapter 1 Grasping the Fundamentals of Big Data In This Chapter Looking at a history of data management Understanding why big data matters to business Applying big data to business effectiveness Defining
Business Analytics Big Data Next-Generation Analytics the way we see it Table of contents Executive summary 1 Introduction: What is big data and why is it different? 3 The business opportunity 7 Traditional
Securing Your Big Data Environment Ajit Gaddam email@example.com Abstract Security and privacy issues are magnified by the volume, variety, and velocity of Big Data. The diversity of data sources, formats,
White Paper BIG DATA-AS-A-SERVICE What Big Data is about What service providers can do with Big Data What EMC can do to help EMC Solutions Group Abstract This white paper looks at what service providers
DATA LAKES FOR DATA SCIENCE Integrating Analytics Tools with Shared Infrastructure for Big Data ABSTRACT This paper examines the relationship between three primary domains of an enterprise big data program:
Learning Objectives Dr. Chen, Data Base Management Chapter 6: Big Data and Analytics Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258 firstname.lastname@example.org
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 1 (2014), pp. 33-40 International Research Publications House http://www. irphouse.com /ijict.htm Big Data