Monitis Project Proposals for AUA September 2014, Yerevan, Armenia
Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop ecosystem projects, Java, Python, MapReduce/Spark, machine learning Hardware Requirements: Cluster hosting the data platform, test servers Prerequisites: Java programming, Linux and networking fundamentals, a brief understanding of NoSQL systems Capacity: 8 applicants
Distributed Log Collecting and Analysing Platform Project Outline As corporate and enterprise systems grow in size, and with recent trends in distributed systems, centralized log collecting, analysing and processing, becomes more and more challenging. The goal of this project is creation of a platform, that will store log data generated by corporate services, and will provide an interface for interactive querying and analysis. Tasks Create a distributed Hadoop based platform for log storage Create a MapReduce framework Organize log data flow (Apache Flume, Fluentd, etc.) Organize querying with SQL syntax (Hive, Impala, Stinger/Stinger.next - Hortonworks, etc.) Organize real-time search (Apache Solr, Elasticsearch) Run machine learning algorithms to process log data to provide health reports for the whole system
Machine Learning and Parallel Computing Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Java, Eclipse Hardware Requirements: None Prerequisites: MapReduce, Java, Object oriented programming Capacity: 2 applicants
Machine Learning and Parallel Computing Project Outline MapReduce is a software framework for distributed computing introduced by Google. While MapReduce might be criticized for its distributed computing capabilities it is ideally suited for the type of experimental evaluations that are carried out in machine learning (ML) and data mining. Testing in machine learning typically involves running variants of algorithms on different datasets and this type of work can be distributed across nodes very effectively using MapReduce. Tasks Implement two machine learning algorithms in a MapReduce framework and demonstrate the evaluation of these algorithms across a number of datasets. These algorithms would not need to be implemented from scratch as a java code for most Machine Learning algorithms are publicly available - see the Weka code for instance.
Apache Spark vs. Classic MapReduce Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Apache Hadoop, Java/Scala/Python, MapReduce, Machine Learning Hardware Requirements: Cluster hosting the data platform Prerequisites: Java, Python, object oriented programming, basic understanding of NoSQL systems Capacity: 2 applicants
Apache Spark vs. Classic MapReduce Project Outline Apache Spark is a fast and general engine for large-scale data processing, and gets more and more popular this days. Developers claim that Spark, is more than 100 times faster than classic MapReduce processing. So the goal of this project is to create a data platform collect some test data(ex. Twitter twits data), and provide comparison analysis of both systems. Tasks Create a Hadoop cluster for distributed data storage Create Yarn based MapReduce framework Write testing data of some size (ex. test could be done on 1TB data set) Run some classic MapReduce tasks and document results Rewrite the same logic with Apache Spark Run Spark jobs and document results Create a detailed comparison report
Machine Learning and Predictive Analysis Project Specifications Category: Statistical Data Analysis Project Type: Applied Research Software Requirements: Java and/or Python, Eclipse, R Hardware Requirements: None Prerequisites: Basic statistics, R Capacity: 2 applicants
Machine Learning and Predictive Analysis Project Outline Every observational data like monitoring data can be represented as time series. Multivariate time series can be represented as matrices. If single time series/matrix rows are varies similarly then it is possible to use some machine learning, statistical tools to predict future behavior of data. Very simple approach is so called multivariate linear regression method. It allows predicting future data and also finding confidence level for every single time series. Tasks Develop a program that will read data from JSON format and convert it to multivariate time series, i.e. timestamp, value pairs. At first write a program for simple linear regression and prediction based on it. Examine other single value prediction algorithms/methods like Holt-Winters, linear regression, ARIMA, state-space. Check compatibility of data to use certain machine learning algorithm. Create test data collection to test proposed algorithm and interpret results.
Social Network Analysis Project Specifications Category: Software Development Project Type: Applied Research Software Requirements: Java and/or Python, Eclipse Hardware Requirements: None Prerequisites: Java, Python, Object oriented programming Capacity: 1 applicant
Social Network Analysis Project Outline The goal for this project is to take a news topic e.g. the Presidential Election or a group of Twitter users such as professional rugby players and analyse the related activity on Twitter. Rather than attempt to analyses the content of the tweets the idea would be to analyse the characteristics of the Twitter network. An example of this type of analysis see below. The reading list below contains links to material on the Twitter API and Twitter4J a Java library for Twitter. Similar resources are available for Python and Ruby. Tasks Develop a system that will take as input a collection of Twitter user IDs and will generate the data around these users which can be represented in a form that can be visualized. Visualization of generated data on a web page is desired.
NoSQL Systems Comparison Analysis Project Specifications Category: Big Data and NoSQL Project Type: Applied Research Software Requirements: Java/Python, NoSQL systems that would be chosen Hardware Requirements: Servers (virtual/dedicated) hosting the NoSQL systems Prerequisites: Java, Python, object oriented programming, basic understanding of NoSQL systems Capacity: 2-3 applicants
NoSQL Systems Comparison Analysis Project Outline The recent trends and developments in "NoSQL" systems, make the decision process for corporate system selection harder. Therefore, having a toolset and comparison reports and/or charts provide for more justified decision on system selection/ customization. The goal of this project is to develop some benchmarking tools to test NoSQL systems of some type (ex. in memory stores), and to provide articles, benchmark test results and comparison reports made with the developed tools. Tasks Create feature comparison report Develop benchmarking tools for the testing systems. The tool(s) should be able to test: Write load test Read load test MapReduce test (if available) Cluster tests Sharding test Create a detailed report of compared systems features and benchmark tests
Statistical Data Analysis Environment Project Specifications Category: Statistical Data Analysis Project Type: Applied Research Software Requirements: Java, JavaScript and/or Python, Eclipse, R Hardware Requirements: None Prerequisites: Basic statistics, R Capacity: 1 applicant
Statistical Data Analysis Environment Project Outline There is a lot of tools based on R, Python and Java for statistical data analysis. Also some web based environments exist to do simple statistical analysis without any embedded test data. But if one would like to test analysis result reliability some test data should exists or a tool should be able to extract data from given test database. So the aim of this project is combining different statistical tools with data for doing analysis and visualise them. For example see RStudio server, StatAce. Tasks Develop a web interface to do statistical analysis and visualisation. Create test database and connect it to web interface. Combine different statistical tools and programming languages like R, SciPy of Python and Java to work in one environment with the same data
Software Performance Project Specifications Category: Software Development Project Type: Design and Implementation Software Requirements: Java, Unit testing framework (JUnit), Logging (Log4J), Eclipse Hardware Requirements: None Prerequisites: Java, object oriented programming Capacity: 1 applicant
Software Performance Project Outline The goal of this project is to monitor the performance of any Java application using aspect oriented programming (AOP) which is a programming concept that allows intercepting calls at run-time and modifying the program and is often used to separate concerns in a software development, typically for crosscutting things such as logging or performance monitoring. Tasks Run testing systems on several Java libraries/applications and study of the testing process (such as when and where faults are injected, how faults are detected) Write a simple performance monitoring tool for testing systems using AOP Visualization of the performance of the testing systems on a web page. Integration with Monitis system is desired. The cooperation with the external monitoring platform should be done via Java Management Extensions (JMX) technology.
Cloud Computing Platforms Comparison Project Specifications Category: Cloud Computing Project Type: Applied Research Software Requirements: Java/Python, Eclipse Hardware Requirements: None Prerequisites: Java, Python, Object oriented programming Capacity: 2 applicants
Cloud Computing Platforms Comparison Project Outline Currently, the business applications are moving more-and-more to the cloud. It s not just a fad - the shift from traditional software models to the Internet has steadily gained momentum over the last 10 years. The goal of this project is to compare and test the main functionality and performance available on Amazon EC2, Google Compute Engine, and Microsoft Windows Azure. Tasks Overview of Cloud computing technologies, architectures provided by well-known vendors Implement cloud based application and benchmark test instrument Make a benchmark test on various well-known cloud platforms Provide comparison of AmazonEC2, Google compute Engine and Microsoft Windows Azure cloud platforms
Projects Structure
Thank You! Questions?