! E6893 Big Data Analytics Lecture 12:! Final Project Proposals

Size: px
Start display at page:

Download "! E6893 Big Data Analytics Lecture 12:! Final Project Proposals"

Transcription

1 E6893 Big Data Analytics Lecture 12: Final Project Proposals Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center November 20th, 2014

2 Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Big Data Analytics Algorithms V (classification & clustering) 10/30/14 9 Linked Big Data Graph Computing I (Graph DB) 11/06/14 10 Linked Big Data Graph Computing II (Graph Analytics) 11/13/14 11 Linked Big Data Graph Computing III (Graphical Models & Platforms) 11/20/14 12 Final Project First Presentations 11/27/14 Thanksgiving Holiday 12/04/14 13 Next Stage of Big Data Analytics 12/11/14 14 Big Data Analytics Workshop Final Project Presentations 2

3 Proposal List (#1 - #17, Page 7-76) Industry Sector Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Project name Exchange Rates Inquiry and Analysis Algorithm Trading Strategies Using Hadoop MapReduce Image Classification in the Cloud and GPU (H-Classification & G-Classification) Google-Analytics, Graph based Online Movie Recommendation System Currency Trend Analyzer Correlating Price / Volume of Low Volume Stocks with Social Media Trading Using Nonparametric Time Series Classification Models Stock Forecasting Using Hadoop Map-Reduce Real-time Risk Management System Stock Daily Price Predictions Based on News Stock Recommendation System Stock signal generation using real time news analysis Customer Complaint Analyses Financial Market Volatility Salary Engine Stock price Movement Prediction with Hadoop+Mahout & Pydoop+Scikit Sector-based Classification and Clustering of Financial News Articles 3

4 Proposal List (#18 - #33, Page ) Information Information Information Information Information Information Information Information Information Information Life Science Life Science Life Science Life Science Life Science Life Science TV Analytics Project Nova Game Outcome Analysis Exploring the Online and Offline Social World Yelp Review Analysis and Recommendation Sentiment Analysis on Movie Document Analysis with Latent Dirichlet Allocation Yelp Recommendation Analysis Using Big Data (Hadoop) for Identification of Aberrant Behavior Clusters in Server Performance Time Series Dat Crocuta: JavaScript Analytics System Learning Brain Activity From fmri Images Network Analysis on the Big Cancer Genome Data Reversal Prediction from Physiology Data EEGoVid: An EEG-Based Interest Level Video Recommendation Engine Brain Edge Detection BrainBow: Reconstruction of Neurons 4

5 Proposal List (#34 - #55, Page ) Retail Retail Retail Retail Retail Retail Retail Retail Retail Telecom Telecom Telecom Telecom Telecom Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Yelp Fake Review Detection Predicting Excitement at donorschoose.org Market strategy suggestions for B2C websites Résumé Category Classification TV Genome Project / Recommendation Engine Analytics Media Acquire Valued Shoppers Twitter-Based Product / Sales Events Recommender Yelp-er: Analyzing Yelp Data Big Mobile Data Network Congestion Analysis Comparison Analysis of Different Telecoms Operators User s Web Events Analysis Based on Browser Extension Analysis of telecom service in cellular networks Human Activity Monitoring and Prediction Minimizing Risk in Energy Arbitrage Best Transportation Choice Manage Energy Consumption By Smart Meters Location Specific Optimization of Taxi Efficiency in NYC Citi Bike System Data Analysis PeopleMaps Project Transeo: Making public buses more efficient and accessible Image Based Geo-localization 5

6 Proposal List (#56 - #77, Page ) Media Media Media Media Media Media Media Media Media Media Media Media Media Media Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government HVision PlayPalate Movie Exploration Fantasy Basketball Fantasy Basketball Prediction Using Previous Season s Data Hunting for NBA players Music-Links Affective Computational Cinematography Improving Movie Recommender System with User Behavior c hanges and Demographics MOVIE RECOMMENDATION AND ANALYSIS OF THIS APPLICATION TV Genome Project / Recommendation Engine TrendyWriter Analysis on pricing strategy for sports team Spark NLP Predicting usefulness of restaurants reviews from subtopics using Yelp data Study Buddy Big Data Analysis on Log Data of Standardized IBT Test (TOEFL) Taker for Effects of Selection Changing Improving Education for At-risk Students Scratch Analyzer Oscar Award Analysis based on big data Error Correction in Large Volume OCR Datasets How to name your new born baby (babies)? 6

7 E6893 Big Data Analytics Finance Group Proposals 7 Nov 20, 2014

8 E6893 Big Data Analytics: Project Name Exchange Rates Inquiry and Analysis Team Members: Mengnan Wang(mw2969), Xiaomeng Zhang(xz2350), Jianze Wang(jw3127), Wanding Li(wl2501) 8 Nov 20, 2014

9 Motivation With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market. 9

10 Dataset, Algorithm, Tools Dataset: Instant exchange rates from Bloomberg History exchange rates from Algorithm: We would forecast the exchange rate of currency in some targeted countries against US Dollar in a developed market, applying scalable model to forecast in real-time. And we would like to use RMSE to measure the reliability and accuracy of our prediction. Besides, by showing the statistical significance, such as P values, the repeatability of the outcome will then be proved. Tools: Eclipse Tomcat Apache 10

11 Progress and Expected Contributions Expected Contributions: Our project is initially designed to provide users with the following contents. Forward and cross exchange rates for most world currencies Both instant and history data Basic analysis of exchange rates including regression and K means Latest news about exchange rates Current Progress A sketch webpage design that looks something like Key Factors Affecting Exchange Rate PPP(Purchasing Power Parity) INT(Interest Rate Differential) Such as Libor GDP(The Difference in GDP Growth Rates) IGP(Income Growth Rate) Relative Economic Strength we may use factors like GDP and IGP to measure it quantitatively 11 E6893 Big Data Analytics Lecture 12: Final Project Proposal

12 E6893 Big Data Analytics: Project Name: Algorithm Trading Strategies Using Hadoop MapReduce Team Members: Yifan Wu, Meibin Chen 12 Nov 20, 2014

13 Motivation Improvement of internet speed and storage Data flooding everywhere Hadoop be able to do this & => MapReduce is the tool for parallel computing of the data Usage of Algorithm trading: Algorithm trading is the use of computer programs for entering trading orders, in which computer algorithms decide on every aspects of the order, such as the timing, price, and quantity of the order. 1. Back test the algorithm using enough historical price data to validate and optimize the algorithm in terms of profitability, stability, etc. 2. For a complex algorithm, there are many parameters that need to be optimized. 1. The system will builds and select most suitable strategy 20% faster than before 2. The enhanced platform doubled the number of strategy groups. 3. Last, the strategies can now be updated more frequently and can include more parameters in the analysis. 13

14 Dataset, Algorithm, Tools The dataset we choose is from stock future index of Shanghai Future Exchange (big bull market of A share ) or others Algorithm describes: The algorithm is given as combination of moving average convergence divergence(mcad) and Relative Strength Index(RSI) MCAD: long average period and the short average period RSI: the upper threshold, the lower threshold and the calculation period The project will divided into two part: First one is MapReduce platform----hadoop Second one is the trading strategies to test the data on this platform 14

15 Progress and Expected Contributions Inner MapReduce: Input: Daily price data. Each line contains 100 days price information Output: The performance of the parameters on the data Outside MapReduce: Input:Parameters combination: Each line contains one combination of parameters Ouput: The best parameters The contribution as Algorithm trading can provide many usage in investment strategy, including market marking, inter-market spreading, arbitrage, or pure speculation. With MapReduce, it can achieve faster, multi-tasks, and more real time updates. 15

16 E6893 Big Data Analytics: Image Classification in the Cloud and GPU (H-Classification & G-Classification) Team Members: Anand Rajan & Eric Johnson 16 Nov 20, 2014

17 Motivation - With the advent of social media the number of mobile pictures being taken and uploaded is increasing exponentially - Although most photos are uploaded with some basic metadata: date, time, camera model, and possibly geo-location - a great deal of details are missing when they enter the cloud. Unless users physically go through and tag each image this can create a search nightmare - Example: How do you find that picture you took a few years back while on vacation in Paris? It was under a bridge by the river right? Resorting to clicking through hundreds of photos or waiting for images to cache on your phone can take forever when you want to show someone in a pinch. Challenge In order to make more effective image search it will be important to develop and utilize advanced algorithms to help auto-tag images. Doing so can help narrow down image search and improve the quality of search results. - Leveraging the Yahoo Labs Flickr dataset we plan to test and develop upon feature extraction methods utilizing a parallelized computing system to efficiently extract image characteristics. - Using these image characteristics we will train and test the image classification of these images and evaluate them based on precision. Going beyond this step we also plan to experiment with a GPU powered processing system to evaluate added benefits and performance benchmarks that might be had during the image analysis stage over a standard distributed system. 17

18 Dataset, Algorithm, Tools Dataset: Yahoo/Flickr Image Dataset (83GB) 2 Million User Photos (200,000 x 10 categories) - Contains: photo_id, jpeg url, and some corresponding metadata such as the title, description, title, camera type, title, tags. Addition: Flickr API Details like comments, favorites, and social network data, can be queried. Data is broken into 10 categories - 1 nature - 6 sky - 2 food - 7 london - 3 people - 8 beach - 4 wedding music - 10 travel Toolset: R, Java, Hadoop/Mahout, NVidia CUDA - Linux based Cluster Array - 26x 2.7GHZ Intel Xeon CPUs - 64 GB of GDDR3 RAM - Windows 7 Desktop - 2.7GHZ Intel i7 CPU - 24 GB of GDDR3 RAM - 2GB 256-Bit GDDR5 w/ 1344 CUDA Cores Algorithms: Implemented by Stage - Feature Extraction / Reduction - SIFT - Hog - PCA (Map stage) Good for GPU - Encoding - KMeans Clustering (Reduce Stage) - Classification - Naïve Bayes - Log SGD Regression 18

19 Progress and Expected Contributions Progress - Conducted thorough research on opensource Image Classification packages as well as the tools required to perform feature extraction and analysis - Setup an environment for Hadoop distributed system - Acquired Yahoo Flickr dataset and begun initial testing - Acquired hardware and performed initial tests on a GPU based system Expected Contributions - Our goal is to research, implement, and build upon the current Open Source offerings for Image Classification to help improve the auto-tagging process of digital photos - If given the time and resources we aim develop a web based interface that will allow users to upload an image and perform a feature analysis and extraction to determine which tags and keywords are associated with it 19 Potential Challenges - Research for feature extract is very resource intensive so we have picked a challenging project for only two students - Conducting such a challenging project will involve many abstract mathematical models - Companies like Yahoo and Facebook invest millions in this field - *We anticipate this to be an ongoing project that we can continue well beyond the course and perhaps into the second semester

20 E6893 Big Data Analytics: Google-Analytics, Graph based Online Movie Recommendation System Team Members: Tian Han, Yifan Du, and Hang Guan 20 Nov 20, 2014

21 Motivation - To build our own movie website and implement the movie recommendation functionality. 21

22 Process Flow Recommendation Movie Website Users log files Graph Database Google Analytics Key words for query 22

23 Dataset, Algorithm, Tools - Dataset MovieLens 1M dataset, which contains 1 million ratings from 6000 users on 4000 movies - Algorithm Various collaborative filtering algorithms, e.g. user-based recommendation, item-based recommendation etc. - Tools Web design - Dreamweaver / CoffeeCup Users log file analysis - Google Analytics Graph Database Gremlin / Neo4j Others Mahout / Eclipse 23

24 Progress and Expected Contributions Timeline: Now -- 11/27/14 : Movie website design and publish 11/28 12/04/14: Plug in Google Analytics in website and Analyze user log file 12/05 12/11/14: Generate Query for the Graph Database and update recommended movies on website Expected Contributions: - To build our own online movie website. - To implement the movie recommendation functionality To to website analysis using Google Analytics

25 E6893 Big Data Analytics: Currency Trend Analyzer Team Members: Tim Paine, Mark Aligbe 25 Nov 20, 2014

26 Motivation Forex trading requires statistical insight into the exchange market. Large quantity of data, visualization only utilized at the day/week/ month level Difficult to see real time trends, analyze real time trends at granularity < 1 day Need to be able to collect, analyze, visualize data streaming in real time Solution Distributed Computing Real Time Computation Statistical Analysis Data Visualization 26

27 Datasets Large amounts of intraday/daily forex/equity/other data Algorithms Recommenders - suggesting trading prices and items to exchange Clustering - to analyze trends over a variable period of time Classifying - to classify trends into upward/downward movements, Tools momentum Java and Mahout for the analytics Javascript, Python, and R for data gathering, web server, and visualization 27

28 Progress and Expected Contributions Forex data acquired, sanitized, formatted, ready to 28 use Built system to batch collect data from multiple feeds when it becomes available Current stage: building design and field research Next steps: other distributed computing libraries End contribution: an extensible framework for collecting, analyzing, and visualizing real time data feeds

29 E6893 Big Data Analytics: Correlating Price / Volume of Low Volume Stocks with Social Media Jeff Ho, MS Statistics William Lee, MS Operations Research (CVN) 29 Oct 20, 2014

30 Hypothesis and Method Hypothesis Low volume stocks typically do not generate mainstream news coverage We hypothesize that social media could be a useful source of information Method Backtest different methods of using Big Data (specifically Twitter) to ultimately try to predict future price movements We will test various cases attempting to seek a correlation between tweets and movement of low volume stocks in price or volume We will verify whether these tweets are leading or lagging indicators of price or volume changes 30

31 Stock Criteria Stock Criteria Low volume stocks No / Low analyst coverage Stick with one industry Data Source: Yahoo Finance Big Data Platform Twitter Data Source: Twitter API 31

32 Predicting stock movements with tweet volume 1. Built list of stocks to test 2. Found cases where tweets can predict absolute movement in prices and stock volume 3. Can build a strategy around tweet volumes Absolute Percent Change in Stock Price 10.00% 7.50% 5.00% 2.50% 0.00% ALSK shows Rise in Tweet Volume 2 Days Prior to Significant Stock Movement y = x R² = Tweet Volume 2 Days Prior ALSK shows Rise in Tweet Volume 2 Days Prior to Signifcant Stock Volume Increase y = 22249x R² = Stock Volume Tweet Volume 2 Days Prior

33 E6893 Big Data Analytics: Trading Using Nonparametric Time Series Classification Models Team Members: Yufan Cai, Bowen Wang, Junchao Zhang 33 Nov 20, 2014

34 Motivation Traditional trading strategies usually involved with time series models such GARCH. It is difficult to incorporate categorized parameters such Twitter data. Using classification models, we can give a prediction on whether the asset price will go up or down by incorporating unstructured data stream. 34

35 Dataset, Algorithm, Tools Dataset: Stock live price data, order book data (Bloomberg API, Bitcoin/USD). Twitter (Optional) Algorithms: Mahout Algorithms such as Logistic Regression Others like Weighted Majority Voting and Nearest-Neighbor Classification Tools: Java, Mahout, Hadoop 35

36 Progress and Expected Contributions Progress: Researched on various recent time series classification models Have set up interface with data API Expected Contributions: Provide a Hadoop based classification model implementations on time series 36

37 E6893 Big Data Analytics: Project Name STOCK FORECASTING USING HADOOP MAP-REDUCE Team Members: Yi Yu, Yu Xia, Xiangliang Yang, Yumeng Xu 37 Nov 20, 2014

38 Motivation Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Hadoop MapReduce is a latest framework specially designed for processing large datasets on distributed sources. Apache s Hadoop is an implementation of MapReduce. 38

39 Dataset, Algorithm, Tools 39 Algorithms: Pearson Correlation Similarity Euclidean Distance Similarity Stochastic Gradient Descent (SDG) Tools: Hadoop Mahout Hbase

40 Progress and Expected Contributions Expected Contributions: We are going to analyze the dataset called Daily Holdings for All ProShares ETFs which contains tons of information collected from the stock exchange market. The first step is to scrutinize the data and provide the stock which may go up potentially. With these screened stocks, suggest a certain user a potential stock which she/he may be interested in. Progress: We have already got the dataset and analyzed the some similarities which could be useful in further steps. 40

41 E6893 Big Data Analytics: Real-time Risk Management System Team Members: Iljoon Hwang, Sungwoo Yoo, Sungjoon Huh 41 Nov 20, 2014

42 Motivation 1. Motivation - Objective: Developing the Real-time Risk Management System (Intraday Value at Risk) for large complex portfolio in an unified framework - Expected Outcome: The system which performs the calculation of stressed VaR, "what-if" scenarios, stresstesting on complex portfolio with large number of underlying risk factors and vectors in real-time. - Importance: Risk management is crucial to throughout the investment/trading activities from front trading desk to back office. However, because of the complexity of calculating VaR in large multi-asset portfolio, delivering the VaR in real-time is not available at legacy system. Big Data with in-memory multi-dimensional analytics can resolve this big issue. 42

43 Dataset, Algorithm, Tools 2. Dataset Yahoo Finance Tick data for S&P Algorithms Valut at Risk: Monte Carlo simulation, Parametric method Big Data: Map-Reduce, In-Memory 4. Tools / Languages Hadoop, Spark, Scala, R, Python, and Google Cloud 43

44 Progress and Expected Contributions 5. Current Status - Developed environment system construction in Google Cloud - Have been collecting tick data from Yahoo Finance - Have decided specific algorithms for calculating Value at Risk - Have been studying related papers and articles 6. Team members and Expected Contributions - Iljoon Hwang: Environment systems construction development, Storage server programming, Testbed - Sungwoo Yoo: Data collection, Research papers / articles - Sungjoon Huh: Batch processing / Real-time app programming 44

45 E6893 Big Data Analytics: Stock Daily Price Predictions Based on News Team Members: Jie Liu, Jingnan Li, Lu Qiu, Ruixin Tan 45 Nov 20, 2014

46 Motivation Traditional technical trading only take into account the quantitative but not qualitative factors that influence the stock prices. It is well know that news items have significant impact on stock indices and prices. To make a better prediction, we combine quantitative methods with headline NLP feature analysis in our model. 46

47 Dataset, Algorithm, Tools Dataset: ( Jan 2, 2013 Nov 17, 2014) Stock Prices Indices& Yield News Yahoo Finance/ Bloomberg Yahoo Fiance Text Algorithm: NLP, Lasso, SVR, ARIMAX Attributes Data Source Data Type Description News Articles Indices, Yield Historical Prices Yahoo Finance Numeric Google, Apple and IBM stock daily prices NLP Tools: Java, Python, Hadoop, etc. Numeric Nasdaq, S&P 500, Treasury Yield 5 Years NLP Features News articles related to Google, Apple and IBM Data collection: Yahoo Finance RSS Around 3000 articles Feature Selection: Lasso Prediction: ARIMAX, SVR Stock Prices 47

48 Progress and Expected Contributions 0.05 What we have done: Collected numerical data and parts of news. Built preliminary models, verifying our thoughts reasonable. Correlation (NLP features VS AAPL Stock Prices) R 2 of Regression Methods Method Train Test Baseline: Linear SVM with RBF kernel Day -3 Day -2 Day -1 Day 0 Day 1 Day 2 SVM with RBF of Indexes and Lag =1 on Stock What we will do: Implement the algorithm on the whole dataset. 48 Prove the positive impression of NLP on stock price prediction.

49 E6893 Big Data Analytics: Project Name Stock Recommendation System Team Members: Guangyang Zhang, Yuechen Qin, Bowen Dang, Zheng Fang 49 Nov 20, 2014

50 Motivation To help stock buyers to make wiser choices To find those who are very good at gaining profits from stocks (experts) Using user-based collaborative recommendation to find the similarity between the buyer and the expert Recommend the stock buyer some stocks from the most similar expert. To ease stock buyers from the heavy burden of looking through thousands of stocks and making a wise choice. 50

51 Dataset, Algorithm, Tools Dataset NASDAQ Stock Exchange Data Yahoo Finance dataset for historical prices Simulated trading records of users Algorithm User-based Collaborative Filtering with Inferred Tag Ratings Tools Eclipse J2EE, Mahout, Maven, MySQL, PHP server 51

52 Progress and Expected Contributions Progress Implemented UI and database Determine the algorithm User-based Collaborative Filtering with Inferred Tag Ratings. Expected Contributions Achieve improved Collaborative Filtering algorithm with Inferred Tag Ratings. Present clients with sound stock recommendations based on the analysis of experts buying choices. 52

53 E6893 Big Data Analytics: Project Name Stock signal generation using real time news analysis Team Members: Mandeep Singh, Mayank Misra, Rajesh Madala, Shreyas Shrivastava 53 Nov 20, 2014

54 Motivation Stock movement due to white house related news on twitter Twitter based hedge fund Algorithmic trading Twitter data traditionally used for Sentiment Analysis But now also a great source to consume News real time Stock prices have correlation with news By applying appropriate filters on tweets by news agencies, and then scoring the filtered tweets we aim to generate signals for stock prices that could be consumed by algorithms or traders to do better trades The framework we are building is scalable and could potentially be used to generate signals for a portfolio of heterogeneous stocks 54

55 Dataset, Algorithm, Tools Stream Filter Algo Filter Algo Filter Algo Scorer Algo Scorer Algo Scorer Algo Dataset: Twitter Stream Algorithms: Custom filtering and scoring algorithms with use of NLP, logistic regression Tools: Stream processing, Python, Json parser, Python ML libraries, d3.js 55

56 Progress and Expected Contributions So far we have completed the stream ingestion part and are working on refining our filtering and scoring algorithms to generate better correlation between generated signals and stock price movements. Below is a breakdown of work and contributions by team members 1. Ingesting data stream Mayank 2. Filtering Algo - Mandeep 3. Scoring Algo - Shreyas 4. Fetching real time stock data and plotting it together with stock signals generated in real time Rajesh 56

57 E6893 Big Data Analytics Project Name Customer Complaint Analyses Insights Into Issues plaguing the Banking Sector Team Members Abhaar Gupta, Avinash Sridhar, Nachiket Rau, Sankalp Singayapally 57 Nov 20, 2014

58 Motivation One of the biggest challenges for banks is minimizing customer attrition rate which is directly dependent on customer satisfaction. Customers are inclined to choose the banks who can be trusted for their services. Banks make their decisions based on a subset of data because of absence of scalable solutions. In this project, we propose a scalable design to counter the above problems 58

59 Dataset, Algorithm, Tools Consumer Complaints Database: The dataset contains retail consumer complaints with banks and financial institutions (provided by Consumer Financial Protection Bureau). 59 Algorithms: Various Clustering and Classification Algorithms Tools and Languages: Hadoop, Mahout, Java, Python

60 Progress and Expected Contributions Major retail banking issues by state and match-analyze them based on geographic or socio-economic brackets. Top concerns of consumers in various states. Derive business impact of customer satisfaction or dissatisfaction with their complaints on the institution. Propose likely solutions that can be deemed as first response for future complaints of similar nature. Hypothesize a performance metric to apply to all complaints can be used to prioritize complaints based on resolution time. 60

61 E6893 Big Data Analytics: Financial Market Volatility Team Members: John Terzis Tim Wu Oliver Zhou Jimmy Zhong 61 Nov 20, 2014

62 Motivation Understanding volatility in financial markets has long been of interest to hedge and speculators. Empirical evidence has shown us that volatility is a highly nonlinear evolving process. Modeling this process using the Hadoop ecosystem can offer tremendous advantages over traditional econometric models that are limited to datasets which fit in main memory. 62

63 Dataset, Algorithm, Tools Dataset: We have procured a massive dataset of price quotes on equities, exchange traded futures, futures, and market indices over the span of the last ten to fifteen years at the one minute granularity level. In addition to price quotes on specific instruments, our dataset features derivative indicators of price and volume activity. Algorithm: We propose to train and test several scalable machine learning based regression models on our dataset with the goal of producing a functional form of future realized volatility at the symbol level that minimizes bias and variance and ultimately generalizes well to unforeseen data. Feature selection will be integral to the task given the likelihood that many of our input variables are highly correlated. We intend to build a framework on top of Apache Spark that can at a minimum perform an n-fold cross validation of a training model and use beam search or other established methods to calibrate the hyperparameters of our SVM, random forest, or regularized regression model in a reasonably fast time frame given the algorithmic complexity of the underlying routines employed. Tools: Hadoop Apache Spark Mahout AWS Git 63 R, Java, Python

64 Progress and Expected Contributions Progress: Set up AWS web server with Ubuntu, Hadoop, and Mahout environments. Purchased and uploaded dataset. Selected initial set of machine learning algorithms. Expected Contributions: Preprocessing: Jimmy/Tim/John Feature Selection: John/Oliver Spark SVM: John Mahout Random Forest: Oliver/Tim Evaluator: Jimmy/Tim R APIs: Tim Java APIs: Jimmy Forward Progress: Preprocessing & Setup: 11/22 Algorithm Application: 11/30 Evaluation: 12/6 Final Report: 12/11 64

65 E6893 Big Data Analytics: Project Name: Salary Engine Team Members: Lin Huang, Mingrui Xu, Wei Cao, Fan Ye 65 Nov 20, 2014

66 Motivation Main idea: job ad. Employers -would determine more reasonable salary for a position. Employees - could find more jobs match their background by using our recommendation system. We want to help employers and jobseekers figure out the market worth of different positions by building a prediction engine for the salary of any UK In this way, we would bring more transparency to this important market : Simple sample of our Salary Engine 66

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Big Data and Analytics: Challenges and Opportunities

Big Data and Analytics: Challenges and Opportunities Big Data and Analytics: Challenges and Opportunities Dr. Amin Beheshti Lecturer and Senior Research Associate University of New South Wales, Australia (Service Oriented Computing Group, CSE) Talk: Sharif

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II ! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and

More information

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

NetView 360 Product Description

NetView 360 Product Description NetView 360 Product Description Heterogeneous network (HetNet) planning is a specialized process that should not be thought of as adaptation of the traditional macro cell planning process. The new approach

More information

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights DATA EXPERTS We accelerate research and transform data to help you create actionable insights WE MINE WE ANALYZE WE VISUALIZE Domains Data Mining Mining longitudinal and linked datasets from web and other

More information

lop Building Machine Learning Systems with Python en source

lop Building Machine Learning Systems with Python en source Building Machine Learning Systems with Python Master the art of machine learning with Python and build effective machine learning systems with this intensive handson guide Willi Richert Luis Pedro Coelho

More information

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

CiteSeer x in the Cloud

CiteSeer x in the Cloud Published in the 2nd USENIX Workshop on Hot Topics in Cloud Computing 2010 CiteSeer x in the Cloud Pradeep B. Teregowda Pennsylvania State University C. Lee Giles Pennsylvania State University Bhuvan Urgaonkar

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Big Data Analytics OverOnline Transactional Data Set

Big Data Analytics OverOnline Transactional Data Set Big Data Analytics OverOnline Transactional Data Set Rohit Vaswani 1, Rahul Vaswani 2, Manish Shahani 3, Lifna Jos(Mentor) 4 1 B.E. Computer Engg. VES Institute of Technology, Mumbai -400074, Maharashtra,

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2 DATA SCIENCE CURRICULUM Before class even begins, students start an at-home pre-work phase. When they convene in class, students spend the first eight weeks doing iterative, project-centered skill acquisition.

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

HADOOP. Revised 10/19/2015

HADOOP. Revised 10/19/2015 HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec 20 2011 Online Content Optimization Using Hadoop Jyoti Ahuja Dec 20 2011 What do we do? Deliver right CONTENT to the right USER at the right TIME o Effectively and pro-actively learn from user interactions with

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015 Dr. Daisy Zhe Wang Director of Data Science Research Lab University of Florida, CISE

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Find the Hidden Signal in Market Data Noise

Find the Hidden Signal in Market Data Noise Find the Hidden Signal in Market Data Noise Revolution Analytics Webinar, 13 March 2013 Andrie de Vries Business Services Director (Europe) @RevoAndrie andrie@revolutionanalytics.com Agenda Find the Hidden

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015 E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Sources: Summary Data is exploding in volume, variety and velocity timely

Sources: Summary Data is exploding in volume, variety and velocity timely 1 Sources: The Guardian, May 2010 IDC Digital Universe, 2010 IBM Institute for Business Value, 2009 IBM CIO Study 2010 TDWI: Next Generation Data Warehouse Platforms Q4 2009 Summary Data is exploding

More information

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

More information

Ubuntu and Hadoop: the perfect match

Ubuntu and Hadoop: the perfect match WHITE PAPER Ubuntu and Hadoop: the perfect match February 2012 Copyright Canonical 2012 www.canonical.com Executive introduction In many fields of IT, there are always stand-out technologies. This is definitely

More information

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing

Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392. Research Article. E-commerce recommendation system on cloud computing Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2015, 7(3):1388-1392 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 E-commerce recommendation system on cloud computing

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

HOW TO MAKE SENSE OF BIG DATA TO BETTER DRIVE BUSINESS PROCESSES, IMPROVE DECISION-MAKING, AND SUCCESSFULLY COMPETE IN TODAY S MARKETS.

HOW TO MAKE SENSE OF BIG DATA TO BETTER DRIVE BUSINESS PROCESSES, IMPROVE DECISION-MAKING, AND SUCCESSFULLY COMPETE IN TODAY S MARKETS. HOW TO MAKE SENSE OF BIG DATA TO BETTER DRIVE BUSINESS PROCESSES, IMPROVE DECISION-MAKING, AND SUCCESSFULLY COMPETE IN TODAY S MARKETS. ALTILIA turns Big Data into Smart Data and enables businesses to

More information

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction A Big Data Analytical Framework For Portfolio Optimization Dhanya Jothimani, Ravi Shankar and Surendra S. Yadav Department of Management Studies, Indian Institute of Technology Delhi {dhanya.jothimani,

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Microsoft Big Data. Solution Brief

Microsoft Big Data. Solution Brief Microsoft Big Data Solution Brief Contents Introduction... 2 The Microsoft Big Data Solution... 3 Key Benefits... 3 Immersive Insight, Wherever You Are... 3 Connecting with the World s Data... 3 Any Data,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Using Tableau Software with Hortonworks Data Platform

Using Tableau Software with Hortonworks Data Platform Using Tableau Software with Hortonworks Data Platform September 2013 2013 Hortonworks Inc. http:// Modern businesses need to manage vast amounts of data, and in many cases they have accumulated this data

More information

III Big Data Technologies

III Big Data Technologies III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

White Paper: Hadoop for Intelligence Analysis

White Paper: Hadoop for Intelligence Analysis CTOlabs.com White Paper: Hadoop for Intelligence Analysis July 2011 A White Paper providing context, tips and use cases on the topic of analysis over large quantities of data. Inside: Apache Hadoop and

More information

Fast Data in the Era of Big Data: Twitter s Real-

Fast Data in the Era of Big Data: Twitter s Real- Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# So What is Data Science?" Doing Data Science" Data Preparation" Roles" This Lecture" What is Data Science?" Data Science aims to derive knowledge!

More information

SEAIP 2009 Presentation

SEAIP 2009 Presentation SEAIP 2009 Presentation By David Tan Chair of Yahoo! Hadoop SIG, 2008-2009,Singapore EXCO Member of SGF SIG Imperial College (UK), Institute of Fluid Science (Japan) & Chicago BOOTH GSB (USA) Alumni Email:

More information

Sentiment Analysis on Big Data

Sentiment Analysis on Big Data SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012. Viswa Sharma Solutions Architect Tata Consultancy Services Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, 2012 Viswa Sharma Solutions Architect Tata Consultancy Services 1 Agenda What is Hadoop Why Hadoop? The Net Generation is here Sizing the

More information

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice

More information

Hong Kong Stock Index Forecasting

Hong Kong Stock Index Forecasting Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a long-time attractive topic

More information

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract

W H I T E P A P E R. Building your Big Data analytics strategy: Block-by-Block! Abstract W H I T E P A P E R Building your Big Data analytics strategy: Block-by-Block! Abstract In this white paper, Impetus discusses how you can handle Big Data problems. It talks about how analytics on Big

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Big Data Analysis: Apache Storm Perspective

Big Data Analysis: Apache Storm Perspective Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Testing Big data is one of the biggest

Testing Big data is one of the biggest Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Cisco Data Preparation

Cisco Data Preparation Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

White Paper. How Streaming Data Analytics Enables Real-Time Decisions White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream

More information