! E6893 Big Data Analytics Lecture 12:! Final Project Proposals

Transcription

1 E6893 Big Data Analytics Lecture 12: Final Project Proposals Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center November 20th, 2014

2 Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Big Data Analytics Algorithms V (classification & clustering) 10/30/14 9 Linked Big Data Graph Computing I (Graph DB) 11/06/14 10 Linked Big Data Graph Computing II (Graph Analytics) 11/13/14 11 Linked Big Data Graph Computing III (Graphical Models & Platforms) 11/20/14 12 Final Project First Presentations 11/27/14 Thanksgiving Holiday 12/04/14 13 Next Stage of Big Data Analytics 12/11/14 14 Big Data Analytics Workshop Final Project Presentations 2

3 Proposal List (#1 - #17, Page 7-76) Industry Sector Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Project name Exchange Rates Inquiry and Analysis Algorithm Trading Strategies Using Hadoop MapReduce Image Classification in the Cloud and GPU (H-Classification & G-Classification) Google-Analytics, Graph based Online Movie Recommendation System Currency Trend Analyzer Correlating Price / Volume of Low Volume Stocks with Social Media Trading Using Nonparametric Time Series Classification Models Stock Forecasting Using Hadoop Map-Reduce Real-time Risk Management System Stock Daily Price Predictions Based on News Stock Recommendation System Stock signal generation using real time news analysis Customer Complaint Analyses Financial Market Volatility Salary Engine Stock price Movement Prediction with Hadoop+Mahout & Pydoop+Scikit Sector-based Classification and Clustering of Financial News Articles 3

4 Proposal List (#18 - #33, Page ) Information Information Information Information Information Information Information Information Information Information Life Science Life Science Life Science Life Science Life Science Life Science TV Analytics Project Nova Game Outcome Analysis Exploring the Online and Offline Social World Yelp Review Analysis and Recommendation Sentiment Analysis on Movie Document Analysis with Latent Dirichlet Allocation Yelp Recommendation Analysis Using Big Data (Hadoop) for Identification of Aberrant Behavior Clusters in Server Performance Time Series Dat Crocuta: JavaScript Analytics System Learning Brain Activity From fmri Images Network Analysis on the Big Cancer Genome Data Reversal Prediction from Physiology Data EEGoVid: An EEG-Based Interest Level Video Recommendation Engine Brain Edge Detection BrainBow: Reconstruction of Neurons 4

5 Proposal List (#34 - #55, Page ) Retail Retail Retail Retail Retail Retail Retail Retail Retail Telecom Telecom Telecom Telecom Telecom Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Yelp Fake Review Detection Predicting Excitement at donorschoose.org Market strategy suggestions for B2C websites Résumé Category Classification TV Genome Project / Recommendation Engine Analytics Media Acquire Valued Shoppers Twitter-Based Product / Sales Events Recommender Yelp-er: Analyzing Yelp Data Big Mobile Data Network Congestion Analysis Comparison Analysis of Different Telecoms Operators User s Web Events Analysis Based on Browser Extension Analysis of telecom service in cellular networks Human Activity Monitoring and Prediction Minimizing Risk in Energy Arbitrage Best Transportation Choice Manage Energy Consumption By Smart Meters Location Specific Optimization of Taxi Efficiency in NYC Citi Bike System Data Analysis PeopleMaps Project Transeo: Making public buses more efficient and accessible Image Based Geo-localization 5

6 Proposal List (#56 - #77, Page ) Media Media Media Media Media Media Media Media Media Media Media Media Media Media Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government HVision PlayPalate Movie Exploration Fantasy Basketball Fantasy Basketball Prediction Using Previous Season s Data Hunting for NBA players Music-Links Affective Computational Cinematography Improving Movie Recommender System with User Behavior c hanges and Demographics MOVIE RECOMMENDATION AND ANALYSIS OF THIS APPLICATION TV Genome Project / Recommendation Engine TrendyWriter Analysis on pricing strategy for sports team Spark NLP Predicting usefulness of restaurants reviews from subtopics using Yelp data Study Buddy Big Data Analysis on Log Data of Standardized IBT Test (TOEFL) Taker for Effects of Selection Changing Improving Education for At-risk Students Scratch Analyzer Oscar Award Analysis based on big data Error Correction in Large Volume OCR Datasets How to name your new born baby (babies)? 6

7 E6893 Big Data Analytics Finance Group Proposals 7 Nov 20, 2014

8 E6893 Big Data Analytics: Project Name Exchange Rates Inquiry and Analysis Team Members: Mengnan Wang(mw2969), Xiaomeng Zhang(xz2350), Jianze Wang(jw3127), Wanding Li(wl2501) 8 Nov 20, 2014

9 Motivation With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market. 9

10 Dataset, Algorithm, Tools Dataset: Instant exchange rates from Bloomberg History exchange rates from Algorithm: We would forecast the exchange rate of currency in some targeted countries against US Dollar in a developed market, applying scalable model to forecast in real-time. And we would like to use RMSE to measure the reliability and accuracy of our prediction. Besides, by showing the statistical significance, such as P values, the repeatability of the outcome will then be proved. Tools: Eclipse Tomcat Apache 10

11 Progress and Expected Contributions Expected Contributions: Our project is initially designed to provide users with the following contents. Forward and cross exchange rates for most world currencies Both instant and history data Basic analysis of exchange rates including regression and K means Latest news about exchange rates Current Progress A sketch webpage design that looks something like Key Factors Affecting Exchange Rate PPP(Purchasing Power Parity) INT(Interest Rate Differential) Such as Libor GDP(The Difference in GDP Growth Rates) IGP(Income Growth Rate) Relative Economic Strength we may use factors like GDP and IGP to measure it quantitatively 11 E6893 Big Data Analytics Lecture 12: Final Project Proposal

12 E6893 Big Data Analytics: Project Name: Algorithm Trading Strategies Using Hadoop MapReduce Team Members: Yifan Wu, Meibin Chen 12 Nov 20, 2014

13 Motivation Improvement of internet speed and storage Data flooding everywhere Hadoop be able to do this & => MapReduce is the tool for parallel computing of the data Usage of Algorithm trading: Algorithm trading is the use of computer programs for entering trading orders, in which computer algorithms decide on every aspects of the order, such as the timing, price, and quantity of the order. 1. Back test the algorithm using enough historical price data to validate and optimize the algorithm in terms of profitability, stability, etc. 2. For a complex algorithm, there are many parameters that need to be optimized. 1. The system will builds and select most suitable strategy 20% faster than before 2. The enhanced platform doubled the number of strategy groups. 3. Last, the strategies can now be updated more frequently and can include more parameters in the analysis. 13

14 Dataset, Algorithm, Tools The dataset we choose is from stock future index of Shanghai Future Exchange (big bull market of A share ) or others Algorithm describes: The algorithm is given as combination of moving average convergence divergence(mcad) and Relative Strength Index(RSI) MCAD: long average period and the short average period RSI: the upper threshold, the lower threshold and the calculation period The project will divided into two part: First one is MapReduce platform----hadoop Second one is the trading strategies to test the data on this platform 14

15 Progress and Expected Contributions Inner MapReduce: Input: Daily price data. Each line contains 100 days price information Output: The performance of the parameters on the data Outside MapReduce: Input:Parameters combination: Each line contains one combination of parameters Ouput: The best parameters The contribution as Algorithm trading can provide many usage in investment strategy, including market marking, inter-market spreading, arbitrage, or pure speculation. With MapReduce, it can achieve faster, multi-tasks, and more real time updates. 15

16 E6893 Big Data Analytics: Image Classification in the Cloud and GPU (H-Classification & G-Classification) Team Members: Anand Rajan & Eric Johnson 16 Nov 20, 2014

17 Motivation - With the advent of social media the number of mobile pictures being taken and uploaded is increasing exponentially - Although most photos are uploaded with some basic metadata: date, time, camera model, and possibly geo-location - a great deal of details are missing when they enter the cloud. Unless users physically go through and tag each image this can create a search nightmare - Example: How do you find that picture you took a few years back while on vacation in Paris? It was under a bridge by the river right? Resorting to clicking through hundreds of photos or waiting for images to cache on your phone can take forever when you want to show someone in a pinch. Challenge In order to make more effective image search it will be important to develop and utilize advanced algorithms to help auto-tag images. Doing so can help narrow down image search and improve the quality of search results. - Leveraging the Yahoo Labs Flickr dataset we plan to test and develop upon feature extraction methods utilizing a parallelized computing system to efficiently extract image characteristics. - Using these image characteristics we will train and test the image classification of these images and evaluate them based on precision. Going beyond this step we also plan to experiment with a GPU powered processing system to evaluate added benefits and performance benchmarks that might be had during the image analysis stage over a standard distributed system. 17

18 Dataset, Algorithm, Tools Dataset: Yahoo/Flickr Image Dataset (83GB) 2 Million User Photos (200,000 x 10 categories) - Contains: photo_id, jpeg url, and some corresponding metadata such as the title, description, title, camera type, title, tags. Addition: Flickr API Details like comments, favorites, and social network data, can be queried. Data is broken into 10 categories - 1 nature - 6 sky - 2 food - 7 london - 3 people - 8 beach - 4 wedding music - 10 travel Toolset: R, Java, Hadoop/Mahout, NVidia CUDA - Linux based Cluster Array - 26x 2.7GHZ Intel Xeon CPUs - 64 GB of GDDR3 RAM - Windows 7 Desktop - 2.7GHZ Intel i7 CPU - 24 GB of GDDR3 RAM - 2GB 256-Bit GDDR5 w/ 1344 CUDA Cores Algorithms: Implemented by Stage - Feature Extraction / Reduction - SIFT - Hog - PCA (Map stage) Good for GPU - Encoding - KMeans Clustering (Reduce Stage) - Classification - Naïve Bayes - Log SGD Regression 18

19 Progress and Expected Contributions Progress - Conducted thorough research on opensource Image Classification packages as well as the tools required to perform feature extraction and analysis - Setup an environment for Hadoop distributed system - Acquired Yahoo Flickr dataset and begun initial testing - Acquired hardware and performed initial tests on a GPU based system Expected Contributions - Our goal is to research, implement, and build upon the current Open Source offerings for Image Classification to help improve the auto-tagging process of digital photos - If given the time and resources we aim develop a web based interface that will allow users to upload an image and perform a feature analysis and extraction to determine which tags and keywords are associated with it 19 Potential Challenges - Research for feature extract is very resource intensive so we have picked a challenging project for only two students - Conducting such a challenging project will involve many abstract mathematical models - Companies like Yahoo and Facebook invest millions in this field - *We anticipate this to be an ongoing project that we can continue well beyond the course and perhaps into the second semester

20 E6893 Big Data Analytics: Google-Analytics, Graph based Online Movie Recommendation System Team Members: Tian Han, Yifan Du, and Hang Guan 20 Nov 20, 2014

21 Motivation - To build our own movie website and implement the movie recommendation functionality. 21

22 Process Flow Recommendation Movie Website Users log files Graph Database Google Analytics Key words for query 22

23 Dataset, Algorithm, Tools - Dataset MovieLens 1M dataset, which contains 1 million ratings from 6000 users on 4000 movies - Algorithm Various collaborative filtering algorithms, e.g. user-based recommendation, item-based recommendation etc. - Tools Web design - Dreamweaver / CoffeeCup Users log file analysis - Google Analytics Graph Database Gremlin / Neo4j Others Mahout / Eclipse 23

24 Progress and Expected Contributions Timeline: Now -- 11/27/14 : Movie website design and publish 11/28 12/04/14: Plug in Google Analytics in website and Analyze user log file 12/05 12/11/14: Generate Query for the Graph Database and update recommended movies on website Expected Contributions: - To build our own online movie website. - To implement the movie recommendation functionality To to website analysis using Google Analytics

25 E6893 Big Data Analytics: Currency Trend Analyzer Team Members: Tim Paine, Mark Aligbe 25 Nov 20, 2014

26 Motivation Forex trading requires statistical insight into the exchange market. Large quantity of data, visualization only utilized at the day/week/ month level Difficult to see real time trends, analyze real time trends at granularity < 1 day Need to be able to collect, analyze, visualize data streaming in real time Solution Distributed Computing Real Time Computation Statistical Analysis Data Visualization 26

27 Datasets Large amounts of intraday/daily forex/equity/other data Algorithms Recommenders - suggesting trading prices and items to exchange Clustering - to analyze trends over a variable period of time Classifying - to classify trends into upward/downward movements, Tools momentum Java and Mahout for the analytics Javascript, Python, and R for data gathering, web server, and visualization 27

28 Progress and Expected Contributions Forex data acquired, sanitized, formatted, ready to 28 use Built system to batch collect data from multiple feeds when it becomes available Current stage: building design and field research Next steps: other distributed computing libraries End contribution: an extensible framework for collecting, analyzing, and visualizing real time data feeds

29 E6893 Big Data Analytics: Correlating Price / Volume of Low Volume Stocks with Social Media Jeff Ho, MS Statistics William Lee, MS Operations Research (CVN) 29 Oct 20, 2014

30 Hypothesis and Method Hypothesis Low volume stocks typically do not generate mainstream news coverage We hypothesize that social media could be a useful source of information Method Backtest different methods of using Big Data (specifically Twitter) to ultimately try to predict future price movements We will test various cases attempting to seek a correlation between tweets and movement of low volume stocks in price or volume We will verify whether these tweets are leading or lagging indicators of price or volume changes 30

31 Stock Criteria Stock Criteria Low volume stocks No / Low analyst coverage Stick with one industry Data Source: Yahoo Finance Big Data Platform Twitter Data Source: Twitter API 31

32 Predicting stock movements with tweet volume 1. Built list of stocks to test 2. Found cases where tweets can predict absolute movement in prices and stock volume 3. Can build a strategy around tweet volumes Absolute Percent Change in Stock Price 10.00% 7.50% 5.00% 2.50% 0.00% ALSK shows Rise in Tweet Volume 2 Days Prior to Significant Stock Movement y = x R² = Tweet Volume 2 Days Prior ALSK shows Rise in Tweet Volume 2 Days Prior to Signifcant Stock Volume Increase y = 22249x R² = Stock Volume Tweet Volume 2 Days Prior

33 E6893 Big Data Analytics: Trading Using Nonparametric Time Series Classification Models Team Members: Yufan Cai, Bowen Wang, Junchao Zhang 33 Nov 20, 2014

34 Motivation Traditional trading strategies usually involved with time series models such GARCH. It is difficult to incorporate categorized parameters such Twitter data. Using classification models, we can give a prediction on whether the asset price will go up or down by incorporating unstructured data stream. 34

35 Dataset, Algorithm, Tools Dataset: Stock live price data, order book data (Bloomberg API, Bitcoin/USD). Twitter (Optional) Algorithms: Mahout Algorithms such as Logistic Regression Others like Weighted Majority Voting and Nearest-Neighbor Classification Tools: Java, Mahout, Hadoop 35

36 Progress and Expected Contributions Progress: Researched on various recent time series classification models Have set up interface with data API Expected Contributions: Provide a Hadoop based classification model implementations on time series 36

37 E6893 Big Data Analytics: Project Name STOCK FORECASTING USING HADOOP MAP-REDUCE Team Members: Yi Yu, Yu Xia, Xiangliang Yang, Yumeng Xu 37 Nov 20, 2014

38 Motivation Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Hadoop MapReduce is a latest framework specially designed for processing large datasets on distributed sources. Apache s Hadoop is an implementation of MapReduce. 38

39 Dataset, Algorithm, Tools 39 Algorithms: Pearson Correlation Similarity Euclidean Distance Similarity Stochastic Gradient Descent (SDG) Tools: Hadoop Mahout Hbase

40 Progress and Expected Contributions Expected Contributions: We are going to analyze the dataset called Daily Holdings for All ProShares ETFs which contains tons of information collected from the stock exchange market. The first step is to scrutinize the data and provide the stock which may go up potentially. With these screened stocks, suggest a certain user a potential stock which she/he may be interested in. Progress: We have already got the dataset and analyzed the some similarities which could be useful in further steps. 40

41 E6893 Big Data Analytics: Real-time Risk Management System Team Members: Iljoon Hwang, Sungwoo Yoo, Sungjoon Huh 41 Nov 20, 2014

42 Motivation 1. Motivation - Objective: Developing the Real-time Risk Management System (Intraday Value at Risk) for large complex portfolio in an unified framework - Expected Outcome: The system which performs the calculation of stressed VaR, "what-if" scenarios, stresstesting on complex portfolio with large number of underlying risk factors and vectors in real-time. - Importance: Risk management is crucial to throughout the investment/trading activities from front trading desk to back office. However, because of the complexity of calculating VaR in large multi-asset portfolio, delivering the VaR in real-time is not available at legacy system. Big Data with in-memory multi-dimensional analytics can resolve this big issue. 42

43 Dataset, Algorithm, Tools 2. Dataset Yahoo Finance Tick data for S&P Algorithms Valut at Risk: Monte Carlo simulation, Parametric method Big Data: Map-Reduce, In-Memory 4. Tools / Languages Hadoop, Spark, Scala, R, Python, and Google Cloud 43

44 Progress and Expected Contributions 5. Current Status - Developed environment system construction in Google Cloud - Have been collecting tick data from Yahoo Finance - Have decided specific algorithms for calculating Value at Risk - Have been studying related papers and articles 6. Team members and Expected Contributions - Iljoon Hwang: Environment systems construction development, Storage server programming, Testbed - Sungwoo Yoo: Data collection, Research papers / articles - Sungjoon Huh: Batch processing / Real-time app programming 44

45 E6893 Big Data Analytics: Stock Daily Price Predictions Based on News Team Members: Jie Liu, Jingnan Li, Lu Qiu, Ruixin Tan 45 Nov 20, 2014

46 Motivation Traditional technical trading only take into account the quantitative but not qualitative factors that influence the stock prices. It is well know that news items have significant impact on stock indices and prices. To make a better prediction, we combine quantitative methods with headline NLP feature analysis in our model. 46

47 Dataset, Algorithm, Tools Dataset: ( Jan 2, 2013 Nov 17, 2014) Stock Prices Indices& Yield News Yahoo Finance/ Bloomberg Yahoo Fiance Text Algorithm: NLP, Lasso, SVR, ARIMAX Attributes Data Source Data Type Description News Articles Indices, Yield Historical Prices Yahoo Finance Numeric Google, Apple and IBM stock daily prices NLP Tools: Java, Python, Hadoop, etc. Numeric Nasdaq, S&P 500, Treasury Yield 5 Years NLP Features News articles related to Google, Apple and IBM Data collection: Yahoo Finance RSS Around 3000 articles Feature Selection: Lasso Prediction: ARIMAX, SVR Stock Prices 47

48 Progress and Expected Contributions 0.05 What we have done: Collected numerical data and parts of news. Built preliminary models, verifying our thoughts reasonable. Correlation (NLP features VS AAPL Stock Prices) R 2 of Regression Methods Method Train Test Baseline: Linear SVM with RBF kernel Day -3 Day -2 Day -1 Day 0 Day 1 Day 2 SVM with RBF of Indexes and Lag =1 on Stock What we will do: Implement the algorithm on the whole dataset. 48 Prove the positive impression of NLP on stock price prediction.

49 E6893 Big Data Analytics: Project Name Stock Recommendation System Team Members: Guangyang Zhang, Yuechen Qin, Bowen Dang, Zheng Fang 49 Nov 20, 2014

50 Motivation To help stock buyers to make wiser choices To find those who are very good at gaining profits from stocks (experts) Using user-based collaborative recommendation to find the similarity between the buyer and the expert Recommend the stock buyer some stocks from the most similar expert. To ease stock buyers from the heavy burden of looking through thousands of stocks and making a wise choice. 50

51 Dataset, Algorithm, Tools Dataset NASDAQ Stock Exchange Data Yahoo Finance dataset for historical prices Simulated trading records of users Algorithm User-based Collaborative Filtering with Inferred Tag Ratings Tools Eclipse J2EE, Mahout, Maven, MySQL, PHP server 51

52 Progress and Expected Contributions Progress Implemented UI and database Determine the algorithm User-based Collaborative Filtering with Inferred Tag Ratings. Expected Contributions Achieve improved Collaborative Filtering algorithm with Inferred Tag Ratings. Present clients with sound stock recommendations based on the analysis of experts buying choices. 52

53 E6893 Big Data Analytics: Project Name Stock signal generation using real time news analysis Team Members: Mandeep Singh, Mayank Misra, Rajesh Madala, Shreyas Shrivastava 53 Nov 20, 2014

54 Motivation Stock movement due to white house related news on twitter Twitter based hedge fund Algorithmic trading Twitter data traditionally used for Sentiment Analysis But now also a great source to consume News real time Stock prices have correlation with news By applying appropriate filters on tweets by news agencies, and then scoring the filtered tweets we aim to generate signals for stock prices that could be consumed by algorithms or traders to do better trades The framework we are building is scalable and could potentially be used to generate signals for a portfolio of heterogeneous stocks 54

55 Dataset, Algorithm, Tools Stream Filter Algo Filter Algo Filter Algo Scorer Algo Scorer Algo Scorer Algo Dataset: Twitter Stream Algorithms: Custom filtering and scoring algorithms with use of NLP, logistic regression Tools: Stream processing, Python, Json parser, Python ML libraries, d3.js 55

56 Progress and Expected Contributions So far we have completed the stream ingestion part and are working on refining our filtering and scoring algorithms to generate better correlation between generated signals and stock price movements. Below is a breakdown of work and contributions by team members 1. Ingesting data stream Mayank 2. Filtering Algo - Mandeep 3. Scoring Algo - Shreyas 4. Fetching real time stock data and plotting it together with stock signals generated in real time Rajesh 56

57 E6893 Big Data Analytics Project Name Customer Complaint Analyses Insights Into Issues plaguing the Banking Sector Team Members Abhaar Gupta, Avinash Sridhar, Nachiket Rau, Sankalp Singayapally 57 Nov 20, 2014

58 Motivation One of the biggest challenges for banks is minimizing customer attrition rate which is directly dependent on customer satisfaction. Customers are inclined to choose the banks who can be trusted for their services. Banks make their decisions based on a subset of data because of absence of scalable solutions. In this project, we propose a scalable design to counter the above problems 58

59 Dataset, Algorithm, Tools Consumer Complaints Database: The dataset contains retail consumer complaints with banks and financial institutions (provided by Consumer Financial Protection Bureau) Algorithms: Various Clustering and Classification Algorithms Tools and Languages: Hadoop, Mahout, Java, Python

60 Progress and Expected Contributions Major retail banking issues by state and match-analyze them based on geographic or socio-economic brackets. Top concerns of consumers in various states. Derive business impact of customer satisfaction or dissatisfaction with their complaints on the institution. Propose likely solutions that can be deemed as first response for future complaints of similar nature. Hypothesize a performance metric to apply to all complaints can be used to prioritize complaints based on resolution time. 60

61 E6893 Big Data Analytics: Financial Market Volatility Team Members: John Terzis Tim Wu Oliver Zhou Jimmy Zhong 61 Nov 20, 2014

62 Motivation Understanding volatility in financial markets has long been of interest to hedge and speculators. Empirical evidence has shown us that volatility is a highly nonlinear evolving process. Modeling this process using the Hadoop ecosystem can offer tremendous advantages over traditional econometric models that are limited to datasets which fit in main memory. 62

63 Dataset, Algorithm, Tools Dataset: We have procured a massive dataset of price quotes on equities, exchange traded futures, futures, and market indices over the span of the last ten to fifteen years at the one minute granularity level. In addition to price quotes on specific instruments, our dataset features derivative indicators of price and volume activity. Algorithm: We propose to train and test several scalable machine learning based regression models on our dataset with the goal of producing a functional form of future realized volatility at the symbol level that minimizes bias and variance and ultimately generalizes well to unforeseen data. Feature selection will be integral to the task given the likelihood that many of our input variables are highly correlated. We intend to build a framework on top of Apache Spark that can at a minimum perform an n-fold cross validation of a training model and use beam search or other established methods to calibrate the hyperparameters of our SVM, random forest, or regularized regression model in a reasonably fast time frame given the algorithmic complexity of the underlying routines employed. Tools: Hadoop Apache Spark Mahout AWS Git 63 R, Java, Python

64 Progress and Expected Contributions Progress: Set up AWS web server with Ubuntu, Hadoop, and Mahout environments. Purchased and uploaded dataset. Selected initial set of machine learning algorithms. Expected Contributions: Preprocessing: Jimmy/Tim/John Feature Selection: John/Oliver Spark SVM: John Mahout Random Forest: Oliver/Tim Evaluator: Jimmy/Tim R APIs: Tim Java APIs: Jimmy Forward Progress: Preprocessing & Setup: 11/22 Algorithm Application: 11/30 Evaluation: 12/6 Final Report: 12/11 64

65 E6893 Big Data Analytics: Project Name: Salary Engine Team Members: Lin Huang, Mingrui Xu, Wei Cao, Fan Ye 65 Nov 20, 2014

66 Motivation Main idea: job ad. Employers -would determine more reasonable salary for a position. Employees - could find more jobs match their background by using our recommendation system. We want to help employers and jobseekers figure out the market worth of different positions by building a prediction engine for the salary of any UK In this way, we would bring more transparency to this important market : Simple sample of our Salary Engine 66

67 Dataset, Algorithm, Tools Dataset: Job(id (PK), title, descrip, loc, locnorm, jobtype, time, company, category, salary, salarynorm, sourcename) Exceeds 2GB, over 100,000 records Algorithm: Prediction model: ( help companies offer more reasonable salary) Classification: Stochastic Gradient Descent, Naive Bayes, RandomForest Regression: SVM+Kernel, Linear regression, K NN Job recommendation system : (recommend positions to jobseekers) Use Item-based and User-based concept, based on the similarity between a positions and employees. Tools: AWS (EC2, S3, DynamoDB), Hive Tomcat Mahout 67

68 Progress and Expected Contributions Progress Data clean (text mining), model build (machine learning algorithm) Web UI design, servlet API implementation, database construction Chat forum design and recommendation evaluation Result test and model improvement. Using the test dataset, which leaves salary field blank. Works will be focused on improve prediction and recommendation accuracy through adjusting methods or parameters in model. Expected Contributions 68 Implement and improve two models that could respectively have quite accuracy in prediction and recommendation, and analyze their application scope. Make this web as dynamic so that more newly-post job info could be loaded in as part of the dataset. Allow jobseekers search for specific opportunities, get recommendation from systems, and exchange information with others.

69 E6893 Big Data Analytics: Project Name Stock price Movement Prediction with Hadoop +Mahout & Pydoop+Scikit Team Members: Arman Arkilic, Ao Hong, Tian Zhang, Yuheng Liu 69

70 Motivation, Dataset and Expected Contribution Motivation: Predicting stock price movements are essential for portfolio risk management and in the core of any trading model Test Hadoop and its tools ability for such basic and common purpose Stock tick data is available to public with various time range and frequency Output is straightforward and easy to evaluate Datasets: Google finance URL format: i=[period]&p=[days]d&f=d,o,h,l,c,v& df=cpct&q=[ticker] Yahoo finance Free S&P 500 Daily Pricing Data Expected Contribution: Development of 2 methods of stock price movement prediction and analysis of their advantages and disadvantages Create the original method of using python to communicate in hadoop Compare performance of models on different datasets and to get high accuracy Apply the prediction model for future stock price movement Help company to adjust business strategy for more profit 70 Big Data Analytics-- Stock E6893 Price Big Movement Data Analytics Prediction Lecture Using 12: Final Mahout Project & Proposals Pydoop+Scikit

71 Algorithms and Tools Linear regression in order to predict stock price movement: price of stock higher or lower the following day Mahout (Stochastic Gradient Descent--Classifier Algorithm): org.apache.mahout.classifier.sgd.trainlogistic given stock s opening price, high price, low price, and closing price predict its movement the following day similar approach to supervised learning methods covered in class for project #3 Pydoop + scikit-learn Hadoop s provided Python API doesn t support C/C++ wrapped Python libraries(core of Python s scientific computation toolkits: numpy, scipy, matplotlib, scikit-image, scikit-learn, etc.) Pydoop tackles this by wrapping Hadoop C++ pipes(boost.python) and libhdfs Pydoop provides both HDFS access and MapReduce tasks with pure Python code (no Jython) Better than using stdin/stdout utilities within Hadoop that is common to all languages given one might want to explore the data in hdfs and/or submit large chunks as a part of the yarn task Scikit-learn provides simple tools for data mining and analysis Provides Stochastic Gradient Descent approach to fit linear models similar to Mahout sklearn.linear_model.sgdclassifier 71 Big Data Analytics-- Stock E6893 Price Big Movement Data Analytics Prediction Lecture Using 12: Final Mahout Project & Proposals Pydoop+Scikit

72 Progress Using mahout: Dataset from yahoo finance: open price, close price, highest price, lowest price. Split dataset into 2 parts: training data and testing data Build the model and generate prediction function Test accuracy Test result with one company s stock data from 4/12/1996 to 1/28/2010 with accuracy being 0.58 Future improvement will be focused on adjustments of parameters to increase accuracy. Using Pydoop + scikit-learn: Dataset imported to hdfs and converted to numpy/scipy array within Python Submitted some small MapReduce jobs for practice using identical framework to Mahout (Hadoop in Virtual Machine provided by TAs [Ubuntu 14 LTS]) with identical number of cores to Mahout Future Work: Implement identical scikit-learn algorithm in Mahout and get similar results to Mahout Deploy this algorithm to Hadoop as MapReduce job using Pydoop s Mapper and Reducer classes Compare the results to Mahout findings Scalability performance analysis between Mahout code with changing hardware constraints 72 Big Data Analytics-- Stock E6893 Price Big Movement Data Analytics Prediction Lecture Using 12: Final Mahout Project & Proposals Pydoop+Scikit

73 E6893 Big Data Analytics: Final Project Proposal Presentation: Sector-based Classification and Clustering of Financial News Articles Michael DAndrea CVN student 73 November 9, 2014

74 Project Summary Team members: Michael DAndrea CVN student Motivation of your project (The problem you would like to solve); Clustering and classifying sector-based financial articles Changed project because of issues around confidentiality of datasets and shifting prioritization of needs, also learning the Mahout algorithms surfaced more use cases Current Status of Project Completed collection of initial training dataset Still scoping out technical requirements and learning appropriate applications of machine learning algorithms; 74

75 Problem and Solution Background Importance of sector-based data for investment allocation decisions Sector exposure has been the second-most influential factor in the performance of the U.S. equity market during the past 20 years. Fidelity Investments Portfolios built with equity sectors were consistently more efficient higher return and lower risk- than those created using style box components.- Fidelity Investments Problem Clustering and classifying sector-based financial articles Assists with investment decisions based on a sector basis Solution Mahout Machine Learning Algorithms Clustering-Canopy and K-means for clustering articles in sectors and extracting themes Classification-C-Bayes and Naïve Bayes 75

76 Dataset, Tools, and Workplan Dataset- Keyword-based Bloomberg financial news article datasets Reuters 20K article set If time, utilize web crawler for additional article extraction Tools Hadoop, Pig, Mahout, Java or Node.Js, AWS or Azure MongoDB, D3 (optional) Workplan Phase 1- Clustering and classifying news articles Financial News Articles Phase 2- Storing and visualizing results IF time permits 76

77 E6893 Big Data Analytics Information Group Proposals 77 Nov 20, 2014

78 E6893 Big Data Analytics: TV Analytics Team Members: Shubhanshu Yadav(sy2511) Vaibhav Jagannathan(vj2192) 78 Nov 20, 2014

79 Motivation It is always a challenge to keep track of which TV Shows are going to air new episodes. User may not able to follow all the series they are interested in. Even if user manages to keep track of all his favourite shows, if many of them are released around the same time, he/she may not have the luxury to watch all of them. If the user is presented with a list of upcoming episodes, sorted by the release date and a probabilistic rating, then the choice is clear. 79

80 Dataset, Algorithm, Tools Languages: Python, Java Script Dataset: The dataset will be scraped off of imdb website. Tools: Flask (Website creation Uses MVC framework) Scrapy (for scrubbing the data) AWS (Hosting the website, offline processing tasks) Hadoop, Mahout (for machine learning part) Pig (for Database access) D3 visualization 80

81 Progress and Expected Contributions Progress: Future Work: Base Website Implementation Data Scraping process from imdb Basic Linear Regression Analysis on small data D3 Visualization Using hadoop to store and access data through pig. Using mahout for ML on more features. Hosting the website on AWS. Move future scraping of data to ec2 using SQS. 81

82 E6893 Big Data Analytics: Project Nova Team Members: Abdus Samad Khan(ak3674), Chithra Srinivasan(cs3315), Mingyun Guan(mg3419), Yuzhong Ji(yj2345) 82 Nov 20, 2014

83 Motivation Build a system that collects and clusters news from web (New York Times, CNN, Reuters, etc.) on a daily basis, and provides a userfriendly interface. Articles on the same story from different sources are grouped together and present once to avoid duplicated readings. 83

84 Dataset, Algorithm, Tools Data Source: News articles from web Apache Nutch Data Storage: Apache Hbase Mahout User Interface A website list the news articles the topic users are interested Cluster news articles using K-Means (conf: k=unknown, n-gram =3 ) Using TF-IDF as a Vector Space Model Mahout library in Java Eclipse IDE Interface. 84

85 Progress and Expected Contributions Progress: Successfully implemented K-Means algorithm on JAVA to cluster dummy articles Implemented web crawling using Nutch. Contributions: To develop a unified, unbiased and real time news application using complete apache open source system. To provide the user with a one stop news solution. 85

86 E6893 Big Data Analytics: Game Outcome Analysis Team Members: Raymond Barker (rjb2150) 86 Nov 20, 2014 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

87 Motivation Motivation: to answer the following question: Given the current game state, who will win? Specifically, I ll be looking at games where the game state can be thought of as two opposing teams, each a multiset of welldefined members. For example: chess (each team is composed of N pawns, M rooks, etc.) deck-building games (each team is composed of cards from a predefined list) MMORPGs (each team is composed of players using predefined classes ) 87 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

88 Dataset, Algorithm, Tools Dataset: There is a standardized format for storing chess games known as PGN (portable game notation); various groups make datasets of notable games available, such as at chessok.com/? page_id=694. Magic tournament results are available at mtganalytics.net, though the data is not in a consistent format. EVE makes its player battle data available via an API which sites such as eve-kill.net aggregate. Algorithm: convert the game state into features, and then classify games into win / loss / stalemate buckets Tools: primarily Mahout, along with pre-processing code 88 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

89 Progress and Expected Contributions Progress: I haven t started yet. I ll be working with the chess data first, for two reasons: there is a huge amount of chess data available, and it s in a relatively easy-to-use format there is a huge amount of literature about / analysis of chess that already exists; for example, see the following chart of chess piece values: Once I am able to get some good results for chess, I can move on to working with datasets for other games (MTG, EVE, etc.) as time allows. Expected Contributions: I ll be doing all of the coding since this is a one-person group 89 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

90 E6893 Big Data Analytics: Exploring the Online and Offline Social World Large Scale Event-Based Social Network Analysis of Meetup.com Team Members: Mengge Li (ml3695, EE), Yiwei Zhang (yz2698,ee) Rahul Gaur (rg2930,cs), Rongyao Huang (rh2648, STAT)) 90 Nov 20, 2014

91 Motivation Unique value of event-based social network (EBSN) Both online offline social interactions Analyze & compare properties and dynamics of the 2 networks Commercial value: industrial trends, recommendation of services/ products based on user preference Big Fan of Meetup.com Popularity across academia, industry and recreation Excellent API: user, group, event, tags location & time Great opportunity to apply big data techniques and tools Graph database: Neo4j with Cypher Clustering, Recommendation Large scale social network analysis 91

92 Dataset, Algorithm, Tools Meetup Dataset # Users: 4,448,454 # Groups: 42,052 # Events: 1,595,833 # Tags: 77,810 # User-Group Pairs: 8,863,235 # User-Event Pairs: 13,553,134 # User-Tag Pairs: 15,057,535 # Group-Tag Pairs: 144,793 # Users with Locations: 3,741,699 # Events with Locations: 983,333 Analytics/ Modelling Properties of social interactions: degree, centrality, separation, clustering coef, density, etc. Temporal & spatial patterns of specific groups/events Clustering: Fiedler method, linear combination, generalized SVD Recommendation: user-based, item-based Tools Neo4j Java/Python 92

93 Progress and Expected Contributions Project Timeline - project planning, literature review 11/06-11/13 - pull data from Meetup Server and convert to proper format (Java) 11/14-11/20 - inject data into Neo4j 11/21-11/27 - Social network analysis using Cypher 11/21-11/27 - recommendation using Cypher 11/28-12/04 - use Neo4j with Java/Python for clustering 11/28 12/04 - write up report 12/05-12/11 Expected Contributions Up-to-date Meetup network construction/evaluation Recommendation of groups, events, products Industrial Trends of product/concepts of concern (dashboard) 93

94 E6893 Big Data Analytics: Yelp Review Analysis and Recommendation Team Members: Enrui Liao, Yuqing Guan, Ying Tan, Mingqing Wu 94 Nov 20th, 2014 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

95 Motivation 1.Besides ratings, user s review is a rich treasure of feedback. But traditional methods we learnt in class simply discard review text, which leave many latent factor difficult to interpret. 2. We aim to fuse latent rating dimensions with latent review topics. This deeper use of feedback yield a better recommendation Yelp Dataset Challenge Yelp Dataset Challenge 3. Reading large quantities of reviews is a difficult and time-consuming task. In this situation, a visualization that summarizes the user generated reviews is needed for perusing reviews. 95 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

96 Dataset, Algorithm, Tools Dataset: Yelp Academic Dataset Algorithms: 1. Syntactic Analysis 2. Clustering: Latent Dirichlet Allocation 3. Classification: Naïve Bayes, KNN, etc 4. Recommendation: Collaborative Filtering Tools: Java, Hadoop, Mahout JGibbLDA ( Stanford Parser ( 96 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

97 Progress and Expected Contributions Progress: 1. Discussed on different clustering / classification algorithms to choose which algorithm to be used on the data 2. Analyzed and extracted valuable attributes from raw data 3. Prepared and configured a web server for the project Expected Contributions: 1. Use syntax analysis to retrieve information from raw review text 2. Cluster the reviews by LDA algorithm and compare the results with classification algorithms 3. Improve the recommendation on businesses in Yelp data with the clustering and classification results 4. Analyze the correlation between ratings and reviews. Give advice for businesses about how to get a high rating 5. Build a website to visualize the clustering / classification and recommendation 6. (optional) Connect our work to a social network, like Facebook or Twitter, to perform online recommendations 97 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

98 E6893 Big Data Analytics: Project Name STOCK FORECASTING USING HADOOP MAP-REDUCE Team Members: Yi Yu, Yu Xia, Xiangliang Yang, Yumeng Xu 98 Nov 20, 2014

99 Motivation Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Hadoop MapReduce is a latest framework specially designed for processing large datasets on distributed sources. Apache s Hadoop is an implementation of MapReduce. 99

100 Dataset, Algorithm, Tools Algorithms: Pearson Correlation Similarity Euclidean Distance Similarity Stochastic Gradient Descent (SDG) Tools: Hadoop Mahout Hbase 100

101 Progress and Expected Contributions Expected Contributions: We are going to analyze the dataset called Daily Holdings for All ProShares ETFs which contains tons of information collected from the stock exchange market. The first step is to scrutinize the data and provide the stock which may go up potentially. With these screened stocks, suggest a certain user a potential stock which she/he may be interested in. Progress: We have already got the dataset and analyzed the some similarities which could be useful in further steps. 101

102 E6893 Big Data Analytics: Sentiment Analysis on Movie Yunao Liu, Hao Tong, Di Liu 102 Nov 20, 2014

103 Motivation We love movies, and movie reviews help us to find a good movie. In this project we will use large-scale machine learning and natural language process technologies to experiment on a sentiment analysis task on movie reviews. Rather than simply decide whether a review is thumbs up or thumbs down, we want to label a review on a scale of five values: one to five stars. 103

104 Dataset, Algorithm, Tools Dataset: Source: Rotten Tomatoes Format: Reviews with score ranging from 0 to 4 Size: 8500 reviews Algorithm: Multi-class SVM Logistic regression Optional: Algorithm proposed by Pang and L. Lee Tools: Hadoop Mahout Stanford Parser 104

105 Progress and Expected Contributions Progress: We ve done background research Looked into datasets Expected Contributions: Feature extraction: Yunao Liu, Hao Tong Experiments and evaluation: Hao Tong, Di Liu Analytics and discussions: Yunao Liu, Di Liu Paper works: Yunao Liu, Hao Tong, Di Liu 105

106 E6893 Big Data Analytics: Project Name Documents Analysis with Latent Dirichlet Allocation Team Members: Hao Fu, Xiuwen Sun 106 Nov 20, 2014

107 Motivation 1 extend the documents analysis algorithm to big dataset using Hadoop 2 analyze documents by answering these questions: 2.1 cluster documents using generative models 2.2 gain knowledge about clusters results 2.3 find the hidden attributes of different documents 107

108 Dataset, Algorithm, Tools 1 Dataset: Wikipedia database 50-topic browser of latent Dirichlet allocation fit to the 2006 arxiv 2 Algorithm: Latent Dirichlet Allocation (Collapsed Gibbs Samling) 3 Tools: Mahout, Hadoop 108

109 Progress and Expected Contributions 1 Build the LDA model for documents analysis on Hadoop 2 Cluster documents based on LDA analysis results: e.g. distribution on topics 3 Give description for each topic based on distribution on words 109

110 E6893 Big Data Analytics: Yelp Recommendation Analysis Team Members: Siddharth Sunil Boobna (ssb2171), Yash Parikh (yp2348), Prateek Sinha (ps2791) 110 Nov 20, 2014

111 Motivation The data provided by Yelp currently can be sometimes overwhelming for the users to make a choice. They are provided with a myriad of choices even for a specific set of businesses. We aim to make this easier for users by taking their preferences and those of similar users, a weighted score for the reviews based on votes and timeline along with various attributes of the business. 111

112 Dataset, Algorithm, Tools We are using the Yelp academic dataset available here. The dataset consists of details about businesses (i.e. geo location, categories, open hours and other attributes), reviews (by users of various businesses), user (i.e. name, friends), check-in (# of check-ins at different hours), tips (by users for various businesses). We will implement a Collaborative Filtering algorithm using a weighted BiPartite graph to get similar users and businesses. The weights will depend on the timeline of reviews (the newer it is, the more accurate it will be), # of useful votes for the reviews, check-ins by users for the businesses. We will run a sentiment analysis on the tips using a basic Naïve-Bayes algorithm. These sentiments will help us further enhance the recommendation system. Tools: Apache Mahout, Apache Spark, AWS, NLTK 112

113 Progress and Expected Contributions We have collected and studied the dataset. One thing we noticed was that the data was sparse. There were very less number of pairs of users rating the same business. Hence, the recommendation might not be very accurate. So to overcome this, we might have to cluster the users and the business to a more compressed graph. Contributions: Siddharth: Trying out a clustering algorithm (maybe K- means) on the users and the businesses separately. Yash: Running the sentiment analysis on the tips that will help us further enhance our recommendation system. Prateek: Implementing the basic BiPartite graph between users and businesses and later, include weights based on various features. 113

114 E6893 Big Data Analytics: Using Big Data (Hadoop) for Identification of Aberrant Behavior Clusters in Server Performance Time Series Data Team Members: Tad Kellogg CVN 114 Nov 20, 2014

115 Motivation Large enterprise data/cloud centers employ vast numbers of distributed compute servers for various organizational informational technology applications, e.g. accounting, human resources, high performance computations, risk management systems, etc To benefit the enterprise organization most servers are interconnected either by design or causally due to shared resources, e.g. network infrastructure, storage, shared database server, etc Identification of individual server aberrant behavior is currently a critical operation for enterprise data center operations. Scalability of server aberrant behavior detection for large (>1000) numbers of servers is a current problem. This project proposes additional value to aberrant behavior detection by offering a scalable detection algorithm and identification of clusters of servers based on aberrant behavior detection characteristics. Identification of previously unknown server clusters could be vital to a successful resolution of application outage or server performance problems. 115

116 Dataset, Algorithm, Tools Dataset: A collection of NMON ( performance log files for 100 Linux servers with a history of 90 days. Metric collection interval will be 5 minutes. Specifically, server performance, e.g. cpu utilization, memory utilization, disk and network input/output rates and throughput time series data from servers will be used for analysis. Algorithms: Transform NMON data from log to tabular format: PIG dataflow script Time series aggregation for percentiles and standard deviation: HIVE scripts Map/Reduce implementation of Holt-Winters time series forecasting algorithm Mahout K-Means clustering Tools: HortonWorks HDP 2.1 ( Mahout 1.0-SNAPSHOT 116

117 Progress and Expected Contributions Progress: Dataset collection complete PIG and HIVE scripting complete Holt-Winters map/reduce in progress Mahout K-Means clustering in progress Expected Contributions: PIG scripts for consuming NMON data files into Hadoop cluster HIVE scripts for NMON data aggregation Hadoop Map/Reduce based implementation of Holt-Winters algorithm Demonstration of K-Means clustering on server performance time series data 117

118 E6893 Big Data Analytics: Crocuta: JavaScript Analytics System Team Members: Rusty Nelson 118 Nov 20, 2014

119 Motivation Dependency Management Move the Compute to the Data Generic Interface Across More Platforms Dynamic (Open Compute) Scaling Many Small Unreliable Nodes with Limited IO NodeJS Servers Opted in Web Browsers 119

120 Progress P.O.C. (Proof of Concepts): Communication between nodes browser servers Dependency management Async Read of Large Datasets 120

121 121 Expected Contributions

122 E6893 Big Data Analytics Life Science Group Proposals 122 Nov 20, 2014

123 E6893 Big Data Analytics: Project Name: Learning Brain Activity From fmri Images Team Members: Zhao Hu Nov 20,

124 Motivation Background Knowledge: fmri(functional Magnetic Resonance Imaging): Image procedure using MRI technology that measures brain activity by detecting associated changes in blood flow. Goal: Want to use machine learning methods to classify the cognitive state of a human subject based on fmri data over a single time interval. What is cognitive states: Whether the human subject is looking at a picture or a sentence? Whether the subject is viewing a word describing a food or a book? 124

125 Dataset, Algorithm, Tools Data: Image data collected every 500 msec Extremely high dimensional (More than features) Extremely sparse, noise data Algorithm: Gaussian Naive Bayes Linear Support Vector Machine Some network algorithm( Bayes network) Tools: Hadoop, Mahout, Python and Amazon EMR 125

126 Progress and Expected Contributions Progress: Dimensionality reduction for fmri data (Done ) Feature selection (In process) Training and evaluation (Future) Expected Contributions: Help people have more clearly understanding about how brain works with stimulus. Improve the brain diagnosis, help to find the abnormal activity of brain. Etc. 126

127 E6893 Big Data Analytics: Network Analysis on the Big Cancer Genome Data Team Members: Tai-Hsien Ou Yang and Kaiyi Zhu Nov 20, 2014

128 Motivation Hallmarks of Cancer -Cancers are deadly and hard to be cured -Common attributes of all types of cancers that are associated with patient s outcome -Targetable interactions are too complicated for treatment A genomic-clinical regulation network based on big data analytics Nobody has really done this before and the tools have not been made on Hadoop A friendly framework for cancer diagnosis, treatment suggestion, and prognosis prediction 128 D Hanahan and RA Weinberg, Cell, 2011

129 Dataset, Algorithm, Tools 129

130 E6893 Big Data Analytics: Reversal Prediction from Physiology Data Team Members: Hongzhuo Zhang, Ruoyu Wang Nov 20,

131 Motivation Complicated physiological data Monitor Reversal prediction Human understandable activity BUNCH OF POTENTIAL APPLICATIONS 131

132 Dataset, Algorithm, Tools Dataset: Physiological dataset for ICML 2004 Algorithm: (1)Naïve Bayesian (2)SVM-based markov model (3)Conditional random fields (4)Ultra-fast forest trees Tools: Java and Hadoop 132

133 Progress and Expected Contributions Progress: Running existed algorithms on the dataset and try to synthesize multiple algorithms in next two weeks. Expected Contributions: Improve existed algorithm which could (1)Utilize the unlabeled data rather than ignore them (2)Achieve higher prediction accuracy 133

134 E6893 Big Data Analytics: EEGoVid: An EEG-Based Interest Level Video Recommendation Engine Team Members: Mohamed El-Refaey, Vincent Rubino, Jingwen Xie, Shi Zong 134 Nov 20, 2014

135 Motivation In collaboration with Neuromatters, There is a need to do analytics for interest level datasets captured from EEG signals for people watching movies. A potential to have a generic EEG-based recommendation engine that doesn t require users interaction (i.e. no explicit rating). 135

136 Dataset, Algorithm, Tools Dataset Interest levels data set taken from Neuromatters. Overall level of interest for each commercial (Not time-resolved). A time-resolved interest level and super bowl videos list. ImageNet. Tentative list of Algorithms: Algorithms for data pre-processing. Event-related field/response analysis. Classification and Collaborative Filtering. Tools Apache Mahout. Apache Hadoop. Java. HBass, Hive. D3 and JavaScript. 136

137 Progress and Expected Contributions Done with the system design and architecture. Acquired the dataset from Neuromatters. Done with the initial feature extraction from ImageNet. Started to process interest level dataset and mapping with video frames. 137

138 Progress and Expected Contributions 138

139 E6893 Big Data Analytics: Brain Edge Detection Team Members: Terrence Adams, 139 Nov 20, 2014

140 Motivation General goal: Improve understanding of brain structure. Background: The last several years have lead to amazing progress in the understanding of the brain s physical structure and its function. Focused goal: Improve detection of synapses. How: Develop computer vision algorithms for volumetric (voxel) processing of image slices taken from high-throughput electron microscopy of the brain. 140

141 Dataset, Algorithm, Tools Data used for this project will consist mostly of high-throughput electron microscopy image slices downloaded from Neural circuits are imaged at nanometer scales which leads to terabytes of data. Mouse brain estimated at 2M x 2M x 100k pixels. There exists limited hand-marked data of synapses. The markings are high-precision, but there is less guarantee of recall. I plan to develop a tailored computer vision algorithm for detecting synapses. I may use caffe to train a deep belief network, but this may not be feasible without more training data. Tools: Hadoop, Spark, ScaleGraph, InfoSphere Streams, matlab, Opencv, caffe. This list is still tentative. 141

142 Progress and Expected Contributions I do not have prior experience in this area. I was able to meet Will Gray Roncal, a researcher in applied neuroscience at the Applied Physics Laboratory (Johns Hopkins University). Mr. Roncal explained basics about the Openconnectome project and how to access data. Will showed me how to download the OCP API (matlab). He was extremely helpful. I was able to speak with Jacob Vogelstein and Mark Chevillet who were also very helpful. I expect to develop a new synapse detector that achieves state-of-the-art on EM brain images. 142

143 E6893 Big Data Analytics: BrainBow: Reconstruction of Neurons Suraj Keshri, Min-hwan Oh, Gaurav Gupta 143 Nov 20, 2014

144 Motivation Neurons are tree-like structures that are densely packed in a volume so that an average neuron makes on the order of 104 synapses. This astronomical number suggests that neuronal processes are densely intertwined, and even a small piece of neuronal tissue contains many meters of dendrites and axons from many neurons. Isolating individual neurons automatically in a given volume is an unsolved problem. The Brainbow method uses genetic engineering to express random fluorescent colors in individual neurons. Here, we build an optimization framework for reconstructing individual neurons in Brainbow tissues. 144

145 Dataset, Algorithm, Tools Data: We use generic data each image of neurons with size V (voxels) ~10 6, N (# of neurons) 10 We define an object as a pair (x, c T ) such that x [0, 1] V 1 (e.g., V voxels), and c is a non-negative column vector whose size is the same for all objects. x is referred to as a neuron. N objects are observed by Y = XC + e The problem is to recover both X and C Y : V 3 matrix of observations C : N 3 matrix of RGB colors e : noise term Algorithms: Beta Process for dictionary learning LARS for single neuron reconstruction Tools: Java, Matlab, Hadoop 145

146 Progress and Expected Contributions Implmentation on 2D data: Figure (2) is the true neuron image from which figure (1) has been generated with noise. Figure (3) shows a result from our reconstruction algorithm on Figure (2). Figure (1) Figure (2) Figure (3) Our goal is to implement our algorithm on bigger data set using hadoop and other tools. This algorithm can be used for general dictionary learning and representation problem. 146

147 E6893 Big Data Analytics Retail Group Proposals 147 Nov 20, 2014

148 E6893 Big Data Analytics: Project Name: [Yelp Fake Review Detection] Team Members: [Dhruv Kuchhal, Duo Chen, Mo Zhou, Chen Wen] 148 Nov 20, 2014

149 Motivation Negative Effects Caused by Opinion Spamming Unfair Competitions Deceitful Information for Users Detriment to Yelp s Credibility Yelp Dataset Challenge Ready-to-use Dataset 5000 Dollars Prize 149

150 Dataset, Algorithm, Tools Dataset The Challenge Dataset including rating data of business attributes, check-in sets, users, edge social graph, tips and reviews from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh Algorithm GSRank Algorithm (ref. Spotting Fake Reviewer Groups in Consumer Reviews ) Tools Hadoop, Mahout, Java, Python 150

151 Progress and Expected Contributions Progress Dataset Acquisition Literature Review Expected Contributions Dhruv Kuchhal documenting report and processing dataset Duo Chen - parsing and processing dataset and doing presentation Mo Zhou - studying and implementing algorithm Chen Wen implementing algorithm and documenting final report 151

152 E6893 Big Data Analytics: Project Name Predicting Excitement at donorschoose.org Team Members: Ran Ran(rr2950), Yi Jiang(yj2306), Lina Jin(lj2351),Yuezhi Wang(yw2586) 152 Nov 20, 2014

153 Motivation How DonorsChoose.org works DonorsChoose.org is an online charity that makes it easy for anyone to help students in need through school donations. Public school teachers from every corner of America post classroom project requests on DonorsChoose.org, and people, companies, and foundations can help fund request. Once funded, DonorsChoose.org will send the resources directly to the classroom. Goal The goal of this project is to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. 153

154 Dataset, Algorithm, Tools Dataset: 5 relational datasets about projects's information: donations, outcomes, resources, essays, and projects. Any project posted prior to is in the training set (along with its funding outcomes). Any project posted after is in the test set. Some projects in the test set may still be live and are ignored in the scoring. We do not disclose which projects are still live to avoid leakage regarding the funding status. (From Kaggle) Algorithm: Text Clustering: Kmeans, Canopy Classification : Naive Bayes, SGD Tool : HBase, Pig as the database Mahout for the clustering and classification R for visualization 154

155 Progress and Expected Contributions Progess: 1. Find out some relative features by calculating data in provided dataset. Ex. How many exciting projects did one school have in the past? 2. Clustering essays.cvs into a couple of categories which can be used as a new feature in dataset. 3. Train a model with the features generated above. 4. Test the correctness of this model. Expected Contribution: Help DonorsChoose.org identify projects that are exceptionally exciting to the business so that they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. 155

156 E6893 Big Data Analytics: Project Name Market strategy suggestions for B2C websites. Team Members: Xuebo Wang, Zixuan Gong, Siyuan Zhang, Wenxin Wang 156 Nov 20, 2014

157 Motivation Traditional marketing tools Bigdata Era: Data of large scale and overall aspects Convincing and Concrete reference User customerized 157

158 Dataset, Algorithm, Tools Dataset Jordan Average Expense Age Area Family member Bryant Johnson Cusomerized clustering clustering Cluster 1 Cluster 3 Cluster 2 age Average expense area $6000- $11000 New York Boston.

159 Progress and Expected Contributions Basic requisite: Raw dataset found Framework constructed Basic contribution : Construct the whole function flow Customer friendly interface Future Contribution: Result Visualization Cluster parameter adjustable 159

160 E6893 Big Data Analytics: Résumé Category Classification Team Members: Kaicheng Feng, Wentao Jiang, Hongliang Xu, Kaiwei Zhang 160 Nov 20, 2014

161 Motivation Candidate/Employer matching helps people find jobs and employers find the right candidates. Lynxsy is a company that matches job seekers with startup companies. It is critical to match the right profiles to the relevant positions But relying on HR to personally review thousands of resumes is both inefficient and unreliable. A tool to streamline and automate the process Analysis, Filter, Classify and Evaluate resumes automatically and efficiently. 161

162 Dataset, Algorithm, Tools Dataset Lynxsy shared about 3000 resumes to our team in PDF format together with the candidate survey information. We will manually label them using the instruction they specified. Algorithm Extract text from PDF resumes Pre-filtering, pre-processing, and formatting the raw dataset Transform the dataset into vectors Classification algorithms & clustering algorithms Cross validation to evaluate Tools Hadoop & Mahout Java/Ruby 162

163 Progress and Expected Contributions Progress We just received the dataset from Lynxsy. We started coding the PDF extracting and pre-filtering program. We had a plan on the overall project implementation. Expected A tool that takes a PDF resume as input and it will output the predicted resume category and the prediction confidence score. 163

164 E6893 Big Data Analytics: TV Genome Project / Recommendation Engine Analytics Media Group Team Members: Ishaan Sayal, Preeti Vaidya, Joshua Edgerton, Samuel Sharpe 164 Nov 20, 2014

165 Motivation TV Genome Project / Recommendation Engine for AMG: As a media company AMG constantly analyzes peoples viewing habits and television interests. User inputs a few shows they currently watch, and the tool predicts a few shows they might also like MOTIVE: Personalized TV recommendation system with focus on viewers historical viewing records or demographic data. Match most promising targets with their actual viewing habits to find the best shows to recommend to them. FUTURE PROSPECTS Products designed at gauging the potential viewership/ interest/success of a new show based on the characteristics of past shows and their relative success. 165

166 Dataset, Algorithm, Tools AMG Data: TV set-top boxes and online ad networks. Algorithms Used: Item Based Recommendation Possibly Clustering Collaborative Filtering with user and movie attributes. Tools: Tools/ Languages/ IDE Application 166 SQL Java/ Eclipse Mahout PHP/Yii Framework user_id series_id Fields in Dataset time_viewed_of_series series_tuning_instances series_days_tuned first_date_key_active last_date_key_active, total_hours_viewed total_tuning_instances total_days_tv_watched Data querying and preprocessing Recommendation Algorithm Recommendation Algorithm UI Design

167 Progress and Expected Contributions Task Description Team Members Status 1 Evaluation of Data Collecting useful information Pseudo Rating Experimentation 2 Running Recommendation Algorithms on Data Collected 3 Comparing Algorithms, Rating metrics and their performance 4 Expanding models to include user and show attributes 167 Joshua Samuel Preeti Ishaan Samuel Joshua Preeti Samuel Ishaan 5 UI Design Preeti Samuel 6 Database Implementation Joshua Ishaan Done On Time ( ) Scheduled ( ) Scheduled ( ) Scheduled ( ) Scheduled ( ) 7 Report and Delivery Entire Team Scheduled ( )

168 E6893 Big Data Analytics: Project Name Acquire Valued Shoppers Team Members: Ayushi Singhal, Dharmen Mehta, Nimai Buch 168 Nov 20, 2014

169 Motivation Competing Retail companies Predicting sales Offering personalized deals Increasing customer base 169

170 Dataset, Algorithm, Tools Dataset : Size : 22Gb Algorithm : User-Based Collaborative Filtering Item-Based Collaborative Filtering Logistic Regression Naïve Bayes Multiplayer Perceptron Tools : Hadoop Mahout 170

171 Progress and Expected Contributions Expected Contributions User Perspective Customized offers Company Perspective More Revenue Progress We have acquired, studied and cleaned our dataset. We have shortlisted the tools and algorithms we want to consider. We are setting up our Virtual machine to be able to run these algorithms on our dataset. 171

172 E6893 Big Data Analytics: Twitter-Based Product / Sales Events Recommender Team Members: Qianyi Zhong, Dongxue Liu, Chia-Jung Lin, Sung-Yen Liu 172 Nov 20, 2014

173 Motivation Motivation: 1. Need for an application for people to get effective and precise product recommendation simply by tweeting their desires for some products 2. A platform to benefit both customers and local retailers Our goal: A product and sales event recommendation system based on social network: 1. Reply tweets with information of mentioned products 2. Website for retailers to post sales events and for customers to subscribe/search (location-based) 173

174 Dataset, Algorithm, Tools Data Dataset: Tweets fetched from twitter, Feedback from our followers, Amazon products information Algorithm: User-based recommendation, Item name tagging after NLP Tools: Twitter API, Amazon API, Hadoop, TweetNLP/The Stanford Parser 174

175 Progress and Expected Contributions Progress: Fetch tweets using twitter API Website interface Expected Contributions: 1. Build a system integrating Twitter, Amazon which able to recommend items simply and automatically to our app s followers by querying Amazon with filtered tweets 2. Diagram of the reciprocal platform 175

176 E6893 Big Data Analytics: Yelp-er: Analyzing Yelp Data Team Members: Naman Jain, Natasha Kenkre, Rhea Goel, Sanket Jain 176 Nov 20, 2014

177 Motivation/Problem Statement Sentiment Analysis What makes a review cool, funny, useful Find most-talked-about topics for a business Bag of words, Word Cloud User Reputation System Based on number of reviews, friends, fans, compliments, votes, yelping since, iselite(), average rating Trending Today For businesses: Popularity timeline For users: Location (Heat map to find hubs for different cuisines within a city) Recommendation System Item based and User based 177

178 Dataset, Algorithm, Tools Yelp Dataset Challenge - Business, User, Review, Tip, Check-ins Bag-of-words, Neural nets, SVM, Recommendation algos Hadoop, Wordle, Heat Map API, Weka 178

179 Progress and Expected Contributions Cursory analysis of dataset Formulatio n of Problem statement Figuring out the right tools and technologi es that can be used Implement ation of algorithms Integration of tools Mostly collaborative effort, but. Naman Jain: Sentiment Analysis Natasha Kenkre: Recommendation system Rhea Goel: User Reputation System Sanket Jain: Trending Today 179

180 E6893 Big Data Analytics: Project Name Big Mobile Data Team Members: Kevin Wang (fw2253) 180 Nov 20, 2014

181 Motivation Apply big data analysis ideally in a mobile setting or organize mobile data Understand how big data can be applied in a telecom/mobile setting, and how to combine this with retail. 181

182 Dataset, Algorithm, Tools Dataset Yahoo Research Lab Data Set Mobile traffic data Algorithm Classification, Clustering and Recommendation Tools Mahout 182

183 Progress and Expected Contributions Started to gather mobile data for analysis 183

184 E6893 Big Data Analytics Telecom Group Proposals 184 Nov 20, 2014

185 E6893 Big Data Analytics: Network Congestion Analysis Team Members: David Cadigan, Hongjie Wang, Wei Zhang, Jiayi Yan 1 Nov 20, 2014

186 Motivation The goal: Design an intuitive tool to parse and determine shortest-path information for a computer network Simple tool to parse our node / network based dataset from download to final processing Calculate costs and paths similarly to how a packet switched network with spanning tree functions What we learn and use: Expand upon the tools which we used during our homework assignment to extract and process datasets Tool designed to further understanding of Hadoop and Mahout, as well as explore other tools for big data analytics Exploration of what others are already doing in this field Examples include TCP traffic data analysis and spanning tree evaluation 186

187 Dataset, Algorithm, Tools Dataset: Stanford P2P Network data Gnutella dataset which defines systems as nodes and edges as networks between them Stretch goal is to make the tool dataset agnostic can process any dataset depending on what is input. This is a stretch goal likely outside the scope of the initial project submission Algorithms: Setup Mahout engines to build a shortest cost path tool similar to spanning tree protocol for networking Usage of classification and recommendation engines within Apache Mahout Classification slow vs fast path Tools: Mahout on Hadoop Develop initial parsing engine to get the data into the format which we need Perl, C, Java based Language not as important as functionality 187

188 Progress and Expected Contributions Progress: High level design of tools and algorithm selection under way Next step is to develop the tools around the structure of the dataset which we are using to pull and preprocess the data After data formatting, begin development of the overall analysis tools Contributions: Each team member will contribute to all aspects of the projects. Tools, algorithm, and data selection have all been at the team level Note that the team is a mix of on-campus students and a CVN student Need to formalize communication currently working strictly through and teleconference 188

189 E6893 Big Data Analytics: Project Name: Comparison Analysis of Different Telecoms Operators Team Members: Zhenying Zhu, Jiahui Cheng, Chenyun Zhao, Lingxue Li 189 Nov 20, 2014

190 Motivation Analyzes web crawl data to compare several ISPs. Multiple types of web-news: Promotion activities, public reviews, community service news etc. -Cluster webs for each ISPs -Make comparison

191 Dataset, Algorithm, Tools Dataset Common Crawl Data from Amazon S3 - Information on billions of web pages - Search through the contents - Use ARC and Text files Algorithms Map Reduce Clustering: - KNN Clustering, Spectral Clustering, Canopy Clustering. Tools Hadoop Mahout Apache Pig Java for front end development Elastic Search SQL/NoSQL database 191

192 Progress and Expected Contributions Progress Expected Contribution Clusters of AT&T: Clusters of T- Mobile: Cluster 1:{related with promotion plan} :savings, free, limited time Cluster 1: Cluster 2:{related with community news} Cluster 2: Top terms:senior, charge, stores, shops Cluster 3: Cluster 3:{related with public reviews} Top terms: price, speed, quality Clusters of Sprint: Cluster 1: Cluster 2: Cluster 3: Would be a search engine tool for : Customers choosing telecom operators, Policy Makers, Journalists 192

193 E6893 Big Data Analytics: User s Web Events Analysis Based on Browser Extension Zheang Li, Cong Zhu, Linjun Kuang, Yifei Xu 193 Nov 20, 2014

194 Motivation Feel difficult to concentrate while studying? We are developing a software to monitor your browser activities. It can provide a straightforward visualization of your browser events and let you compare with others. Feeling the pressure? Stay focused 194

195 Dataset, Algorithm, Tools Dataset: Our browser extension will collect user activities as dataset. Algorithm: 1. Monitor the footprint of the browser, which will generate an event sequence of each user. 2. Identify the user event pattern. 3. Evaluate users events based on time, age, area, etc. Tools: JavaScript, Apache, Hortonworks 195

196 Progress and Expected Contributions Expected Workload 1. Develop a browser extension to monitor user s web events. 2. Build a server to collect the data. 3. Analyze the data from several aspects. 4. Implement a web for demo the result. Expected Contribution: 1. Provide visualization of your everyday browser activities. 2. Compare with other people of similar and different background 3. Provide dataset for further research. 196

197 E6893 Big Data Analytics: Analysis of telecom service in cellular networks Team Members: Zhilei Miao(zm2221), Yizhe Wang(yw2625), Shibiao Nong(sn2603), Yaqi Chen(yc2998) 197 Nov 20, 2014

198 Why choose this topic How people assign the usage of their telecom service and data plan. Data call vs Voice call Peak phone call Plan selection How to provide better plan to fit customers need 198

199 Dataset, Algorithm, Tools Dataset: Telecom Service Dataset From Churn Response Modeling Tournament, 2003 provided by Duke University Algorithm: For telecom company: Clustering For customer: User-based Recommendation Item-based Recommendation Tools: Hadoop Mahout Neo4j 199

200 Progress and Expected Contributions. For telecom companies Divide customers into groups by clustering algorithms. Help telecom companies analyze the behaviors of customers Provide optimized service for different customer groups.. For client Analyze the characteristics of client such as data usage, payment trends and billing history Set up a recommendation system to recommend the most fit plan to customers. 200

201 E6893 Big Data Analytics: Project Name Human Activity Monitoring and Prediction Team Members: Chao Chen, Junkai Yan, Qi Li 201 Nov 20, 2014

202 Motivation Human activity monitoring and prediction system health care, near-emergency early warning, fitness monitoring and assisted living Sensor data from the practical, small and unobtrusive platform -- smartphone Accelerometer(3-axial linear acceleration) Gyroscope (3-axial angular velocity) sampling rate: 50Hz Each person performed six activities walking, walking_upstairs, walking_downstairs, sitting, standing, lying 202 Activity recognition process pipeline

203 Dataset, Algorithm, Tools Dataset +Using+Smartphones Algorithm: Data Preprocessing: FFT Classification: Support Vector Machines; Naive Bayes networks; K-Nearest Neighborhoods Tools: Hadoop,Mahout,Pig,Java,Matlab 203

204 Progress and Expected Contributions Progress Analyzed the raw dataset with ADLs(Activities of daily living) obtained from a competition. Working on extracting useful information which are used for classification. Expected Contribution: 1. Signal analysis detected from the accelerometers and gyroscope and convert it into dataset. 2. Try different classification methods on the useful data extracted from raw signals and compare the accuracy of the classifiers. 3. Create a predictive model which could indicate the human activity state. 4. Apply our model on a different dataset with elderly voluntaries with ages between years to test its effects. 204

205 E6893 Big Data Analytics Transportation and Energy Group Proposals 205 Nov 20, 2014

206 E6893 Big Data Analytics: Project Name: Minimizing Risk in Energy Arbitrage Team Members: Adeyemi Aladesawe (aoa2124) 206 Nov 20, 2014

207 Motivation Enron bankruptcy was the biggest news sometime back Frank A. Wolak, Department of Economics, Stanford University, described in a paper Arbitrage, Risk Management, and Market Manipulation: What Do Energy Traders Do and When is it illegal? events, especially market manipulation and sharp practices that led to rising prices, and Enron s collapse Energy traders buy power at low prices from location A and sell high at location B. Profitability comes where the difference in prices is higher than the cost incurred to make the transaction happen. Buying and delivering energy immediately to spot markets incurs less risk than buying to deliver in a future time These moments of demands are almost fleeting, and speed to act almost separates losers from profiteers Thus, traders require knowledge of demand and supply trends, and the predictability, with a certain amount of confidence, that a location will demand energy priced at most on a threshold. 207 E6893 Big Data Analytics Lecture 12: Final Project Proposal

208 Dataset, Algorithm, Tools The dataset contains multivariate data types, with date-timestamps, and meter readings of a French household energy consumption over a 4 year period spanning 2006 to Algorithms suggested will be: 1) time-series analysis to spot trends 2) Plot of the data-points to intuitate a pattern or class of models 3) Regression to learn of underlying generative model of the dataset 4) Predict/generate data from the models learned - Predicting/Generating data points from the learned model will make use of Bayesian Inference 208

209 Progress and Expected Contributions - Model hypothesis is ongoing, the next phase is to select a regression function to learn a model to fit the energy-time curve - Expected contributions is the ability to forecast, with a maximum threshold of risk, energy needs in the future* - Unfortunately, the dataset under investigation is from France, and prediction will only be accurate in France 209

210 E6893 Big Data Analytics: Project Name Best Transportation Choice Team Members: Joseph Kevin Machado, Xia Shang, Zhao Pan, Andre Cunha 210 Nov 20, 2014

211 Motivation There are many apps in the market to get train routes or bus routes. But none of them tries to decide the best between them or even suggest a combination of both transit methods. The Goal of our project is to find the quickest way possible for a user to reach his destination. The trip may include subway trips, bus trips and even a combination of both. The datasets are extremely large and this is where a distributed File system framework such as Hadoop comes into play. 211

212 Dataset, Algorithm, Tools Dataset: MTA dataset (cannot be published built results can be used, we plan to use 2 weeks of data) Algorithm: we make use of prediction algorithms which are built into MAHOUT Tools: Hadoop, Mahout, Pig, Hive, Hbase (tentative) 212

213 Progress and Expected Contributions We have started by analyzing the available data and how to proceed with our ultimate goal. Contributions: Joseph Kevin Machado : Hadoop cluster and data management. Xia Shang: Analyzing data. Zhao Pang: Analyzing data. Andre Cunha: Make the final report. 213

214 E6893 Big Data Analytics: Manage Energy Consumption By Smart Meters Team Members: Xun Zhang (xz2348) 214 Nov 20, 2014

215 Motivation Big Data Utility companies have one of the largest costumer population. Perhaps a single state energy provider would have to store and manage over hundred TB of data including costumer information, weather and demographics, historical utility data, geographical data and much more. Energy Efficiency Information technology will enable massive and smart collection and management of energy data. Therefore personalized energy plan could be developed to meet the costumer requirement while reduce the consumption and boost the efficiency. 215

216 Approaches Objectives Delivering personalized service Proactively addressing potential safety risks Transform utility business into a data-driven service Analytics Data: Historical climates and demographic data; energy consumption data; utility price fluctuation data Algorithms: classification, decision and recommendation, etc Tools: python, Hadoop 216

217 Agenda Current: Data collection and pre-processing (normalization and validation) Week 12-13: Data analysis Week 13-14: Algorithm implementation and testing 217

218 E6893 Big Data Analytics: Project Name: Location Specific Optimization of Taxi Efficiency in NYC Team Members: Nick DeGiacomo, Preetam Dutta, Aamir Jahan, Tingting Lei 218 Nov 20, 2014

219 Motivation Problem Taxi inefficiency in NYC. Too many taxis in one area and not enough taxis in other areas. Application to Big Data Big Data Analytics- 485,000 yellow cab trips/day carrying 600,000 passengers/ day. We can analyze trip data to allocate resources (taxi) more efficiently. Solution Location specific analysis based on time of day, seasonality, etc. Algorithms to help taxi drivers determine optimal location to search for customers. 219

220 Dataset, Algorithm, Tools Dataset: NYC TLC Gigabytes of Data Description of the dataset: source and destination information, average time between rides, GPS coordinates, fares, etc. Algorithm/Methods: time series analyses, routing and scheduling algorithms, hypothesis testing, recommendation algorithms, etc. Tools: AWS, Hadoop, R, and Mahout 220

221 Progress and Expected Contributions Start (Now) 11.0 GB Data Software Web Interface Middle Time Series Analysis AND Recommendation Algorithm Finish (Deliverable) Web-Based, Mobile Application 221

222 E6893 Big Data Analytics: Project Name Citi Bike System Data Analysis Team Members: Zhefeng Xu, Wenxuan Zhang, Sun-Yi Lin, Yen-Hsi Lin 222 Nov 20, 2014

223 Motivation Citi Bike is an innovated bike-sharing system which is set up in recent years, and it provides a simple, convenient and eco-friendly way for New Yorkers and visitors to travel around the city. Now being in touch with this outstanding system, we may face some following questions: Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? Thus, we want to have a total evaluation of such system. 223

224 Dataset, Algorithm, Tools Dataset The Citi Bike kiosk system records a large quantity of trip and users information and we will use it as our dataset. Algorithm K-means, Naïve Bayes, K nearest classification, Complement Naïve Bays, Fuzzy K- means clustering Tools Hadoop, Eclipse, Mahout, Hive, Hbase 224

225 Progress and Expected Contributions Progress: 1. Find corresponding dataset. 2. Perform exploratory research and find algorithm 3. Get familiar with using different tools for big data. Expected Contributions: 1. Energy conservation. 2. Citi Bike usage rate according to different user type, age, gender and region. 3. Behavior Prediction of Citi Bike user. 225

226 E6893 Big Data Analytics: PeopleMaps Team Members: Anirban Gangopadhyay, Esha Maharishi Aditya Naganath, Abhinav Mishra Nov 20, E6893 Big Data Analytics Lecture 4: 12: Big Final Data Project Analytics Proposals Algorithms

227 Motivation GoogleMaps, Waze, and other path-recommendation services use a limited and pre-defined set of attributes to determine the worth of a path Wish to utilize the very rich knowledge of individuals in their known environments to recommend paths Not measure any attributes directly, but rather assume that a user taking a path is voting for that path as holistically better than any other By dynamically updating our knowledge of an environment through end-user s choices, we can provide a nuanced, insider recommendation of the current optimal path between two locales 227

228 Dataset, Algorithm, Tools The data is crowdsourced from end-users using Apple s CoreLocation Framework. On a user request for best path from A to B, if our crowdsourced data contains user-submitted paths from A to B, we use our collected paths to recommend the best path. If we do not have data from A to B, we supply the baseline GoogleMaps route suggestion, guaranteeing a minimum standard of quality We will use kmeans clustering with Euclidean Distance to group collected paths together. We will maintain a dynamic queue of paths based on a defined time interval. We will enqueue and dequeue accordingly to maintain updated location data. 228

229 Progress and Expected Contributions Determined relevant APIs for collection location data from users (CoreLocation Framework), for submitting requests to Google for the baseline directions (GoogleMaps Directions API), and a well-fitting application stack for the system (including node.js as the server, MongoDB as the database, and EC2 as the host) since it links our entire stack well through Javascript. Determined interfaces between different parts of the system, such as the representation of paths in each document in the database ios/insertion Backend/Server - Esha & Abhinav Mahout Clustering & Recommendation - Aditya & Anirban 229

230 E6893 Big Data Analytics: Project Transeo: Making public buses more efficient and accessible Dhruv Nair Omar Kiyani Manav Malhotra 230 November 20, 2014

231 Overview The MTA provides access to real time bus location information. This allows us to solve do two things: single route text-based Provide consumers with an interface that makes accessing this information natural. This is based on the notion that bus use in NYC is mostly for short distances, and there are many parallel routes Understand the optimal bus spacing and timing based on demand stop-based no discovery 231

232 What we plan on building Front-end Mobile map which shows the nearest routes as well as bus location in real time Back-end Poll the MTA API every 15 seconds Store bus location information Store consumer demand information Map with routes and bus overlap 232

233 Data uses Tie bus movement patterns with consumer demand and traffic Analyze bus overlap and demand (we will show a sample analysis with a published bustime dataset) Provide the MTA with a continuous analysis of their services for logistical optimization 233

234 E6893 Big Data Analytics: Image Based Geo-localization Team Members: Christopher Stathis, Yongchen Jiang 234 Nov 20, 2014

235 Motivation Q: Where was this picture taken? A computer vision problem requiring: A large database of imagery Considerable computational power General approach: Build a database of images tagged with GPS information Generate features for an input image Match features against those in the DB 235

236 Application Autonomous vehicle navigation Augment inaccurate GPS systems Auto-tag camera pictures Localize (human) users if lost in the street 236

237 Dataset, Algorithm, Tools Dataset Testing set: 100 images of Columbia University Practical set: Google Street View Algorithms Matching: SIFT, ASIFT, SURF Searching and matching: Mahout recommendation, Vocabulary tree Tools Python, OpenCV, Hadoop, Mahout, Google API 237

238 Progress and Expected Contributions Using Python-OpenCV to build SIFT Building Dataset Testing image matching algorithms Exploring distributed storage strategies 238

239 E6893 Big Data Analytics Media Group Proposals 239 Nov 20, 2014

240 E6893 Big Data Analytics Project: HVision Proposal Emad Barsoum (eb2871) DES Student Dept. of Computer Science focus on Computer Vision. HVision is a scalable computer vision platform on the top of Hadoop. 240 November 20, 2014

241 Goal and Summary Goal: 1. Provide a scalable vision platform for Computer Vision communities, to run vision algorithms on huge amount of data in parallel. 2. Provide a solution to Hadoop small files problem, in order to efficiently handle large amount of images. 3. Write a scalable Content Based Image Recognition (CBIR) using different algorithms. 4. What HVision to Computer Vision is what MAHOUT to Machine Learning. Summary: HVision will be composed of various components, here the high level list: 1. Image Processing: A number of Hadoop maps that perform per image, image processing task. 2. Feature Extraction: A number of Hadoop maps that uses Computer Vision to extract various image features (i.e. Histogram, SIFT, SURF, HOG etc). 3. Analytics: A Map\Reduce task that solve an end-to-end image task, such as query using an image as input. 4. Tools: A bunch of tools that can pack the input set of images, and unpack the result for viewing or analysis. 241

242 HVision High Level Architecture HVision Mappers MapReduce Tools Image Processing Image Retrieval (CBIR) Hadoop Images to Seq Feature Extraction Classification Seq to Images MapReduce Read data Write result Hadoop HDFS Submit Job Job Status Hadoop API 242

243 Progress, Dependencies and What next Progress, all the below items are done and checked in: 1. Maven based project with multiple modules in Github with Apache license: github.com/ebarsoum/hvision.git 2. Most of the plumbing and architecture decision are done (I will provide full details in the final report). 3. Command line tool for packing images into a sequence file, and another command line tool that unpack a sequence file back. 4. Simple Mapper only job for verification. 5. A main driver entry point that reroute the command to the right job (inspired by Mahout). 1. My goal is to simplify job submission so the syntax will be: hvision <cmd> <args>. Dependencies: 1. OpenCV and JavaCV. 2. Google Guava library. 3. And Hadoop library, you can change the version from Maven build. What next: 1. Algorithms, parallelize content based image search on the top of map reduce. 2. Test the above on various image dataset. 3. As time permit, keep porting more vision algorithms and optimize existing one. 243

244 E6893 Big Data Analytics: PlayPalate, a music playlist generator Devin Jones & Andy Enkeboll 244 Nov 20, 2014

245 Motivation PlayPalate will find new music based on your taste in music. The webapp will deliver a personalized Spotify/Rdio playlist based on artist similarity measures and relationships and a user s social graph. 245

246 Dataset, Algorithm, Tools Datasets o o o Spotify API Facebook API Rovi API Algorithms o Tools o o o o o o NLP & Similarity computation (Tanimoto, Log-likelihood, cosine, Euclidean distance, etc) Stack: Rails app on Heroku User DB: MySQL Graph DB: Neo4j (GrapheneDB) Data warehouse: Hadoop (Treasure Data) Message queuing: Beanstalkd Data processing: Python (NLTK) 246

247 Progress and Expected Contributions Progress o Vetted project ideas, researched available datasources, algorithms, and tech stack Expected Contributions o Back end: Andy o API connections & o o o ETL: Devin Similarity/recommendation algorithms: Devin & Andy Playlist generation: Devin Front end: Andy 247

248 PlayPalate Data Pipeline User s Music History / Preferences { 248 Artist Similarity Measures using NLP & graph traversal (Tanimoto, Loglikelihood, cosine, Euclidean distance, etc)

249 E6893 Big Data Analytics: Movie Exploration Team Members: Yongjie Cao yc2978 Fengyi Song fs2523 Yuzhe Shen ys2821 Hui Zou hz Nov 20, 2014

250 Motivation Audience preference: I like comedies I prefer blockbusters I like Robert Downey Jr I d like enjoy movies with my daughter I... Direction for Producers: What kind of movie customers would like to see? Which actor is perfect match for this character? Are there any qualities that make a memorable movie? 11

251 Dataset, Algorithm, Tools Dataset: Name movie_review Amazo n movie_test movie_train Tools: Source Quantity Content Yahoo Lab Yahoo Lab reviews ratings ratings movie name, score, comments movie name, score, who gives that score movie name, score, who gives that score Algorithms: Pig: process user profile Cluster & Classification: make correlation among audience Recommendation: make predicatos based upon the keywords, extractions and preferences 12

252 Progress and Expected Contributions AUDIENCE Expected Category/Actor Recommended Movies DATA POOL Preferred Category Category Average Score Keywords Related movie Name+Rating PRODUCER Comments and reviews Ratings Expectations Keywords: universe, disaster... Actors & actresses Correlation: similarities, clusters(kmeans...),ratings... Recommender: recommendations(user-based, item-based...) Profits Honors Awards Costs 13

253 E6893 Big Data Analytics: Fantasy Basketball Winning Strategy Team Members: Hao-Hsiang Chuang, Kun-Yen Tsai, Lin Su, Yujia Gu 253 Nov 20, 2014

254 Motivation Sports play a big role among people in USA. As a result, people spend more time on watching NBA, MLB, NFL...etc. for their leisure time. Within the last five years, ESPN and Yahoo release an online sports game of all the professional sports league for people to actually have a chance to form a simulated team online and compete with other gamers. The ranking of the online teams is based on the real world behavior of the players they pick for their teams. Therefore, drafting becomes a key factor for the victory of the game. We wish to provide a winning strategy for drafting. Moreover, we hope to provide a practical dynamic strategy for real NBA team to choose players. Our goal : - Recommend a player for the online gamer every round - Cluster all the players into different groups of different functionalities such as scoring, rebounding etc. - Rank the players inside each clustered group - Give suggestion based on the present status of the online gamers 254

255 Dataset, Algorithm, Tools Dataset - NBA players statistics Algorithms - Clustering - Kmeans clustering, f-kmeans clustering etc - Classification - PLA algorithms, Linear Regression etc - Recommendation Tools - Mahout, Matlab, and every resource provided from class 255

256 Progress and Expected Contributions Progress - Survey on the point system of Fantasy Basketball - Search for all the dataset that might be useful to our recommendation - Get familiar with all the tools and algorithms that we might use Expected Contributions - Set up a dynamic recommendation system for the game - Help the gamer to build a balanced champion team for the game 256

257 E6893 Big Data Analytics: Fantasy Basketball Prediction Using Previous Season s Data Team Members: ChiaKang Chao 257 Nov 20, 2014

258 Motivation Fantasy basketball has been around for a long time and many people have participated in every season. There are many predicting algorithms online (ex. ESPN, Yahoo) but these are already set and cannot be altered. Flexibility is lacking. Therefore, a self-made predictor using the previous seasons data can be really helpful and will fit personal needs. 258

259 Dataset, Algorithm, Tools The dataset is acquired from basketball-reference.com The most up-to-date data costs 650 dollars to get, so the alternative is to use older data (up to 2009). Instead of using current season to check the accuracy, use 2009 season and train with instead. Use Mahout to train and classify with K-mean then EM algorithm. 259

260 Progress and Expected Contributions Data has been acquired after talking to the admin of the website. The format of the data is rearranged in order to do processing. So far, all progress has been on pre-processing of raw data. Need to determine the value of K for k-means. So far, super star, starters, role players, and bench warmers are set, but more classes/categories would make the algorithm more versatile. 260

261 E6893 Big Data Analytics: Project Name Hunting for NBA players Team Members: Su Shen, Tianji Wang, Miao Lin 261 Nov 20, 2014

262 Motivation Amar'e Stoudemire in New York Knicks Salary this year : $23,410,988 (2 nd highest in NBA league) Season Game Played Rebounds Points Give insight advices to team managers to make decisions 1. Quantify players performances 2. Evaluate team s demand 3. Recommend the most matched players to each team 262

263 Dataset, Algorithm, Tools Dataset from and Historical data for all teams, players and games from last 10 seasons Write Python code to extract data from the above URLs - Beautiful Soup - data scrap data - pandas - python data analysis library - Requests - Integrate Python with HTTP seamlessly (based on open source project Extracting NBA data from ESPN With the above dataset and tools, we will implement our own map-reducer and recommender. 263

264 Progress and Expected Contributions Project Status: 1. Got some datasets from espn.com 2. Tailoring the datasets to meet the input format of out map-reducer 3. Working on implementing map-reducer Datasets obtaining and formatting: Tianji Wang Algorithms, implementations: Su Shen, Miao Lin 264

265 E6893 Big Data Analytics: Music-Links: suggesting music and potential friends with similar tastes Team Members: Boren Liu (bl2547), Yihan Zou (yz2575) Jiayin Xu (jx2238), John Grossmann 265 Nov 20, 2014

266 Motivation The rise of portable mp3 players and downloaded music has resulted in music recommendation becoming a larger aprt of major e-commerce and massively used applications(itunes, Amazon). With and widespread use of social media sites, it is possible to efficiently mine user contextual data along with music preference of a vast and diverse population of people. This new data renders old music recommendation algorithms based solely on music content and preference, obsolete. 266

267 Dataset, Algorithm, Tools Dataset: Million Musical Tweet Dataset Algorithms: mahout recommendation algorithms (userbased, item-based with different similarity measurement), geographic averaging Optional: Clustering for geographic information Tools: Mahout, Hadoop, Java 267

268 Progress and Expected Contributions After sufficient research, we now know specifically what we re going to do. So far, we ve got our datasets and the plan to accomplish the project. By the end of it, we look forward to being able to recommend different music to different users based on the music they like and their geospatial context. 268

269 E6893 Big Data Analytics: Affective Computational Cinematography Team Members: Brendan Jou & Joseph G. Ellis 269 Nov 20, 2014

270 Motivation Q: What movies and portions of movies are most appealing to a user? A: The movies that appeal most to a user s desired emotion Why Movies?: Movies are created in a way that elicit strong and varied emotional responses from an audience. Movie creators are specialists in creating a movie that elicits specific emotions at particular points within the movie. There will be a high amount of perceived emotion correlation between different movie-goers at the same time. The perceived emotion between viewers of user-generated-content and generic videos will not correlate as highly as within movies. Project Goals: Create an emotionally-inspired movie scene mid-level feature ontology for emotion classification in movie videos. Build concept detector for emotionally relevant scene attributes that can be used for emotion prediction of movie scenes. Examples of Concepts: Fight, Silence, 270

271 Dataset, Algorithm, Tools Dataset: Movie trailers that have been crawled from IMDB. Movie Concept Labels: Crawled from IMDB plot keywords that are used to describe each movie. Examples: Fight, Sex Scene, Yelling, Night, Day, etc. Algorithms: Feature Extraction: Clustering (K-means), SIFT, Neural Networks, Fisher Vector Encoding Classification: Multiple Instance Learning, SVM, Logistic Regression Tools: Python, BeautifulSoup, CUDA, Vlfeat, Matlab, Cluster Computing 271

272 Progress and Expected Contributions Current Progress: Wrote crawlers for IMDB dataset trailers and labels. Downloaded the dataset ~1400 Trailers w/ labels Classification framework written and completed In Progress: Feature Extraction: Dense SIFT w/ Fisher Vector Encoding Face Detection and Feature Extraction Audio Feature Extraction if time permits Shot Detection Train Concept Detectors Compute accuracy Contributions: A concept detector bank for emotionally relevant movie scene attributes that can be used for movie annotation or emotion prediction. 272

273 E6893 Big Data Analytics: Improving Movie Recommender System with User Behavior changes and Demographics Team Members: Jimin Choi (CVN Student) 273 Nov 20, 2014

274 Motivation - Most movie recommender systems rely on pure user ratings without considering other important factors such as opinion changes over time, demographics, and bias on genre etc. - Human mood can change from time to time, and interpretation of rating scale is different from person to person. The same movie can be rated differently by the same user after some time. - Traditional recommendation algorithms need to be improved to take human aspect of movie rating systems. Movie or music rating naturally have emotional aspects to it. Purely relying on numbers would not give the best results. - My small, but ambitious big data project is to come up with significant improvement on movie recommendation systems. 274

275 Dataset, Algorithm, Tools - Dataset : Yahoo Labs Movies User Ratings and Descriptive Content Information, v.1.0 (R4), Yahoo Music ratings for User Selected and Randomly Selected songs (R3), and other possible movie ratings if available - Algorithm : Similarity based user and item based recommendations, my own filtering and weighted methods for improving recommendation algorithms - Tools: Java programming language, Apache Mahout open source library, Excel for charts and graphs, a REST API framework (if time permits, I will develop APIs out of the research) 275

276 Progress and Expected Contributions - Solo project - Learning exercises for Apache Mahout has been successfully completed, and data set for experimentations are approved and compiled together - Several default recommendation algorithms were tested against the specific data set - Actual algorithm design for improving existing recommendation system should be finalized. - Implementation of algorithms and evaluation of the new scheme should be done - Refinement and testing - Writing report and final presentation 276

277 E6893 Big Data Analytics: MOVIE RECOMMENDATION AND ANALYSIS OF THIS APPLICATION ZIHAO WANG MINGYUAN WANG JING GUO Nov 20, 2014

278 Motivation Background With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch. We can recommend movies for users. 1. Users register with some information, like ages, genders, etc. 2. Clustering movies according to several factors, like catogories, ages. 3. Recommending movies by priority to users

279 Dataset, Algorithm, Tools Dataset IMDB (Internet Movie Database) Algorithms 1. Write java code to process dataset extracted from the website 2. Fuzzy-kmeans 3. User and item based recommendation Tools Eclipse Mahout

280 Progress and Expected Contributions Progress 1. Download the dataset and try to format the dataset. 2. Design the appearence and functions of this application. 3. Set up the developing environment. 4. Implement the functions of this application. Contribution Obtaining and formating the dataset. (Mingyuan Wang) Implement the funtions. (Zihao Wang) Testing and analyzing the results and improvement. (Guo Jing)

281 E6893 Big Data Analytics: TV Genome Project / Recommendation Engine Analytics Media Group Team Members: Ishaan Sayal, Preeti Vaidya, Joshua Edgerton, Samuel Sharpe 281 Nov 20, 2014

282 Motivation TV Genome Project / Recommendation Engine for AMG: As a media company AMG constantly analyzes peoples viewing habits and television interests. User inputs a few shows they currently watch, and the tool predicts a few shows they might also like MOTIVE: Personalized TV recommendation system with focus on viewers historical viewing records or demographic data. Match most promising targets with their actual viewing habits to find the best shows to recommend to them. FUTURE PROSPECTS Products designed at gauging the potential viewership/ interest/success of a new show based on the characteristics of past shows and their relative success. 282

283 Dataset, Algorithm, Tools AMG Data: TV set-top boxes and online ad networks. Algorithms Used: Item Based Recommendation Possibly Clustering Collaborative Filtering with user and movie attributes. Tools: 283 Tools/ Languages/ IDE SQL Java/ Eclipse Mahout PHP/Yii Framework Application Fields in Dataset user_id series_id time_viewed_of_series series_tuning_instances series_days_tuned first_date_key_active last_date_key_active, total_hours_viewed total_tuning_instances total_days_tv_watched Data querying and preprocessing Recommendation Algorithm Recommendation Algorithm UI Design

284 Progress and Expected Contributions Task Description Team Members Status 1 Evaluation of Data Collecting useful information Pseudo Rating Experimentation 2 Running Recommendation Algorithms on Data Collected 3 Comparing Algorithms, Rating metrics and their performance 4 Expanding models to include user and show attributes Joshua Samuel Preeti Ishaan Samuel Joshua Preeti Samuel Ishaan 5 UI Design Preeti Samuel 6 Database Implementation Joshua Ishaan Done On Time ( ) Scheduled ( ) Scheduled ( ) Scheduled ( ) Scheduled ( ) 7 Report and Delivery Entire Team Scheduled ( ) 284

285 E6893 Big Data Analytics: Trendy Writer Yin Hang Meng-Yi Hsu 285 Nov 20, 2014

286 Outline 286

287 General Approach (1)(1) Grab massive data (2)(2) Filter out stop words (3)(3) Split data to training and testing set (4)(4) Train model to get popular word(s) (5)(5) GOAL : - - Return popular single word in real time - - Return popular related words in real time 287

288 Data Source (1)(1) Source: Major magazine websites. Ex. the Huffington Post (2)(2) Fetch data from those websites programmatically (3)(3) GOAL: - - Grab totally over 10 thousands sentences - - Grab new data in real time 288

289 Word Count (1) Single word count - Pig (2) (1) Load data from file (3) (2) Filter out stop words (4) (3) Group by word to get word count (5) (4) Export result (6) Related words: Mahout Clustering (7)(1) Split data to training and testing data (8)(2) Find the best existing model for this project (9)(3) On top of chosen model, improve and customize it 289

290 Graph Database (1)(1) Inject data into Graph DB (Neo4j) according to clustering result (2)(2) Use Cypher to query data (3)(3) Get the keywords of each article (4)(4) Recommend topics by keywords 290

291 E6893 Big Data Analytics: Analysis on pricing strategy for sports team Team Members:Han Cui 291 Nov 20, 2014

292 Motivation Professional sports ticket pricing is a dynamic business since it constantly changes with the competitiveness of the sports team, mainly because the performance of the team, which is also affected by lots of other elements of the team meanwhile. One important factor of the sports team ticket pricing is the team s performance Buying the ticket is a strategic decision for customers There are various tickets available, pricing of these tickets is also a strategic decision Thus there is a relative optimal decision on pricing of the tickets for sports team 292

293 Dataset, Algorithm, Tools This project will analyze based on the data of major European soccer teams from regions like England, Germany etc. The tool being used for the project will be Apache Mahout, and there will be a pseudo-application based on C language as open source code. One important reason of choosing Apache Mahout is that the results could be displayed visually. Since it s a generally complicated project, only main factors that data could be legally extracted and applied would be considered 293

294 Dataset, Algorithm, Tools Project Result. For this project I should get a set of optimized parameters for each variable considered as a result. And making a most optimal decision of pricing for the sports team. The results should be a parameter in a certain range The results should also be displayed visually The results should be able to used to explain the real difference performance-wise and pricing-wise between clubs 294

295 Progress and Expected Contributions Week 1: Pseudo C code created for the project Week 2: Apache Mahout project created according to the pseudo code Week 3: Testing the project with all the data available for the project and adjust the project code to the most optimal. 295

296 E6893 Big Data Analytics: Spark NLP Team Members: Pierre Arnoux, Neraj Bobra, Talha Ansari 296 Nov 20, 2014

297 Motivation NLP is one of the most currently trending fields in machine learning Hadoop has Machine Learning libraries but Map/ Reduce is becoming a bottleneck Let s do something more efficient: Spark 297

298 Dataset, Algorithm, Tools Dataset: NewsGroup and/or Wikipedia Dataset Algorithm: K-means, and potentially LDA Tools: Spark, AWS 298

299 Progress and Expected Contributions Accuracy benchmarks have been set and we will try to get comparable results Contribute to an open source Machine Learning library for Spark 299

300 E6893 Big Data Analytics Social Science and Government Group Proposals 300 Nov 20, 2014

301 E6893 Big Data Analytics: PeopleMaps Team Members: Anirban Gangopadhyay, Esha Maharishi Aditya Naganath, Abhinav Mishra Nov 20,

302 Motivation GoogleMaps, Waze, and other pathrecommendation services use a limited and predefined set of attributes to determine the worth of a path Wish to utilize the very rich knowledge of individuals in their known environments to recommend paths Not measure any attributes directly, but rather assume that a user taking a path is voting for that path as holistically better than any other By dynamically updating our knowledge of an environment through end-user s choices, we can provide a nuanced, insider recommendation of the current optimal path between two locales 302

303 Dataset, Algorithm, Tools The data is crowdsourced from end-users using Apple s CoreLocation Framework. On a user request for best path from A to B, if our crowdsourced data contains user-submitted paths from A to B, we use our collected paths to recommend the best path. If we do not have data from A to B, we supply the baseline GoogleMaps route suggestion, guaranteeing a minimum standard of quality We will use kmeans clustering with Euclidean Distance to group collected paths together. We will maintain a dynamic queue of paths based on a defined time interval. We will enqueue and dequeue accordingly to maintain updated location data. 303

304 Progress and Expected Contributions Determined relevant APIs for collection location data from users (CoreLocation Framework), for submitting requests to Google for the baseline directions (GoogleMaps Directions API), and a wellfitting application stack for the system (including node.js as the server, MongoDB as the database, and EC2 as the host) since it links our entire stack well through Javascript. Determined interfaces between different parts of the system, such as the representation of paths in each document in the database ios/insertion Backend/Server - Esha & Abhinav Mahout Clustering & Recommendation - Aditya & Anirban 304 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

305 E6893 Big Data Analytics: Predicting usefulness of restaurants reviews from subtopics using Yelp data Team Members: Yu-Hua Cheng (yc2911) Jingchi Wang (jw3153) 305 Nov 20, 2014

306 Motivation Description of Tasks Topic Modeling. To categorize restaurant reviews into a number of subtopics. Ideally, we expect these subtopics to be food, service, environment, delivery and etc. Classification. After getting the subtopics, we will use several classification algorithms to predict the usefulness of reviews in each subtopic. Why subtopics? (Topic Modeling) Users have different purpose or interested topics toward one business. For instance, some may be interested in the food quality, and others may be into knowing the staff service. For those who order online, they may mostly care about delivery quality (speed and food). However, for those who go to restaurants, they may care a lot about environment and parking. Therefore, if we can categorize the reviews into subtopics, it may help users find the targeted reviews. Why predicting usefulness? (Classification) First, why not use the available votes (how many users think a review helpful)? The reason is that first, for some less popular restaurants, all the reviews may have zero useful vote because they are barely viewed. Second, a new posted review is more likely to have fewer votes, but it does not necessarily mean it has lower quality of usefulness. If we can predict the usefulness of reviews, we are able to provide them with high qualities in one specific subtopic to Yelp users. 306

307 Dataset, Algorithm, Tools Data: Yelp 2014 Dataset Challenge (Review dataset) Four cities: Phoenix, Las Vegas, Madison,Waterloo and Edinburgh 42,153 businesses 252,898 users 1,125,458 reviews Algorithms: Topic Modeling and Classification Latent Dirichlet allocation (LDA) To get the subtopics for our classification tasks, we will first use LDA. It is a widely used topic model for generating topics from documents. The results of this model will be a given number of topics composed of a set of terms (topic words). Naïve Bayes Naïve Bayes classifier is selected in this study, because it is a traditional probabilistic methods for text categorization. The model applies Bays theory and assumes strong independence between variables. Logistics Model The logistics classifier is a very popular probabilistic model used for predicting categorical dependent variable. It models the probability that the response belongs to a particular category. Support Vector Machines (SVM) SVM is almost the most widely used supervised learning algorithm for classification. It is very different from logistic model in that non-probabilistic binary linear classifier is used. Moreover, non-linear classification can also be performed with SVM using a specific kernel. Tools: R Python Mahout

308 Progress and Expected Contributions Progress: Merged the review data with business data, in order to scrape out the restaurant-type only reviews. And converted the json data to the data frame. Expected Contributions: Providing a more functional and targeting review service to Yelp users,which can assure them of a valuable reviews based on user s interested topics. 308

309 E6893 Big Data Analytics: Study Buddy ---- Education Resources Search, Organization, Recommendation Team Members: Zidong Gao(zg2185) Huan Gao(hg2357) Yuanhui Luo(yl3026) Yifan Yang(yy2495) 309 Nov 20, 2014

310 Motivation Pertinence Free of Advertisements Recommendation of study partners and potential interests 310

311 Dataset, Algorithm, Tools Dataset: Searching results from general purpose searching engine (i.e. Google, Baidu, etc.) Self-produced user data Algorithm: Various algorithms for Classification, Recommendation Tools: Java, Mahout, Hadoop, neo4j, HTML, Javascript 311

312 Progress and Expected Contributions Classification: Data retrieval and preprocessing Model training and analysis Classification and Results display Recommendation: Collecting of the search history and preferences Graph database integration with Java Visualization of users and topics Recommendation and Results display Expected Contributions: Yuanhui Luo: Front-end work, UI, Train Model, Retrieve Data Zidong Gao: Classification: Train Model, Front-end work +Visualization Yifan Yang: Classification: Retrieve Data + Preprocessing + Train Model Huan Gao: Graph Database + Recommendation 312

313 E6893 Big Data Analytics: Big Data Analysis on Log Data of Standardized IBT Test (TOEFL) Taker for Effects of Selection Changing Team Member: Jiaming Gu (jg3460) 313 Nov 20, 2014

314 Motivation Test takers always change their selections or choices of questions during test. Therefore, it will be interesting to find out the effects of this behavior via analysis of log data, that whether test taker gain more points when their choice or not. Moreover, we can also gain more insights on test taker behavior. 1. helping test developer to understand IBT test taker behavior regarding choice switching. 2. helping test taker to prepare for the standardized IBT test more effective and efficient. 3. improve the IBT adaptive test question reliability and validity via analyzing the log data. 314

315 Dataset, Algorithm, Tools Dataset: ETS Internal IBT database ( standard tests datatbase and test taker behavior tracking system) Algorithm: ETS testing behavior algorithms ETS scoring algorithms ETS fraud detecting techniques Statistical Analysis ( including statistcal methods and modeling analysis ) Tools: Hadoop Pig, Python, D3.js 315

316 Progress and Expected Contributions Progress: Datasets are ready for implements. Analysis methods are decided and starts to implement on data for trial phase. Expected Contributions: Whether changing choices or selections help test taker to gain more points or not. Moreover, effects of changing behavior under subsituations or categories, namely, under what kind of circumstance, test taker are more likely to gain points or loss points. Besides, the analysis may also contribute to some testing fraud detecting research. 316

317 E6893 Big Data Analytics: Improving Education for At-risk Students Team Members: Jairo Pava 317 Nov 20, 2014

318 Motivation This project will use data from the Department of Education to identify elementary school students that are at risk for low middle school academic performance. These elementary school students will be compared to similar students who became successful middle school students so that individual learning plans may be designed to them improve their academic changes for success. This project will offer a web front end to facilitate analysis. 318

319 Dataset, Algorithm, Tools Dataset The Early Childhood Longitudinal Study, Kindergarten Class of (Link) Focuses on children's early school experiences beginning with kindergarten and following children through middle school Enables researchers to study how a wide range of family, school, community, and individual factors are associated with school performance. Algorithm Clustering using Mahout to create groups of students based on factors described above. Classification to identify if a student belongs to an at-risk group. Tools Hadoop Hive Mahout Neo4J (Depends if performance problems faced during HW 3 can be addressed) 319

320 Progress and Expected Contributions Progress The data has been retrieved from the Department of Education website The Features that will be used for clustering students have been identified A literature review of research on the data set has been completed Clustering and classification of students is pending Creation of web front end is pending Expected Contributions To enable educators to identify children that may be at risk of poor academic performance. To enable educators to identify learning paths for at risk children based on their similarity to other successful students. 320

321 E6893 Big Data Analytics Project Name: Scratch Analyzer Team Members: Jeff Bender 321 Nov 20, 2014

322 Motivation Scratch Initial Learning Environment Constructionism Collaborative Communities 322

323 Dataset, Algorithm, Tools Scratch Website Existing Tools: Scrape Eclipse IDE Mahout Ecosystem 323

324 Progress and Expected Contributions Scratch JSON Lucene Framework SAGE Literature Map 324

325 E6893 Big Data Analytics: Oscar Award Analysis based on big data Team Members: Xi Chen, Yunge Ma, Yuxuan Liu, Zhiyuan Guo 25 Nov 20, 2014

326 Motivation Why this year's Oscar winner is that movie? Wanna know if your favourite movie will be awarded Oscar best movie award? 326

327 Dataset, Algorithm, Tools Yahoo Lab Movie dataset: - 211,231 ratings Amazon movie reviews dataset: - 8 million reviews of 10 years Algorithms: - Canopy - K-Means Tools: 27

328 Progress and Expected Contributions Progress: - Finished homework 1 & 2 & 3 - Learned clustering and classification - Prepared dataset Expected Contribution: - Find movie taste of people within a specific age group - Find the group whose taste is the closest to the Oscar winner - Predict 2015 Oscar nomination 28

329 E6893 Big Data Analytics: Error Correction in Large Volume OCR Datasets Team Members: Thomas Adams 329 Nov 20, 2014

330 Motivation Large scale Optical Character Recognition (OCR) of historical documents produces quantities of text that are impractical for correction by human review. Often, methods for error identification are ad-hoc and only address well-known and easily detectable errors that are specific to a given dataset. Proposed: Development of a generalized Data Correction Toolkit, along with configurable workflows, that can apply to and adapt to the idiosyncrasies of any particular dataset; for example: different languages or data set types (names, addresses, locations, events). 330

331 Dataset, Algorithm, Tools Dataset 1.3 billion records comprising US City Directories spanning 160 years from US City Directories were historical documents published by cities and often enumerated residents, addresses, occupations, familial relationships, etc. * Algorithm Use probability analysis on term frequencies; Markov models on character sequences; configurable heuristics. Tools MapReduce, Hbase, Hive, Oozie (or AWS equivalent), Java * Provided by special permission from Ancestry.com 331

332 Progress and Expected Contributions Goals To allow the unsupervised correction of large datasets by detection of highly probably OCR errors. For example: correction of Bobert Robert should occur while allowing the existence of both Bob and Rob. Of historical interest and specific to this dataset, a secondary goal is the identification of fictitious individuals intentionally but covertly included by the publishers to facilitate copyright protection. 332

333 E6893 Big Data Analytics: Project Name How to name your new-born baby(babies)? Team Members: Pei Huang (ph2325) Nov 20,

334 Motivation In this project, I would like to work on something (relatively) simple, but very important in terms of answering a real-life question. For soon-to-be parents, what name(s) should you give to your new-born baby(babies)? I would look at historical popular male/female baby names, and use the various Recommendation, as well as Classification and/or Clustering techniques to help the expecting parents to make the right name choice, so their children do not complain about having not-so-cool names for the rest of their lives. 334

335 Dataset, Algorithm, Tools Dataset Popular male/female baby names going back to the 19 th Century from the Social Security Administration. Algorithms User-based and Item-based Recommendation, Clustering and/or Classification Tools R HDFS/Hive/PIG/HBase Mahout 335