! E6893 Big Data Analytics Lecture 12:! Final Project Proposals

Similar documents

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Advanced Big Data Analytics with R and Hadoop

Big Data and Analytics: Challenges and Opportunities

COMP9321 Web Application Engineering

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Oracle Big Data SQL Technical Update

BIG DATA What it is and how to use?

NetView 360 Product Description

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Big Data and Data Science: Behind the Buzz Words

CiteSeer x in the Cloud

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Big Data Analytics OverOnline Transactional Data Set

Log Mining Based on Hadoop s Map and Reduce Technique

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Large-Scale Test Mining

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

GigaSpaces Real-Time Analytics for Big Data

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

RevoScaleR Speed and Scalability

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Machine Learning using MapReduce

HADOOP. Revised 10/19/2015

Neural Networks for Sentiment Detection in Financial Text

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

CSE-E5430 Scalable Cloud Computing Lecture 2

Challenges for Data Driven Systems

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

On a Hadoop-based Analytics Service System

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

Large scale processing using Hadoop. Ján Vaňo

Image Search by MapReduce

Mammoth Scale Machine Learning!

Chapter 7. Using Hadoop Cluster and MapReduce

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Distributed Computing and Big Data: Hadoop and MapReduce

A Brief Introduction to Apache Tez

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Find the Hidden Signal in Market Data Noise

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop and Map-Reduce. Swati Gore

Manifest for Big Data Pig, Hive & Jaql

Sources: Summary Data is exploding in volume, variety and velocity timely

Unified Big Data Processing with Apache Spark. Matei

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

ANALYTICS CENTER LEARNING PROGRAM

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Advanced In-Database Analytics

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Ubuntu and Hadoop: the perfect match

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hong Kong Stock Index Forecasting

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

How To Make Sense Of Data With Altilia

An Introduction to Data Mining

Analysis Tools and Libraries for BigData

Big Data Analysis: Apache Storm Perspective

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

White Paper: Hadoop for Intelligence Analysis

How To Handle Big Data With A Data Scientist

Sentiment Analysis on Big Data

Open source Google-style large scale data analysis with Hadoop

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Hadoop & Spark Using Amazon EMR

Fast Analytics on Big Data with H20

Chase Wu New Jersey Ins0tute of Technology

Fast Data in the Era of Big Data: Twitter s Real-

Data Mining Yelp Data - Predicting rating stars from review text

Machine Learning for Data Science (CS4786) Lecture 1

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Assignment # 1 (Cloud Computing Security)

Making Sense of the Mayhem: Machine Learning and March Madness

Moving From Hadoop to Spark

Testing Big data is one of the biggest

The Future of Data Management

Open source large scale distributed data management with Google s MapReduce and Bigtable

HiBench Introduction. Carson Wang Software & Services Group

Bayesian networks - Time-series models - Apache Spark & Scala

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Databricks. A Primer

Keywords social media, internet, data, sentiment analysis, opinion mining, business

Getting to Know Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Microsoft Big Data. Solution Brief

Transcription:

E6893 Big Data Analytics Lecture 12: Final Project Proposals Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center November 20th, 2014

Course Structure Class Data Number Topics Covered 09/04/14 1 Introduction to Big Data Analytics 09/11/14 2 Big Data Analytics Platforms 09/18/14 3 Big Data Storage and Processing 09/25/14 4 Big Data Analytics Algorithms -- I 10/02/14 5 Big Data Analytics Algorithms -- II (recommendation) 10/09/14 6 Big Data Analytics Algorithms III (clustering) 10/16/14 7 Big Data Analytics Algorithms IV (classification) 10/23/14 8 Big Data Analytics Algorithms V (classification & clustering) 10/30/14 9 Linked Big Data Graph Computing I (Graph DB) 11/06/14 10 Linked Big Data Graph Computing II (Graph Analytics) 11/13/14 11 Linked Big Data Graph Computing III (Graphical Models & Platforms) 11/20/14 12 Final Project First Presentations 11/27/14 Thanksgiving Holiday 12/04/14 13 Next Stage of Big Data Analytics 12/11/14 14 Big Data Analytics Workshop Final Project Presentations 2

Proposal List (#1 - #17, Page 7-76) Industry Sector Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Finance Project name Exchange Rates Inquiry and Analysis Algorithm Trading Strategies Using Hadoop MapReduce Image Classification in the Cloud and GPU (H-Classification & G-Classification) Google-Analytics, Graph based Online Movie Recommendation System Currency Trend Analyzer Correlating Price / Volume of Low Volume Stocks with Social Media Trading Using Nonparametric Time Series Classification Models Stock Forecasting Using Hadoop Map-Reduce Real-time Risk Management System Stock Daily Price Predictions Based on News Stock Recommendation System Stock signal generation using real time news analysis Customer Complaint Analyses Financial Market Volatility Salary Engine Stock price Movement Prediction with Hadoop+Mahout & Pydoop+Scikit Sector-based Classification and Clustering of Financial News Articles 3

Proposal List (#18 - #33, Page 77-147) Information Information Information Information Information Information Information Information Information Information Life Science Life Science Life Science Life Science Life Science Life Science TV Analytics Project Nova Game Outcome Analysis Exploring the Online and Offline Social World Yelp Review Analysis and Recommendation Sentiment Analysis on Movie Document Analysis with Latent Dirichlet Allocation Yelp Recommendation Analysis Using Big Data (Hadoop) for Identification of Aberrant Behavior Clusters in Server Performance Time Series Dat Crocuta: JavaScript Analytics System Learning Brain Activity From fmri Images Network Analysis on the Big Cancer Genome Data Reversal Prediction from Physiology Data EEGoVid: An EEG-Based Interest Level Video Recommendation Engine Brain Edge Detection BrainBow: Reconstruction of Neurons 4

Proposal List (#34 - #55, Page 148-239) Retail Retail Retail Retail Retail Retail Retail Retail Retail Telecom Telecom Telecom Telecom Telecom Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Transportation and Energy Yelp Fake Review Detection Predicting Excitement at donorschoose.org Market strategy suggestions for B2C websites Résumé Category Classification TV Genome Project / Recommendation Engine Analytics Media Acquire Valued Shoppers Twitter-Based Product / Sales Events Recommender Yelp-er: Analyzing Yelp Data Big Mobile Data Network Congestion Analysis Comparison Analysis of Different Telecoms Operators User s Web Events Analysis Based on Browser Extension Analysis of telecom service in cellular networks Human Activity Monitoring and Prediction Minimizing Risk in Energy Arbitrage Best Transportation Choice Manage Energy Consumption By Smart Meters Location Specific Optimization of Taxi Efficiency in NYC Citi Bike System Data Analysis PeopleMaps Project Transeo: Making public buses more efficient and accessible Image Based Geo-localization 5

Proposal List (#56 - #77, Page 240-336) Media Media Media Media Media Media Media Media Media Media Media Media Media Media Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government Social Science-Government HVision PlayPalate Movie Exploration Fantasy Basketball Fantasy Basketball Prediction Using Previous Season s Data Hunting for NBA players Music-Links Affective Computational Cinematography Improving Movie Recommender System with User Behavior c hanges and Demographics MOVIE RECOMMENDATION AND ANALYSIS OF THIS APPLICATION TV Genome Project / Recommendation Engine TrendyWriter Analysis on pricing strategy for sports team Spark NLP Predicting usefulness of restaurants reviews from subtopics using Yelp data Study Buddy Big Data Analysis on Log Data of Standardized IBT Test (TOEFL) Taker for Effects of Selection Changing Improving Education for At-risk Students Scratch Analyzer Oscar Award Analysis based on big data Error Correction in Large Volume OCR Datasets How to name your new born baby (babies)? 6

E6893 Big Data Analytics Finance Group Proposals 7 Nov 20, 2014

E6893 Big Data Analytics: Project Name Exchange Rates Inquiry and Analysis Team Members: Mengnan Wang(mw2969), Xiaomeng Zhang(xz2350), Jianze Wang(jw3127), Wanding Li(wl2501) 8 Nov 20, 2014

Motivation With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market. 9

Dataset, Algorithm, Tools Dataset: Instant exchange rates from Bloomberg History exchange rates from www.livecharts.co.hk/historicaldata.php Algorithm: We would forecast the exchange rate of currency in some targeted countries against US Dollar in a developed market, applying scalable model to forecast in real-time. And we would like to use RMSE to measure the reliability and accuracy of our prediction. Besides, by showing the statistical significance, such as P values, the repeatability of the outcome will then be proved. Tools: Eclipse Tomcat Apache 10

Progress and Expected Contributions Expected Contributions: Our project is initially designed to provide users with the following contents. Forward and cross exchange rates for most world currencies Both instant and history data Basic analysis of exchange rates including regression and K means Latest news about exchange rates Current Progress A sketch webpage design that looks something like Key Factors Affecting Exchange Rate PPP(Purchasing Power Parity) INT(Interest Rate Differential) Such as Libor GDP(The Difference in GDP Growth Rates) IGP(Income Growth Rate) Relative Economic Strength we may use factors like GDP and IGP to measure it quantitatively 11 E6893 Big Data Analytics Lecture 12: Final Project Proposal

E6893 Big Data Analytics: Project Name: Algorithm Trading Strategies Using Hadoop MapReduce Team Members: Yifan Wu, Meibin Chen 12 Nov 20, 2014

Motivation Improvement of internet speed and storage Data flooding everywhere Hadoop be able to do this & => MapReduce is the tool for parallel computing of the data Usage of Algorithm trading: Algorithm trading is the use of computer programs for entering trading orders, in which computer algorithms decide on every aspects of the order, such as the timing, price, and quantity of the order. 1. Back test the algorithm using enough historical price data to validate and optimize the algorithm in terms of profitability, stability, etc. 2. For a complex algorithm, there are many parameters that need to be optimized. 1. The system will builds and select most suitable strategy 20% faster than before 2. The enhanced platform doubled the number of strategy groups. 3. Last, the strategies can now be updated more frequently and can include more parameters in the analysis. 13

Dataset, Algorithm, Tools The dataset we choose is from stock future index of Shanghai Future Exchange (big bull market of A share ) or others Algorithm describes: The algorithm is given as combination of moving average convergence divergence(mcad) and Relative Strength Index(RSI) MCAD: long average period and the short average period RSI: the upper threshold, the lower threshold and the calculation period The project will divided into two part: First one is MapReduce platform----hadoop Second one is the trading strategies to test the data on this platform 14

Progress and Expected Contributions Inner MapReduce: Input: Daily price data. Each line contains 100 days price information Output: The performance of the parameters on the data Outside MapReduce: Input:Parameters combination: Each line contains one combination of parameters Ouput: The best parameters The contribution as Algorithm trading can provide many usage in investment strategy, including market marking, inter-market spreading, arbitrage, or pure speculation. With MapReduce, it can achieve faster, multi-tasks, and more real time updates. 15

E6893 Big Data Analytics: Image Classification in the Cloud and GPU (H-Classification & G-Classification) Team Members: Anand Rajan & Eric Johnson 16 Nov 20, 2014

Motivation - With the advent of social media the number of mobile pictures being taken and uploaded is increasing exponentially - Although most photos are uploaded with some basic metadata: date, time, camera model, and possibly geo-location - a great deal of details are missing when they enter the cloud. Unless users physically go through and tag each image this can create a search nightmare - Example: How do you find that picture you took a few years back while on vacation in Paris? It was under a bridge by the river right? Resorting to clicking through hundreds of photos or waiting for images to cache on your phone can take forever when you want to show someone in a pinch. Challenge In order to make more effective image search it will be important to develop and utilize advanced algorithms to help auto-tag images. Doing so can help narrow down image search and improve the quality of search results. - Leveraging the Yahoo Labs Flickr dataset we plan to test and develop upon feature extraction methods utilizing a parallelized computing system to efficiently extract image characteristics. - Using these image characteristics we will train and test the image classification of these images and evaluate them based on precision. Going beyond this step we also plan to experiment with a GPU powered processing system to evaluate added benefits and performance benchmarks that might be had during the image analysis stage over a standard distributed system. 17

Dataset, Algorithm, Tools Dataset: Yahoo/Flickr Image Dataset (83GB) 2 Million User Photos (200,000 x 10 categories) - Contains: photo_id, jpeg url, and some corresponding metadata such as the title, description, title, camera type, title, tags. Addition: Flickr API Details like comments, favorites, and social network data, can be queried. Data is broken into 10 categories - 1 nature - 6 sky - 2 food - 7 london - 3 people - 8 beach - 4 wedding - 9 2012-5 music - 10 travel Toolset: R, Java, Hadoop/Mahout, NVidia CUDA - Linux based Cluster Array - 26x 2.7GHZ Intel Xeon CPUs - 64 GB of GDDR3 RAM - Windows 7 Desktop - 2.7GHZ Intel i7 CPU - 24 GB of GDDR3 RAM - 2GB 256-Bit GDDR5 w/ 1344 CUDA Cores Algorithms: Implemented by Stage - Feature Extraction / Reduction - SIFT - Hog - PCA (Map stage) Good for GPU - Encoding - KMeans Clustering (Reduce Stage) - Classification - Naïve Bayes - Log SGD Regression 18

Progress and Expected Contributions Progress - Conducted thorough research on opensource Image Classification packages as well as the tools required to perform feature extraction and analysis - Setup an environment for Hadoop distributed system - Acquired Yahoo Flickr dataset and begun initial testing - Acquired hardware and performed initial tests on a GPU based system Expected Contributions - Our goal is to research, implement, and build upon the current Open Source offerings for Image Classification to help improve the auto-tagging process of digital photos - If given the time and resources we aim develop a web based interface that will allow users to upload an image and perform a feature analysis and extraction to determine which tags and keywords are associated with it 19 Potential Challenges - Research for feature extract is very resource intensive so we have picked a challenging project for only two students - Conducting such a challenging project will involve many abstract mathematical models - Companies like Yahoo and Facebook invest millions in this field - *We anticipate this to be an ongoing project that we can continue well beyond the course and perhaps into the second semester

E6893 Big Data Analytics: Google-Analytics, Graph based Online Movie Recommendation System Team Members: Tian Han, Yifan Du, and Hang Guan 20 Nov 20, 2014

Motivation - To build our own movie website and implement the movie recommendation functionality. 21

Process Flow Recommendation Movie Website Users log files Graph Database Google Analytics Key words for query 22

Dataset, Algorithm, Tools - Dataset MovieLens 1M dataset, which contains 1 million ratings from 6000 users on 4000 movies - Algorithm Various collaborative filtering algorithms, e.g. user-based recommendation, item-based recommendation etc. - Tools Web design - Dreamweaver / CoffeeCup Users log file analysis - Google Analytics Graph Database Gremlin / Neo4j Others Mahout / Eclipse 23

Progress and Expected Contributions Timeline: Now -- 11/27/14 : Movie website design and publish 11/28 12/04/14: Plug in Google Analytics in website and Analyze user log file 12/05 12/11/14: Generate Query for the Graph Database and update recommended movies on website Expected Contributions: - To build our own online movie website. - To implement the movie recommendation functionality. 24 - To to website analysis using Google Analytics

E6893 Big Data Analytics: Currency Trend Analyzer Team Members: Tim Paine, Mark Aligbe 25 Nov 20, 2014

Motivation Forex trading requires statistical insight into the exchange market. Large quantity of data, visualization only utilized at the day/week/ month level Difficult to see real time trends, analyze real time trends at granularity < 1 day Need to be able to collect, analyze, visualize data streaming in real time Solution Distributed Computing Real Time Computation Statistical Analysis Data Visualization 26

Datasets Large amounts of intraday/daily forex/equity/other data Algorithms Recommenders - suggesting trading prices and items to exchange Clustering - to analyze trends over a variable period of time Classifying - to classify trends into upward/downward movements, Tools momentum Java and Mahout for the analytics Javascript, Python, and R for data gathering, web server, and visualization 27

Progress and Expected Contributions Forex data acquired, sanitized, formatted, ready to 28 use Built system to batch collect data from multiple feeds when it becomes available Current stage: building design and field research Next steps: other distributed computing libraries End contribution: an extensible framework for collecting, analyzing, and visualizing real time data feeds

E6893 Big Data Analytics: Correlating Price / Volume of Low Volume Stocks with Social Media Jeff Ho, MS Statistics William Lee, MS Operations Research (CVN) 29 Oct 20, 2014

Hypothesis and Method Hypothesis Low volume stocks typically do not generate mainstream news coverage We hypothesize that social media could be a useful source of information Method Backtest different methods of using Big Data (specifically Twitter) to ultimately try to predict future price movements We will test various cases attempting to seek a correlation between tweets and movement of low volume stocks in price or volume We will verify whether these tweets are leading or lagging indicators of price or volume changes 30

Stock Criteria Stock Criteria Low volume stocks No / Low analyst coverage Stick with one industry Data Source: Yahoo Finance Big Data Platform Twitter Data Source: Twitter API 31

Predicting stock movements with tweet volume 1. Built list of stocks to test 2. Found cases where tweets can predict absolute movement in prices and stock volume 3. Can build a strategy around tweet volumes Absolute Percent Change in Stock Price 10.00% 7.50% 5.00% 2.50% 0.00% 800000 600000 ALSK shows Rise in Tweet Volume 2 Days Prior to Significant Stock Movement y = 0.0033x + 0.0144 R² = 0.5572 0 8 15 23 30 Tweet Volume 2 Days Prior ALSK shows Rise in Tweet Volume 2 Days Prior to Signifcant Stock Volume Increase y = 22249x + 146257 R² = 0.6768 Stock Volume 400000 200000 32 0 0 8 15 23 30 Tweet Volume 2 Days Prior

E6893 Big Data Analytics: Trading Using Nonparametric Time Series Classification Models Team Members: Yufan Cai, Bowen Wang, Junchao Zhang 33 Nov 20, 2014

Motivation Traditional trading strategies usually involved with time series models such GARCH. It is difficult to incorporate categorized parameters such Twitter data. Using classification models, we can give a prediction on whether the asset price will go up or down by incorporating unstructured data stream. 34

Dataset, Algorithm, Tools Dataset: Stock live price data, order book data (Bloomberg API, Bitcoin/USD). Twitter (Optional) Algorithms: Mahout Algorithms such as Logistic Regression Others like Weighted Majority Voting and Nearest-Neighbor Classification Tools: Java, Mahout, Hadoop 35

Progress and Expected Contributions Progress: Researched on various recent time series classification models Have set up interface with data API Expected Contributions: Provide a Hadoop based classification model implementations on time series 36

E6893 Big Data Analytics: Project Name STOCK FORECASTING USING HADOOP MAP-REDUCE Team Members: Yi Yu, Yu Xia, Xiangliang Yang, Yumeng Xu 37 Nov 20, 2014

Motivation Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Hadoop MapReduce is a latest framework specially designed for processing large datasets on distributed sources. Apache s Hadoop is an implementation of MapReduce. 38

Dataset, Algorithm, Tools 39 Algorithms: Pearson Correlation Similarity Euclidean Distance Similarity Stochastic Gradient Descent (SDG) Tools: Hadoop Mahout Hbase

Progress and Expected Contributions Expected Contributions: We are going to analyze the dataset called Daily Holdings for All ProShares ETFs which contains tons of information collected from the stock exchange market. The first step is to scrutinize the data and provide the stock which may go up potentially. With these screened stocks, suggest a certain user a potential stock which she/he may be interested in. Progress: We have already got the dataset and analyzed the some similarities which could be useful in further steps. 40

E6893 Big Data Analytics: Real-time Risk Management System Team Members: Iljoon Hwang, Sungwoo Yoo, Sungjoon Huh 41 Nov 20, 2014

Motivation 1. Motivation - Objective: Developing the Real-time Risk Management System (Intraday Value at Risk) for large complex portfolio in an unified framework - Expected Outcome: The system which performs the calculation of stressed VaR, "what-if" scenarios, stresstesting on complex portfolio with large number of underlying risk factors and vectors in real-time. - Importance: Risk management is crucial to throughout the investment/trading activities from front trading desk to back office. However, because of the complexity of calculating VaR in large multi-asset portfolio, delivering the VaR in real-time is not available at legacy system. Big Data with in-memory multi-dimensional analytics can resolve this big issue. 42

Dataset, Algorithm, Tools 2. Dataset Yahoo Finance Tick data for S&P 500 3. Algorithms Valut at Risk: Monte Carlo simulation, Parametric method Big Data: Map-Reduce, In-Memory 4. Tools / Languages Hadoop, Spark, Scala, R, Python, and Google Cloud 43

Progress and Expected Contributions 5. Current Status - Developed environment system construction in Google Cloud - Have been collecting tick data from Yahoo Finance - Have decided specific algorithms for calculating Value at Risk - Have been studying related papers and articles 6. Team members and Expected Contributions - Iljoon Hwang: Environment systems construction development, Storage server programming, Testbed - Sungwoo Yoo: Data collection, Research papers / articles - Sungjoon Huh: Batch processing / Real-time app programming 44

E6893 Big Data Analytics: Stock Daily Price Predictions Based on News Team Members: Jie Liu, Jingnan Li, Lu Qiu, Ruixin Tan 45 Nov 20, 2014

Motivation Traditional technical trading only take into account the quantitative but not qualitative factors that influence the stock prices. It is well know that news items have significant impact on stock indices and prices. To make a better prediction, we combine quantitative methods with headline NLP feature analysis in our model. 46

Dataset, Algorithm, Tools Dataset: ( Jan 2, 2013 Nov 17, 2014) Stock Prices Indices& Yield News Yahoo Finance/ Bloomberg Yahoo Fiance Text Algorithm: NLP, Lasso, SVR, ARIMAX Attributes Data Source Data Type Description News Articles Indices, Yield Historical Prices Yahoo Finance Numeric Google, Apple and IBM stock daily prices NLP Tools: Java, Python, Hadoop, etc. Numeric Nasdaq, S&P 500, Treasury Yield 5 Years NLP Features News articles related to Google, Apple and IBM Data collection: Yahoo Finance RSS Around 3000 articles Feature Selection: Lasso Prediction: ARIMAX, SVR Stock Prices 47

Progress and Expected Contributions 0.05 What we have done: Collected numerical data and parts of news. Built preliminary models, verifying our thoughts reasonable. Correlation (NLP features VS AAPL Stock Prices) R 2 of Regression Methods 0.0375 0.025 Method Train Test Baseline: Linear 0.335 0.094 0.0125 SVM with RBF kernel 0.426 0.172 0 Day -3 Day -2 Day -1 Day 0 Day 1 Day 2 SVM with RBF of Indexes and Lag =1 on Stock -0.001-0.108 What we will do: Implement the algorithm on the whole dataset. 48 Prove the positive impression of NLP on stock price prediction.

E6893 Big Data Analytics: Project Name Stock Recommendation System Team Members: Guangyang Zhang, Yuechen Qin, Bowen Dang, Zheng Fang 49 Nov 20, 2014

Motivation To help stock buyers to make wiser choices To find those who are very good at gaining profits from stocks (experts) Using user-based collaborative recommendation to find the similarity between the buyer and the expert Recommend the stock buyer some stocks from the most similar expert. To ease stock buyers from the heavy burden of looking through thousands of stocks and making a wise choice. 50

Dataset, Algorithm, Tools Dataset NASDAQ Stock Exchange Data Yahoo Finance dataset for historical prices Simulated trading records of 100000 users Algorithm User-based Collaborative Filtering with Inferred Tag Ratings Tools Eclipse J2EE, Mahout, Maven, MySQL, PHP server 51

Progress and Expected Contributions Progress Implemented UI and database Determine the algorithm User-based Collaborative Filtering with Inferred Tag Ratings. Expected Contributions Achieve improved Collaborative Filtering algorithm with Inferred Tag Ratings. Present clients with sound stock recommendations based on the analysis of experts buying choices. 52

E6893 Big Data Analytics: Project Name Stock signal generation using real time news analysis Team Members: Mandeep Singh, Mayank Misra, Rajesh Madala, Shreyas Shrivastava 53 Nov 20, 2014

Motivation Stock movement due to white house related news on twitter Twitter based hedge fund Algorithmic trading Twitter data traditionally used for Sentiment Analysis But now also a great source to consume News real time Stock prices have correlation with news By applying appropriate filters on tweets by news agencies, and then scoring the filtered tweets we aim to generate signals for stock prices that could be consumed by algorithms or traders to do better trades The framework we are building is scalable and could potentially be used to generate signals for a portfolio of heterogeneous stocks 54

Dataset, Algorithm, Tools Stream Filter Algo Filter Algo Filter Algo Scorer Algo Scorer Algo Scorer Algo Dataset: Twitter Stream Algorithms: Custom filtering and scoring algorithms with use of NLP, logistic regression Tools: Stream processing, Python, Json parser, Python ML libraries, d3.js 55

Progress and Expected Contributions So far we have completed the stream ingestion part and are working on refining our filtering and scoring algorithms to generate better correlation between generated signals and stock price movements. Below is a breakdown of work and contributions by team members 1. Ingesting data stream Mayank 2. Filtering Algo - Mandeep 3. Scoring Algo - Shreyas 4. Fetching real time stock data and plotting it together with stock signals generated in real time Rajesh 56

E6893 Big Data Analytics Project Name Customer Complaint Analyses Insights Into Issues plaguing the Banking Sector Team Members Abhaar Gupta, Avinash Sridhar, Nachiket Rau, Sankalp Singayapally 57 Nov 20, 2014

Motivation One of the biggest challenges for banks is minimizing customer attrition rate which is directly dependent on customer satisfaction. Customers are inclined to choose the banks who can be trusted for their services. Banks make their decisions based on a subset of data because of absence of scalable solutions. In this project, we propose a scalable design to counter the above problems 58

Dataset, Algorithm, Tools Consumer Complaints Database: The dataset contains retail consumer complaints with banks and financial institutions (provided by Consumer Financial Protection Bureau). 59 http://www.consumerfinance.gov/complaintdatabase/ Algorithms: Various Clustering and Classification Algorithms Tools and Languages: Hadoop, Mahout, Java, Python

Progress and Expected Contributions Major retail banking issues by state and match-analyze them based on geographic or socio-economic brackets. Top concerns of consumers in various states. Derive business impact of customer satisfaction or dissatisfaction with their complaints on the institution. Propose likely solutions that can be deemed as first response for future complaints of similar nature. Hypothesize a performance metric to apply to all complaints can be used to prioritize complaints based on resolution time. 60

E6893 Big Data Analytics: Financial Market Volatility Team Members: John Terzis Tim Wu Oliver Zhou Jimmy Zhong 61 Nov 20, 2014

Motivation Understanding volatility in financial markets has long been of interest to hedge and speculators. Empirical evidence has shown us that volatility is a highly nonlinear evolving process. Modeling this process using the Hadoop ecosystem can offer tremendous advantages over traditional econometric models that are limited to datasets which fit in main memory. 62

Dataset, Algorithm, Tools Dataset: We have procured a massive dataset of price quotes on equities, exchange traded futures, futures, and market indices over the span of the last ten to fifteen years at the one minute granularity level. In addition to price quotes on specific instruments, our dataset features derivative indicators of price and volume activity. Algorithm: We propose to train and test several scalable machine learning based regression models on our dataset with the goal of producing a functional form of future realized volatility at the symbol level that minimizes bias and variance and ultimately generalizes well to unforeseen data. Feature selection will be integral to the task given the likelihood that many of our input variables are highly correlated. We intend to build a framework on top of Apache Spark that can at a minimum perform an n-fold cross validation of a training model and use beam search or other established methods to calibrate the hyperparameters of our SVM, random forest, or regularized regression model in a reasonably fast time frame given the algorithmic complexity of the underlying routines employed. Tools: Hadoop Apache Spark Mahout AWS Git 63 R, Java, Python

Progress and Expected Contributions Progress: Set up AWS web server with Ubuntu, Hadoop, and Mahout environments. Purchased and uploaded dataset. Selected initial set of machine learning algorithms. Expected Contributions: Preprocessing: Jimmy/Tim/John Feature Selection: John/Oliver Spark SVM: John Mahout Random Forest: Oliver/Tim Evaluator: Jimmy/Tim R APIs: Tim Java APIs: Jimmy Forward Progress: Preprocessing & Setup: 11/22 Algorithm Application: 11/30 Evaluation: 12/6 Final Report: 12/11 64

E6893 Big Data Analytics: Project Name: Salary Engine Team Members: Lin Huang, Mingrui Xu, Wei Cao, Fan Ye 65 Nov 20, 2014

Motivation Main idea: job ad. Employers -would determine more reasonable salary for a position. Employees - could find more jobs match their background by using our recommendation system. We want to help employers and jobseekers figure out the market worth of different positions by building a prediction engine for the salary of any UK In this way, we would bring more transparency to this important market : Simple sample of our Salary Engine 66

Dataset, Algorithm, Tools Dataset: Job(id (PK), title, descrip, loc, locnorm, jobtype, time, company, category, salary, salarynorm, sourcename) Exceeds 2GB, over 100,000 records Algorithm: Prediction model: ( help companies offer more reasonable salary) Classification: Stochastic Gradient Descent, Naive Bayes, RandomForest Regression: SVM+Kernel, Linear regression, K NN Job recommendation system : (recommend positions to jobseekers) Use Item-based and User-based concept, based on the similarity between a positions and employees. Tools: AWS (EC2, S3, DynamoDB), Hive Tomcat Mahout 67

Progress and Expected Contributions Progress Data clean (text mining), model build (machine learning algorithm) Web UI design, servlet API implementation, database construction Chat forum design and recommendation evaluation Result test and model improvement. Using the test dataset, which leaves salary field blank. Works will be focused on improve prediction and recommendation accuracy through adjusting methods or parameters in model. Expected Contributions 68 Implement and improve two models that could respectively have quite accuracy in prediction and recommendation, and analyze their application scope. Make this web as dynamic so that more newly-post job info could be loaded in as part of the dataset. Allow jobseekers search for specific opportunities, get recommendation from systems, and exchange information with others.

E6893 Big Data Analytics: Project Name Stock price Movement Prediction with Hadoop +Mahout & Pydoop+Scikit Team Members: Arman Arkilic, Ao Hong, Tian Zhang, Yuheng Liu 69

Motivation, Dataset and Expected Contribution Motivation: Predicting stock price movements are essential for portfolio risk management and in the core of any trading model Test Hadoop and its tools ability for such basic and common purpose Stock tick data is available to public with various time range and frequency Output is straightforward and easy to evaluate Datasets: Google finance URL format: www.google.com/finance/getprices? i=[period]&p=[days]d&f=d,o,h,l,c,v& df=cpct&q=[ticker] Yahoo finance Free S&P 500 Daily Pricing Data https://quantquote.com/historical-stock-data Expected Contribution: Development of 2 methods of stock price movement prediction and analysis of their advantages and disadvantages Create the original method of using python to communicate in hadoop Compare performance of models on different datasets and to get high accuracy Apply the prediction model for future stock price movement Help company to adjust business strategy for more profit 70 Big Data Analytics-- Stock E6893 Price Big Movement Data Analytics Prediction Lecture Using 12: Final Mahout Project & Proposals Pydoop+Scikit

Algorithms and Tools Linear regression in order to predict stock price movement: price of stock higher or lower the following day Mahout (Stochastic Gradient Descent--Classifier Algorithm): org.apache.mahout.classifier.sgd.trainlogistic given stock s opening price, high price, low price, and closing price predict its movement the following day similar approach to supervised learning methods covered in class for project #3 Pydoop + scikit-learn Hadoop s provided Python API doesn t support C/C++ wrapped Python libraries(core of Python s scientific computation toolkits: numpy, scipy, matplotlib, scikit-image, scikit-learn, etc.) Pydoop tackles this by wrapping Hadoop C++ pipes(boost.python) and libhdfs Pydoop provides both HDFS access and MapReduce tasks with pure Python code (no Jython) Better than using stdin/stdout utilities within Hadoop that is common to all languages given one might want to explore the data in hdfs and/or submit large chunks as a part of the yarn task Scikit-learn provides simple tools for data mining and analysis Provides Stochastic Gradient Descent approach to fit linear models similar to Mahout sklearn.linear_model.sgdclassifier 71 Big Data Analytics-- Stock E6893 Price Big Movement Data Analytics Prediction Lecture Using 12: Final Mahout Project & Proposals Pydoop+Scikit

Progress Using mahout: Dataset from yahoo finance: open price, close price, highest price, lowest price. Split dataset into 2 parts: training data and testing data Build the model and generate prediction function Test accuracy Test result with one company s stock data from 4/12/1996 to 1/28/2010 with accuracy being 0.58 Future improvement will be focused on adjustments of parameters to increase accuracy. Using Pydoop + scikit-learn: Dataset imported to hdfs and converted to numpy/scipy array within Python Submitted some small MapReduce jobs for practice using identical framework to Mahout (Hadoop 2.5.0 in Virtual Machine provided by TAs [Ubuntu 14 LTS]) with identical number of cores to Mahout Future Work: Implement identical scikit-learn algorithm in Mahout and get similar results to Mahout Deploy this algorithm to Hadoop as MapReduce job using Pydoop s Mapper and Reducer classes Compare the results to Mahout findings Scalability performance analysis between Mahout code with changing hardware constraints 72 Big Data Analytics-- Stock E6893 Price Big Movement Data Analytics Prediction Lecture Using 12: Final Mahout Project & Proposals Pydoop+Scikit

E6893 Big Data Analytics: Final Project Proposal Presentation: Sector-based Classification and Clustering of Financial News Articles Michael DAndrea CVN student 73 November 9, 2014

Project Summary Team members: Michael DAndrea CVN student Motivation of your project (The problem you would like to solve); Clustering and classifying sector-based financial articles Changed project because of issues around confidentiality of datasets and shifting prioritization of needs, also learning the Mahout algorithms surfaced more use cases Current Status of Project Completed collection of initial training dataset Still scoping out technical requirements and learning appropriate applications of machine learning algorithms; 74

Problem and Solution Background Importance of sector-based data for investment allocation decisions Sector exposure has been the second-most influential factor in the performance of the U.S. equity market during the past 20 years. Fidelity Investments Portfolios built with equity sectors were consistently more efficient higher return and lower risk- than those created using style box components.- Fidelity Investments Problem Clustering and classifying sector-based financial articles Assists with investment decisions based on a sector basis Solution Mahout Machine Learning Algorithms Clustering-Canopy and K-means for clustering articles in sectors and extracting themes Classification-C-Bayes and Naïve Bayes 75

Dataset, Tools, and Workplan Dataset- Keyword-based Bloomberg financial news article datasets Reuters 20K article set If time, utilize web crawler for additional article extraction Tools Hadoop, Pig, Mahout, Java or Node.Js, AWS or Azure MongoDB, D3 (optional) Workplan Phase 1- Clustering and classifying news articles Financial News Articles Phase 2- Storing and visualizing results IF time permits 76

E6893 Big Data Analytics Information Group Proposals 77 Nov 20, 2014

E6893 Big Data Analytics: TV Analytics Team Members: Shubhanshu Yadav(sy2511) Vaibhav Jagannathan(vj2192) 78 Nov 20, 2014

Motivation It is always a challenge to keep track of which TV Shows are going to air new episodes. User may not able to follow all the series they are interested in. Even if user manages to keep track of all his favourite shows, if many of them are released around the same time, he/she may not have the luxury to watch all of them. If the user is presented with a list of upcoming episodes, sorted by the release date and a probabilistic rating, then the choice is clear. 79

Dataset, Algorithm, Tools Languages: Python, Java Script Dataset: The dataset will be scraped off of imdb website. Tools: Flask (Website creation Uses MVC framework) Scrapy (for scrubbing the data) AWS (Hosting the website, offline processing tasks) Hadoop, Mahout (for machine learning part) Pig (for Database access) D3 visualization 80

Progress and Expected Contributions Progress: Future Work: Base Website Implementation Data Scraping process from imdb Basic Linear Regression Analysis on small data D3 Visualization Using hadoop to store and access data through pig. Using mahout for ML on more features. Hosting the website on AWS. Move future scraping of data to ec2 using SQS. 81

E6893 Big Data Analytics: Project Nova Team Members: Abdus Samad Khan(ak3674), Chithra Srinivasan(cs3315), Mingyun Guan(mg3419), Yuzhong Ji(yj2345) 82 Nov 20, 2014

Motivation Build a system that collects and clusters news from web (New York Times, CNN, Reuters, etc.) on a daily basis, and provides a userfriendly interface. Articles on the same story from different sources are grouped together and present once to avoid duplicated readings. 83

Dataset, Algorithm, Tools Data Source: News articles from web Apache Nutch Data Storage: Apache Hbase Mahout User Interface A website list the news articles the topic users are interested Cluster news articles using K-Means (conf: k=unknown, n-gram =3 ) Using TF-IDF as a Vector Space Model Mahout library in Java Eclipse IDE Interface. 84

Progress and Expected Contributions Progress: Successfully implemented K-Means algorithm on JAVA to cluster dummy articles Implemented web crawling using Nutch. Contributions: To develop a unified, unbiased and real time news application using complete apache open source system. To provide the user with a one stop news solution. 85

E6893 Big Data Analytics: Game Outcome Analysis Team Members: Raymond Barker (rjb2150) 86 Nov 20, 2014 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

Motivation Motivation: to answer the following question: Given the current game state, who will win? Specifically, I ll be looking at games where the game state can be thought of as two opposing teams, each a multiset of welldefined members. For example: chess (each team is composed of N pawns, M rooks, etc.) deck-building games (each team is composed of cards from a predefined list) MMORPGs (each team is composed of players using predefined classes ) 87 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

Dataset, Algorithm, Tools Dataset: There is a standardized format for storing chess games known as PGN (portable game notation); various groups make datasets of notable games available, such as at chessok.com/? page_id=694. Magic tournament results are available at mtganalytics.net, though the data is not in a consistent format. EVE makes its player battle data available via an API which sites such as eve-kill.net aggregate. Algorithm: convert the game state into features, and then classify games into win / loss / stalemate buckets Tools: primarily Mahout, along with pre-processing code 88 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

Progress and Expected Contributions Progress: I haven t started yet. I ll be working with the chess data first, for two reasons: there is a huge amount of chess data available, and it s in a relatively easy-to-use format there is a huge amount of literature about / analysis of chess that already exists; for example, see the following chart of chess piece values: Once I am able to get some good results for chess, I can move on to working with datasets for other games (MTG, EVE, etc.) as time allows. Expected Contributions: I ll be doing all of the coding since this is a one-person group 89 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

E6893 Big Data Analytics: Exploring the Online and Offline Social World Large Scale Event-Based Social Network Analysis of Meetup.com Team Members: Mengge Li (ml3695, EE), Yiwei Zhang (yz2698,ee) Rahul Gaur (rg2930,cs), Rongyao Huang (rh2648, STAT)) 90 Nov 20, 2014

Motivation Unique value of event-based social network (EBSN) Both online offline social interactions Analyze & compare properties and dynamics of the 2 networks Commercial value: industrial trends, recommendation of services/ products based on user preference Big Fan of Meetup.com Popularity across academia, industry and recreation Excellent API: user, group, event, tags location & time Great opportunity to apply big data techniques and tools Graph database: Neo4j with Cypher Clustering, Recommendation Large scale social network analysis 91

Dataset, Algorithm, Tools Meetup Dataset # Users: 4,448,454 # Groups: 42,052 # Events: 1,595,833 # Tags: 77,810 # User-Group Pairs: 8,863,235 # User-Event Pairs: 13,553,134 # User-Tag Pairs: 15,057,535 # Group-Tag Pairs: 144,793 # Users with Locations: 3,741,699 # Events with Locations: 983,333 Analytics/ Modelling Properties of social interactions: degree, centrality, separation, clustering coef, density, etc. Temporal & spatial patterns of specific groups/events Clustering: Fiedler method, linear combination, generalized SVD Recommendation: user-based, item-based Tools Neo4j Java/Python 92

Progress and Expected Contributions Project Timeline - project planning, literature review 11/06-11/13 - pull data from Meetup Server and convert to proper format (Java) 11/14-11/20 - inject data into Neo4j 11/21-11/27 - Social network analysis using Cypher 11/21-11/27 - recommendation using Cypher 11/28-12/04 - use Neo4j with Java/Python for clustering 11/28 12/04 - write up report 12/05-12/11 Expected Contributions Up-to-date Meetup network construction/evaluation Recommendation of groups, events, products Industrial Trends of product/concepts of concern (dashboard) 93

E6893 Big Data Analytics: Yelp Review Analysis and Recommendation Team Members: Enrui Liao, Yuqing Guan, Ying Tan, Mingqing Wu 94 Nov 20th, 2014 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

Motivation 1.Besides ratings, user s review is a rich treasure of feedback. But traditional methods we learnt in class simply discard review text, which leave many latent factor difficult to interpret. 2. We aim to fuse latent rating dimensions with latent review topics. This deeper use of feedback yield a better recommendation Yelp Dataset Challenge Yelp Dataset Challenge 3. Reading large quantities of reviews is a difficult and time-consuming task. In this situation, a visualization that summarizes the user generated reviews is needed for perusing reviews. 95 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

Dataset, Algorithm, Tools Dataset: Yelp Academic Dataset http://www.yelp.com/dataset_challenge Algorithms: 1. Syntactic Analysis 2. Clustering: Latent Dirichlet Allocation 3. Classification: Naïve Bayes, KNN, etc 4. Recommendation: Collaborative Filtering Tools: Java, Hadoop, Mahout JGibbLDA (http://jgibblda.sourceforge.net) Stanford Parser (http://nlp.stanford.edu/software/lex-parser.shtml) 96 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

Progress and Expected Contributions Progress: 1. Discussed on different clustering / classification algorithms to choose which algorithm to be used on the data 2. Analyzed and extracted valuable attributes from raw data 3. Prepared and configured a web server for the project Expected Contributions: 1. Use syntax analysis to retrieve information from raw review text 2. Cluster the reviews by LDA algorithm and compare the results with classification algorithms 3. Improve the recommendation on businesses in Yelp data with the clustering and classification results 4. Analyze the correlation between ratings and reviews. Give advice for businesses about how to get a high rating 5. Build a website to visualize the clustering / classification and recommendation 6. (optional) Connect our work to a social network, like Facebook or Twitter, to perform online recommendations 97 E6893 Big Data Analytics Yelp Review Analysis and Recommendation 2014 Columbia University

E6893 Big Data Analytics: Project Name STOCK FORECASTING USING HADOOP MAP-REDUCE Team Members: Yi Yu, Yu Xia, Xiangliang Yang, Yumeng Xu 98 Nov 20, 2014

Motivation Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Stock Market has high profit and high risk features, on the stock market analysis and prediction research has been paid attention by people. The stock price trend is complex nonlinear function so the price has certain predictability Hadoop MapReduce is a latest framework specially designed for processing large datasets on distributed sources. Apache s Hadoop is an implementation of MapReduce. 99

Dataset, Algorithm, Tools Algorithms: Pearson Correlation Similarity Euclidean Distance Similarity Stochastic Gradient Descent (SDG) Tools: Hadoop Mahout Hbase 100

Progress and Expected Contributions Expected Contributions: We are going to analyze the dataset called Daily Holdings for All ProShares ETFs which contains tons of information collected from the stock exchange market. The first step is to scrutinize the data and provide the stock which may go up potentially. With these screened stocks, suggest a certain user a potential stock which she/he may be interested in. Progress: We have already got the dataset and analyzed the some similarities which could be useful in further steps. 101

E6893 Big Data Analytics: Sentiment Analysis on Movie Yunao Liu, Hao Tong, Di Liu 102 Nov 20, 2014

Motivation We love movies, and movie reviews help us to find a good movie. In this project we will use large-scale machine learning and natural language process technologies to experiment on a sentiment analysis task on movie reviews. Rather than simply decide whether a review is thumbs up or thumbs down, we want to label a review on a scale of five values: one to five stars. 103

Dataset, Algorithm, Tools Dataset: Source: Rotten Tomatoes Format: Reviews with score ranging from 0 to 4 Size: 8500 reviews Algorithm: Multi-class SVM Logistic regression Optional: Algorithm proposed by Pang and L. Lee. 2005. Tools: Hadoop Mahout Stanford Parser 104

Progress and Expected Contributions Progress: We ve done background research Looked into datasets Expected Contributions: Feature extraction: Yunao Liu, Hao Tong Experiments and evaluation: Hao Tong, Di Liu Analytics and discussions: Yunao Liu, Di Liu Paper works: Yunao Liu, Hao Tong, Di Liu 105

E6893 Big Data Analytics: Project Name Documents Analysis with Latent Dirichlet Allocation Team Members: Hao Fu, Xiuwen Sun 106 Nov 20, 2014

Motivation 1 extend the documents analysis algorithm to big dataset using Hadoop 2 analyze documents by answering these questions: 2.1 cluster documents using generative models 2.2 gain knowledge about clusters results 2.3 find the hidden attributes of different documents 107

Dataset, Algorithm, Tools 1 Dataset: Wikipedia database 50-topic browser of latent Dirichlet allocation fit to the 2006 arxiv 2 Algorithm: Latent Dirichlet Allocation (Collapsed Gibbs Samling) 3 Tools: Mahout, Hadoop 108

Progress and Expected Contributions 1 Build the LDA model for documents analysis on Hadoop 2 Cluster documents based on LDA analysis results: e.g. distribution on topics 3 Give description for each topic based on distribution on words 109

E6893 Big Data Analytics: Yelp Recommendation Analysis Team Members: Siddharth Sunil Boobna (ssb2171), Yash Parikh (yp2348), Prateek Sinha (ps2791) 110 Nov 20, 2014

Motivation The data provided by Yelp currently can be sometimes overwhelming for the users to make a choice. They are provided with a myriad of choices even for a specific set of businesses. We aim to make this easier for users by taking their preferences and those of similar users, a weighted score for the reviews based on votes and timeline along with various attributes of the business. 111

Dataset, Algorithm, Tools We are using the Yelp academic dataset available here. The dataset consists of details about businesses (i.e. geo location, categories, open hours and other attributes), reviews (by users of various businesses), user (i.e. name, friends), check-in (# of check-ins at different hours), tips (by users for various businesses). We will implement a Collaborative Filtering algorithm using a weighted BiPartite graph to get similar users and businesses. The weights will depend on the timeline of reviews (the newer it is, the more accurate it will be), # of useful votes for the reviews, check-ins by users for the businesses. We will run a sentiment analysis on the tips using a basic Naïve-Bayes algorithm. These sentiments will help us further enhance the recommendation system. Tools: Apache Mahout, Apache Spark, AWS, NLTK 112

Progress and Expected Contributions We have collected and studied the dataset. One thing we noticed was that the data was sparse. There were very less number of pairs of users rating the same business. Hence, the recommendation might not be very accurate. So to overcome this, we might have to cluster the users and the business to a more compressed graph. Contributions: Siddharth: Trying out a clustering algorithm (maybe K- means) on the users and the businesses separately. Yash: Running the sentiment analysis on the tips that will help us further enhance our recommendation system. Prateek: Implementing the basic BiPartite graph between users and businesses and later, include weights based on various features. 113

E6893 Big Data Analytics: Using Big Data (Hadoop) for Identification of Aberrant Behavior Clusters in Server Performance Time Series Data Team Members: Tad Kellogg CVN 114 Nov 20, 2014

Motivation Large enterprise data/cloud centers employ vast numbers of distributed compute servers for various organizational informational technology applications, e.g. accounting, human resources, high performance computations, risk management systems, etc To benefit the enterprise organization most servers are interconnected either by design or causally due to shared resources, e.g. network infrastructure, storage, shared database server, etc Identification of individual server aberrant behavior is currently a critical operation for enterprise data center operations. Scalability of server aberrant behavior detection for large (>1000) numbers of servers is a current problem. This project proposes additional value to aberrant behavior detection by offering a scalable detection algorithm and identification of clusters of servers based on aberrant behavior detection characteristics. Identification of previously unknown server clusters could be vital to a successful resolution of application outage or server performance problems. 115

Dataset, Algorithm, Tools Dataset: A collection of NMON (http://nmon.sourceforge.net/pmwiki.php) performance log files for 100 Linux servers with a history of 90 days. Metric collection interval will be 5 minutes. Specifically, server performance, e.g. cpu utilization, memory utilization, disk and network input/output rates and throughput time series data from servers will be used for analysis. Algorithms: Transform NMON data from log to tabular format: PIG dataflow script Time series aggregation for percentiles and standard deviation: HIVE scripts Map/Reduce implementation of Holt-Winters time series forecasting algorithm Mahout K-Means clustering Tools: HortonWorks HDP 2.1 (http://hortonworks.com) Mahout 1.0-SNAPSHOT 116

Progress and Expected Contributions Progress: Dataset collection complete PIG and HIVE scripting complete Holt-Winters map/reduce in progress Mahout K-Means clustering in progress Expected Contributions: PIG scripts for consuming NMON data files into Hadoop cluster HIVE scripts for NMON data aggregation Hadoop Map/Reduce based implementation of Holt-Winters algorithm Demonstration of K-Means clustering on server performance time series data 117

E6893 Big Data Analytics: Crocuta: JavaScript Analytics System Team Members: Rusty Nelson 118 Nov 20, 2014

Motivation Dependency Management Move the Compute to the Data Generic Interface Across More Platforms Dynamic (Open Compute) Scaling Many Small Unreliable Nodes with Limited IO NodeJS Servers Opted in Web Browsers 119

Progress P.O.C. (Proof of Concepts): Communication between nodes browser servers Dependency management Async Read of Large Datasets 120

121 Expected Contributions

E6893 Big Data Analytics Life Science Group Proposals 122 Nov 20, 2014

E6893 Big Data Analytics: Project Name: Learning Brain Activity From fmri Images Team Members: Zhao Hu Nov 20, 2014 123

Motivation Background Knowledge: fmri(functional Magnetic Resonance Imaging): Image procedure using MRI technology that measures brain activity by detecting associated changes in blood flow. Goal: Want to use machine learning methods to classify the cognitive state of a human subject based on fmri data over a single time interval. What is cognitive states: Whether the human subject is looking at a picture or a sentence? Whether the subject is viewing a word describing a food or a book? 124

Dataset, Algorithm, Tools Data: Image data collected every 500 msec Extremely high dimensional (More than 100000 features) Extremely sparse, noise data Algorithm: Gaussian Naive Bayes Linear Support Vector Machine Some network algorithm( Bayes network) Tools: Hadoop, Mahout, Python and Amazon EMR 125

Progress and Expected Contributions Progress: Dimensionality reduction for fmri data (Done ) Feature selection (In process) Training and evaluation (Future) Expected Contributions: Help people have more clearly understanding about how brain works with stimulus. Improve the brain diagnosis, help to find the abnormal activity of brain. Etc. 126

E6893 Big Data Analytics: Network Analysis on the Big Cancer Genome Data Team Members: Tai-Hsien Ou Yang and Kaiyi Zhu 6 127 Nov 20, 2014

Motivation Hallmarks of Cancer -Cancers are deadly and hard to be cured -Common attributes of all types of cancers that are associated with patient s outcome -Targetable interactions are too complicated for treatment A genomic-clinical regulation network based on big data analytics Nobody has really done this before and the tools have not been made on Hadoop A friendly framework for cancer diagnosis, treatment suggestion, and prognosis prediction 128 D Hanahan and RA Weinberg, Cell, 2011

Dataset, Algorithm, Tools 129

E6893 Big Data Analytics: Reversal Prediction from Physiology Data Team Members: Hongzhuo Zhang, Ruoyu Wang Nov 20, 2014 130

Motivation Complicated physiological data Monitor Reversal prediction Human understandable activity BUNCH OF POTENTIAL APPLICATIONS 131

Dataset, Algorithm, Tools Dataset: Physiological dataset for ICML 2004 Algorithm: (1)Naïve Bayesian (2)SVM-based markov model (3)Conditional random fields (4)Ultra-fast forest trees Tools: Java and Hadoop 132

Progress and Expected Contributions Progress: Running existed algorithms on the dataset and try to synthesize multiple algorithms in next two weeks. Expected Contributions: Improve existed algorithm which could (1)Utilize the unlabeled data rather than ignore them (2)Achieve higher prediction accuracy 133

E6893 Big Data Analytics: EEGoVid: An EEG-Based Interest Level Video Recommendation Engine Team Members: Mohamed El-Refaey, Vincent Rubino, Jingwen Xie, Shi Zong 134 Nov 20, 2014

Motivation In collaboration with Neuromatters, There is a need to do analytics for interest level datasets captured from EEG signals for people watching movies. A potential to have a generic EEG-based recommendation engine that doesn t require users interaction (i.e. no explicit rating). 135

Dataset, Algorithm, Tools Dataset Interest levels data set taken from Neuromatters. Overall level of interest for each commercial (Not time-resolved). A time-resolved interest level and super bowl videos list. ImageNet. Tentative list of Algorithms: Algorithms for data pre-processing. Event-related field/response analysis. Classification and Collaborative Filtering. Tools Apache Mahout. Apache Hadoop. Java. HBass, Hive. D3 and JavaScript. 136

Progress and Expected Contributions Done with the system design and architecture. Acquired the dataset from Neuromatters. Done with the initial feature extraction from ImageNet. Started to process interest level dataset and mapping with video frames. 137

Progress and Expected Contributions 138

E6893 Big Data Analytics: Brain Edge Detection Team Members: Terrence Adams, tma2131@columbia.edu 139 Nov 20, 2014

Motivation General goal: Improve understanding of brain structure. Background: The last several years have lead to amazing progress in the understanding of the brain s physical structure and its function. Focused goal: Improve detection of synapses. How: Develop computer vision algorithms for volumetric (voxel) processing of image slices taken from high-throughput electron microscopy of the brain. 140

Dataset, Algorithm, Tools Data used for this project will consist mostly of high-throughput electron microscopy image slices downloaded from http://www.openconnectomeproject.org/. Neural circuits are imaged at nanometer scales which leads to terabytes of data. Mouse brain estimated at 2M x 2M x 100k pixels. There exists limited hand-marked data of synapses. The markings are high-precision, but there is less guarantee of recall. I plan to develop a tailored computer vision algorithm for detecting synapses. I may use caffe to train a deep belief network, but this may not be feasible without more training data. Tools: Hadoop, Spark, ScaleGraph, InfoSphere Streams, matlab, Opencv, caffe. This list is still tentative. 141

Progress and Expected Contributions I do not have prior experience in this area. I was able to meet Will Gray Roncal, a researcher in applied neuroscience at the Applied Physics Laboratory (Johns Hopkins University). Mr. Roncal explained basics about the Openconnectome project and how to access data. Will showed me how to download the OCP API (matlab). He was extremely helpful. I was able to speak with Jacob Vogelstein and Mark Chevillet who were also very helpful. I expect to develop a new synapse detector that achieves state-of-the-art on EM brain images. 142

E6893 Big Data Analytics: BrainBow: Reconstruction of Neurons Suraj Keshri, Min-hwan Oh, Gaurav Gupta 143 Nov 20, 2014

Motivation Neurons are tree-like structures that are densely packed in a volume so that an average neuron makes on the order of 104 synapses. This astronomical number suggests that neuronal processes are densely intertwined, and even a small piece of neuronal tissue contains many meters of dendrites and axons from many neurons. Isolating individual neurons automatically in a given volume is an unsolved problem. The Brainbow method uses genetic engineering to express random fluorescent colors in individual neurons. Here, we build an optimization framework for reconstructing individual neurons in Brainbow tissues. 144

Dataset, Algorithm, Tools Data: We use generic data each image of neurons with size V (voxels) ~10 6, N (# of neurons) 10 We define an object as a pair (x, c T ) such that x [0, 1] V 1 (e.g., V voxels), and c is a non-negative column vector whose size is the same for all objects. x is referred to as a neuron. N objects are observed by Y = XC + e The problem is to recover both X and C Y : V 3 matrix of observations C : N 3 matrix of RGB colors e : noise term Algorithms: Beta Process for dictionary learning LARS for single neuron reconstruction Tools: Java, Matlab, Hadoop 145

Progress and Expected Contributions Implmentation on 2D data: Figure (2) is the true neuron image from which figure (1) has been generated with noise. Figure (3) shows a result from our reconstruction algorithm on Figure (2). Figure (1) Figure (2) Figure (3) Our goal is to implement our algorithm on bigger data set using hadoop and other tools. This algorithm can be used for general dictionary learning and representation problem. 146

E6893 Big Data Analytics Retail Group Proposals 147 Nov 20, 2014

E6893 Big Data Analytics: Project Name: [Yelp Fake Review Detection] Team Members: [Dhruv Kuchhal, Duo Chen, Mo Zhou, Chen Wen] 148 Nov 20, 2014

Motivation Negative Effects Caused by Opinion Spamming Unfair Competitions Deceitful Information for Users Detriment to Yelp s Credibility Yelp Dataset Challenge Ready-to-use Dataset 5000 Dollars Prize 149

Dataset, Algorithm, Tools Dataset The Challenge Dataset including rating data of business attributes, check-in sets, users, edge social graph, tips and reviews from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh Algorithm GSRank Algorithm (ref. Spotting Fake Reviewer Groups in Consumer Reviews ) Tools Hadoop, Mahout, Java, Python 150

Progress and Expected Contributions Progress Dataset Acquisition Literature Review Expected Contributions Dhruv Kuchhal documenting report and processing dataset Duo Chen - parsing and processing dataset and doing presentation Mo Zhou - studying and implementing algorithm Chen Wen implementing algorithm and documenting final report 151

E6893 Big Data Analytics: Project Name Predicting Excitement at donorschoose.org Team Members: Ran Ran(rr2950), Yi Jiang(yj2306), Lina Jin(lj2351),Yuezhi Wang(yw2586) 152 Nov 20, 2014

Motivation How DonorsChoose.org works DonorsChoose.org is an online charity that makes it easy for anyone to help students in need through school donations. Public school teachers from every corner of America post classroom project requests on DonorsChoose.org, and people, companies, and foundations can help fund request. Once funded, DonorsChoose.org will send the resources directly to the classroom. Goal The goal of this project is to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. 153

Dataset, Algorithm, Tools Dataset: 5 relational datasets about projects's information: donations, outcomes, resources, essays, and projects. Any project posted prior to 2014-01-01 is in the training set (along with its funding outcomes). Any project posted after is in the test set. Some projects in the test set may still be live and are ignored in the scoring. We do not disclose which projects are still live to avoid leakage regarding the funding status. (From Kaggle) Algorithm: Text Clustering: Kmeans, Canopy Classification : Naive Bayes, SGD Tool : HBase, Pig as the database Mahout for the clustering and classification R for visualization 154

Progress and Expected Contributions Progess: 1. Find out some relative features by calculating data in provided dataset. Ex. How many exciting projects did one school have in the past? 2. Clustering essays.cvs into a couple of categories which can be used as a new feature in dataset. 3. Train a model with the features generated above. 4. Test the correctness of this model. Expected Contribution: Help DonorsChoose.org identify projects that are exceptionally exciting to the business so that they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn. 155

E6893 Big Data Analytics: Project Name Market strategy suggestions for B2C websites. Team Members: Xuebo Wang, Zixuan Gong, Siyuan Zhang, Wenxin Wang 156 Nov 20, 2014

Motivation Traditional marketing tools Bigdata Era: Data of large scale and overall aspects Convincing and Concrete reference User customerized 157

Dataset, Algorithm, Tools Dataset Jordan Average Expense Age Area Family member Bryant Johnson Cusomerized clustering clustering Cluster 1 Cluster 3 Cluster 2 age Average expense area 37-59 158 $6000- $11000 New York Boston.

Progress and Expected Contributions Basic requisite: Raw dataset found Framework constructed Basic contribution : Construct the whole function flow Customer friendly interface Future Contribution: Result Visualization Cluster parameter adjustable 159

E6893 Big Data Analytics: Résumé Category Classification Team Members: Kaicheng Feng, Wentao Jiang, Hongliang Xu, Kaiwei Zhang 160 Nov 20, 2014

Motivation Candidate/Employer matching helps people find jobs and employers find the right candidates. Lynxsy is a company that matches job seekers with startup companies. It is critical to match the right profiles to the relevant positions But relying on HR to personally review thousands of resumes is both inefficient and unreliable. A tool to streamline and automate the process Analysis, Filter, Classify and Evaluate resumes automatically and efficiently. 161

Dataset, Algorithm, Tools Dataset Lynxsy shared about 3000 resumes to our team in PDF format together with the candidate survey information. We will manually label them using the instruction they specified. Algorithm Extract text from PDF resumes Pre-filtering, pre-processing, and formatting the raw dataset Transform the dataset into vectors Classification algorithms & clustering algorithms Cross validation to evaluate Tools Hadoop & Mahout Java/Ruby 162

Progress and Expected Contributions Progress We just received the dataset from Lynxsy. We started coding the PDF extracting and pre-filtering program. We had a plan on the overall project implementation. Expected A tool that takes a PDF resume as input and it will output the predicted resume category and the prediction confidence score. 163

E6893 Big Data Analytics: TV Genome Project / Recommendation Engine Analytics Media Group Team Members: Ishaan Sayal, Preeti Vaidya, Joshua Edgerton, Samuel Sharpe 164 Nov 20, 2014

Motivation TV Genome Project / Recommendation Engine for AMG: As a media company AMG constantly analyzes peoples viewing habits and television interests. User inputs a few shows they currently watch, and the tool predicts a few shows they might also like MOTIVE: Personalized TV recommendation system with focus on viewers historical viewing records or demographic data. Match most promising targets with their actual viewing habits to find the best shows to recommend to them. FUTURE PROSPECTS Products designed at gauging the potential viewership/ interest/success of a new show based on the characteristics of past shows and their relative success. 165

Dataset, Algorithm, Tools AMG Data: TV set-top boxes and online ad networks. Algorithms Used: Item Based Recommendation Possibly Clustering Collaborative Filtering with user and movie attributes. Tools: Tools/ Languages/ IDE Application 166 SQL Java/ Eclipse Mahout PHP/Yii Framework user_id series_id Fields in Dataset time_viewed_of_series series_tuning_instances series_days_tuned first_date_key_active last_date_key_active, total_hours_viewed total_tuning_instances total_days_tv_watched Data querying and preprocessing Recommendation Algorithm Recommendation Algorithm UI Design

Progress and Expected Contributions Task Description Team Members Status 1 Evaluation of Data Collecting useful information Pseudo Rating Experimentation 2 Running Recommendation Algorithms on Data Collected 3 Comparing Algorithms, Rating metrics and their performance 4 Expanding models to include user and show attributes 167 Joshua Samuel Preeti Ishaan Samuel Joshua Preeti Samuel Ishaan 5 UI Design Preeti Samuel 6 Database Implementation Joshua Ishaan Done On Time (11-21-2014) Scheduled (11-28-2014) Scheduled (11-28-2014) Scheduled (12-07-2014) Scheduled (12-07-2014) 7 Report and Delivery Entire Team Scheduled (12-13-2014)

E6893 Big Data Analytics: Project Name Acquire Valued Shoppers Team Members: Ayushi Singhal, Dharmen Mehta, Nimai Buch 168 Nov 20, 2014

Motivation Competing Retail companies Predicting sales Offering personalized deals Increasing customer base 169

Dataset, Algorithm, Tools Dataset : https://www.kaggle.com/c/acquire-valuedshoppers-challenge/data Size : 22Gb Algorithm : User-Based Collaborative Filtering Item-Based Collaborative Filtering Logistic Regression Naïve Bayes Multiplayer Perceptron Tools : Hadoop Mahout 170

Progress and Expected Contributions Expected Contributions User Perspective Customized offers Company Perspective More Revenue Progress We have acquired, studied and cleaned our dataset. We have shortlisted the tools and algorithms we want to consider. We are setting up our Virtual machine to be able to run these algorithms on our dataset. 171

E6893 Big Data Analytics: Twitter-Based Product / Sales Events Recommender Team Members: Qianyi Zhong, Dongxue Liu, Chia-Jung Lin, Sung-Yen Liu 172 Nov 20, 2014

Motivation Motivation: 1. Need for an application for people to get effective and precise product recommendation simply by tweeting their desires for some products 2. A platform to benefit both customers and local retailers Our goal: A product and sales event recommendation system based on social network: 1. Reply tweets with information of mentioned products 2. Website for retailers to post sales events and for customers to subscribe/search (location-based) 173

Dataset, Algorithm, Tools Data Dataset: Tweets fetched from twitter, Feedback from our followers, Amazon products information Algorithm: User-based recommendation, Item name tagging after NLP Tools: Twitter API, Amazon API, Hadoop, TweetNLP/The Stanford Parser 174

Progress and Expected Contributions Progress: Fetch tweets using twitter API Website interface Expected Contributions: 1. Build a system integrating Twitter, Amazon which able to recommend items simply and automatically to our app s followers by querying Amazon with filtered tweets 2. Diagram of the reciprocal platform 175

E6893 Big Data Analytics: Yelp-er: Analyzing Yelp Data Team Members: Naman Jain, Natasha Kenkre, Rhea Goel, Sanket Jain 176 Nov 20, 2014

Motivation/Problem Statement Sentiment Analysis What makes a review cool, funny, useful Find most-talked-about topics for a business Bag of words, Word Cloud User Reputation System Based on number of reviews, friends, fans, compliments, votes, yelping since, iselite(), average rating Trending Today For businesses: Popularity timeline For users: Location (Heat map to find hubs for different cuisines within a city) Recommendation System Item based and User based 177

Dataset, Algorithm, Tools Yelp Dataset Challenge - Business, User, Review, Tip, Check-ins Bag-of-words, Neural nets, SVM, Recommendation algos Hadoop, Wordle, Heat Map API, Weka 178

Progress and Expected Contributions Cursory analysis of dataset Formulatio n of Problem statement Figuring out the right tools and technologi es that can be used Implement ation of algorithms Integration of tools Mostly collaborative effort, but. Naman Jain: Sentiment Analysis Natasha Kenkre: Recommendation system Rhea Goel: User Reputation System Sanket Jain: Trending Today 179

E6893 Big Data Analytics: Project Name Big Mobile Data Team Members: Kevin Wang (fw2253) 180 Nov 20, 2014

Motivation Apply big data analysis ideally in a mobile setting or organize mobile data Understand how big data can be applied in a telecom/mobile setting, and how to combine this with retail. 181

Dataset, Algorithm, Tools Dataset Yahoo Research Lab Data Set Mobile traffic data Algorithm Classification, Clustering and Recommendation Tools Mahout 182

Progress and Expected Contributions Started to gather mobile data for analysis 183

E6893 Big Data Analytics Telecom Group Proposals 184 Nov 20, 2014

E6893 Big Data Analytics: Network Congestion Analysis Team Members: David Cadigan, Hongjie Wang, Wei Zhang, Jiayi Yan 1 Nov 20, 2014

Motivation The goal: Design an intuitive tool to parse and determine shortest-path information for a computer network Simple tool to parse our node / network based dataset from download to final processing Calculate costs and paths similarly to how a packet switched network with spanning tree functions What we learn and use: Expand upon the tools which we used during our homework assignment to extract and process datasets Tool designed to further understanding of Hadoop and Mahout, as well as explore other tools for big data analytics Exploration of what others are already doing in this field Examples include TCP traffic data analysis and spanning tree evaluation 186

Dataset, Algorithm, Tools Dataset: Stanford P2P Network data Gnutella dataset which defines systems as nodes and edges as networks between them Stretch goal is to make the tool dataset agnostic can process any dataset depending on what is input. This is a stretch goal likely outside the scope of the initial project submission Algorithms: Setup Mahout engines to build a shortest cost path tool similar to spanning tree protocol for networking Usage of classification and recommendation engines within Apache Mahout Classification slow vs fast path Tools: Mahout on Hadoop Develop initial parsing engine to get the data into the format which we need Perl, C, Java based Language not as important as functionality 187

Progress and Expected Contributions Progress: High level design of tools and algorithm selection under way Next step is to develop the tools around the structure of the dataset which we are using to pull and preprocess the data After data formatting, begin development of the overall analysis tools Contributions: Each team member will contribute to all aspects of the projects. Tools, algorithm, and data selection have all been at the team level Note that the team is a mix of on-campus students and a CVN student Need to formalize communication currently working strictly through email and teleconference 188

E6893 Big Data Analytics: Project Name: Comparison Analysis of Different Telecoms Operators Team Members: Zhenying Zhu, Jiahui Cheng, Chenyun Zhao, Lingxue Li 189 Nov 20, 2014

Motivation Analyzes web crawl data to compare several ISPs. Multiple types of web-news: Promotion activities, public reviews, community service news etc. -Cluster webs for each ISPs -Make comparison

Dataset, Algorithm, Tools Dataset Common Crawl Data from Amazon S3 - Information on billions of web pages - Search through the contents - Use ARC and Text files Algorithms Map Reduce Clustering: - KNN Clustering, Spectral Clustering, Canopy Clustering. Tools Hadoop Mahout Apache Pig Java for front end development Elastic Search SQL/NoSQL database 191

Progress and Expected Contributions Progress Expected Contribution Clusters of AT&T: Clusters of T- Mobile: Cluster 1:{related with promotion plan} :savings, free, limited time Cluster 1: Cluster 2:{related with community news} Cluster 2: Top terms:senior, charge, stores, shops Cluster 3: Cluster 3:{related with public reviews} Top terms: price, speed, quality Clusters of Sprint: Cluster 1: Cluster 2: Cluster 3: Would be a search engine tool for : Customers choosing telecom operators, Policy Makers, Journalists 192

E6893 Big Data Analytics: User s Web Events Analysis Based on Browser Extension Zheang Li, Cong Zhu, Linjun Kuang, Yifei Xu 193 Nov 20, 2014

Motivation Feel difficult to concentrate while studying? We are developing a software to monitor your browser activities. It can provide a straightforward visualization of your browser events and let you compare with others. Feeling the pressure? Stay focused 194

Dataset, Algorithm, Tools Dataset: Our browser extension will collect user activities as dataset. Algorithm: 1. Monitor the footprint of the browser, which will generate an event sequence of each user. 2. Identify the user event pattern. 3. Evaluate users events based on time, age, area, etc. Tools: JavaScript, Apache, Hortonworks 195

Progress and Expected Contributions Expected Workload 1. Develop a browser extension to monitor user s web events. 2. Build a server to collect the data. 3. Analyze the data from several aspects. 4. Implement a web for demo the result. Expected Contribution: 1. Provide visualization of your everyday browser activities. 2. Compare with other people of similar and different background 3. Provide dataset for further research. 196

E6893 Big Data Analytics: Analysis of telecom service in cellular networks Team Members: Zhilei Miao(zm2221), Yizhe Wang(yw2625), Shibiao Nong(sn2603), Yaqi Chen(yc2998) 197 Nov 20, 2014

Why choose this topic How people assign the usage of their telecom service and data plan. Data call vs Voice call Peak phone call Plan selection How to provide better plan to fit customers need 198

Dataset, Algorithm, Tools Dataset: Telecom Service Dataset From Churn Response Modeling Tournament, 2003 provided by Duke University Algorithm: For telecom company: Clustering For customer: User-based Recommendation Item-based Recommendation Tools: Hadoop Mahout Neo4j 199

Progress and Expected Contributions. For telecom companies Divide customers into groups by clustering algorithms. Help telecom companies analyze the behaviors of customers Provide optimized service for different customer groups.. For client Analyze the characteristics of client such as data usage, payment trends and billing history Set up a recommendation system to recommend the most fit plan to customers. 200

E6893 Big Data Analytics: Project Name Human Activity Monitoring and Prediction Team Members: Chao Chen, Junkai Yan, Qi Li 201 Nov 20, 2014

Motivation Human activity monitoring and prediction system health care, near-emergency early warning, fitness monitoring and assisted living Sensor data from the practical, small and unobtrusive platform -- smartphone Accelerometer(3-axial linear acceleration) Gyroscope (3-axial angular velocity) sampling rate: 50Hz Each person performed six activities walking, walking_upstairs, walking_downstairs, sitting, standing, lying 202 Activity recognition process pipeline

Dataset, Algorithm, Tools Dataset http://www.smartlab.ws http://archive.ics.uci.edu/ml/datasets/human+activity+recognition +Using+Smartphones Algorithm: Data Preprocessing: FFT Classification: Support Vector Machines; Naive Bayes networks; K-Nearest Neighborhoods Tools: Hadoop,Mahout,Pig,Java,Matlab 203

Progress and Expected Contributions Progress Analyzed the raw dataset with ADLs(Activities of daily living) obtained from a competition. Working on extracting useful information which are used for classification. Expected Contribution: 1. Signal analysis detected from the accelerometers and gyroscope and convert it into dataset. 2. Try different classification methods on the useful data extracted from raw signals and compare the accuracy of the classifiers. 3. Create a predictive model which could indicate the human activity state. 4. Apply our model on a different dataset with elderly voluntaries with ages between 60-70 years to test its effects. 204

E6893 Big Data Analytics Transportation and Energy Group Proposals 205 Nov 20, 2014

E6893 Big Data Analytics: Project Name: Minimizing Risk in Energy Arbitrage Team Members: Adeyemi Aladesawe (aoa2124) 206 Nov 20, 2014

Motivation Enron bankruptcy was the biggest news sometime back Frank A. Wolak, Department of Economics, Stanford University, described in a paper Arbitrage, Risk Management, and Market Manipulation: What Do Energy Traders Do and When is it illegal? events, especially market manipulation and sharp practices that led to rising prices, and Enron s collapse Energy traders buy power at low prices from location A and sell high at location B. Profitability comes where the difference in prices is higher than the cost incurred to make the transaction happen. Buying and delivering energy immediately to spot markets incurs less risk than buying to deliver in a future time These moments of demands are almost fleeting, and speed to act almost separates losers from profiteers Thus, traders require knowledge of demand and supply trends, and the predictability, with a certain amount of confidence, that a location will demand energy priced at most on a threshold. 207 E6893 Big Data Analytics Lecture 12: Final Project Proposal

Dataset, Algorithm, Tools The dataset contains multivariate data types, with date-timestamps, and meter readings of a French household energy consumption over a 4 year period spanning 2006 to 2010. Algorithms suggested will be: 1) time-series analysis to spot trends 2) Plot of the data-points to intuitate a pattern or class of models 3) Regression to learn of underlying generative model of the dataset 4) Predict/generate data from the models learned - Predicting/Generating data points from the learned model will make use of Bayesian Inference 208

Progress and Expected Contributions - Model hypothesis is ongoing, the next phase is to select a regression function to learn a model to fit the energy-time curve - Expected contributions is the ability to forecast, with a maximum threshold of risk, energy needs in the future* - Unfortunately, the dataset under investigation is from France, and prediction will only be accurate in France 209

E6893 Big Data Analytics: Project Name Best Transportation Choice Team Members: Joseph Kevin Machado, Xia Shang, Zhao Pan, Andre Cunha 210 Nov 20, 2014

Motivation There are many apps in the market to get train routes or bus routes. But none of them tries to decide the best between them or even suggest a combination of both transit methods. The Goal of our project is to find the quickest way possible for a user to reach his destination. The trip may include subway trips, bus trips and even a combination of both. The datasets are extremely large and this is where a distributed File system framework such as Hadoop comes into play. 211

Dataset, Algorithm, Tools Dataset: MTA dataset (cannot be published built results can be used, we plan to use 2 weeks of data) Algorithm: we make use of prediction algorithms which are built into MAHOUT Tools: Hadoop, Mahout, Pig, Hive, Hbase (tentative) 212

Progress and Expected Contributions We have started by analyzing the available data and how to proceed with our ultimate goal. Contributions: Joseph Kevin Machado : Hadoop cluster and data management. Xia Shang: Analyzing data. Zhao Pang: Analyzing data. Andre Cunha: Make the final report. 213

E6893 Big Data Analytics: Manage Energy Consumption By Smart Meters Team Members: Xun Zhang (xz2348) 214 Nov 20, 2014

Motivation Big Data Utility companies have one of the largest costumer population. Perhaps a single state energy provider would have to store and manage over hundred TB of data including costumer information, weather and demographics, historical utility data, geographical data and much more. Energy Efficiency Information technology will enable massive and smart collection and management of energy data. Therefore personalized energy plan could be developed to meet the costumer requirement while reduce the consumption and boost the efficiency. 215

Approaches Objectives Delivering personalized service Proactively addressing potential safety risks Transform utility business into a data-driven service Analytics Data: Historical climates and demographic data; energy consumption data; utility price fluctuation data Algorithms: classification, decision and recommendation, etc Tools: python, Hadoop 216

Agenda Current: Data collection and pre-processing (normalization and validation) Week 12-13: Data analysis Week 13-14: Algorithm implementation and testing 217

E6893 Big Data Analytics: Project Name: Location Specific Optimization of Taxi Efficiency in NYC Team Members: Nick DeGiacomo, Preetam Dutta, Aamir Jahan, Tingting Lei 218 Nov 20, 2014

Motivation Problem Taxi inefficiency in NYC. Too many taxis in one area and not enough taxis in other areas. Application to Big Data Big Data Analytics- 485,000 yellow cab trips/day carrying 600,000 passengers/ day. We can analyze trip data to allocate resources (taxi) more efficiently. Solution Location specific analysis based on time of day, seasonality, etc. Algorithms to help taxi drivers determine optimal location to search for customers. 219

Dataset, Algorithm, Tools Dataset: NYC TLC 2013 11.0 Gigabytes of Data Description of the dataset: source and destination information, average time between rides, GPS coordinates, fares, etc. Algorithm/Methods: time series analyses, routing and scheduling algorithms, hypothesis testing, recommendation algorithms, etc. Tools: AWS, Hadoop, R, and Mahout 220

Progress and Expected Contributions Start (Now) 11.0 GB Data Software Web Interface Middle Time Series Analysis AND Recommendation Algorithm Finish (Deliverable) Web-Based, Mobile Application 221

E6893 Big Data Analytics: Project Name Citi Bike System Data Analysis Team Members: Zhefeng Xu, Wenxuan Zhang, Sun-Yi Lin, Yen-Hsi Lin 222 Nov 20, 2014

Motivation Citi Bike is an innovated bike-sharing system which is set up in recent years, and it provides a simple, convenient and eco-friendly way for New Yorkers and visitors to travel around the city. Now being in touch with this outstanding system, we may face some following questions: Where do Citi Bikers ride? When do they ride? How far do they go? Which stations are most popular? What days of the week are most rides taken on? Thus, we want to have a total evaluation of such system. 223

Dataset, Algorithm, Tools Dataset The Citi Bike kiosk system records a large quantity of trip and users information and we will use it as our dataset. Algorithm K-means, Naïve Bayes, K nearest classification, Complement Naïve Bays, Fuzzy K- means clustering Tools Hadoop, Eclipse, Mahout, Hive, Hbase 224

Progress and Expected Contributions Progress: 1. Find corresponding dataset. 2. Perform exploratory research and find algorithm 3. Get familiar with using different tools for big data. Expected Contributions: 1. Energy conservation. 2. Citi Bike usage rate according to different user type, age, gender and region. 3. Behavior Prediction of Citi Bike user. 225

E6893 Big Data Analytics: PeopleMaps Team Members: Anirban Gangopadhyay, Esha Maharishi Aditya Naganath, Abhinav Mishra Nov 20, 2014 226 E6893 Big Data Analytics Lecture 4: 12: Big Final Data Project Analytics Proposals Algorithms

Motivation GoogleMaps, Waze, and other path-recommendation services use a limited and pre-defined set of attributes to determine the worth of a path Wish to utilize the very rich knowledge of individuals in their known environments to recommend paths Not measure any attributes directly, but rather assume that a user taking a path is voting for that path as holistically better than any other By dynamically updating our knowledge of an environment through end-user s choices, we can provide a nuanced, insider recommendation of the current optimal path between two locales 227

Dataset, Algorithm, Tools The data is crowdsourced from end-users using Apple s CoreLocation Framework. On a user request for best path from A to B, if our crowdsourced data contains user-submitted paths from A to B, we use our collected paths to recommend the best path. If we do not have data from A to B, we supply the baseline GoogleMaps route suggestion, guaranteeing a minimum standard of quality We will use kmeans clustering with Euclidean Distance to group collected paths together. We will maintain a dynamic queue of paths based on a defined time interval. We will enqueue and dequeue accordingly to maintain updated location data. 228

Progress and Expected Contributions Determined relevant APIs for collection location data from users (CoreLocation Framework), for submitting requests to Google for the baseline directions (GoogleMaps Directions API), and a well-fitting application stack for the system (including node.js as the server, MongoDB as the database, and EC2 as the host) since it links our entire stack well through Javascript. Determined interfaces between different parts of the system, such as the representation of paths in each document in the database ios/insertion Backend/Server - Esha & Abhinav Mahout Clustering & Recommendation - Aditya & Anirban 229

E6893 Big Data Analytics: Project Transeo: Making public buses more efficient and accessible Dhruv Nair Omar Kiyani Manav Malhotra 230 November 20, 2014

Overview The MTA provides access to real time bus location information. This allows us to solve do two things: single route text-based Provide consumers with an interface that makes accessing this information natural. This is based on the notion that bus use in NYC is mostly for short distances, and there are many parallel routes Understand the optimal bus spacing and timing based on demand stop-based no discovery 231

What we plan on building Front-end Mobile map which shows the nearest routes as well as bus location in real time Back-end Poll the MTA API every 15 seconds Store bus location information Store consumer demand information Map with routes and bus overlap 232

Data uses Tie bus movement patterns with consumer demand and traffic Analyze bus overlap and demand (we will show a sample analysis with a published bustime dataset) Provide the MTA with a continuous analysis of their services for logistical optimization 233

E6893 Big Data Analytics: Image Based Geo-localization Team Members: Christopher Stathis, Yongchen Jiang 234 Nov 20, 2014

Motivation Q: Where was this picture taken? A computer vision problem requiring: A large database of imagery Considerable computational power General approach: Build a database of images tagged with GPS information Generate features for an input image Match features against those in the DB 235

Application Autonomous vehicle navigation Augment inaccurate GPS systems Auto-tag camera pictures Localize (human) users if lost in the street 236

Dataset, Algorithm, Tools Dataset Testing set: 100 images of Columbia University Practical set: Google Street View Algorithms Matching: SIFT, ASIFT, SURF Searching and matching: Mahout recommendation, Vocabulary tree Tools Python, OpenCV, Hadoop, Mahout, Google API 237

Progress and Expected Contributions Using Python-OpenCV to build SIFT Building Dataset Testing image matching algorithms Exploring distributed storage strategies 238

E6893 Big Data Analytics Media Group Proposals 239 Nov 20, 2014

E6893 Big Data Analytics Project: HVision Proposal Emad Barsoum (eb2871) DES Student Dept. of Computer Science focus on Computer Vision. HVision is a scalable computer vision platform on the top of Hadoop. 240 November 20, 2014

Goal and Summary Goal: 1. Provide a scalable vision platform for Computer Vision communities, to run vision algorithms on huge amount of data in parallel. 2. Provide a solution to Hadoop small files problem, in order to efficiently handle large amount of images. 3. Write a scalable Content Based Image Recognition (CBIR) using different algorithms. 4. What HVision to Computer Vision is what MAHOUT to Machine Learning. Summary: HVision will be composed of various components, here the high level list: 1. Image Processing: A number of Hadoop maps that perform per image, image processing task. 2. Feature Extraction: A number of Hadoop maps that uses Computer Vision to extract various image features (i.e. Histogram, SIFT, SURF, HOG etc). 3. Analytics: A Map\Reduce task that solve an end-to-end image task, such as query using an image as input. 4. Tools: A bunch of tools that can pack the input set of images, and unpack the result for viewing or analysis. 241

HVision High Level Architecture HVision Mappers MapReduce Tools Image Processing Image Retrieval (CBIR) Hadoop Images to Seq Feature Extraction Classification Seq to Images MapReduce Read data Write result Hadoop HDFS Submit Job Job Status Hadoop API 242

Progress, Dependencies and What next Progress, all the below items are done and checked in: 1. Maven based project with multiple modules in Github with Apache license: https:// github.com/ebarsoum/hvision.git 2. Most of the plumbing and architecture decision are done (I will provide full details in the final report). 3. Command line tool for packing images into a sequence file, and another command line tool that unpack a sequence file back. 4. Simple Mapper only job for verification. 5. A main driver entry point that reroute the command to the right job (inspired by Mahout). 1. My goal is to simplify job submission so the syntax will be: hvision <cmd> <args>. Dependencies: 1. OpenCV and JavaCV. 2. Google Guava library. 3. And Hadoop library, you can change the version from Maven build. What next: 1. Algorithms, parallelize content based image search on the top of map reduce. 2. Test the above on various image dataset. 3. As time permit, keep porting more vision algorithms and optimize existing one. 243

E6893 Big Data Analytics: PlayPalate, a music playlist generator Devin Jones & Andy Enkeboll 244 Nov 20, 2014

Motivation PlayPalate will find new music based on your taste in music. The webapp will deliver a personalized Spotify/Rdio playlist based on artist similarity measures and relationships and a user s social graph. 245

Dataset, Algorithm, Tools Datasets o o o Spotify API Facebook API Rovi API Algorithms o Tools o o o o o o NLP & Similarity computation (Tanimoto, Log-likelihood, cosine, Euclidean distance, etc) Stack: Rails app on Heroku User DB: MySQL Graph DB: Neo4j (GrapheneDB) Data warehouse: Hadoop (Treasure Data) Message queuing: Beanstalkd Data processing: Python (NLTK) 246

Progress and Expected Contributions Progress o Vetted project ideas, researched available datasources, algorithms, and tech stack Expected Contributions o Back end: Andy o API connections & o o o ETL: Devin Similarity/recommendation algorithms: Devin & Andy Playlist generation: Devin Front end: Andy 247

PlayPalate Data Pipeline User s Music History / Preferences { 248 Artist Similarity Measures using NLP & graph traversal (Tanimoto, Loglikelihood, cosine, Euclidean distance, etc)

E6893 Big Data Analytics: Movie Exploration Team Members: Yongjie Cao yc2978 Fengyi Song fs2523 Yuzhe Shen ys2821 Hui Zou hz2361 10 Nov 20, 2014

Motivation Audience preference: I like comedies I prefer blockbusters I like Robert Downey Jr I d like enjoy movies with my daughter I... Direction for Producers: What kind of movie customers would like to see? Which actor is perfect match for this character? Are there any qualities that make a memorable movie? 11

Dataset, Algorithm, Tools Dataset: Name movie_review Amazo n movie_test movie_train Tools: Source Quantity Content Yahoo Lab Yahoo Lab 7911684 reviews 211231 ratings 11915 ratings movie name, score, comments movie name, score, who gives that score movie name, score, who gives that score Algorithms: Pig: process user profile Cluster & Classification: make correlation among audience Recommendation: make predicatos based upon the keywords, extractions and preferences 12

Progress and Expected Contributions AUDIENCE Expected Category/Actor Recommended Movies DATA POOL Preferred Category Category Average Score Keywords Related movie Name+Rating PRODUCER Comments and reviews Ratings Expectations Keywords: universe, disaster... Actors & actresses Correlation: similarities, clusters(kmeans...),ratings... Recommender: recommendations(user-based, item-based...) Profits Honors Awards Costs 13

E6893 Big Data Analytics: Fantasy Basketball Winning Strategy Team Members: Hao-Hsiang Chuang, Kun-Yen Tsai, Lin Su, Yujia Gu 253 Nov 20, 2014

Motivation Sports play a big role among people in USA. As a result, people spend more time on watching NBA, MLB, NFL...etc. for their leisure time. Within the last five years, ESPN and Yahoo release an online sports game of all the professional sports league for people to actually have a chance to form a simulated team online and compete with other gamers. The ranking of the online teams is based on the real world behavior of the players they pick for their teams. Therefore, drafting becomes a key factor for the victory of the game. We wish to provide a winning strategy for drafting. Moreover, we hope to provide a practical dynamic strategy for real NBA team to choose players. Our goal : - Recommend a player for the online gamer every round - Cluster all the players into different groups of different functionalities such as scoring, rebounding etc. - Rank the players inside each clustered group - Give suggestion based on the present status of the online gamers 254

Dataset, Algorithm, Tools Dataset - NBA players statistics http://downloads.nbastuffer.com/nba-player-data-sets Algorithms - Clustering - Kmeans clustering, f-kmeans clustering etc - Classification - PLA algorithms, Linear Regression etc - Recommendation Tools - Mahout, Matlab, and every resource provided from class 255

Progress and Expected Contributions Progress - Survey on the point system of Fantasy Basketball - Search for all the dataset that might be useful to our recommendation - Get familiar with all the tools and algorithms that we might use Expected Contributions - Set up a dynamic recommendation system for the game - Help the gamer to build a balanced champion team for the game 256

E6893 Big Data Analytics: Fantasy Basketball Prediction Using Previous Season s Data Team Members: ChiaKang Chao 257 Nov 20, 2014

Motivation Fantasy basketball has been around for a long time and many people have participated in every season. There are many predicting algorithms online (ex. ESPN, Yahoo) but these are already set and cannot be altered. Flexibility is lacking. Therefore, a self-made predictor using the previous seasons data can be really helpful and will fit personal needs. 258

Dataset, Algorithm, Tools The dataset is acquired from basketball-reference.com The most up-to-date data costs 650 dollars to get, so the alternative is to use older data (up to 2009). Instead of using current season to check the accuracy, use 2009 season and train with 1949-2008 instead. Use Mahout to train and classify with K-mean then EM algorithm. 259

Progress and Expected Contributions Data has been acquired after talking to the admin of the website. The format of the data is rearranged in order to do processing. So far, all progress has been on pre-processing of raw data. Need to determine the value of K for k-means. So far, super star, starters, role players, and bench warmers are set, but more classes/categories would make the algorithm more versatile. 260

E6893 Big Data Analytics: Project Name Hunting for NBA players Team Members: Su Shen, Tianji Wang, Miao Lin 261 Nov 20, 2014

Motivation Amar'e Stoudemire in New York Knicks Salary this year : $23,410,988 (2 nd highest in NBA league) Season Game Played Rebounds Points 12-13 29 5.0 14.2 13-14 65 4.9 11.9 14-15 10 7.7 10.7 Give insight advices to team managers to make decisions 1. Quantify players performances 2. Evaluate team s demand 3. Recommend the most matched players to each team 262

Dataset, Algorithm, Tools Dataset from http://espn.go.com/nba/statistics and http://stats.nba.com/ Historical data for all teams, players and games from last 10 seasons Write Python code to extract data from the above URLs - Beautiful Soup - data scrap data - pandas - python data analysis library - Requests - Integrate Python with HTTP seamlessly (based on open source project Extracting NBA data from ESPN http://danielfrg.com/blog/2013/04/01/nba-scraping-data/) With the above dataset and tools, we will implement our own map-reducer and recommender. 263

Progress and Expected Contributions Project Status: 1. Got some datasets from espn.com 2. Tailoring the datasets to meet the input format of out map-reducer 3. Working on implementing map-reducer Datasets obtaining and formatting: Tianji Wang Algorithms, implementations: Su Shen, Miao Lin 264

E6893 Big Data Analytics: Music-Links: suggesting music and potential friends with similar tastes Team Members: Boren Liu (bl2547), Yihan Zou (yz2575) Jiayin Xu (jx2238), John Grossmann 265 Nov 20, 2014

Motivation The rise of portable mp3 players and downloaded music has resulted in music recommendation becoming a larger aprt of major e-commerce and massively used applications(itunes, Amazon). With and widespread use of social media sites, it is possible to efficiently mine user contextual data along with music preference of a vast and diverse population of people. This new data renders old music recommendation algorithms based solely on music content and preference, obsolete. 266

Dataset, Algorithm, Tools Dataset: Million Musical Tweet Dataset Algorithms: mahout recommendation algorithms (userbased, item-based with different similarity measurement), geographic averaging Optional: Clustering for geographic information Tools: Mahout, Hadoop, Java 267

Progress and Expected Contributions After sufficient research, we now know specifically what we re going to do. So far, we ve got our datasets and the plan to accomplish the project. By the end of it, we look forward to being able to recommend different music to different users based on the music they like and their geospatial context. 268

E6893 Big Data Analytics: Affective Computational Cinematography Team Members: Brendan Jou & Joseph G. Ellis 269 Nov 20, 2014

Motivation Q: What movies and portions of movies are most appealing to a user? A: The movies that appeal most to a user s desired emotion Why Movies?: Movies are created in a way that elicit strong and varied emotional responses from an audience. Movie creators are specialists in creating a movie that elicits specific emotions at particular points within the movie. There will be a high amount of perceived emotion correlation between different movie-goers at the same time. The perceived emotion between viewers of user-generated-content and generic videos will not correlate as highly as within movies. Project Goals: Create an emotionally-inspired movie scene mid-level feature ontology for emotion classification in movie videos. Build concept detector for emotionally relevant scene attributes that can be used for emotion prediction of movie scenes. Examples of Concepts: Fight, Silence, 270

Dataset, Algorithm, Tools Dataset: Movie trailers that have been crawled from IMDB. Movie Concept Labels: Crawled from IMDB plot keywords that are used to describe each movie. Examples: Fight, Sex Scene, Yelling, Night, Day, etc. Algorithms: Feature Extraction: Clustering (K-means), SIFT, Neural Networks, Fisher Vector Encoding Classification: Multiple Instance Learning, SVM, Logistic Regression Tools: Python, BeautifulSoup, CUDA, Vlfeat, Matlab, Cluster Computing 271

Progress and Expected Contributions Current Progress: Wrote crawlers for IMDB dataset trailers and labels. Downloaded the dataset ~1400 Trailers w/ labels Classification framework written and completed In Progress: Feature Extraction: Dense SIFT w/ Fisher Vector Encoding Face Detection and Feature Extraction Audio Feature Extraction if time permits Shot Detection Train Concept Detectors Compute accuracy Contributions: A concept detector bank for emotionally relevant movie scene attributes that can be used for movie annotation or emotion prediction. 272

E6893 Big Data Analytics: Improving Movie Recommender System with User Behavior changes and Demographics Team Members: Jimin Choi (CVN Student) 273 Nov 20, 2014

Motivation - Most movie recommender systems rely on pure user ratings without considering other important factors such as opinion changes over time, demographics, and bias on genre etc. - Human mood can change from time to time, and interpretation of rating scale is different from person to person. The same movie can be rated differently by the same user after some time. - Traditional recommendation algorithms need to be improved to take human aspect of movie rating systems. Movie or music rating naturally have emotional aspects to it. Purely relying on numbers would not give the best results. - My small, but ambitious big data project is to come up with significant improvement on movie recommendation systems. 274

Dataset, Algorithm, Tools - Dataset : Yahoo Labs Movies User Ratings and Descriptive Content Information, v.1.0 (R4), Yahoo Music ratings for User Selected and Randomly Selected songs (R3), and other possible movie ratings if available - Algorithm : Similarity based user and item based recommendations, my own filtering and weighted methods for improving recommendation algorithms - Tools: Java programming language, Apache Mahout open source library, Excel for charts and graphs, a REST API framework (if time permits, I will develop APIs out of the research) 275

Progress and Expected Contributions - Solo project - Learning exercises for Apache Mahout has been successfully completed, and data set for experimentations are approved and compiled together - Several default recommendation algorithms were tested against the specific data set - Actual algorithm design for improving existing recommendation system should be finalized. - Implementation of algorithms and evaluation of the new scheme should be done - Refinement and testing - Writing report and final presentation 276

E6893 Big Data Analytics: MOVIE RECOMMENDATION AND ANALYSIS OF THIS APPLICATION ZIHAO WANG MINGYUAN WANG JING GUO 277 38 Nov 20, 2014

Motivation Background With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch. We can recommend movies for users. 1. Users register with some information, like ages, genders, etc. 2. Clustering movies according to several factors, like catogories, ages. 3. Recommending movies by priority to users. 278 39

Dataset, Algorithm, Tools Dataset IMDB (Internet Movie Database) Algorithms 1. Write java code to process dataset extracted from the website 2. Fuzzy-kmeans 3. User and item based recommendation Tools Eclipse Mahout 279 40

Progress and Expected Contributions Progress 1. Download the dataset and try to format the dataset. 2. Design the appearence and functions of this application. 3. Set up the developing environment. 4. Implement the functions of this application. Contribution Obtaining and formating the dataset. (Mingyuan Wang) Implement the funtions. (Zihao Wang) Testing and analyzing the results and improvement. (Guo Jing) 280 41

E6893 Big Data Analytics: TV Genome Project / Recommendation Engine Analytics Media Group Team Members: Ishaan Sayal, Preeti Vaidya, Joshua Edgerton, Samuel Sharpe 281 Nov 20, 2014

Motivation TV Genome Project / Recommendation Engine for AMG: As a media company AMG constantly analyzes peoples viewing habits and television interests. User inputs a few shows they currently watch, and the tool predicts a few shows they might also like MOTIVE: Personalized TV recommendation system with focus on viewers historical viewing records or demographic data. Match most promising targets with their actual viewing habits to find the best shows to recommend to them. FUTURE PROSPECTS Products designed at gauging the potential viewership/ interest/success of a new show based on the characteristics of past shows and their relative success. 282

Dataset, Algorithm, Tools AMG Data: TV set-top boxes and online ad networks. Algorithms Used: Item Based Recommendation Possibly Clustering Collaborative Filtering with user and movie attributes. Tools: 283 Tools/ Languages/ IDE SQL Java/ Eclipse Mahout PHP/Yii Framework Application Fields in Dataset user_id series_id time_viewed_of_series series_tuning_instances series_days_tuned first_date_key_active last_date_key_active, total_hours_viewed total_tuning_instances total_days_tv_watched Data querying and preprocessing Recommendation Algorithm Recommendation Algorithm UI Design

Progress and Expected Contributions Task Description Team Members Status 1 Evaluation of Data Collecting useful information Pseudo Rating Experimentation 2 Running Recommendation Algorithms on Data Collected 3 Comparing Algorithms, Rating metrics and their performance 4 Expanding models to include user and show attributes Joshua Samuel Preeti Ishaan Samuel Joshua Preeti Samuel Ishaan 5 UI Design Preeti Samuel 6 Database Implementation Joshua Ishaan Done On Time (11-21-2014) Scheduled (11-28-2014) Scheduled (11-28-2014) Scheduled (12-07-2014) Scheduled (12-07-2014) 7 Report and Delivery Entire Team Scheduled (12-13-2014) 284

E6893 Big Data Analytics: Trendy Writer Yin Hang Meng-Yi Hsu 285 Nov 20, 2014

Outline 286

General Approach (1)(1) Grab massive data (2)(2) Filter out stop words (3)(3) Split data to training and testing set (4)(4) Train model to get popular word(s) (5)(5) GOAL : - - Return popular single word in real time - - Return popular related words in real time 287

Data Source (1)(1) Source: Major magazine websites. Ex. the Huffington Post (2)(2) Fetch data from those websites programmatically (3)(3) GOAL: - - Grab totally over 10 thousands sentences - - Grab new data in real time 288

Word Count (1) Single word count - Pig (2) (1) Load data from file (3) (2) Filter out stop words (4) (3) Group by word to get word count (5) (4) Export result (6) Related words: Mahout Clustering (7)(1) Split data to training and testing data (8)(2) Find the best existing model for this project (9)(3) On top of chosen model, improve and customize it 289

Graph Database (1)(1) Inject data into Graph DB (Neo4j) according to clustering result (2)(2) Use Cypher to query data (3)(3) Get the keywords of each article (4)(4) Recommend topics by keywords 290

E6893 Big Data Analytics: Analysis on pricing strategy for sports team Team Members:Han Cui 291 Nov 20, 2014

Motivation Professional sports ticket pricing is a dynamic business since it constantly changes with the competitiveness of the sports team, mainly because the performance of the team, which is also affected by lots of other elements of the team meanwhile. One important factor of the sports team ticket pricing is the team s performance Buying the ticket is a strategic decision for customers There are various tickets available, pricing of these tickets is also a strategic decision Thus there is a relative optimal decision on pricing of the tickets for sports team 292

Dataset, Algorithm, Tools This project will analyze based on the data of major European soccer teams from regions like England, Germany etc. The tool being used for the project will be Apache Mahout, and there will be a pseudo-application based on C language as open source code. One important reason of choosing Apache Mahout is that the results could be displayed visually. Since it s a generally complicated project, only main factors that data could be legally extracted and applied would be considered 293

Dataset, Algorithm, Tools Project Result. For this project I should get a set of optimized parameters for each variable considered as a result. And making a most optimal decision of pricing for the sports team. The results should be a parameter in a certain range The results should also be displayed visually The results should be able to used to explain the real difference performance-wise and pricing-wise between clubs 294

Progress and Expected Contributions Week 1: Pseudo C code created for the project Week 2: Apache Mahout project created according to the pseudo code Week 3: Testing the project with all the data available for the project and adjust the project code to the most optimal. 295

E6893 Big Data Analytics: Spark NLP Team Members: Pierre Arnoux, Neraj Bobra, Talha Ansari 296 Nov 20, 2014

Motivation NLP is one of the most currently trending fields in machine learning Hadoop has Machine Learning libraries but Map/ Reduce is becoming a bottleneck Let s do something more efficient: Spark 297

Dataset, Algorithm, Tools Dataset: NewsGroup and/or Wikipedia Dataset Algorithm: K-means, and potentially LDA Tools: Spark, AWS 298

Progress and Expected Contributions Accuracy benchmarks have been set and we will try to get comparable results Contribute to an open source Machine Learning library for Spark 299

E6893 Big Data Analytics Social Science and Government Group Proposals 300 Nov 20, 2014

E6893 Big Data Analytics: PeopleMaps Team Members: Anirban Gangopadhyay, Esha Maharishi Aditya Naganath, Abhinav Mishra Nov 20, 2014 301

Motivation GoogleMaps, Waze, and other pathrecommendation services use a limited and predefined set of attributes to determine the worth of a path Wish to utilize the very rich knowledge of individuals in their known environments to recommend paths Not measure any attributes directly, but rather assume that a user taking a path is voting for that path as holistically better than any other By dynamically updating our knowledge of an environment through end-user s choices, we can provide a nuanced, insider recommendation of the current optimal path between two locales 302

Dataset, Algorithm, Tools The data is crowdsourced from end-users using Apple s CoreLocation Framework. On a user request for best path from A to B, if our crowdsourced data contains user-submitted paths from A to B, we use our collected paths to recommend the best path. If we do not have data from A to B, we supply the baseline GoogleMaps route suggestion, guaranteeing a minimum standard of quality We will use kmeans clustering with Euclidean Distance to group collected paths together. We will maintain a dynamic queue of paths based on a defined time interval. We will enqueue and dequeue accordingly to maintain updated location data. 303

Progress and Expected Contributions Determined relevant APIs for collection location data from users (CoreLocation Framework), for submitting requests to Google for the baseline directions (GoogleMaps Directions API), and a wellfitting application stack for the system (including node.js as the server, MongoDB as the database, and EC2 as the host) since it links our entire stack well through Javascript. Determined interfaces between different parts of the system, such as the representation of paths in each document in the database ios/insertion Backend/Server - Esha & Abhinav Mahout Clustering & Recommendation - Aditya & Anirban 304 E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms

E6893 Big Data Analytics: Predicting usefulness of restaurants reviews from subtopics using Yelp data Team Members: Yu-Hua Cheng (yc2911) Jingchi Wang (jw3153) 305 Nov 20, 2014

Motivation Description of Tasks Topic Modeling. To categorize restaurant reviews into a number of subtopics. Ideally, we expect these subtopics to be food, service, environment, delivery and etc. Classification. After getting the subtopics, we will use several classification algorithms to predict the usefulness of reviews in each subtopic. Why subtopics? (Topic Modeling) Users have different purpose or interested topics toward one business. For instance, some may be interested in the food quality, and others may be into knowing the staff service. For those who order online, they may mostly care about delivery quality (speed and food). However, for those who go to restaurants, they may care a lot about environment and parking. Therefore, if we can categorize the reviews into subtopics, it may help users find the targeted reviews. Why predicting usefulness? (Classification) First, why not use the available votes (how many users think a review helpful)? The reason is that first, for some less popular restaurants, all the reviews may have zero useful vote because they are barely viewed. Second, a new posted review is more likely to have fewer votes, but it does not necessarily mean it has lower quality of usefulness. If we can predict the usefulness of reviews, we are able to provide them with high qualities in one specific subtopic to Yelp users. 306

Dataset, Algorithm, Tools Data: Yelp 2014 Dataset Challenge (Review dataset) Four cities: Phoenix, Las Vegas, Madison,Waterloo and Edinburgh 42,153 businesses 252,898 users 1,125,458 reviews Algorithms: Topic Modeling and Classification Latent Dirichlet allocation (LDA) To get the subtopics for our classification tasks, we will first use LDA. It is a widely used topic model for generating topics from documents. The results of this model will be a given number of topics composed of a set of terms (topic words). Naïve Bayes Naïve Bayes classifier is selected in this study, because it is a traditional probabilistic methods for text categorization. The model applies Bays theory and assumes strong independence between variables. Logistics Model The logistics classifier is a very popular probabilistic model used for predicting categorical dependent variable. It models the probability that the response belongs to a particular category. Support Vector Machines (SVM) SVM is almost the most widely used supervised learning algorithm for classification. It is very different from logistic model in that non-probabilistic binary linear classifier is used. Moreover, non-linear classification can also be performed with SVM using a specific kernel. Tools: R Python Mahout

Progress and Expected Contributions Progress: Merged the review data with business data, in order to scrape out the restaurant-type only reviews. And converted the json data to the data frame. Expected Contributions: Providing a more functional and targeting review service to Yelp users,which can assure them of a valuable reviews based on user s interested topics. 308

E6893 Big Data Analytics: Study Buddy ---- Education Resources Search, Organization, Recommendation Team Members: Zidong Gao(zg2185) Huan Gao(hg2357) Yuanhui Luo(yl3026) Yifan Yang(yy2495) 309 Nov 20, 2014

Motivation Pertinence Free of Advertisements Recommendation of study partners and potential interests 310

Dataset, Algorithm, Tools Dataset: Searching results from general purpose searching engine (i.e. Google, Baidu, etc.) Self-produced user data Algorithm: Various algorithms for Classification, Recommendation Tools: Java, Mahout, Hadoop, neo4j, HTML, Javascript 311

Progress and Expected Contributions Classification: Data retrieval and preprocessing Model training and analysis Classification and Results display Recommendation: Collecting of the search history and preferences Graph database integration with Java Visualization of users and topics Recommendation and Results display Expected Contributions: Yuanhui Luo: Front-end work, UI, Train Model, Retrieve Data Zidong Gao: Classification: Train Model, Front-end work +Visualization Yifan Yang: Classification: Retrieve Data + Preprocessing + Train Model Huan Gao: Graph Database + Recommendation 312

E6893 Big Data Analytics: Big Data Analysis on Log Data of Standardized IBT Test (TOEFL) Taker for Effects of Selection Changing Team Member: Jiaming Gu (jg3460) 313 Nov 20, 2014

Motivation Test takers always change their selections or choices of questions during test. Therefore, it will be interesting to find out the effects of this behavior via analysis of log data, that whether test taker gain more points when their choice or not. Moreover, we can also gain more insights on test taker behavior. 1. helping test developer to understand IBT test taker behavior regarding choice switching. 2. helping test taker to prepare for the standardized IBT test more effective and efficient. 3. improve the IBT adaptive test question reliability and validity via analyzing the log data. 314

Dataset, Algorithm, Tools Dataset: ETS Internal IBT database ( standard tests datatbase and test taker behavior tracking system) Algorithm: ETS testing behavior algorithms ETS scoring algorithms ETS fraud detecting techniques Statistical Analysis ( including statistcal methods and modeling analysis ) Tools: Hadoop Pig, Python, D3.js 315

Progress and Expected Contributions Progress: Datasets are ready for implements. Analysis methods are decided and starts to implement on data for trial phase. Expected Contributions: Whether changing choices or selections help test taker to gain more points or not. Moreover, effects of changing behavior under subsituations or categories, namely, under what kind of circumstance, test taker are more likely to gain points or loss points. Besides, the analysis may also contribute to some testing fraud detecting research. 316

E6893 Big Data Analytics: Improving Education for At-risk Students Team Members: Jairo Pava 317 Nov 20, 2014

Motivation This project will use data from the Department of Education to identify elementary school students that are at risk for low middle school academic performance. These elementary school students will be compared to similar students who became successful middle school students so that individual learning plans may be designed to them improve their academic changes for success. This project will offer a web front end to facilitate analysis. 318

Dataset, Algorithm, Tools Dataset The Early Childhood Longitudinal Study, Kindergarten Class of 1998-1999 (Link) Focuses on children's early school experiences beginning with kindergarten and following children through middle school Enables researchers to study how a wide range of family, school, community, and individual factors are associated with school performance. Algorithm Clustering using Mahout to create groups of students based on factors described above. Classification to identify if a student belongs to an at-risk group. Tools Hadoop Hive Mahout Neo4J (Depends if performance problems faced during HW 3 can be addressed) 319

Progress and Expected Contributions Progress The data has been retrieved from the Department of Education website The Features that will be used for clustering students have been identified A literature review of research on the data set has been completed Clustering and classification of students is pending Creation of web front end is pending Expected Contributions To enable educators to identify children that may be at risk of poor academic performance. To enable educators to identify learning paths for at risk children based on their similarity to other successful students. 320

E6893 Big Data Analytics Project Name: Scratch Analyzer Team Members: Jeff Bender 321 Nov 20, 2014

Motivation Scratch Initial Learning Environment Constructionism Collaborative Communities 322

Dataset, Algorithm, Tools Scratch Website Existing Tools: Scrape Eclipse IDE Mahout Ecosystem 323

Progress and Expected Contributions Scratch JSON Lucene Framework SAGE Literature Map 324

E6893 Big Data Analytics: Oscar Award Analysis based on big data Team Members: Xi Chen, Yunge Ma, Yuxuan Liu, Zhiyuan Guo 25 Nov 20, 2014

Motivation Why this year's Oscar winner is that movie? Wanna know if your favourite movie will be awarded Oscar best movie award? 326

Dataset, Algorithm, Tools Yahoo Lab Movie dataset: - 211,231 ratings Amazon movie reviews dataset: - 8 million reviews of 10 years Algorithms: - Canopy - K-Means Tools: 27

Progress and Expected Contributions Progress: - Finished homework 1 & 2 & 3 - Learned clustering and classification - Prepared dataset Expected Contribution: - Find movie taste of people within a specific age group - Find the group whose taste is the closest to the Oscar winner - Predict 2015 Oscar nomination 28

E6893 Big Data Analytics: Error Correction in Large Volume OCR Datasets Team Members: Thomas Adams 329 Nov 20, 2014

Motivation Large scale Optical Character Recognition (OCR) of historical documents produces quantities of text that are impractical for correction by human review. Often, methods for error identification are ad-hoc and only address well-known and easily detectable errors that are specific to a given dataset. Proposed: Development of a generalized Data Correction Toolkit, along with configurable workflows, that can apply to and adapt to the idiosyncrasies of any particular dataset; for example: different languages or data set types (names, addresses, locations, events). 330

Dataset, Algorithm, Tools Dataset 1.3 billion records comprising US City Directories spanning 160 years from 1829-1989. US City Directories were historical documents published by cities and often enumerated residents, addresses, occupations, familial relationships, etc. * Algorithm Use probability analysis on term frequencies; Markov models on character sequences; configurable heuristics. Tools MapReduce, Hbase, Hive, Oozie (or AWS equivalent), Java * Provided by special permission from Ancestry.com 331

Progress and Expected Contributions Goals To allow the unsupervised correction of large datasets by detection of highly probably OCR errors. For example: correction of Bobert Robert should occur while allowing the existence of both Bob and Rob. Of historical interest and specific to this dataset, a secondary goal is the identification of fictitious individuals intentionally but covertly included by the publishers to facilitate copyright protection. 332

E6893 Big Data Analytics: Project Name How to name your new-born baby(babies)? Team Members: Pei Huang (ph2325) Nov 20, 2014 333

Motivation In this project, I would like to work on something (relatively) simple, but very important in terms of answering a real-life question. For soon-to-be parents, what name(s) should you give to your new-born baby(babies)? I would look at historical popular male/female baby names, and use the various Recommendation, as well as Classification and/or Clustering techniques to help the expecting parents to make the right name choice, so their children do not complain about having not-so-cool names for the rest of their lives. 334

Dataset, Algorithm, Tools Dataset Popular male/female baby names going back to the 19 th Century from the Social Security Administration. Algorithms User-based and Item-based Recommendation, Clustering and/or Classification Tools R HDFS/Hive/PIG/HBase Mahout 335