MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS 2014. November 7, 2014. Machine Learning Group



Similar documents
Keyphrase Extraction for Scholarly Big Data

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

Big Data and Open Data

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Chapter 7. Using Hadoop Cluster and MapReduce

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Exploiting Data at Rest and Data in Motion with a Big Data Platform

Big Data and Society: The Use of Big Data in the ATHENA project

Big Data: Study in Structured and Unstructured Data

Sunnie Chung. Cleveland State University

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Hadoop for Enterprises:

Business Intelligence meets Big Data: An Overview on Security and Privacy

Location-Based Social Media Intelligence

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Exploring Big Data in Social Networks

Sentiment Analysis on Big Data

BIG DATA IN BUSINESS ENVIRONMENT

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

The 4 Pillars of Technosoft s Big Data Practice

ANALYTICS IN BIG DATA ERA

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Text Mining - Scope and Applications

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Of all the data in recorded human history, 90 percent has been created in the last two years. - Mark van Rijmenam, Think Bigger, 2014

How To Make Sense Of Data With Altilia

COMP9321 Web Application Engineering

Transforming the Telecoms Business using Big Data and Analytics

Big Data. Fast Forward. Putting data to productive use

Big Data Analytics. Lucas Rego Drumond

Web intelligence on Big Data in Today s Life. Web intelligence on Big Data in Today s Life,

Connecting library content using data mining and text analytics on structured and unstructured data

Big Data Executive Survey

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Data Warehousing in the Age of Big Data

Big Data Systems CS 5965/6965 FALL 2014

How To Use Big Data For Business

Next presentation starting soon Business Analytics using Big Data to gain competitive advantage

Analysis of Social Media Streams

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

THE STATE OF Social Media Analytics. How Leading Marketers Are Using Social Media Analytics

PDF PREVIEW EMERGING TECHNOLOGIES. Applying Technologies for Social Media Data Analysis

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Introduction to Text Mining and Semantics. Seth Grimes -- President, Alta Plana

BIG DATA FUNDAMENTALS

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Web Archiving and Scholarly Use of Web Archives

Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

Big Data. Patrick Derde. Use Cases and Architecture

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

CitationBase: A social tagging management portal for references

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

UNIVERSITY OF INFINITE AMBITIONS. MASTER OF SCIENCE COMPUTER SCIENCE DATA SCIENCE AND SMART SERVICES

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Applications of Deep Learning to the GEOINT mission. June 2015

Data Refinery with Big Data Aspects

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Big Data Explained. An introduction to Big Data Science.

BIG DATA What it is and how to use?

UNIFY YOUR (BIG) DATA

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

Where is... How do I get to...

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Data and Analytics: Challenges and Opportunities

Information Management course

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Are You Ready for Big Data?

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data Analytics: 14 November 2013

BIG DATA CHALLENGES AND PERSPECTIVES

Modern (Computational) Approaches to Big Data Analytics. CSC 576 Computer Science, University of Rochester Instructor: Ji Liu

Sources: Summary Data is exploding in volume, variety and velocity timely

The Future of Business Analytics is Now! 2013 IBM Corporation

How To Scale Out Of A Nosql Database

Statistics for BIG data

Introduction to Engineering Using Robotics Experiments Lecture 17 Big Data

Big Data. What is Big Data? Over the past years. Big Data. Big Data: Introduction and Applications

Mark E. Pruzansky MD. Local SEO Action Plan for. About your Local SEO Action Plan. Technical SEO. 301 Redirects. XML Sitemap. Robots.

IJRCS - International Journal of Research in Computer Science ISSN:

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

Video and Image Analytics for Business, Consumer and Social Insights. Jialie Shen School of Information Systems, SMU

Transcription:

Big Data and Its Implication to Research Methodologies and Funding Cornelia Caragea TARDIS 2014 November 7, 2014 UNT Computer Science and Engineering

Data Everywhere Lots of data is being collected and warehoused News articles and news comments Weblogs, e-commerce data, customer reviews, forum threads Scientific documents PubMed currently has over 24 million documents Google Scholar is estimated to have 160 million documents Social network data Facebook passes 1.23 billion monthly active users, 945 million mobile users, and 757 million daily users Twitter usage: 284 million monthly active users, 500 million tweets sent per day, 80% of Twitter active users are on mobile UNT Computer Science and Engineering 2

Big Data - What is it? A term used to describe the exponential growth and availability of data in almost all domains. Types of data: Relational data (Tables / Transaction) Text data (Web) Semi-structured data (XML) Graph data Social networks, Semantic Web (RDF) Streaming data UNT Computer Science and Engineering 3

The Three Vs of Big Data Volume Huge data volumes stored Due, in part, to the decrease of storage costs. Velocity Drink from a fire hose! Data is now streaming in at unprecedented speed It must be dealt with in a timely manner. Variety Large number of diverse data sources to integrate Data comes in all forms: structured numeric data such as weather data, sensor data unstructured text video audio UNT Computer Science and Engineering 4

Big Data - Why it Matters? Big Data has become as important as the Internet A source of great benefits to discovery, learning, and staying informed It may lead to more accurate analyses, leading to more confident decision making Better decisions => reduced risk / improved revenue. Quickly identify customers who matter the most. Detect fraudulent behavior in real time. It can make our lives easier and more productive, if we can infer the relationships of interest to us from the data. It has the potential to save lives during disaster events. UNT Computer Science and Engineering 5

Potential Pitfalls of Big Data UNT Computer Science and Engineering 6

Big Data - Challenges To make effective use of these data E.g., how to reduce false positives in medical data, which may lead to unnecessary surgeries. Need to understand how to mine it effectively Do we store it all? Do we analyze it all? How can we mine it to our best advantage? What if the data volume is so large and varied that we do not know how to deal with it? How to ensure sensitive information cannot be inferred UNT Computer Science and Engineering 7

Implications of Big Data on Research Methodologies Move from simple (SQL) analytics to complex (non-sql) analytics: Knowledge Discovery Discovery of useful, possibly unexpected, patterns in data Data Mining Machine Learning Incorporate massive data and modalities in analysis Use high-performance technologies, e.g., Hadoop, MapReduce, etc. Determine upfront which data is relevant UNT Computer Science and Engineering 8

Example Scenarios using Big Data Extracting Keyphrases from Document Networks Understanding Disaster Events through Social Media Analyzing Images Privacy for the Modern Web UNT Computer Science and Engineering 9

Extracting Keyphrases from Document Networks Project funded by NSF UNT Computer Science and Engineering 10

Why Keyphrase Extraction? Keyphrase extraction is the task of automatically extracting descriptive phrases or concepts from a document. Keyphrases: Allow for efficient processing of more information in less time Are useful in many applications: topic tracking, information filtering and search, classification, clustering, and recommendation. UNT Computer Science and Engineering 11

Previous Approaches to Keyphrase Extraction Use generally only the textual content of the target document [Mihalcea and Tarau, 2004], [Liu et al., 2010]. Recently, models are proposed that incorporate a local neighborhood of a document [Wan and Xiao, 2008]. Obtained improvements over models that use only textual content. However, their neighborhood is limited to textually-similar documents. During these Big Data times access to giant document networks In addition to a document s textual content and textuallysimilar neighbors, are there other informative neighborhoods that exist in research document collections? Can these neighborhoods improve keyphrase extraction? UNT Computer Science and Engineering 12

From Data to Knowledge A typical scientific research paper: Proposes new problems or extends the state-of-the-art for existing research problems Cites relevant, previously-published papers in appropriate contexts The citations between research papers gives rise to an interlinked document network, commonly referred to as the citation network. UNT Computer Science and Engineering 13

Citation Networks In a citation network, information flows from one paper to another via the citation relation [Shi et al., 2010]. The influence of one paper on another as well as the flow of information are captured by means of citation contexts (short text segments surrounding a paper's mention) They serve as micro summaries of a cited paper! UNT Computer Science and Engineering 14

A Small Citation Network Citation contexts are very informative! UNT Computer Science and Engineering [Das G. and Caragea, 2014]; [Caragea et al., 2014] 15

Understanding Disaster Events through Social Media Project funded by NSF UNT Computer Science and Engineering 16

Social Media Social media is now part of our daily lives and everyday communication patterns. Scholars of disasters see hope in social media Used around crises, it can produce accurate results, often in advance of official communications. However, social media data has not been incorporate much in emergency response systems. UNT Computer Science and Engineering 17

Proof of Concept Using Twitter data from Hurricane Sandy, we identify the sentiment of tweets and then measure the distance of each categorized tweet from the epicenter of the hurricane. Sandy Hurricane Twitter Data: We crawled 12,933,053 tweets between 10-26-2012 and 11-12-2012. UNT Computer Science and Engineering 18

Why Sentiment Analysis in Disaster Events? Can help understand the dynamics of the social network The main users concerns and panics The emotional impacts of interactions among users. Can help obtain a holistic view about the general mood and the situation on the ground. Strong value to those experiencing the disaster and those seeking information about the disaster, as well as to the responder organizations. Extracting sentiments during a disaster could help responders develop stronger situational awareness of the disaster zone itself. UNT Computer Science and Engineering 19

Geo-Tagged Tweets Sentiment Analysis n n Could be integrated into systems to help response organizations have a real time map to display the physical disaster and the spikes of intense emotional activity in its proximity. Using Big Data : Automatically infer tweets geo-location Automatically identifying trustworthy information spread around disaster events UNT Computer Science and Engineering [Caragea, Squiciarrini, Stehle, Neppalli, Tapia; 2014] 20

Analyzing Images Privacy for the Modern Web Project funded by NSF UNT Computer Science and Engineering 21

Why Online Image Privacy? Yahoo! Claims 880 billion images are shared in 14. 30K images per minute in Instagram. 200K images per minute in Facebook. Sharing sensitive images is also on a rise. UNT Computer Science and Engineering http://www.sociallystacked.com/2014/01/the-growth-of-social-media-in-2014-40-surprising-stats-infographic/ 22

Why Online Image Privacy? http://www.wordstream.com/articles/google-privacy-internet-privacy UNT Computer Science and Engineering SocialNetworkSecurityPrivacy. Barracudalabs.com 23

Why Online Image Privacy? With the advancements in mobile technology and Web 2.0 online image sharing is very easy. Many users are ignorant of privacy policies and risks of image sharing. Social network privacy policies are complex Facebook explains 61 content privacy settings across 7 pages Linkedin explains 52 content privacy settings across 18 pages Great need for methods to detect sensitivity of an image and recommend privacy policies. UNT Computer Science and Engineering 24

Image Analysis for Privacy Setting Image features Content and tag feature, e.g., RGB, SIFT, Edge direction, and Face detection. Metadata types Tags Comments People Notes/Description Contextual information Type of objects Names of people Place of photo etc. Using thousands of Flickr images!! [Squiciarrini, Caragea, Balakavi; 2014]; [Godea, Caragea, Squiciarrini; 2014] UNT Computer Science and Engineering 25

Implications of Big Data on Research Funding Great opportunities for complex data analytics Great funding opportunities NSF awards for BIGDATA shown geographically UNT Computer Science and Engineering 26

Conclusions Machine learning for Big Data is an exciting field of research with limitless practical application: Finance, robotics, vision, machine translation, medicine, etc. Open field, lots of room for new work 12 IT skills that employers cannot say No to Machine Learning is #1 The beauty of machine learning? It never stops learning! UNT Computer Science and Engineering 27

References S. Das Gollapalli and C. Caragea (2014). Extracting Keyphrases from Research Papers using Citation Networks. In: Proceedings of the 28 th American Association for Artificial Intelligence (AAAI 2014). C. Caragea, F. Bulgarov, A. Godea, and S. Das Gollapalli. (2014). Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). C. Caragea, A. Squicciarini, S. Stehle, K. Neppalli, A. H. Tapia. (2014). Mapping Moods: Geo- Mapped Sentiment Analysis During Hurricane Sandy. In: Proceedings of the 11th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2014). A. Squicciarini, C. Caragea, and R. Balakavi. (2014). Analyzing Images' Privacy for the Modern Web. In: Proceedings of the 25th ACM Conference on Hypertext and Social Media (HT 2014). A. Godea, C. Caragea, A. Squicciarini. (2014). Analyzing Images to Improve Tag Recommendation. Work in Progress. UNT Computer Science and Engineering 28

References Z. Liu, W. Huang, Y. Zheng, and M. Sun. (2010). Automatic keyphrase extraction via topic decomposition. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 10). Mihalcea, R. & Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 04). Shi, X., Leskovec, J., & McFarland, D. A. (2010). Citing for high impact. In Proceedings of the Joint Conference on Digital Libraries (JCDL 10). Wan, X. & Xiao, J. (2008). Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI 08). UNT Computer Science and Engineering 29

Thank you! Andreea Godea Kishore Neppalli UNT Computer Science and Engineering Florin Bulgarov 30