BIG DATA MEET PHARMACEUTICAL INDUSTRY: AN



Similar documents
International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Online Media Kit 2014-FCC_OnlineMediaKit 12/4/2014 8:56 AM Page 1 nline Odvertising A

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Transforming the Telecoms Business using Big Data and Analytics

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Introduction to Predictive Analytics. Dr. Ronen Meiri

Outline. What is Big data and where they come from? How we deal with Big data?

Advanced Big Data Analytics with R and Hadoop

Statistics for BIG data

Doing Multidisciplinary Research in Data Science

Foundations of Business Intelligence: Databases and Information Management

COMP9321 Web Application Engineering

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

HIGH PERFORMANCE ANALYTICS FOR TERADATA

Open source large scale distributed data management with Google s MapReduce and Bigtable

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Industry 4.0 and Big Data

Copyright This report and/or appended material may not be partly or completely published or

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Suresh Lakavath csir urdip Pune, India

Big Data & Analytics: Your concise guide (note the irony) Wednesday 27th November 2013

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

The Data Mining Process

Big Data. Fast Forward. Putting data to productive use

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

A New Era Of Analytic

BIG DATA What it is and how to use?

The big data business model: opportunity and key success factors

Data Warehouse (DW) Maturity Assessment Questionnaire

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Mining Text Data for Useful Information in Higher Education John Zilvinskis Indiana University

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Big Data Technologies Compared June 2014

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Foundations of Business Intelligence: Databases and Information Management

The Next Wave of Data Management. Is Big Data The New Normal?

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

Big Data and Analytics: Challenges and Opportunities

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Hadoop Technology for Flow Analysis of the Internet Traffic

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

The 3 questions to ask yourself about BIG DATA

METADATA DRIVEN INTEGRATED STATISTICAL DATA PROCESSING AND DISSEMINATION SYSTEM

Chapter 1. Contrasting traditional and visual analytics approaches

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

So What s the Big Deal?

Open & Big Data for Life Imaging Technical aspects : existing solutions, main difficulties. Pierre Mouillard MD

Course MIS. Foundations of Business Intelligence

How To Handle Big Data With A Data Scientist

Big Data a threat or a chance?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

The Scientific Data Mining Process

Oracle Big Data SQL Technical Update

Big Data: Study in Structured and Unstructured Data

Customer Classification And Prediction Based On Data Mining Technique

Web Archiving and Scholarly Use of Web Archives

Data Visualization Techniques

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Elizabeth Comino Centre fo Primary Health Care and Equity 12-Aug-2015

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Knowledge Discovery from patents using KMX Text Analytics

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Big Data Analytics and Healthcare

By Ken Thompson, ServQ Alliance

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

BIG DATA CHALLENGES AND PERSPECTIVES

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Foundations of Business Intelligence: Databases and Information Management

The primary goal of this thesis was to understand how the spatial dependence of

Data Visualization Techniques

Big Data in the Nordics 2012

Ubuntu and Hadoop: the perfect match

EDM THE AUDIENCE ANALYSIS

A PRACTICAL GUIDE TO MODERN MARKETING ANALYTICS

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Transcription:

BIG DATA MEET PHARMACEUTICAL INDUSTRY: AN APPLICATION ON SOCIAL MEDIA DATA Caterina Liberati 1, Paolo Mariani 1 1 Catrina Liberati, Paolo Mariani, Department of Economics, Management and Statistics, University of Milano-Bicocca email: caterina.liberati@unimib.it; paolo.mariani@unimib.it KEYWORDS: Data Analysis; Big Data, Pharmaceutical industry. 1 An introduction Big Data represent the new frontier of data analysis. There is still substantial confusion between having a lot of data available and operate on Big Data. In order to clarify such misunderstanding, we can take into account seven characteristics that allow us to highlight differences and to uncover peculiarities of each type of data (Table 1): Table 1: Data vs Big Data: the 7 V V Data Big Data Volume Megabyte MB 10 6 Zettabyte ZB 10 21 Velocity Static Real time Variety Structured and rarely integrated from different sources Structured e unstructured- Not integrated. Collected from different sources Value High Not verified Veridicity High Low Validity High Limited and with high obsolescence due to time Nowadays businesses are attempting to employ Big Data in their operative contexts because they recognize the innovative and strategic aspects of such source of information. The pharmaceutical industry, for example, although has a classical perspective in manipulating data, is exploring this new contest (Santoro, E., 2009). Inspired by a report produced by Cubeyou on Facebook data, referred to the pharmaceutical sector (Cubeyou, 2014), we analyzed microdata applying statistical technique for sparse matrices reducing. The present work describes the survey, the data collected and the modeling employed, which is based on a pre-process done

with Principal Component Analysis. Finally, part of results is shown in the last section. 2 The Goal and the Data The goal of the Cubeyou report is to help businesses understand which are their customers or potential ones and what they do. It also highlights how helping business in structuring marketing activities and how to make informed marketing decisions in the areas of media and content. In such report, instances observed were selected among users of the social media pages, websites and forums that write about drugs and health. Some pages visited are: Bristol - Myers Squibb; Amgen; Boehringer Ingelheim; Schering Plough; Baxter International; Takeda Pharmaceutical; AIFA, just to cite a few. The research was conducted on 5607 Italian subjects, at the end of 2014, hunting all interactions between people and brands, products and services (shares, likes, tweets, pins, posts, ) on Facebook, and classifies them into categories. The algorithm employed in the report transforms the data in meaningful customer information as personalities, psychographics, location, hobbies, interests, media (Kosinski et al, 2013) and it has been derived on dataset set including Facebook Likes provided by over 58,000 volunteers, together with their detailed demographic profiles, and the results of several psychometric tests. Cubeyou raw data are stored on a cloud platform with 20 servers active on Amazon Web Services (AWS) infrastructure. Thanks to Hadoop2 and Hbase, more than 5TB distributed database can be stored and updated daily. Three main tables are used to store different types of data. The behavioural table with more than 8 billions rows stores users actions on Facebook pages. Each row contains a single user by page interaction. The User demographic table contains unstructured data about users profiles. The Pages demographic table contains unstructured data about Facebook pages. Starting from this infrastructure, for each project data are then extracted with queries based on users profile and behaviour. Extraction process aggregates data by Facebook users and creates a user by variable matrix stored in a structured dataframe. Therefore the logistic/linear regression employed to predict individual psycho-demographic profiles 1 benefits from such external survey responses. The ambitious goal of our work is to propose a strategy to explore and synthesize the same raw data set but using only the Facebook information in order to aid businesses to make their strategic decisions in terms of communication or activities and targets. The matrix analysed has 5607 rows (Facebook users) and 159 dummy columns (pages visited/liked). In order to reduce the dimensionality of the data we exploited the subjects classification which distinguish among: Pet Lovers; Outdoor 1 Psychographics classifies people, using their interests, attitudes, habits, values and opinions, not only on their objective demographic characteristics to better understand what drives them or could drive them to purchase and engage with the company. It is based on the assumption that the types of products and brands an individual purchases will reflect who that person is and how he lives.

Enthusiast; Techies; Car Lovers; Book Lovers; Social Activist; Gamers; Movie Lovers; Politically Active; Sport Lovers; Fashion Lovers; Music Lovers; Travel Lovers; Public Figures Followers; Food Lovers; Home Decorators-amp DIYs; Beauty and Wellness Aware; Business People; Housekeepers. In our application we focus on the selection of public figures usable to address a media campaign. 3 Methodology and results Before considering techniques or models we have to face with the question: Are such data Big Data? Yes, if we consider the origin, not if we consider that present format. The data, actually, have been collapsed in a matrix using a map reduce technique (like Hadoop) and now are completely manageable. The second question to consider is: How to elaborate this data set? We suggest performing a pre-processing as if the data have a big volume. The data matrix dimensions have been reduced, generating a contingency table of 'like'. Therefore, we obtained 19 psychographics profiles (in row) and 140 topics (in column). In this way we have squeeze the data volume. An alternative choice could have been the Multiple Correspondence Analysis but the results interpretation might have been difficult. On this new data set we performed a Principal Component Analysis (Bolasco S., 2010) 2, retaining only those factors with eigenvalue greater or equal to 1. In such way we generated a principal plane which explain the 93.65% with KMO equal to 0.949 and Bartlet equal to 0.000. The first factor named Hedonism (47,97 % of explained inertia) reports: Outdoor Enthusiast; Car Lovers; Gamers; Movie Lovers; Sports Lovers; Music Lovers; Beauty and wellness Aware. The second factor, called Commitment, (45.68 % of explained inertia) reports: Pet Lovers; Techies; Book Lovers; Activist company; Politically Active; Business People; Home Decorators. Other profiles as Housekeepers; Public Figures Followers; Fashion Lovers; Food Lovers; Travel Lovers are transversal respect to the 2 axis. Using only the Italian public figures we can explain the peculiarities of the social media pages users, websites and forums that show an interest in drugs and health. In the first quadrant of the factor plane we found: Marco Travaglio; Fiorello; Beppe Grillo, Luciana Litizzetto. Hedonism and Commitment are both positive. Hedonism negative and Commitment positive, the second quadrant is filled in by Gino Strada; Papa Francesco; Massimo Gramellini. Hedonism and Commitment are both negative into third quadrant: Sonia Peronaci e Giulio Golia. Hedonism positive and Commitment negative in the fourth quadrant filled in by BAZ Marco Bazzoni; Belen Rodriguez; Paolo Bitta; Alessia Marcuzzi; Alessandro Borghese (Figure 1).

Figure 1: ACP Plan Hedonism vs Commitment (Source: Elaboration on Cubeyou data November 2014) The choice of the testimonial by a business will take into account the psychographic profiles preferences here visualized, in order to provide coherent brand image. 4 Further developments Social data offer a set of information associated with an user, which can be declared, if they are spontaneous self-profiling by the user (sex, age, work place) or gathered, if they are derived from the interactions with other contents, business pages, users etc (Moubarak G. et alii, 2010). Social data are Big Data when the seven V are respected. In our case the application was performed after preelaborations, and then the usage of standard statistical models were legitimated. Our results confirm how Big Data can meet pharmaceutical industry and highlight how small is the cluster of people interested in pharmaceutical environment. Into this cluster most of the users are interested in wellness in general and not in drugs or pathologies. We limited the analysis to the public figures but the further information available from the survey could be modeled at a later stage. The results illustrated are very interesting, especially if we consider the context: pharmaceutical industries are attempting to measure adherence on therapies and R&D area. To this purpose marketing department (Greene j. A., Kesselheim, A. S. 2010) of such companies are focusing their attention to Big Data. References BOLASCO, S. 2010. Analisi multidimensional dei dati. Metodi, strategie e criteri di interpretazione. Roma: Carocci

Cubeyou 2014, The Pharmaceutical Industry How Pharmaceutical Market Customers Benchmark Against Average Italian People. Industry Report, 11/2014, Cubeyou Inc., Redwood City, CA, USA GREENE J. A., KESSELHEIM, A. S. 2010. Pharmaceutical Marketing and the New Social Media. The New England Journal of Medicine, november 25. KOSINSKI M., STILLWELL D. & GRAEPEL, T. 2013. Private traits and attributes are predictable from digital records of human behaviour. www.pnas.org/cgi/doi/10.1073/pnas.1218772110 MOUBARAK G., GUIOT A., BENHAMOU Y. & HARIRI S. 2010. Relationship and its impact on the doctor-patient Facebook activity of residents and fellows. J Med Ethics. doi:10.1136/jme.2010.036293 SANTORO, E. 2009. Web 2.0 e Medicina: come social network, podcast, wiki e blog trasformano la comunicazione, l assistenza e la formazione in sanità. Milano: Il pensiero scientifico Editore.