BIG DATA MEET PHARMACEUTICAL INDUSTRY: AN APPLICATION ON SOCIAL MEDIA DATA Caterina Liberati 1, Paolo Mariani 1 1 Catrina Liberati, Paolo Mariani, Department of Economics, Management and Statistics, University of Milano-Bicocca email: caterina.liberati@unimib.it; paolo.mariani@unimib.it KEYWORDS: Data Analysis; Big Data, Pharmaceutical industry. 1 An introduction Big Data represent the new frontier of data analysis. There is still substantial confusion between having a lot of data available and operate on Big Data. In order to clarify such misunderstanding, we can take into account seven characteristics that allow us to highlight differences and to uncover peculiarities of each type of data (Table 1): Table 1: Data vs Big Data: the 7 V V Data Big Data Volume Megabyte MB 10 6 Zettabyte ZB 10 21 Velocity Static Real time Variety Structured and rarely integrated from different sources Structured e unstructured- Not integrated. Collected from different sources Value High Not verified Veridicity High Low Validity High Limited and with high obsolescence due to time Nowadays businesses are attempting to employ Big Data in their operative contexts because they recognize the innovative and strategic aspects of such source of information. The pharmaceutical industry, for example, although has a classical perspective in manipulating data, is exploring this new contest (Santoro, E., 2009). Inspired by a report produced by Cubeyou on Facebook data, referred to the pharmaceutical sector (Cubeyou, 2014), we analyzed microdata applying statistical technique for sparse matrices reducing. The present work describes the survey, the data collected and the modeling employed, which is based on a pre-process done
with Principal Component Analysis. Finally, part of results is shown in the last section. 2 The Goal and the Data The goal of the Cubeyou report is to help businesses understand which are their customers or potential ones and what they do. It also highlights how helping business in structuring marketing activities and how to make informed marketing decisions in the areas of media and content. In such report, instances observed were selected among users of the social media pages, websites and forums that write about drugs and health. Some pages visited are: Bristol - Myers Squibb; Amgen; Boehringer Ingelheim; Schering Plough; Baxter International; Takeda Pharmaceutical; AIFA, just to cite a few. The research was conducted on 5607 Italian subjects, at the end of 2014, hunting all interactions between people and brands, products and services (shares, likes, tweets, pins, posts, ) on Facebook, and classifies them into categories. The algorithm employed in the report transforms the data in meaningful customer information as personalities, psychographics, location, hobbies, interests, media (Kosinski et al, 2013) and it has been derived on dataset set including Facebook Likes provided by over 58,000 volunteers, together with their detailed demographic profiles, and the results of several psychometric tests. Cubeyou raw data are stored on a cloud platform with 20 servers active on Amazon Web Services (AWS) infrastructure. Thanks to Hadoop2 and Hbase, more than 5TB distributed database can be stored and updated daily. Three main tables are used to store different types of data. The behavioural table with more than 8 billions rows stores users actions on Facebook pages. Each row contains a single user by page interaction. The User demographic table contains unstructured data about users profiles. The Pages demographic table contains unstructured data about Facebook pages. Starting from this infrastructure, for each project data are then extracted with queries based on users profile and behaviour. Extraction process aggregates data by Facebook users and creates a user by variable matrix stored in a structured dataframe. Therefore the logistic/linear regression employed to predict individual psycho-demographic profiles 1 benefits from such external survey responses. The ambitious goal of our work is to propose a strategy to explore and synthesize the same raw data set but using only the Facebook information in order to aid businesses to make their strategic decisions in terms of communication or activities and targets. The matrix analysed has 5607 rows (Facebook users) and 159 dummy columns (pages visited/liked). In order to reduce the dimensionality of the data we exploited the subjects classification which distinguish among: Pet Lovers; Outdoor 1 Psychographics classifies people, using their interests, attitudes, habits, values and opinions, not only on their objective demographic characteristics to better understand what drives them or could drive them to purchase and engage with the company. It is based on the assumption that the types of products and brands an individual purchases will reflect who that person is and how he lives.
Enthusiast; Techies; Car Lovers; Book Lovers; Social Activist; Gamers; Movie Lovers; Politically Active; Sport Lovers; Fashion Lovers; Music Lovers; Travel Lovers; Public Figures Followers; Food Lovers; Home Decorators-amp DIYs; Beauty and Wellness Aware; Business People; Housekeepers. In our application we focus on the selection of public figures usable to address a media campaign. 3 Methodology and results Before considering techniques or models we have to face with the question: Are such data Big Data? Yes, if we consider the origin, not if we consider that present format. The data, actually, have been collapsed in a matrix using a map reduce technique (like Hadoop) and now are completely manageable. The second question to consider is: How to elaborate this data set? We suggest performing a pre-processing as if the data have a big volume. The data matrix dimensions have been reduced, generating a contingency table of 'like'. Therefore, we obtained 19 psychographics profiles (in row) and 140 topics (in column). In this way we have squeeze the data volume. An alternative choice could have been the Multiple Correspondence Analysis but the results interpretation might have been difficult. On this new data set we performed a Principal Component Analysis (Bolasco S., 2010) 2, retaining only those factors with eigenvalue greater or equal to 1. In such way we generated a principal plane which explain the 93.65% with KMO equal to 0.949 and Bartlet equal to 0.000. The first factor named Hedonism (47,97 % of explained inertia) reports: Outdoor Enthusiast; Car Lovers; Gamers; Movie Lovers; Sports Lovers; Music Lovers; Beauty and wellness Aware. The second factor, called Commitment, (45.68 % of explained inertia) reports: Pet Lovers; Techies; Book Lovers; Activist company; Politically Active; Business People; Home Decorators. Other profiles as Housekeepers; Public Figures Followers; Fashion Lovers; Food Lovers; Travel Lovers are transversal respect to the 2 axis. Using only the Italian public figures we can explain the peculiarities of the social media pages users, websites and forums that show an interest in drugs and health. In the first quadrant of the factor plane we found: Marco Travaglio; Fiorello; Beppe Grillo, Luciana Litizzetto. Hedonism and Commitment are both positive. Hedonism negative and Commitment positive, the second quadrant is filled in by Gino Strada; Papa Francesco; Massimo Gramellini. Hedonism and Commitment are both negative into third quadrant: Sonia Peronaci e Giulio Golia. Hedonism positive and Commitment negative in the fourth quadrant filled in by BAZ Marco Bazzoni; Belen Rodriguez; Paolo Bitta; Alessia Marcuzzi; Alessandro Borghese (Figure 1).
Figure 1: ACP Plan Hedonism vs Commitment (Source: Elaboration on Cubeyou data November 2014) The choice of the testimonial by a business will take into account the psychographic profiles preferences here visualized, in order to provide coherent brand image. 4 Further developments Social data offer a set of information associated with an user, which can be declared, if they are spontaneous self-profiling by the user (sex, age, work place) or gathered, if they are derived from the interactions with other contents, business pages, users etc (Moubarak G. et alii, 2010). Social data are Big Data when the seven V are respected. In our case the application was performed after preelaborations, and then the usage of standard statistical models were legitimated. Our results confirm how Big Data can meet pharmaceutical industry and highlight how small is the cluster of people interested in pharmaceutical environment. Into this cluster most of the users are interested in wellness in general and not in drugs or pathologies. We limited the analysis to the public figures but the further information available from the survey could be modeled at a later stage. The results illustrated are very interesting, especially if we consider the context: pharmaceutical industries are attempting to measure adherence on therapies and R&D area. To this purpose marketing department (Greene j. A., Kesselheim, A. S. 2010) of such companies are focusing their attention to Big Data. References BOLASCO, S. 2010. Analisi multidimensional dei dati. Metodi, strategie e criteri di interpretazione. Roma: Carocci
Cubeyou 2014, The Pharmaceutical Industry How Pharmaceutical Market Customers Benchmark Against Average Italian People. Industry Report, 11/2014, Cubeyou Inc., Redwood City, CA, USA GREENE J. A., KESSELHEIM, A. S. 2010. Pharmaceutical Marketing and the New Social Media. The New England Journal of Medicine, november 25. KOSINSKI M., STILLWELL D. & GRAEPEL, T. 2013. Private traits and attributes are predictable from digital records of human behaviour. www.pnas.org/cgi/doi/10.1073/pnas.1218772110 MOUBARAK G., GUIOT A., BENHAMOU Y. & HARIRI S. 2010. Relationship and its impact on the doctor-patient Facebook activity of residents and fellows. J Med Ethics. doi:10.1136/jme.2010.036293 SANTORO, E. 2009. Web 2.0 e Medicina: come social network, podcast, wiki e blog trasformano la comunicazione, l assistenza e la formazione in sanità. Milano: Il pensiero scientifico Editore.