Social Sentiment Analysis Financial IndeXes ICT Grant: D3.1 Data Requirement Analysis and Data Management Plan V1

Transcription

1 Social Sentiment Analysis Financial IndeXes ICT Grant: D3.1 Data Requirement Analysis and Data Management Plan V1 Project Coordinator Dr. Brian Davis (NUI Galway) Document Authors Mr. Angelo Cavallini (3rdPLACE SRL) With Contributions from Dr. Brian Davis Galway), Mr. Ross McDermott Galway), Dr. Laurentiu Vasiliu (Peracton Ltd), Mr. Juha Vilhunen (Lionbridge) First Quality Reviewer Ms. Helen Lippell Second Quality Reviewer Dr. Brian Davis (NUI Galway) Deliverable Nature Dissemination Level Report (R) PU (Public) Contractual Delivery Date 01/07/2015 Actual Delivery Date 01/07/2015 Document Version 1.0 Total Number of Pages 12

2 Document Information Grant Agreement No Full Title Project Acronym Project URL Document URL Social Sentiment Analysis Financial IndeXes SSIX N/A Project Start Date 01/03/2015 Project Duration EU Project Officer 3 Years Ms Alina Lupu Workpackage No. 3 Workpackage Title Data Management Deliverable No. 3.1 Deliverable Title Abstract (for dissemination) Keywords Data Requirement Analysis and Data Management Plan V1 The present document aims to provide a detailed overview of the platforms and techniques that can be used as data sources for the entire SSIX platform. Data Management, DMP, Data Sources, WP3, SSIX, Platforms, Data Assessment, Data Collection 1

3 Document History Version Date Author (Partner) Comments /04/2015 Mr. Angelo Cavallini (3rdPLACE) Document created /06/2015 Mr. Angelo Cavallini (3rdPLACE) Initial draft /06/2015 Ms. Helen Lippell First quality review /06/2015 Dr. Brian Davis (NUI Galway) Second quality review /06/2015 Dr. Brian Davis (NUI Galway) Final version 2

4 Executive Summary This document aims to provide a detailed overview of the platforms and techniques that can be used as data sources for the entire SSIX platform. The document clearly lists all the public data available that can be retrieved and processed by the SSIX platform, along with the detailed results of the assessments performed on the identified data sources. This document will help to highlight important structural aspects of the platform and to identify all the criticalities that have to be taken into consideration when dealing with certain data collection techniques. * This is a public shortened version of D3.1. The rest of the content was considered commercially sensitive by the consortium members and therefore was not made public. The full deliverable was submitted to EC. For any questions and queries, please contact the SSIX Coordinator for further details * 3

5 Table of Contents Document Information... 1 Document History... 2 Executive Summary... 3 Table of Contents Introduction Data Sources Assessment Analysis Criteria Data Management Plan Open Research Data Pilot (Open Access to Scientific Publications and Research Data) 7 5 Technical Issues Geographic Data Availability Real Time Data Processing Batch Data Processing Missing Data Handling Errors handling Conclusions

6 1 Introduction The present document aims to provide a detailed overview of the platforms and techniques that can be used as data sources for the entire SSIX platform. The main activity of WP3 consists in the implementation of the processes dedicated to gathering data and metadata from several platforms and websites, the assorted information needed for the calculation of the SSIX indices forming the core logics of the platform. These processes will allow applications to interact with different social platforms, blogs and newsfeeds, thus requiring the implementation of complex pieces of software dedicated to the collection and processing of increasing amounts of data. This introductory document contains the results of the assessments performed on the identified data sources providing APIs, that helped to highlight important structural aspects of each platform and to identify all the criticalities that have to be taken into consideration when dealing with certain data collection techniques. For instance, almost every social platform (like Facebook, Twitter or Google+) exposes public APIs that can be used to retrieve data from the available endpoints. In these cases, a fundamental factor driving the definition of the functional specifications, resides in the usage limit imposed by these platforms. It is therefore important to keep an eye on these limits when defining the scope of the external data to be collected. A dedicated chapter has been produced about data gathering techniques to be used on those sources that do not provide API access (e.g. web sites, forums, etc.), thus requiring to interact with RSS feeds or HTML pages. Moreover, the document clearly lists all the public data available on the different sources that can be retrieved and processed by the SSIX platform. These tables will help to identify the significant fields to be stored and sent to the subsequent NLP processes. 5

7 2 Data Sources Assessment 2.1 Analysis Criteria All the sources assessed listed here have been analysed and evaluated using the same criteria. The following list provides a short description for each criteria considered during the assessment. If the criterion is not applicable to the analysed source, the label N/A is used. If no information is found about a criterion, the label UNREP is used. Source name: common name for the data source. Status: current status of the access to the source (active, inactive or closed). API name: common name of the API exposed by the data source. Latest version: latest version available at the time this document is updated. Update frequency: frequency with which the API is updated. Costs: the cost and pricing policies for querying the data source, if applicable. Description: brief description of the API used. Interface type: the kind of protocol exposed by the API (e.g. SOAP, RESTful, etc.). Output type: description of the data format returned by the source. Authentication: description of the authentication process, if requested. Data timezone: timezone used in the data returned by the source. Available languages: if the source allows to filter the contents returned on the basis of the language, this contains the list of supported languages. Region: the world region in which the source is valid, if applicable. Quota limits: documented limits in the number of possible calls to the API. Maximum amount of data per request: the maximum amount of data that is returned at every request when the source is queried using the API. Maximum historical data depth: the maximum depth in time that can be requested and retrieved from the source. Most recent data available: the last hour/day available when performing a request, this indicates the freshness of the data. Documentation: where to find official documentation about the source. Support: indicates whether official support exists and where to find it. Resources: tools and resources available to test, debug or explore the API. Public data available: list of the public data that can be retrieved from the data source using the described method. Final considerations and known criticalities. Alternatives: possible services to use as an alternative in case of a major disruption of the official APIs. 6

8 4 Data Management Plan As reported in the official Guidelines on Data Management in Horizon 2020, the purpose of the Data Management Plan (DMP) is to provide an analysis of the main elements of the data management policy that will be used by the applicants with regard to all the datasets that will be generated by the project. In details, the DMP describes the data management lifecycle for all the data sets that will be collected, processed or generated by the project. The DMP is not a fixed document, but evolves during the lifespan of the project. This is why three versions of this document will be released with the following cadence: V1 in M4 V2 in M18 V3 in M24 The Data Management Plan for the SSIX project can be found in Appendix A1 in the CO version of this deliverable. 4.1 Open Research Data Pilot (Open Access to Scientific Publications and Research Data) The SSIX project is participating in the Open Research Data Pilot (ORDP), meaning that all publications, data and metadata to reproduce scientific experiments should be open access. The following constitutes what SSIX will be sharing as part of the ORDP: All open source software and components that shall be developed as part of the project work. Where some code is not open it may be available as web service/api for academic/research by industry partners but not for commercial use freely. All public deliverables. Results and enriched data derived from experiments, as it will allow scientists/researchers to verify and repeat the experiments. This will apply only to data which are not proprietary or commercially sensitive or do not have any ethical/legal implications will be made available. This is inline with the ORDP (see page 9 of Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020) whereby a participant can opt out for reasons related to commercial, security or protection of personal data. All publications will ideally be made open access type gold (immediately accessible for free) if not certainly type green, which would involve a period of embargo. Note that if a peer reviewed publication contains any commercially sensitive content it will pass through IPR screening before being published under open access i.e. "protect first and publish later" 1. Note that if any publishers are not "open access friendly", SSIX can always opt to publish pre-print forms of articles as open access. This is becoming quite common across the research community. All data to be shared with or as part of the ORDP will be placed in a repository that will point to all data entities shared within ORDP so that these can be accessed, mined, exploited, reproduced etc. 2 1 See Page 3 of Guidelines on Open Access to Scientific Publications and Research Data in Horizon See (a) and (i) page 2 of Guidelines on Data Management in Horizon 2020) 7

9 The Open Access Infrastructure for Research in Europe (OpenAIRE 3 ) is the recommended single point of entry for open access data and publications by the EC. 4 We will seek to ensure that there is single point of entry to all SSIX publications and data. ARAN 5 (Access to Research at National University of Ireland, Galway) is already registered as an open access repository in OpenAIRE 6 as well as OPUS 7 - Volltextserver Universität Passau (OPUS University of Passau). The consortium will ensure that that all publications that are deposited within these repositories will be correctly attributed via OpenAIRE to the SSIX project and likewise any publications that are not deposited through NUIG or PASSAU will be submitted directly to OpenAIRE. The advantage of using ARAN or OPUS is that we automatically adhere to all the guidelines 8 listed by the EC since both repositories would not be listed under the Directory of Open Access Repositories (OpenDOAR) 9. Finally, the mandatory first version of the Data Management Plan (DMP) must be produced at month six to participate in the ORDP. The DMP is attached to the CO version of this delerivable in Appendix A1. Not all the data collected or produced by the project will be made available to the public due to the legal implications, examples being the raw data gathered from Twitter, Facebook or other social media platforms, that are protected by strict terms and conditions that forbid to distribute the contents to third parties. Again this is in line with page 9 of Guidelines on Open Access to Scientific Publications and Research Data in Horizon : if participation in the Pilot on Open Research Data is incompatible with existing rules concerning the protection of personal data The DMP provided in Appendix A1 of the CO version of this deliverable helps to identify the different datasets of the SSIX project with a particular attention to data sharing aspects, each may vary from case to case for an individual dataset See Page 7 of Guidelines on Open Access to Scientific Publications and Research Data in Horizon

10 5 Technical Issues 5.1 Geographic Data Availability A relevant information useful for the SSIX indices calculation would consist in the geographic data derived from the collected contents. This would allow to attribute a specific origin to the sentiment trends detected, modulating the algorithms in accordance with the position of the user that generated the content. Unfortunately, geolocation procedures cannot be implemented due to the lack of statistical relevance. For instance in the case of Twitter, that is the main source for most of the incoming contents, we detected that only less than 1% of tweets in english contains geographic coordinates and only about 2% of the total tweets has the place field populated (that is an information explicitly provided by the user). These numbers indicate the impossibility to work with a statistically significant sample. Among the other sources, only Google+ and StockTwits seem to provide geolocation information through their APIs (StockTwits returns an undocumented location field). These platforms have not been tested yet, so it is not possible to provide any statistical sample of geographic data. 5.2 Real Time Data Processing Real Time Data processing (or Nearly-Real Time - NRT - in our case) consist in the process of collecting, analyzing and returning a content a few moments after it has been published on the original source. The delay time in this case may vary from milliseconds to seconds according to different technical and functional factors, among which: computing power, storage performances, incoming traffic, number of filters and enrichments applied to the original data, complexity of the algorithms that manipulate the data. Among the sources assessed in the present document, only Twitter and StockTwits are suitable for processing data with a NRT approach. This is because they provide real time streaming APIs that push the contents to the clients as soon as they are posted, unlike the other sources that can be queried with a traditional REST API approach. These aspects are important for the definition of the algorithms created for the calculation of the SSIX indices. 5.3 Batch Data Processing This kind of data processing refers to the procedures implemented in order to retrieve data from sources exposing traditional REST APIs (like Facebook, Google+ or Linkedin) or not providing API access at all (this is the case of web page scraping or RSS feeds). These procedures, to be considered completely independent pieces of software, have to be scheduled in order to query the remote endpoints at given intervals. The interval suitable for each source cannot be determined a priori, since it is strongly related to the number of items (keywords, stocks, users, companies, etc.) to track and the limits imposed by the API, like the maximum number of requests per minute. The aim of the project, limited to the technical boundaries of the available infrastructure, is to collect and analyze the data with the highest frequency possible, therefore much effort will be put in the creation of data gathering procedures acting at least on a 15 minutes basis. 9

11 5.4 Missing Data Handling Missing data will be addressed with dedicated handlers raising alerts in the following scenarios: The designated technical staff can be alerted (via ) in case of missing data for certain items or for repeated occurrences of data loss; The final user can be alerted with proper messages on the front-end, warning that the some data are partial or missing. It is important to distinguish between data missing because of malfunctions and data missing because of effective lack of contents on the remote source. In the last case, also the lack of data is providing a significant information that should be considered inside the algorithms. 5.5 Errors Handling Errors occurring during the data retrieval processes have to be promptly pointed out through dedicated alerting systems (e.g. or sms). In these cases the designated technical staff will intervene in order to understand the cause of the problem, recover the process and apply software patches if needed. Blocking errors may be caused by different factors, like unreported changes in the remote endpoints (e.g. different field names in the JSON response) or technical malfunctions occurring on the server. 10

12 6 Conclusions The considerations emerged from this document demonstrate the effectiveness of the assessments performed, since the reader can easily acknowledge the risks and criticalities deriving from the data gathering activities, along with the complete lists of the collectable data. First of all, there is a marked difference between real-time and batch processing: in our case, only Twitter is suitable to support real-time processing, since it provides a streaming API that pushes the Tweets to the connected clients as soon as they are published. For all the other sources it is necessary to develop ad hoc procedures that can be scheduled to request and retrieve specific data at regular intervals, in compliance with the limitations applied to certain APIs. Another relevant topic emerging from this document is the variety of the logics to be implemented in order to support the different data gathering techniques. For the SSIX project, the data will be sourced from APIs, RSS feeds, CSV files, web pages using HTML scraping: every modality requires different approaches, that must take into consideration the substantial differences between the queried platforms. The assessments collected in this document also helped to identify the criticalities and the issues related to this kind of activities. Most of them derive from the experience, while others are clearly stated in the available documentation. In general we are able to identify common criticalities, that can be mainly related to the following risks: Application being blocked because of excess in the API usage; Application becoming obsolete because of changes in the API specifications, resulting in the inability to retrieve new data; Application becoming obsolete because of changes in the data structures, resulting in the inability to retrieve new data; Difficulty to find appropriate and complete documentation during development activities, leading to deploy potentially wrong procedures; Difficulty to find complete and reliable channels to monitor in order to stay updated on the potential changing of the sources. These risks can be reduced with the adoption of the following measures: Accurate analysis of the limitations before of the definition of the functional specifications; Distribution of the applications on clustered systems in order to prevent IP blockage; Creation of dedicated tasks able to constantly monitor the status of the queried sources and send appropriate alerts to request manual intervention; Correct handling of application errors and exceptions raised from failures in data requests, in order to address specific warnings to the right persons; Accurate and deep testing sessions during development activities and after each deploy. An ideal scenario would involve a 24H service of constant human monitoring, especially if the number of required servers increase exponentially. This would allow to promptly intervene in case of errors or disruptions, but it requires high financial resources and cannot be instituted during this phase of the project. 11