Introduction This survey has been developed jointly by the United Nations Statistics Division (UNSD) and the United Nations Economic Commission for Europe (UNECE). Our goal is to provide an overview of active Big Data projects in Official Statistics in order to facilitate a more informed discussion. The survey has two focuses: sharing broad information about potential Big Data projects in the statistical community and sharing specific information about partnerships, data sources, and tools. This survey addresses individual projects, partnerships, data sources and tools. Please submit it multiple times - once for each project. For this survey a fairly wide definition of what "Big Data" is has been adopted: Big data are data sources with a high volume, velocity and variety of data, which require new tools and methods to capture, curate, manage, and process them in an efficient way. The UNECE working classification of types of big data may also help define the range of potential sources of big data being considered. This is a working classification, and is not expected to be complete, so if you find a missed area please let us know. The survey is meant for projects at every stage of development. If your project is still in the idea phase we would like to hear about it and the data sources and partnerships you are exploring. Just leave any area that is not relevant to you blank. At the start of the survey you will have a chance to let us know how widely you are able to share the submitted information. At a minimum all information submitted will be shared between the survey authors and used in aggregate or anonymous form at the upcoming International Conference on Big Data for Official Statistics in Beijing and in reports to the UNECE High-Level Group for the Modernisation of Statistical Production and Services. If you have any information you would rather email directly, or have a question email tradestat@un.org. Questions may also be submitted online at the Big Data Inventory Q&A page. Thank you for your time and participation. PLEASE NOTE: submission for this survey is online only. This PDF copy is only for reference. Submit answers at: https://www.surveymonkey.com/s/bigdataproject Thank you for your time and participation. Page 1
Organizational Information Organization: If there are multiple organizations, then the one leading the project. Division: If applicable, the division or subunit of the organization doing the work. Country: Point of contact: Name: Position: E-mail: Can we share your organization and project title publicly? Yes, you may share it publicly [published openly online] Yes, you may share it with organizations participating in the survey [published online behind password] No, do not share this information except in aggregate / anonymous form Can we share the detailed information you submit? Please be as open as possible. We are collecting this information primarily to help the wider official statistics community have an informed discussion. If there are a few details you would like to keep confidential you may submit them by email instead of including them in this survey. Yes, you may share it publicly [published openly online] Yes, you may share it with organizations participating in the survey [published online behind password] No, do not share this information except in aggregate / anonymous form Further comments: Page 2
Project Information Project title: A descriptive title for the project or proposed project. If no official title has been chosen then something that communicates the main idea. Project status: Idea phase [skip to page 6] Proposed (in planning - not yet approved or funded) Approved (approved - not yet funded) Funded (approved and funded - not yet started) Ongoing (in execution phase) Completed Page 3
Potential areas of use for this project: Select all that apply. Demographic and social statistics (including subjective well-being) Economic and financial statistics Environmental statistics Information society / ICT statistics Labour statistics Mobility statistics Price statistics Tourism statistics Transportation statistics Vital and civil registration statistics Other domains of official statistics Would you qualify the project as: Exploratory / research Pilot with a goal of moving it to production if successful For the production of statistics Other (please specify) Page 4
Project overview: Include broad information about your project objectives and scope with an emphasis on the implications for official statistics. Also indicate whether the project is primarily for research purposes or for production of statistics based on Big Data. 1-3 paragraphs Page 5
Project Information Outcomes (for incomplete projects include project goals): A summary of the results or desired results of the project with an emphasis on the implications for official statistics. When discussing actual outcomes, please note how detailed the project output, e.g. coordinate (GPS), regional, or national information updated daily, monthly, or annually. 1-2 paragraphs Most important lessons learned so far in the project: These might have to do with methodological issues, project management, training personnel, how to get funding, the technical tools used in the project, or something else entirely. Essentially the largest challenges you have faced so far and how you have (or plan to) overcome them. 1-2 paragraphs Page 6
Project Information Future directions: For completed projects: what are your next steps? For projects still in the early stages: discuss upcoming plans and challenges. Detailed questions about partnerships and data sources appear later in the questionnaire. 1-2 paragraphs Page 7
Partnerships Do you have any partnerships with other organizations or data providers on this project? The partnerships may still be in the very early stages. Yes No [skip to page 10] Page 8
Partnerships Please discuss any arrangements you have with your primary partner organization. If you have more than one partner on this project please discuss them in the other comments space at the end of the partnerships section. Name of partner: If you do not wish to disclose the name, please supply a working label - e.g. "Partner - Mobile Phone Data Provider". Have you already discussed this partner when submitting information about a different project? There is no need to enter partner information again if you already have done so on another project - you may leave the rest of this section blank. But if there are details about the partnership that were specific to this project that you'd like to provide you may do so. Yes (skip the partnerships section) Yes (do not skip) No [skip to page 10] If yes please specify the project title: Page 9
Partnerships Type of partner organization: Select all that apply. International Organization Government Commercial NGO Academia Other (please specify) Type of partnership: Select all that apply. Data provider Data consumer / data aggregator (not first origin of data sources) Design partner Technology partner Analytical partner Other (please specify) Current status of the partnership: We understand that forming a partnership may not fit cleanly into these categories. Please include further details if required in the 'Other comments' section below. In discussion Prototyping / Testing (some data partners allow this before a contract is signed) Contract in place Other (please specify) Page 10
Are there any payments or financial arrangements with this partner? Yes No Not applicable / Do not wish to share Details of the financial arrangements: Other comments: Please discuss the organizational arrangements and the history of the partnership if applicable. If you have other partners on this project you may discuss them here. 1-2 paragraphs Page 11
Data sources Do you have any data sources for this project? Yes, we already had the data in our organization [skip to page 12] Yes, we have identified a new source and received the data [skip to page 12] Yes, we have a new source and are in discussions with the data provider to obtain the data Yes, we have identified a new source, but no discussion with the data provider has taken place No specific source has been identified yet Page 12
Data sources & analysis (idea / discussion phase) If there are sources that have been explored, but you still do not have data please discuss them here: Please discuss your planned data analysis tools and skills: For instance, are you considering using R, SAS, Python or other tool(s) for analysis? What tools are you already familiar with? What are you considering for the data store - local files, hadoop, a nosql database, or a traditional relational database? Is your preference to run this on your own infrastructure, or on external infrastructure? Either way, what challenges do you face? [SKIP TO PAGE 15 - FINAL COMMENTS] Page 13
Data sources Name of data source: Have you already discussed this data source when submitting information about a different project? There is no need to enter the information again if you already have done so on another project - you may leave the rest of this section blank. But if there are details about the data source that were specific to this project that you'd like to provide you may do so. Yes (skip the data sources section) Yes (do not skip) No [skip to page 14] If yes please specify the project title: Page 14
Data sources Data source description: A brief description of the data source. Type of Big Data: Choose the most specific category that describes your data source. List does not appear in PDF See: http://www1.unece.org/stat/platform/display/bigdata/classification+of+types+of+big+data Who is the provider of the data source? What is the geographical scope of the data source? Local Regional National International Other (please specify) Page 15
How granular is the information in the data source? This should correspond to unit of time used to mark individual records. For instance, a weather station might have a timestamp associated with each observation. But in the data set from the provider the data may be aggregated and averaged by hour. If multiple levels of granularity are available specify the most detailed and describe the mix in the data description. Timestamp (seconds, milliseconds, or more specific) Minutes Hours Days Weeks Months Years Other (please specify) How frequently are data source updates made available? You may not consume each update, but the updates are made available for consumption. If the data source falls between a category choose the higher frequency category, e.g. a data source that posts updates every half hour can be considered constant. Constantly Hourly Daily Weekly Monthly Quarterly Annually Nearly static (highly infrequent / no schedule) Other (please specify) Page 16
Have you established automatic links for transmitting this data source (e.g. API, automatic file download)? Yes No Other (please specify) Links to the data source (if available): If available include both the data source and a link to any data documentation. If there aren't public links but you would like us to host the files please email tradestat@un.org. Data (URL): Documentation (URL): Is this data source publicly available? Yes - accessible to everyone in an easy to use format (CSV, XML, JSON, API, Excel, etc.) Yes - accessible to everyone, but requires significant work to reformat (e.g. PDF, screen scraping, etc.) No - requires explicit permission and is not publicly posted Are there any privacy and confidentiality issues related to this data source? If yes, please provide details about how you have addressed those issues. For instance, did you remove personal characteristics or change the geographic scope of the data? Was this done by you or by the provider? Did this degrade the usefulness of the data for analysis? No Yes (please give details): Page 17
Any other comments about this data source or data provider: Some topics to consider addressing are... - What were the largest limitations in working with this data source and how did you overcome them? - What were the most useful levels of aggregation? - What were the greatest challenges you had working with the data? Page 18
Data analysis, tools and skills Do you integrate traditional data sources with the new "Big Data" source discussed above? No Yes (please give details): In your project, what technologies, methods and tools did you use during the Big Data processing life cycle? e.g. the SVM implementation in python/scikit-learn to identify likely tourists, and hadoop / mapreduce for preprocessing aggregation. Hosting provider and/or partner: Did you use a 3rd party, such as Amazon, deploy on your own servers or share resources with a partner organization? If you are comfortable sharing it, approximately how much did this cost? Page 19
Final comments Do you have any other comments you would like to share? Page 20