Overview of edx Analytics I. Data Available from edx EdX provides researchers with data about your institution's classes running on edx.org and edge.edx.org. This includes: Course data Student information Event tracking Courseware data Discussion forum data Wiki data Student state For more details, see Research Data Package Details. Complete data formats are described in edx Data Documentation. In addition, course staff can access some real-time course data from the Instructor Dashboard in the edx Learning Management System II. Description of edx Research Packages EdX provides two types of research data to partners who are running classes on edx.org and edge.edx.org: Log (event tracking) data Database data, including student information To access log and database data packages, you download files from edx, then extract the information from those files for analysis, as described in How Do I Get my Research Data Package? Event tracking data The edx platform gathers tracking information on almost every interaction of every student. Details about collected information are available in the Tracking Log documentation.
Event tracking data for your institution is collected in a file named: Date- Institution-tracking.tar. For example: 2013-10-27-UniversityA-tracking.tar When you extract the contents of this TAR file, sub-directories are created for each edx server that the course is running on. For example, you may see the following sub-directories: prod-edxapp-003 prod-edxapp-004 prod-edxapp-005 Each of these sub-directories contains a file of tracking data for each day. The TAR file is cumulative; that is, it contains files for all previous days your course was running on that server. The filename format for event tracking data files is: Date_Institution.log.gpg. For example: 2013-10-22_UniversityA.log.gpg. You must decrypt these files. Note: Because a course runs on multiple servers, during analysis you must combine events from each server to get a complete picture of course activity. Database data Database data files are collected in a ZIP file named: Institution-Date.zip For example: UniversityA-2013-10-27.zip When you extract the contents of this ZIP file, files are placed in the same directory as the ZIP file. The filename format of extracted files is: institution-course-date-data_type-serveranalytics.sql.gpg For example: UniversityA-Physics101-2013_user_id_map-prod-analytics.sql.gpg You must decrypt these files. Data file types The data files are views on database tables used by the edx Learning Management System. The following table describes the types of data files that edx delivers.
Type Filename Format Description edx Data Formats Documentation Authorized Users Institution-Course-Dateauth_user-Serveranalytics.sql.gpg about users authorized to access the course. auth_user table Authorized User Profiles Institution-Course-Dateauth_userprofile-Serveranalytics.sql.gpg about student demographics. auth_userprofile table Generated Certificates certificates_generatedcertificate- Server-analytics.sql.gpg Certificate status for graded students after course completion. certificates_generatedcertificates table Courseware courseware_studentmodule- Server-analytics.sql.gpg about courseware state for each student. There is a separate row for each (UNIT?) the course. For courses that do not have any records in this table no file is produced. courseware_studentmodule table Forums Server.mongo.gpg Course discussion forum data. Discussion forum data Course Enrollment student_courseenrollment- Server-analytics.sql.gpg about students enrolled in the course, enrollment status, and type of enrollment. student_courseenrollment table User IDs Institution-Course-Dateuser_id_map-Serveranalytics.sql.gpg A mapping of user IDs and obfuscated user_id_map table
Type Filename Format Description edx Data Formats Documentation IDs used in surveys. Wiki articles Institution-Course-Datewiki_article-Serveranalytics.sql.gpg Course wiki data. Wiki data III. Frequently Asked Questions 1. What kind of data does edx store? EdX collects course data of two different types: stateful and event data. The stateful data includes course XML, the self-reported demographic data that students supply when they register and the posts they make to course discussions and wikis, and student answers to assessments. The event data is a timestamped record of page requests and explicit events made in a course over a period of time. 2. Who has access to edx data? Partner institutions can arrange to download raw stateful and event data for their edx courses, even while they are still in progress Course staff have access to some of the data for a course from the Instructor Dashboard, including some aggregate statistics, as soon as they create the course. 3. How is data delivered? What is a data package? To package course data and deliver it to researchers, edx uses Amazon Web Services (AWS) Simple Storage Service (Amazon S3). EdX creates an account for each partner institution on Amazon S3 that a single designated "data czar" can access. Only an institution's data czar can access S3 to download the data package, which is the collection of files that contain raw, unprocessed stateful and event data. 4. How often are data packages delivered? Data packages are available for download from Amazon S3 weekly. They are usually available on Saturdays or Sundays. 5. What do the data packages contain? Do data packages include custom reports about each of my courses? The data package contains a ZIP file with the database state, that is, the
stateful data, and a compressed TAR file with daily event data. The data packages contain only raw, unprocessed data, without aggregation or customizations. 6. Our university does not yet have live courses. Can we get a sample of all the data formats, so that we can begin setting up research projects now? Sample data is available only on request; however, the data formats are described in the edx Data Documentation guide. 7. What is the typical size of the data for a 7-10 week course? Data packages contain the data for all courses offered by a partner institution. They include data for courses that are in progress, not yet started, and complete, and that have different course assets and numbers of enrolled students. All of these factors have an effect on the amount of data collected for a each course. That said, in general the stateful data for a course can be approximately 100 MB or larger, and the event data can be approximately 1 GB or larger in size. 8. Is there a sample data package? An obfuscated data sample is available from edx on request. 9. What resources does a university need to have in order to start doing research with the data? Different skills and areas of expertise participate in educational research, so you are likely to need a team of contributors. The team is likely to include database administrators who can work with raw, no-sql data to set up a SQL database and queries, engineers who can interpret files in JSON and XML format, statisticians and data analysts to mine the data, and educational researchers to pose questions and interpret results. 10. What is a Data Czar? How do we get one? A data czar is the single representative of a partner institution who is given credentials to securely download and decrypt edx data packages, and who is the primary contact for data within the organization. The data czar is also responsible for transferring the data securely after it is received by your organization. After partners select an individual to be their data czar, they work with their edx Program Manager to get the required credentials set up. 11. What technical qualifications should a Data Czar have? Data czars have experience working with sensitive student data, are familiar with encryption/decryption and file transfer protocols, and can sanity check, copy, move, and store large files. Some data czars are also
database administrators who can work with SQL and NoSQL databases and write queries on the data after it is downloaded. 12. I am a Data Czar. How do I get data for my university? You work with your edx Program Manager to set up a public/private key pair for GNU Privacy Guard. EdX creates an account on Amazon S3, and provides your Program Manager with the credentials for account access. (See How Do I Get my Data Research Package?) You download your data packages from Amazon S3 and decrypt them using the private key. 13. I am a course author, and would like some statistics about my course. Whom should I contact? The Instructor Dashboard provides access to certain demographics, and provides options to download CSV files of student data and course grades. For complete course data and help working with it, contact your data czar and the team that is working with your institution's data to conduct research. 14. Is there documentation? The edx Data Documentation guide provides information about the data and its structure. EdX also hosts a wiki with information and a discussion forum. Getting the first Data Package 1. I am a Data Czar for an xconsortium partner University or Organization. How do I get my data packages? You download data packages from an Amazon S3 account. To access the account, you use the credentials provided by edx. Review and contact your edx Program Manager if you have not received your Amazon S3 username and credentials. 2. How many files should I be downloading for one data package? To deliver the data each week, the edx team uploads two archives of encrypted files to Amazon S3. 3. What format are the files in the data package in? What do I do to "open" the data package? The data package contains a TAR file of event data and a ZIP file of stateful data. To open the package, you uncompress each archive. Each archive contains a folder with with subfolders and encrypted GPG files.you then use your private GPG key to unencrypt the GPG files.
4. What encryption mechanism does edx use? You define a GNU Privacy Guard (GPG) key pair, which consists of a public key and a private key. You share only your public key with edx. EdX uses the public key to encrypt your data files before compressing them. 5. How do I decrypt the files in my data package? You use your private GPG key to decrypt the data package. Different utilities for the decryption process are available for the Windows and Mac operating systems. 6. What do all the acronyms mean what is S3, PGP/GPG, and AWS? AWS is the Amazon Web Service, an online file service for storing files. S3 is the Simple Storage Service from AWS that edx uses for transferring data packages. PGP, or pretty good privacy, is a data encryption and decryption program. GPG, or GNU Privacy Guard, is an OpenPGP replacement for PGP. Sources: https://www.edx.org= https://edge.edx.org http://edx.readthedocs.org/projects/devdata/en/latest/ https://edx-wiki.atlassian.net/wiki/display/oa/open+edx+analytics+home