Big Data Privacy Scenarios Elizabeth Bruce, Karen Sollins, Mona Vernon, and Danny Weitzner



Similar documents
Threat!and!Vulnerability!Assessments!

Accountability Model for Cloud Governance

Special Education Transportation Task Force Report

MASSIVE OPEN ONLINE COURSES AS DISRUPTIVE INNOVATION: POSSIBILITY TO HELP EDUCATIONAL CHALLENGES IN CURRENT TIMES?

DATA RECOVERY SOLUTIONS EXPERT DATA RECOVERY SOLUTIONS FOR ALL DATA LOSS SCENARIOS.

Shareholders Communication Policy

TECHNICAL SPECIFICATION: LEGISLATION EXECUTING CLOUD SERVICES

Enabling Integrated Care

Forum of International Development Studies 21 (Mar. 2002)

Big data tools and analytics are increasingly contributing to the increasing popularity of MOOC.

The Language Services Market: 2014

Whitepaper. GL Consolidation. Published on: August 2011 Author: Sivasankar. Hexaware Technologies. All rights reserved.

How To Write A Mobile Device Policy

BIG DATA WITHIN THE LARGE ENTERPRISE 9/19/2013. Navigating Implementation and Governance

Designing Massive Open Online Courses

New InfoSec Leader The First 90 Days. John Bruce CEO

Is Your Data Management Ready For Systems Of Insight?

The problem of cloud data governance

Information Governance Workshop. David Zanotta, Ph.D. Vice President, Global Data Management & Governance - PMO

The Convergence of Big Data Processing and Integrated Infrastructure

Mobile Money Manager

PANORATIO. Big data : Benefits of a strategic vision. White Paper June Executive Summary

Stakeholder Analysis. Theory. Kerry Malone, Senior Advisor

Personal data and cloud computing, the cloud now has a standard. by Luca Bolognini

Analyzing the Customer Experience. With Q-Flow and SSAS

Analytics Centre of Excellence: Roles, Responsibilities and Challenges

MassMutual Cyber Security. University of Massachusetts Internship Opportunities Within Enterprise Information Risk Management

Program Drill-Downs National

World Hybrid Cloud - Market

Translation Services and Software in the Cloud

CONSENT ORDER. THIS CAUSE came on for consideration as the result of an agreement between

THE LATVIAN PRESIDENCY UNLOCKING EUROPEAN DIGITAL POTENTIAL FOR FASTER AND WIDER INNOVATION THROUGH OPEN AND DATA-INTENSIVE RESEARCH

Communication Policy

Risk Considerations for Internal Audit

Proactive DATA QUALITY MANAGEMENT. Reactive DISCIPLINE. Quality is not an act, it is a habit. Aristotle PLAN CONTROL IMPROVE

Cloud computing based big data ecosystem and requirements

A Framework to Improve Communication and Reliability Between Cloud Consumer and Provider in the Cloud

Foreword Introduction - The Global Food Safety Initiative (GFSI) Scope Section Overview Normative References...

HOW TO SELECT A BACKUP SERVICE FOR CLOUD APPLICATION DATA JUNE 2012

Iowa Student Loan Online Privacy Statement

How To Manage A Project Management Information System In Sharepoint

MOOC at universities

Public Cloud Workshop Offerings

Post-Implementation EMR Evaluation for the Beta Ambulatory Care Clinic Proposed Plan Jul 6/2012, Version 2.0

Horizontal IoT Application Development using Semantic Web Technologies

CA Clarity PPM v13.x Business Analyst Exam

Offline Mode SAP Mobile BI 4.1. Author : Priya Sridhar

Compensation Policy. Introduction

Analytics With Hadoop. SAS and Cloudera Starter Services: Visual Analytics and Visual Statistics

Global Massive Open Online Courses Market

SPECIALISTS TRAINING IN BIG DATA USING DISTRIBUTED ARCHITECTURAL SOLUTIONS SERVICES. Проректор по учебной и воспитательной работе

Adaptive SLA Monitoring of Service Choreographies Enacted on the Cloud

Advisory AgreementAdvisory Agreement

Service Desk Consolidation Project

(Effective for audits of financial statements for periods beginning on or after December 15, 2009) CONTENTS

Executive Director for Operations AUDIT OF NRC S CYBER SECURITY INSPECTION PROGRAM FOR NUCLEAR POWER PLANTS (OIG-14-A-15)

Data Masking Best Practices

IMPROVING RISK VISIBILITY AND SECURITY POSTURE WITH IDENTITY INTELLIGENCE

Media Kit. Contents. Company Overview 2. Frequently Asked Questions 4. Leadership 7. Press Releases 9

QUESTIONS FOR COMMENT ON PROPOSED FRAMEWORK

Impact of International MOOCs on College English Teaching and Our Countermeasures: Challenge and Opportunity

A consumer research study commissioned by ATG

OOA of Railway Ticket Reservation System

D1.3 Industry Advisory Board

ISE Northeast Executive Forum and Awards

Transcription:

Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2015-030 October 1, 2015 Big Data Privacy Scenarios Elizabeth Bruce, Karen Sollins, Mona Vernon, and Danny Weitzner massachusetts institute of technology, cambridge, ma 02139 usa www.csail.mit.edu

BigDataPrivacyScenarios BigDataPrivacyWorkingGroup September2015 BigDataPrivacyWorkingGroupChairs: ElizabethBruce(MIT) KarenSollins(MIT) MonaVernon(ThomsonReuters) DannyWeitzner(MIT)

Acknowledgements WegratefullyacknowledgethemanycontributorstothisScenarioWorkingDocument. ThisincludesalloftheBigDataPrivacyWorkingGroupleaders,teammembers,andguides fortheirthoughtfulefforts.aspecialthankyoutodazzagreenwoodofmitmedialaband SimonThompsonfromBTforcreatingtheoriginaltemplateforthescenariosummaries. BigDataPrivacyScenarioContributors/Teams:MicahAltman(MIT),ElizabethBruce (MIT),DavidDietrich(EMC),JohnEllenberger(SAP),DazzaGreenwood(MIT),Maritza Johnson(Facebook),LalanaKagal(MIT),JakeKendall(GatesFoundation),CameronKerry (MIT),IlariaLiccardi(MIT),YvesVAlexandredeMontjoye(MIT),UnaVMayO Reilly(MIT), MichaelPower(OsgoodeHallLawSchool),ArnieRosenthal(Mitre),KarenSollins(MIT), SimonThompson(BT),MonaVernon(ThomsonReuters),EvelyneViegas(Microsoft),and JamesWilliams(Google/UniversityofToronto) BigDataPrivacyWorkingGroupEditor:BarbaraMack(PingryHillEnterprises,Inc.) 2

TableofContents ExecutiveSummary...5 UseCase:MassiveOpenOnlineCourses(MOOCs)andOnlineLearningEnvironments (OLEs)...6 UseCase:ResearchInfrastructureforSocialMedia...7 UseCase:DataforGood:PublicGoodandPublicPolicyResearchUsingSensor Data/MobileDevices...9 OtherUseCases...10 Conclusions...10 1 Introduction...12 1.1 OverarchingObservations...13 1.2 Stakeholders...14 1.3 OpenQuestionsandIssues...14 1.4 RemainderofThisDocument...15 2 PrivacyIssuesforDataCollectedfromMOOCsandOnlineLearning Environments...16 2.1 Abstract...16 2.2 DetailedNarrative...17 2.3 PrivacyImpactAssessmentVTheSpecificContextofScenario1...18 2.4 GoalsofOLEs...20 2.5 Data...21 2.6 Systems...22 2.7 Risks...22 2.8 Rules/Regulations...22 2.9 Technologies...23 2.10 PrivacyConstraints...23 2.11 TechnologyInformingandSupportingOLEDataPrivacyandConfidentiality Policy 23 3 ResearchInfrastructureforSocialMedia...25 3.1 Abstract...25 3.2 ScenarioIntroduction...25 3.3 StakeholdersandInteractions...26 3

3.4 Systems...27 3.5 AnalyzetheScenario...28 3.6 InnovationIdeasandOpportunities...30 3.7 NotesonScenario...31 3.8 References...31 4 DataforGood:PublicGoodandPublicPolicyResearchUsingSensor Data/MobileDevices...33 4.1 Abstract...33 4.2 ScenarioDevelopment...33 4.3 OperationofScenarios...34 4.4 RegulatoryEnvironment...36 4.5 DataUtility...37 4.6 Privacy...37 4.7 CriticalIssues...38 4.8 PromisingPathsForward...38 4.9 References...39 5 AdditionalUseCases...40 5.1 PrivacyinAggregatedDiverseDataSets...40 5.2 Creation,Management,ApplicationandAuditingofConsentonPersonalData.41 5.3 ConsumerPrivacy/RetailMarketing...43 5.4 GenomicsandHealth...44 6 Conclusions...46 A. B. Appendix:PrivacyScenarioTemplate...48 Appendix:Stakeholders...50 C. Appendix:StakeholderDatafromMOOCsandOnlineLearningEnvironments (OLEs)...52 4

ExecutiveSummary Karen&Sollins&(MIT)& TheMITBigDataPrivacyWorkingGrouplaunchedaseriesofworkshopsbeginningin 2013toexplorethechallengesandpossibletechnologicalsolutionstoelementsofthose challenges.asasuccessortothoseworkshops,theworkinggroupbegantofocusona collectionofrealworldscenariosandusecases,toilluminatethechallengesmore concretely. Thedeeperquestionexploredbythisexerciseiswhat&is&distinctive&about&privacy&in&the& context&of&big&data.althoughprivacyasageneralissueincomputingandcommunications remainsatopicofsignificantattentionanddisagreement,inthiseffortwenarrowour attentiontothe BigData context,tounderstandmoreclearlytheparticularchallenges andpossibleapproachesthatderivefromthecollection,pooling,andcombinationofvast amountsofdata,specificallyaboutpeople.thisfocusonpeopleasthesubjectsofattention inthebigdatacontextiscentraltothedefinitionofprivacy,whichitselffocusesoncontrol data,informationandinferencesaboutpeopleandhowthatcanorshouldbeused, exposed,orotherwisemadeavailable. Wesummarizehereaninitiallistofissuesforprivacythatderivespecificallyfromthe natureofbigdata.thesederivefromobservationsacrosstherealworldscenariosanduse casesexploredinthisprojectaswellaswiderreadinganddiscussions. Scale:Thesheersizeofthedatasetsleadstochallengesincreating,managingand applyingprivacypolicies. Diversity:TheincreasedlikelihoodofmoreandmorediverseparticipantsinBig Datacollection,management,anduse,leadstodifferingagendasandobjectives.By nature,thisislikelytoleadtocontradictoryagendasandobjectives. Integration:Withincreaseddatamanagementtechnologies(e.g.cloudservices, datalakes,andsoforth),integrationacrossdatasets,withnewandoftensurprising opportunitiesforcrossvproductinferences,willalsocomenew information about individualsandtheirbehaviors. Impactonsecondaryparticipants:Becausemanypiecesofinformationare reflectiveofnotonlythetargetedsubject,butsecondary,oftenunattended, participants,theinferencesandresultinginformationwillincreasinglybereflective ofotherpeople,notoriginallyconsideredasthesubjectofprivacyconcernsand approaches. Needforemergentpoliciesforemergentinformation:Asinferencesovermerged datasetsoccur,emergentinformationorunderstandingwilloccur.althougheach uniquedatasetmayhaveexistingprivacypoliciesandenforcementmechanisms,it isnotclearthatitispossibletodeveloptherequisiteandappropriateemerged privacypoliciesandappropriateenforcementofthemautomatically. Theprimarycontentofthisreportisanumberofrealworldscenarios,resultingfrom discussionandthensubgroupeffortswithintheprivacyworkinggroup.eachcasewas analyzedalongacollectionofaxes:keystakeholders,datalifecycle,keysystems,potential privacyrisks,andexistingbestpracticeswithinthecontextofthatscenario.thetemplate waslaidoutinitiallybydazzagreenwoodofthemitmedialabandsimonthompsonof BTandcanbefoundinAppendixA. 5

Asaresultofcollatingthesescenarios,twokindsofpointsemergedacrossthem.Thefirst isasmallsetofcommonquestions.thesecondisalistofcategoriesofstakeholders.we summarizethosehere. Thekeyquestionsthataroseare: Whatnew/uniquechallengesemergewhenitcomestomanagingprivacyinthe contextofbigdata? Howdoweassessbenefitvs.risk? Howdoweevaluate harm?giventhatharmissubjective,difficulttoquantify, andfallsonaspectrumfrominappropriateonlineadvertisementstodiscrimination insettinginsuranceratestolifeordeathmedicalintervention,isitpossibleto evaluateharmuniformlyandifso,howwouldonedothat? Howcanweestablishandassesstrustamongthestakeholders?What mechanisms/modelsdowehaveforunderstandingtrust? Atableofthecategoriesofstakeholdersderivedfromthescenarioscanbefoundin AppendixB.Inaddition,AppendixCdemonstratesanapplicationofthesestakeholder categoriestothefirstusescenarioonmoocsandoles. Theinitiallistofcategoriesofstakeholdersincludes: Datasubject(s) DecisionVmaker Datacollector Datacurator Dataanalyst Dataplatformprovider Policyenforcer Auditor BothofthesesetsofpointsarediscussedinmoredetailinthecompaniontechnologyV mappingdocument,andareprovidedheretoidentifycrosscuttingobservationsfromthe variousscenarios.althoughthecurrentlyidentifiedsetofpotentialstakeholdersislisted here,itisimportanttorecognizethatprivacyisamuchmorecomplexproblemthat concernsmorethanthestakeholdersalone. TheWorkingGroupexploredsevenusecases.Thisreportpresentsthreeintheircomplete formsinsections2v4;thosethreecasesaredescribedbrieflyintheexecutivesummary. Inaddition,inthefinalsectionofthereport,inSection5,summariesoftheadditionalfour casesarepresented,becausethesewerestudiedinlessdetail. UseCase:MassiveOpenOnlineCourses(MOOCs)andOnlineLearning Environments(OLEs) Anyonlinelearningsituationprovidesanopportunitytorecordalltheactivitiesof everyoneinvolvedintheteachingexperience,primarilybutnotexclusivelystudentsand teachingstaff.moocsasasubsetofonlinelearningtakethistonewscalesandoftento newlevelsofautomationaswellasexpandingrolesinthecollectionof,responsibilityfor, anduseofthedatathatderivesfromthoseteachingexperiences. 6

Infocusingonprivacyinthiscontext,oneisconcentratingonquestionsofwhichbehaviors andinformationaboutindividualsmaybeexposedinwaysthattheymayfindcontradicts theirmodelsofprivacy.thechallengesariseatleastinpartfromthenewopportunities thatmoocsprovidetocollect,merge,andreasonovereducationaldataatascaleandwith aneasenotpreviouslypossible.thedatamaynowbeusedinnovelwaysandinvolvenew stakeholdersincludingdatacurators,dataplatformproviders,researchers,andthose interestedinnovelapproachestopedagogy.thechallengeistoachievethatinwaysthat respecttheprivacyoftheindividualstudent,perhapstheteachingstaff,andpossibly secondarypeopleaswell,suchasparentsandguardians,especiallyinthefaceof asymmetricpowerrelationships.oneaspectofthechallengeistounderstandthe implicationsofprivacy violations inthiscontext.theymayarisenotonlyfromthedirect exposureofinformationabouttheindividualthatwasneitherintendednordesired,but alsofrommoresubtleconcernsoverdiscrimination,harassment,inaccessibility,or violationofothercivilandhumanrights. Thecontributionidentifiesanumberofkeyinsightsintoprivacychallengesthatarisein themoocandolearenas,including: Thenatureoftheinformationbeingcollected,includingclickstreams, contributionstoonlinediscussions,forums,andquestionnaires,aswellas behaviorswithrespecttobothaccessingandsubmittingcontent(reading, watchingonlinelecturesorvideos,attemptsatdoinghomework,etc.); Toolsandnormsforexpressionofprivacypolicies,includingcurrent,future, aggregation,andintegrationwithotherdata; Thetusslesinobjectivesamongstudents,teachingstaff,ownersoftheeducational content,crowdorstudentprovisionofcontributions(throughgradingorsocial networkingfacilities)totheexperienceofotherstudents,institutionalhosts, educationalsystems(suchasmunicipalschoolsystemsorstateuniversity systems),researchersandanalysts,andserviceproviderssuchasdatacurators, datastorageandanalysisservices; Thenatureofthepotentialprivacyviolationharmstothevariousstakeholders; TranslationoftheFamilyEducationalRightsandPrivacyAct(FERPA)intothis increasinglyrich,complex,growing,andevolvingdomaininwhichcollectionsof educationaldataiscollected,curated,collatedandperhapsintegrated; Thefactthatthisispreviousunchartedterritorywithsocial,legal,andmoral challengesasyetnotclearlyidentified,whichisalsoevolvingduetoincreased technologicalcapabilities,oftenindependentlyofprivacyobjectivesandinterests. UseCase:ResearchInfrastructureforSocialMedia Thebehaviorsofindividualsandgroupsonlinecanprovidethebasisforsignificantdeeper understandingandpredictionofhumanbehaviorsandinterests.thekindsofdatathatcan beusefulingainingthatincreased social understandingrangefromthevarious contributionsmadebyindividualssuchastext,photos,variouskindsofstreamingmedia andotherinformationrelatingtotheparticipantsaswellasloggedinformationsuchas clickstreams,frequencyandotherpatternsofaccess,etc.atpresentthemajorityofaccess tosuchsocialmediainformationisprimarilyrestrictedtoinvhouseanalysisbysocial mediaorganizations. Thequestionexploredbythisgroupiswhetherandhowonemightprovidea privacy frameworkforsuchinformation,givingthesubjectsoptvincontrolofwhichinformation 7

aboutthemselvescanbemadeavailableforbroaderstudiesandwideravailabilityofthe information.theintentionisthatpermissionforuseremainswiththesubjects,butby givingthemtheopportunitiestoshare,richer,andlargerstudiescanoccur,withallthe potentialsocietalbenefitsthatthosestudiesmightentail.thesubjectmustbegivencontrol overboththegranularityandtypesofthedata,includingbothstaticdatasuchasbirthdate, address,jobhistoryandsoforth,anddynamicdatasuchasongoingpostsinvariousmedia. Intermsofthestakeholders,therearethreekeyparticipants,1)thesubjectsthemselves, 2)thesocialmediaorganizationswhowillplaytheroleofdatacollectors,oftendata curators,anddataplatformproviders,3)thedataanalysts,whomayalsoplaytheroleof datacurators,iftheyprovidedaddedunderstanding(curation)overthedatasets.there aretwogeneralapproachestomakingthedataavailable.thefirstistogenerateslices,on someregularbasis,ofthedatathatistobeexposedanddeliverthattotheanalysts.the alternativeistoretainalldataonacontrolledservicewithaclearlydefinedapi,providing onlyconstrainedaccesstothedata.thefirstgivestheanalystmorefreedomtoexplore,but reducesthesubject sabilitytoretaincontrol,especiallywithrespecttowithdrawingfrom astudyretroactively. Thereareatleastfourcontextsinwhichsuchasystemmustoperate:legal,social,business, andtechnical.thechallengeisthatprivacymustberespectedinthecontextofallofthese domainssimultaneously. Thestudygroupidentifiedalistofrisksorchallengestoprivacythatmustbeconsideredin suchascenarioincluding: Unexpectedinferenceresultingfromtheanalysis; Unexpectedharmduetomodificationsofthedataplatform,duetoinferences,orto thenatureoftheresearchitself; UnpredictablebiasintheresultingresearchbasedonbiasintheselfVselecting natureofparticipation; Unexpectedcorrelationbetweenthestudysubjectpopulationandthegeneral population; Removalfromstudiesafteragreeingtoparticipate; Controlofdownstreamuseofthedata,beyondtheoriginalanalystagreement.This raisesquestionsofprovenance(whohastouchedthedataandhowmightthey havemodifiedit),tohowtoenforcepoliciesbeyondtheboundsofpairwise agreements,toidentificationandrecourseformisuse,forstarters; Responsibilityfordatabreachesbothbythesocialmediaprovideractingas repositoryandcuratorandbytheresearchersandanalysts; Findingthebalancebetweenprivacyandpublicationofresults; Managementofinformedconsents; Automationofasmuchofthisaspossible,whileunderstandingtherisksthatmay beintroducedthroughsuchautomation. Thestudyalsoidentifiedsomekeytechnologiesthatexistandsomeplaceswhere technologiesareneeded,butnotyetavailable. ThescenarioisbasedonacurrentcollaborativestudyinvolvingtheTechnicalUniversityof DenmarkandtheMITHumanDynamicsLaboratory. 8

UseCase:DataforGood:PublicGoodandPublicPolicyResearchUsing SensorData/MobileDevices Thechallengefacedinthisscenarioistotakeadvantageofmobilephonedata(mobility data)withoneoftwopossibleobjectives.thefirstistomodelandpredictoutbreaksof epidemicsandthesecondistoenablemicrovtargetingofindividualsorgroupsofpeople withinterventionsinordertoreduceorpreventoutbreaksofepidemics.thegeographic regionoffocusinthisworkisafrica.ofparticularinterestarepeoplemovingacrossareas whereanepidemicmaybemoreprevalentandthosewhereitmaybelessso. Inadditiontothetwokindsofobjectives,thestudyexaminestwodistinctsystemdesigns orimplementations.inallcases,theoriginaldataiscollectedbythemobilenetwork operators(mno).inoneimplementation,eachmnoanonymizesandcoarsensthedata bothspatiallyandtemporally.thus,forexample,thetimemaybereportedin12vhour blocksrepresentingdayandnightandlocationmayberepresentedasparticularregions wheremalariaisprevalentornot.theindividualityofeachrecordisretained.this enablesthetargetingofindividualsthroughoneoftwomeans.theanonymizedidentifier ispresentedtothemno,whichinturneitherprovidesaccessinformationtotheanalystor actsasanintermediaryconveyinginformationbetweentheanalystandsubject.inthe otherimplementationdesign,dataismergedonaregionalbasisbeforebeingaggregated, soforexample,themnomightreportthataspecificpercentageoftheresidentsofone areaspentadifferentspecificpercentageofnightsinadifferenttargetarea.thissecond designsignificantlyincreasesthesubject sprivacyandreducesthepossibilityofrev identificationorexposure,aswellasreducingtheaccuracyandpotentialutilityofthedata. Thisstudyidentifiedanumberofchallenges: Thescenarioexposesadirecttradeoffbetweenhealthrisks(andpossible mitigation)fortheindividualandpersonalprivacy; Thescenarioalsoexposesadirecttradeoffbetweenanalysiscapabilitiesand personalprivacy; MNOsaregenerallynotinthebusinessofanonymizing,curatingandproviding datatootherentities.inthesecases,theanalystroleisoftentakenonbynational healthministries; ThelegalbasesforprivacyinAfricaarecomplexandgenerallybasedinhistorical traditionfromthecountriesthatcolonizedtheminpreviouscenturies.those WesternandNorthernAfricamostlyderivefromtheFrenchcivilcode,withexplicit privacyframeworksandarecloselyrelatedtotheeuropeanprivacydirective. ThosesuchasSouthAfricathatderivefromtheEnglishcommonlawtraditionhave muchlessconcretepolicieswithrespecttoprivacy.toaddtothis,aspopulations movefromonecountrytoanother,theymayalsobemovingfromoneprivacy policymodeltoanother; TheintentionofthisuseVcasestudywastoallowthegrouptoelicitcommonalitiesand distinctionsamongthecasesthatmightallowustogeneralize.thatinturnalsohas providedthebasisforacompanionpaper,whichconcentratesoncurrentandnearvterm futuretoolstoimprovethepossibilityofprovidingprivacy,whilecontinuingtoallowfor BigDataanalysisandthebenefitsthataccruefromthat. 9

OtherUseCases Thereportconcludeswithabriefsummaryoftheadditionalfourusecasesexaminedby theworkinggroup.theseincludeprivacyunderconditionsofintegratingoverdiverse datasets,thecreationandmanagementofuserconsentoverexposureanduseofpersonal data,consumerprivacyandretailmarketing,andgenomicsandhealth. Conclusions Fromthesescenarioswedrawthreecategoriesofconclusions.Thefirstisasetofcommon overarchingchallenges.inorderofincreasingcomplexitytheseare: Scale:ThesheersizeofboththedataitselfandtheaccompanyingmetaVdatathatis necessarytomanageitandprovideprivacypoliciesisincreasing. Diversity:Withgrowth,wealsoseeanincreaseinthetypesofdata,interestsof analystsorusersofthedata,andrichnessofprivacypoliciesinthesenew scenarios. Integration:ThereisincreasingpressureandopportunitytomergeorcrossV fertilizeamongthesediversedatasets.thisleadstoresultsthatmayhave previouslybeeninaccessible,butthatareexposedthroughperhapsdiffering integratedobservationsoftheindividual. Secondarysubjects:Althoughmuchdataisbasedonprimarysubjects,itmayalso, perhapsinadvertentlyalsoreflectonsecondarysubjects.handlingprivacypolicies forthismoreintegratedsituationissignificantlymorecomplexthanthepolicies applicabletoasinglesubject. Emergentprivacypolicies:Withboththeintegrationofdatasetsandtheincreasing captureofdataaboutsecondarysubjects,thereisalsoaneedforprivacypoliciesto reflectthisemergentdata.thechallengeofhowthesenewpoliciescomeinto existencewillplayanincreasinglyimportantrole. Thesescenarioshaveprovideuswithabasisforaninitialobservationaboutthediffering stakeholdersinvolvedinthehandlingofbigdataandtheprivacypoliciesapplicableto them.webeginwiththesubjectsthemselves,perhapsbothprimaryandsecondary,and thedecisionvmakerswhosetouttohavethedatacollectandmadeavailable.wethen identifyasetofdifferentstakeholdershavingtodowiththecollection,managementand provisionofthedata.thisincludestheactualdatacollector,thedatacurator,andthedata platformprovider.wethenidentifythreekindsofstakeholdersinvolvedintheactivitiesof usageofthedata,thedataanalyst,theprivacypolicyenforcer,andthedataaccessauditor. Withthesechallengesandobservationsinmind,wealsorecognizethatthereareanumber ofopenquestions.thesequestionsrevolvearoundseveralkeyelements.thefirstis whetherornotbigdatabringsnewchallengestotheprovisionofprivacyorwhetherit exposesexistingproblemsperhapsmoreclearly.moreimportantly,arequestionsofrisk vs.benefitstradeoffs.oneofthechallengesonefaceshereisprivacyandtheriskof violationofprivacyisnotbinaryandperhapsnotevenmeasurable.thus,oneisthenledto askabouttheharmsthatmayresultfromdifferentlevelsofprivacypoliciesand/orthe violationsofthoseprivacies.finally,weareleftwithasetofquestionsrelatedtotrust,how itcomesintoexistence,howitmayevolve,howhumans trustcanbemodeled,andhow trustmaybesupportedtechnically. 10

Wenotethatthissetofobservations,challengesandquestionsareonlyrepresentativeof whatonemightdrawevenfromthislimitedsetofscenarios.abroaderstudymightleadto yetmorechallengesandquestions. 11

1 Introduction Karen&Sollins&(MIT)& ThevastamountsofdiversedatathatarenowbeingcalledBigDatapresentsocietywith anextremelyinterestingsetofchallenges,rangingfromhowtouseanyonesuchdataset forawideandincreasingsetofopportunities.thesemayrangefromimprovedproduct recommendationstoimprovedmodelingofhumanmobilityinregionsofinfectious diseasestomanyotherpointsinbetween.butbigdatapresentsadditionalopportunities thatincludeabroaderanddeeperunderstandingacrosssuchdatasets.ifonecanmerge mobilitydatawithmedicalhistories,forexample,onemightprovideamuchmoreaccurate modelofpotentialepidemics,dependingonbothmobilityandpriorepidemicsofdiseases towhichimmunitiesaredeveloped. Atthesametime,societiesandcommunitiesarebecomingincreasinglyconcernedoverthe questionsofwhoknowswhataboutthemandwhetherornottheyhavecontroloverthose datacollectorsandanalyzersknowingthingsaboutthem.theconcerniscapturedinthe word privacy.the problemofprivacy isinfactacomplexandsubtleone,withmany challengesandoftentoofewsolutionstothosechallenges.onemustaskquestionssuchas, Whoisthesubjectofthedata? Theremaybeaprimarysubject,butdataabout interactionsmayhavemultipleprimarysubjects.theremaybesecondarysubjects,suchas theparentsorlegalguardiansofachildwhohappenstobethesubjectofthedata.in addition,onecanaskquestionsaboutwhoelseisinvolvedwiththedatainvariousways, suchascollectingorstoringit,protectingit, curating itforaccuracyandcompleteness, analyzingit,andsoforth.onecanalsoaskwhatpoliciesshouldbeappliedtothedatafor controllingaccesstoit,tomeetanyprivacyconstraintsfromalegitimatepolicysource.or, howmightthatpolicybeenforced?orhowcanonebeconfident(trust)thatthepolicyis eitherbeingdefinedbyalegitimatepolicysourceorbeingenforcedbyatrustvworthy enforcer?andsoforth.thequestionsofwhatismeantbyprivacy,whocandefine appropriateprivacyandhowthatmightbeimplementedareonlynowbeginningtobe examined,withsignificantprogressinsomeareasandlessadvancementinothers. ThechallengewefaceintheBigDataarenaisattheintersectionofthesetwodriving forces,bigdataitselfandallthatithasthepotentialtoprovide,andprivacy,asitbecomes increasinglywellvunderstoodtobeadesignvdriverforsystemsinthecybervage. TheMITBigDataPrivacyWorkingGroupconcentratesonthisproblemdomain.Tothat end,severalworkshopswereorganizedbyandheldatmit. 1 Inaddition,theWorking Grouptookontwoinitialagendaitems:1)documentationofasetofscenariosinorderto betterilluminatesomeofthecentralchallengestoprovidingprivacyina BigData world; 1Seeworkshopreports: 1. Big&Data&Privacy:&Exploring&the&Future&Role&of&Technology&in&Protecting&Privacy,June19,2013.Availableat: report. (http://bigdata.csail.mit.edu/sites/bigdata/files/u9/mitbigdataprivacy_wkshp_2013_finalvweb.pdf) 2. MIT&White&House&Big&Data&Privacy&Workshop:&Advancing&the&State&of&the&Art&in&Technology&and&Practice, March3,2014.Availableat:report.(http://web.mit.edu/bigdataV priv/images/mitbigdataprivacyworkshop2014_final05142014.pdf) 12

2)roadmappingofcurrentandnearVtermfuturetechnologiesthathavepromiseof addressingpartsoftheprivacyinbigdatachallenge.thisdocumentisthefirstofthese. Belowintheremainderofthissectionwewillsummarizeanumberofconclusionswe drawfromthescenarios.thesetakethreeforms.thefirstisasetofissuesthatderivefrom thelargerchallenge.thesecondisasetofcategoriesofstakeholdersweextractfromthe scenarios.finally,weconcludetheintroductionwithasetofquestions,whichremain unanswered,butappeartobecentraltotheproblemdomain. 1.1 OverarchingObservations Inexaminingtheusescenarioshere,wecanidentifyaninitialsetofsignificantissueson theconsiderationofprivacy,whichderivespecificallyfromthenatureofbigdata.these arealsoinformedbywiderreadinganddiscussionsonthetopic: Scale:Thesheersizeofthedatasetsleadtochallengesincreating,managingand applyingprivacypolicies.becausethedatasetsthemselvesareofsuchincreasing size,themanagementofthemetavdatathatreflectsprivacypoliciesaboutitwill incurparallelgrowth.oneofthechallengesisthatasdatasetsgrow,efficiencywill playanincreasingrole.thatwillalsobetrueoftheprivacypolicymanagement associatedwiththegrowingdata. Diversity:Asdatasetsbecome bigdata, itwillbeincreasinglylikelythatmoreand morediversestakeholderswillbeinvolved.eachmaycometotheeffortwithhisor herownagenda.withanincreasingnumberofstakeholderswithdifferent responsibilitieswillalsocomeanincreasedprobabilitythattheirinterests,agendas andobjectiveswilllessalignedwitheachotherandhencetheirapproachesto privacypolicieswillalsobemoredivergentandpossiblyconflicting.thus,privacy policyconflictresolutionwillplayanincreasinglyimportantrole. Integration:Withincreaseddatamanagementtechnologies(e.g.cloudservices, datalakes,andsoforth),integrationacrossdatasets,withnewandoftensurprising opportunitiesforcrossvproductinferences,willalsocomenew information about individualsandtheirbehaviors.thechallengeisthatreasoning,inferenceand otheranalysistoolswillallowfortherecognitionordiscoveryofhithertohidden facts(data)aboutthesubjects.thisraisesaquestionofhowtocreateandenforce privacypoliciesonthisnew data. Impact&on&secondary&participants:Muchdataaboutindividualsubjectstendsto reflectonotherpeopleaswell.thismayrangefrompeoplewho liked apostto peoplewhoarementionedinemailorposts,totruesecondaryparticipants,suchas familymembersorcovworkers.onequestionthatwillbecomeincreasingly importantishowtoobservetheprivacyrightsoftheseotherpeople,whoarenot theprimarysubjectofthedataandmaynotbeavailabletoapplyaprivacypolicy whenthatispossible.evenifthesesecondarypeopleareavailable,itisnotclear howtohandleconflictingprivacypoliciesinthisdomain. Need&for&emergent&policies&for&emergent&information:Asinferencesovermerged datasetsoccur,emergentinformationorunderstandingwilloccur;thiswillbe basedasmentionedaboveonbothsimplymergingdatasets,butperhapsmore importantlyallowingfortheexposureofpreviouslyhiddendatathatisonly exposedinthemergingofdatasets.althougheachuniquedatasetmayhave existingprivacypoliciesandenforcementmechanisms,itisnotclearthatitis possibletoautomaticallydeveloptherequisiteandappropriateemergedprivacy policiesandappropriateenforcementofthem. 13

1.2 Stakeholders Asthereaderwillseeinthescenariosthemselves,thereareanumberofkeystakeholder categoriesthatappearrepeatedly.notallcaseswillincludeallofthesestakeholders.in somecases,individualsmayplaymorethanonestakeholderrole.thus,forexample,the datacollectorandthedatacuratormaybethesame,orthedataplatformprovider,the policyenforcerandtheauditormightbethesame.butothercombinationsarelikelytobe foundaswell.itisalsoimportanttorememberthattheprivacypoliciesforadatasetmay bedefinedbypeopleindifferentrolesindifferentsituationsand,insomecases,the policiesmaybedefinedbyoutsidersonbehalfofoneormoreofthesestakeholders,asfor examplemaybetrueunderaregulatoryregime.thus,itmaybethatonbehalfofthedata subject,thegovernmentrequirescertainprivacypolicies. Datasubject(s) DecisionVmaker Datacollector Datacurator Dataanalyst Dataplatformprovider Policyenforcer Auditor This list was drawn from the scenarios and should only be considered representative rather than complete. Appendix B includes a table with definitions of each of these stakeholder roles. It is also considered at greater length in the companion paper on technologies. Appendix C demonstrates an application of these definitions to the first scenarioonmoocs. 1.3 OpenQuestionsandIssues Instudyingthesescenarios,weareleftwithanumberofchallengingquestionsandissues: Novelty:Whatnew/uniquechallengesemergewhenitcomestomanagingprivacy inthecontextofbigdata? Tradeoff:Howdoweassessbenefitvs.risk?Partofthechallengeinthesedomains isthatthattherisksandtradeoffsneedtobeevaluated,totheextentthattheycan beevaluatedbymetrics,bothbydifferentmetricsandatdifferenttimescales.asan extremelysimpleexample,thebenefitsofmoocanalysismaybetofuturestudents, whiletherisksmaybetothesubjectsofthedata,thestudentsaboutwhomdata hasbeencollected.akeystrokeloggingsystemmayhelpcurrentstudentsifthe teachingstaffcangetimmediatefeedbackonhowlongittakeseachindividual studenttocompleteaparticularexercise,butitmaybethatsystematicchanges mayonlyoccuronalongertermbasisthantheperiodduringwhichaparticular studentisinvolvedwithaparticularcourse.atthesametime,totheextentthatthe datacanprofileindividualstudentsinnumerouswaysbothinrealtimeand perhapsoverthelongerlifevtimeofthedataset,andperhapsinconjunctionwith thedatafromothercoursesthestudenthastaken,theirrisksofviolationofprivacy maycontinuetogrow,anddefinitelyareunrelatedtothebenefitsforfuture students.oneofthechallengesinthisdomainofmetricsisthatprivacyisnot binary.inpartbecauseitiscontextualandinpartsimplybecausetheprivacyof someinformationismorecriticalthanotherinformation,thisquestionofthe 14

tradeofforbalancebetweenbenefitandriskisbothcomplexatanyinstantandisa movingtarget. Harm:Howdoweevaluate harm?asmentionedabove,therisktoprivacyis neitherbinarynornecessarilystable.thedeeperchallengeistounderstandthe potentialharmthatmayaccruefrompotentialrisks.infact,wemayneedtoturn thisissuearound.thequestionwemayneedtoaskis, Whichharmsareimportant totheindividualsandinwhatcontexts? Thus,harmscouldbeimaginedona spectrumfrominappropriateonlineadvertisingtodiscriminationinsetting insuranceratestosomethingthatisalifeordeathmatterintermsofmedical intervention.fromthatwemightbeabletoconsiderwhetherthereissomemetric forevaluatingharmgenerically,orwhetheranycomparativeevaluationcanonlybe doneintermsofspecificharms.interm,fromtheidentificationofharms,wemay alsobeabletoidentifytherisksthatwouldleadtothoseharms.thisisanother wayoftalkingabouttherelatedtopicfromthesecuritycommunity:threats. Trust:Howcanweestablishandassesstrustamongthestakeholders?Whatdoes itmeanforthevariousstakeholderstotrustormistrusteachotherorsetsof others?whatmodelsdowehaveforunderstandingtrust?whatarethecurrent andpredictablefuturemechanismsandtechnologiesforestablishingtrustandhow dotheyrelatetothemodelsinpeople smindsandperception?howistrust establishedandmaintained?howdoesitevolveovertime? Withallthesequestionsandissuesinmind,theremainderofthisreportpresentsthe scenarioanalysisdonebyvarioussubgroupsofthebigdataprivacyworkinggroupfrom whichwedrewtheseobservations,thoughtsandquestions. 1.4 RemainderofThisDocument Theremainderofthedocumentfocusesondescriptionsofthescenariosasoutlinedby subgroupsofthelargerworkinggroup.thefirstfocusesonmoocs(massiveopenonline Courses)andOLEs(OnLineEducationalsystems).Thesecondaddressesthechallengesin usingsocialnetworkingdataforresearch.thethirdconsiderstheuseofmobilecellphone datatoreflecthumanmobilityintoandoutofregionsofhighlyinfectiousdiseases, especiallyindevelopingpartsoftheworld.thefinalsectionofthepapersummarizesa numberofadditionalscenariosaddressedbythegroup,butinlessdepth.theyilluminate moreofthebreadthoftheproblemdomain.thepaperconcludeswiththreeappendices: A)thetemplatedevelopedbythegroupfororganizingtheindividualscenarios,B)amore invdepthtableofthestakeholdercategories,c)anapplicationofthestakeholderanalysisto thefirstscenarioaboutmoocsandoles,asanexample. 15

2 PrivacyIssuesforDataCollectedfromMOOCsandOnline LearningEnvironments Team:&UnaMMay&O Reilly&(MIT),&David&Dietrich&(EMC),&Lalana&Kagal&(MIT)& 2.1 Abstract MOOCs(MassiveOpenOnlineCourses)representaspecifictypeofOnlineLearning Environment(OLE),whichcanbedeployedonInternetVservedplatformsthatcollectlarge volumesofgranularbehavioralinformationaboutstudents learningactivities.somedata revealeachindividualstudent sdetailedstudybehaviorsuchasvideousage,consultation oftextorlearningtools,andthesequenceinwhichmaterialwasnavigated.otherdata includeassessments,grades,andsocialinteractionsandcommunicationonforumswithin theplatform.collectivelythedatacanbelinkedtoauxiliarydemographicinformationsuch asage,sex,andsocioeconomicstatus.itcanalsobelinked,ifnotanonymized,topublic onlinebehavior.ageneralsetoflegitimateusesofthisdataincludeseducationresearch, examination,andanalysesthatdirectlyorindirectlyhelpinstructorsteachandconduct studentassessments.some,butnotall,oftheseusecaseshavecommercializablemodels forpartiesbeyondtheplatformprovider. 2.1.1 DefinitionofaMOOCandtheScopeofOLEandMOOCinthisdocument MOOCisanacronym(MassiveOpenOnlineCourse)originatingin2012.Theacronymhas beenshortvlived,asmoochasevolvedintoanounwithmeaningsfallingoutsidethe acronym.forexample,todayweseemoocsthatarenotopentoallcomersandmoocs thatareonlypartiallyonline,becausetheyareintegratedintoblendedlearningorflipped classroommodels. 2 MOOCssharehistorywithITS IntelligentTutoringSystemsandother learningmanagementsystems,suchasblackboardandmoodles. Wearefocusingondataanditsrelatedprivacyandconfidentialityissuesinthisdocument. NoOLEplatformcollectsexactlythesamedata,butwhereveritislargelyunimportantto differentiateeachplatformbyitsspecificname,wewillrefertothemallasoles. 2.1.2 StateofDataPrivacyOrganization OLEs,andMOOCsinparticular,attheircurrentscalearerelativelyrecent,sodataprivacy andaccesspoliciesareemergentanddynamic.policymakersrangeingovernancescale fromthefederalgovernmenttoplatformproviders,andfurthertoinstitutionaland independentcontentproviders.defactopoliciesandinterimpoliciesthathavebeen necessarytocoverfastvpacedoleactivitybothexist.furthermore,existingpolicieson dataprivacyhavebeeninterpretedinnewcircumstances.policycommitteesandmeetings 2GiventhisfluidityofthemeaningofMOOC,somepeoplereasonablydisputetheoriginofthewidely recognizedfirstmooc,believinglargescaleonlinecoursesatthecollegelevelprecedingng sorthrun sat Stanfordin2012tobevalidexamples.ItisarguablethatCourseraandMITX/edXexamplesaremoreprecisely called xmoocs, whilepreviousonlinelearningcourses,whicharegenerallymuchmorefluidinnaturein termsofcontentdeployment,aremorepreciselycalled connectivist or cmoocs. 16

abound.policymakingisattheinformationcollecting,optiondrafting,andrevisionstages. Thereisapotentialtoleveragetheexperiencefrommanyotherdatadomainsandshapea strongnationalexample.thiswillrequireinputfromdatastakeholders,thelegal community,andtechnologyexperts.thelatterareimportantbecausetheycanadviseon technicalrisksofprivacyandconfidentialitybreaches,whilealsoindicatingthecapabilities andpotentialpowerofnewtechnologies. 2.2 DetailedNarrative TheOnlineLearningEnvironment(OLE)dataprivacyscenarioisrelativelystraightforward comparedtosomeotherdomains,suchashealthrecordsorpersonalgenotyping,for severalreasons: BecauseOLEsarerecent,therearefewdatalegacycomplexities. Becausethenumberofplatformsismodestrightnow,thekindsofdataare enumerableandtheirformatsareknown.however,thiswillchange. Becausethereareenumerableclassesofstakeholdersinthespaceandpolicy precedentsinrelateddomains,thereisgenerallylessdivergenceand/or disagreementonwhatapolicyshouldcoverandwhattheprinciplesandshouldbe. 2.2.1 OpenIssues Recognizingthedynamicnatureofcontrolofthedataandacknowledgingthatthe circumstancesaroundthatcontrolmaychange.thedataisreplicatedandpassed bytheplatformprovidertotheinstitutionofthecontentprovider.atthispoint, twopartieshavecontrol.hereafter,designatedcontrollersmayexpandinnumber, orthecontrolmaybepassedfrompartytopartyinstages.differentcontrollers havedifferentinterestsinthedataandallowvariouspartiestoaccessitundera diversesetofgoalsandagreements.thereisnouniformitytoinstitutional practicesacrossthecountry.ifabroaderpolicyandsetofpracticesweretobe developedbygovernment,theirinterpretationmightstillresultinheterogeneous localpractices. Defininganddetermininglegitimateusesofthedataandhowtheseusesshouldbe controlledinaclear,specific,andopenvendedmanner. Settingguidelinesorstatedpoliciesrelatedtothesale,trade,orsharingofthisdata inolesandmoocs. Defininganddeterminingthelegitimatecommercialuseofthedata,ifany. Definingtheroleoftechnologyinaidingthedraftingandgovernanceofpolicy. AnticipatingcommercialandeducationalactivitiesaroundOLEdata,aswellas potentialmaliciousactivities,andconsideringwhattechnologycandotosupport them(orpreventthem),asnecessary. Thetradeoffsforpolicyarounddatacontrolandaccessinclude: Students righttoconfidentiality,privacy,andaccesstotheirowndata. Institutions andcontentproviders righttoaccessbecauseofcontentprovision. Platformproviders righttoaccessbecauseofserviceprovision. Thebenefitofresearch,theresearchVmotivatedrighttoaccess,andthe countervailingriskofidentification. Thepotentiallinkingofanonymizeddatawithoutsidedata. 17

Commercializationopportunitiesthatmaybeunforeseenorunanticipatedby studentswhograntpermissiontocollectandcontroltheirdata. Thereasonablelimitsoftechnologyforprivacyandconfidentialitypolicysupport. 2.2.2 AdditionalPrivacyConcerns Forumdiscussionsanddatalinkability. OnecommonwaytogradeassignmentsisviapeergradinginMOOCs,whichmay createpowerrelationshipsandopportunitiesformisuse. Powerdynamicsmaynotrespectbasicrights,astheyrelatetothelinkeddataor thetextualinformationfromthediscussionforums.inaddition,themoocscan presentasymmetricalpowerdynamics.considerthecaseofchildrenand prisoners,wherepeoplewithinasystem(educational,correctional)maybe requiredtodothingsaspartofthethatsystem,orinthiscase,themooc,andthey maybeinfluencedtobendtherules,giventheexistingpowerdynamics. Therefore,thisareaneedsadditionalprotection,sinceMOOCshavethepotentialto enablecoercionandpowerimbalance.therearefreemoocsandmoocsfocused oncertificationsandjobs.thereisanasymmetricpowerrelationshipinsome situationsandwhenthisexists,thereshouldbeseparateregulationsgoverning thesemoocstoensurethatthedynamicsarefairandthereisfreewillandclear consent. 2.3 PrivacyImpactAssessmentYTheSpecificContextofScenario1 2.3.1 Actors Students:Userswhotakethecourse,completetheassignments,andreceiveagrade. Teachingcontentproviders:Facultyandteachingstaffthatprovidetheteaching material,monitorandsupportthediscussions,andhandlethegrading. CrowdParticipants:AtVlargepartieswhomightvolunteertogradeorofferfeedbackon assessments,programmingassignments,andsoforth,butwhoarenotstudentsorcore teachingstaff. PeerGraders:Aspecificcaseofstudents,inwhichstudentsareexpectedtogradeeach othersworkinordertomanagethegradingatlargescales,asoccursinsomemoocs contexts. Institutionalcontentprovider:Theinstitutionbehindtheteachingcontentproviders. ExamplesincludeanenterpriseofferinginVhouselearningplatform,auniversityofferinga MOOC,anenterpriseofferingproducteducationforclients,orthegeneralpublic. Platformprovider(e.g.Coursera,edX,StanfordU):Apartythatdeploysthecourseonthe Webviaaplatform.Insomecases,thesamepartydevelopsandmaintainstheplatform. Forexample,edXisanotVforVprofitorganizationthatdevelops,maintains,anddeploysa MOOCplatformasaservicewithaconsortiumofuniversitypartners,includingMITand Harvard.Courseraisacommercialentityandhasdifferentuniversityrelationships.Open edxisanopensourceplatformthatanycontentprovidercanadoptanduseforcontent deployment. Analyst:ApartywhoexaminesthedatacollectedfromOLEs.Analystsinclude researchers,theirstudents(iftheresearchersareacademics),andeducationtechnologists. 18

Teachingstaff,platformproviders,andinstitutionalcontentprovidersmayalsoactas analysts. Datacontrollers:DatacontrolofOLEdataisnotalwayscentralizedorstationary. Examplesofdatacontrollersincludetheplatformproviderandinstitutionalcontent provider.withineachoftheseinstitutions,therecouldbemultiplecontrollers.theymay controlthedataatdifferenttimes,ortheymaycontrolitconcurrently.forexample,atmit, theofficeofdigitallearningreceivesthedata,controlsitsdistributionatonepoint,and thenlaterpassesthisroleontotheinstituteregistrar. 2.3.2 ActorsandRelationships Analystsinteractwithdatacontrollerstogainaccessthedata.Thedatacontrolleroften askedtheanalyststoformallysubmittoapolicy.eventuallyanalystswilltransformsource databylinkingandinterpretationintomoreabstractrepresentationsofstudentbehavior, e.g.variablesformodeling,allthewhiletryingtoenforcestudentanonymity.analystswill interactwithdatacontrollerstoworkouthowtomorewidelysharesuchvariablesandto evaluatetheriskthattheyandmodelsusingthempresentsomeriskofrevidentification. Datacontrollersinteractwitheachothertopassorsharethedata. Studentsinteractwiththeplatformproviderandtheteachingcontentprovider.They registerwiththeplatformprovidertogainentrytotheplatformandcourse.theyprovide backgroundinformation,participateinthecourse,includingitsforumsandassessments, andprovidesurveyinformation.asdatacontrollers,bothproviderswillaccessthis information.itshouldbenotedthatstudentsoftenconfusetheplatformandcontent providers.astudentisshownaprivacyandaccesspolicybytheplatformwhenheorshe registers.astudentagreestoaplatformusepolicywhenregistering.forexample,edx s usepolicystipulatesnoscraping. StudentsindirectlyinteractviatheOLEwiththeInstitutionalContentProviderwhen theyhavegradesplacedintheiracademicrecords,orwhentheyreceivecreditor proficiencycertificates. Studentsindirectlyinteractwithanalysts.Theygainabenefitfromassistancethatcould befoundedontheresearchers analysisoftheirdata bothastudent sindividualdataand thedataofotherstudentsinaggregate. Studentsinteractwithotherstudents,generatingdataofgreatinterest.These interactionsfrequentlytakeplaceonforumswithintheplatform.importantly,fordata privacyreasons,theymaytakeplaceoutsidetheplatform,informallyarising,ratherthan beingorganizedbythecoursestructure.examplesofdigitalrecordsoftheseinteractions arefacebookorlinkedingroups.sometimesstudentsassesstheworkofotherstudentsin peervtovpeerrelationships.studentsmayalsoworkingroupsonprojectsorhomework. StudentsinteractwithCrowdParticipantswhentheyreceivefeedbackfromthem.For example,onecourseatmitinvitesalumnitocommentonstudentsoftwaredesigns. Studentsrarelyinteractwithdatacontrollersatthistimeandhavezeroorlittleaccessto theirdatabeyondofficialrecordscreatedfortheireducationpurposes. Institutionalcontentprovidersemploytheteachingstaff,i.e.teachingcontent providersandhaveagreementswiththemregardingintellectualpropertyrelatedtothe course,andremunerationforinstruction.theinstitutionisusuallythedatacontroller, 19

ratherthantheteachingcontentprovider.infact,thelatterpartymayneedtoseek permissionfordataaccesstotheverycoursesheorshehastaught. TeachingcontentprovidersinteractwithCrowdParticipantstoprovideguidelineson gradingandgetfeedbackonstudentperformanceandinterestinthecourse. Teachingcontentprovidersprovidefeedbacktoinstitutionalcontentprovidersand platformprovidersonusability,additionalfeatures,andstudentperformance,for example. Teachingcontentprovidersmayinteractwithanalyststounderstandhowstudents learnandinteractwiththeirteachingcontentinordertoimprovethatcontent. Teachingcontentprovidersmayinteractwithdatacontrollerstogetaccesstodata abouttheircourseinordertoanalyzeitandtoimprovetheteachingcontent. Institutionalcontentprovidersinteractwiththeplatformproviderstoensurethatthe coursesaresupportedproperlyandprovidefeedbackonadditionalfeatures. Institutionalcontentprovidersinteractwithdatacontrollerstoidentifyand/orspecify thepoliciesthattheywishtoenforceandtodiscussenforcementmechanisms. ThecoreinteractionisthestudentlearningviaanOLE.Aroundthispoint,studentsinteract witheachotherandteachingstaff.intermsofprivacy,studentsareidentifiedbytheirlogin idontheoleplatform.theymayalsorevealtheir offline identifytoeachotherandstaff inthecontentoftheirdiscussionposts.studentsagreetoaplatformuseagreementthat impliesthattheyaccepttheplatform sdatausepolicy.duringthelearningprocess,the platformprovidercapturesclickstream,assessment,discussion,andwikidata.inreal time,oratlongerintervals,theplatformprovideraggregatesthisdatafrommanystudents interactions.theplatformandtheinstitutionalcontentproviderscontrolthesedata.they aregenerallynotaccessibletothestudent,buttheyareaccessibletoteachingcontent providersandanalysts.institutionscontrollingthedataareresponsibleformeetingferpa requirementsandpseudovanonymizingdatatowhichtheywilllinkandprovideaccess. Theyalsodevelopandprovidetechnicalsupportfordataaccesspolicies.Analysts transformsourcedatainthecourseoftheirmodelingactivities.theymaycombinelow levelobservations(e.g.mouseclickactivity)intovariables(e.g.referralstotextduring problemsolving)andcompilelargedatasetsofthem.thesedatasetsdescribestudent behavioratarecognizablelevelofhumanactivity.theyaredestinedtobecomethedata currency ofanalyticresearch.howtohandlethecontrolandprivacyprotectionofsuch secondarydata(i.e.whocanitbesharedwith,givenpotentialforstudentrevidentification) remainstoberesolved. 2.4 GoalsofOLEs General:Toeducate.WithcollegeOLEs,theeducationcouldhave(secondary)outreach, accessibilitygoals.withcorporateoles,theeducationcouldhave(secondary)product adoption,sentiment,andpublicitygoals.inaddition,goalsspecifictoactorsare: Teachingcontentproviders:Providingteachingmaterials,jobtasksforanemployer. CrowdParticipants:Altruisticorprofessionaleducationgoals. PeerGraders:Evaluateotherstudentworkinanappropriate,objectivemanner. 20

Institutionalcontentprovider:Sometimesthroughgeneratingrevenuedirectlyor indirectly;reputation. Platformprovider:Revenuestreamsviaadvertising,signaturetracks,recruiting. Possiblecrosssellingtosteerpeopletowardformaldegreeprogramsatuniversitiesthat providecontent.owntheecosystem,astheyowntheactualplatformandaccessthedata. Analysts:Researchintoeducation,improvementofOLEexperienceforstudentsand teachersbyinterpretinghistoricaldata.inevitably,financialprofitcouldbeagoalforthis kindofactor. Datacontrollers:Thesearethedatagatekeepers.Theyregulateaccesstothedataatthe momentforanalystsandotherpotentialcontrollers.theirgoalistoensurethatthe privacyandconfidentialitypoliciesgoverningthedataarerespected,whileproviding accesstoappropriateanalysts. Thereisalurkingunnamedadversarialgoal/actorinthisspace:Thoseexploitingthedata forcommercialorhackingpurposes,outsidetherealmofeducationaluse,i.e.toidentify someoneandtargetherorhimspecificallyforrevelationsorforprofitvbasedactivities. Forexample,thereisasignificantpotentialfortargetedadvertising. 2.5 Data MOOCsofferapotentialsocialsciencelaboratoryorstudysettingwherestudents behavior andinteractionwithcoursecontentcanbealmostmicroscopicallyobserved.technology allowsustocaptureatremendousamountofdetaileddata,including: ClickVstreaminteractionsbetweenastudentandcontent. UseofvideosandothereVresources,suchasdigitizedreferencematerial,wikis,and forums. Assessmentbehavior:attempts,correctness,useofimmediatefeedback. SelfVreportedbackground,preVandpostVtestsurveys. Moredatathaninaresidentialsetting,butwithlesscontextualinformation accompanyingit. Thisdatacanbesegmentedinseveralways,asoutlinedbelow. 2.5.1 CourseYrelated Coursecontentfromcontentprovider. Dataexhaustfromplatform,asstudentsinteractwithWebservers.Thisisoften calledclickstreamdata.foredx,itisjsonlogsofeveryget/postofdatatotheweb site. StudentinputtotheOLEviawikianddiscussionforumentries,questionnaires,and selfvreportingsurveys. Assessments bothgradesandresponses;certificateachievement. 2.5.2 InstitutionorPlatformYrelated Curriculardatarelatedtocoursestaken,timing,andlearningpaths. Registrationdata,suchasprofileinformationaboutstudents. Paymentdataperhaps(e.g.,CourseraSignatureTracks,otherthirdparties). Certificatedata. 21

ThesedataareindiverseformatsandcanbelinkedtoformstudentVorientedortimeV orienteddescriptions(theformerbeingmoreactionable)oflearningactivitywithin&the& platform.onesuchopenorganizationofmoocplatformdataismoocdbwithinthe MoocDBproject.MoocDBisaplatformagnosticfunctionaldatamodelfordataexhaust frommoocs.themoocdbprojectwillprovideopensourcesoftwareofmooctoolsand frameworks. 2.6 Systems Businesssystems.Asanexample,CourseraisaforVprofitorganization,providingan onlineservice.inthepast,courseraoffereda"freemium"modelinthemarketplace,and hasevolvedtoofferlowcostcoursesandspecializations.signaturetrackingverifies studentauthenticity,recruitersareinthemodelandserveasarevenuesource,andlifelong learnerstakecourseswellbeyondthetypicalstudentyears.inthecaseofacorporate MOOC,HRlearningsystemsarepartofthispictureaswell. 2.7 Risks Thebiggestdatariskisthatsomeoneinthedataisidentifiedandthiscausesharmtothem. DatahastobepseudoVanonymizedbeforerelease,butthatdoesnotassurethatreV identificationwillnotbepossiblewith100%confidence.revidentificationcantakeplace inatleastthreeways: PseudoVrandomizeddatahasconfidentialcrossVreferencetablestotrueidentity. Thesetables,ifnotadequatelyprotected,couldbecompromised. Somereferenceinthecontentofthedata,forexamplefreetextpostsindiscussions ortimestampswilldirectlyorindirectlyallowcrossvreferencingtopublicdatathat revealsidentity. Apreviouslycompromiseddatasetcanpotentiallybeusedtolearnthebehaviorof astudentandthisbehaviorpatterncanthenbeappliedtonewdatasetstoidentify thestudent. Severaladditionalrisksexist: Datacontrolisnotinthehandsofthedataproviders,i.e.thestudent.Therefore, thereisariskthatthedatacanbeusedinawaythatthedataproviderdidnot anticipate,orforareasonthattheydonotapprove. Datareleasedforresearchpurposeswillbeusedforcommercialpurposes. Datawillbeusedtoevaluatetheteachingabilityoftheteachingcontentprovider andtocompareteachingcontentacrossdifferentinstitutionalcontentproviders withoutexplicitconsentrelatedtoindividualdatasharing. StudentsmaynotunderstandtheprivacypolicythattheyhaveagreedtoatsignVup, andtheirpersonaldatagetssharedormonetizedwithouttheirinformedconsent. 2.8 Rules/Regulations IntheUnitedStatesmuchoftheregulationofacademicdataisregulatedbytheFamily EducationalRightsandPrivacyAct(1974), 3 whichdefinestherightsofparentsand 3Seehttp://www2.ed.gov/policy/gen/guid/fpco/ferpa/students.htmlforgeneralinformationaboutFERPA. 22

guardianstoaccessandsomecontroloverwhohasaccesstowhichinformationabout childrenunder18yearsold.italsodefinestherightsofstudentsover18,suchasstudents incollege.itisimportanttorecognizethattheremaybeanumberofnonvferpa regulationswithrespecttotheprivacyofinformationaboutstudents.anexampleofthisis theu.s.healthinsuranceportabilityandaccountabilityact(hipaa),butthereareothers aswell.thisgroupdidnotdiscusstherelationshipsamongthesevariousdifferentfactors intheprivacyofeducationaldata,butjustnotedthatsuchdifferencesandpossible conflictsexist. 2.9 Technologies LearningPlatforms(usingthisbroadlytorefertoplatformssuchasedX,Coursera, Udacity,andotherMOOCproviders,aswellasmoretraditionalLearning ManagementSystems(LMS)suchasBlackboard; Softwareframeworksforprocessinglargedatasets,suchasHadoopanddatalakes thatstoreacombinationofstructuredandunstructureddata; Webbrowsersandfrontendtools; Analyticaltools; Cloudcomputingplatforms(e.g.,AmazonWebServicesandothers); Codeondifferentsystems; Mobiledevices. 2.10 PrivacyConstraints PrivacyconstraintsinaMOOCareverydifferentfromthoseofaphysicalclassroom experience.thereisaperspectivethatsincemoocsaremuchmoreopen,studentsare morevulnerableonline,comparedwithatraditionalclassroomsetting. 2.11 TechnologyInformingandSupportingOLEDataPrivacyand ConfidentialityPolicy 2.11.1 Whattoolsandapproachescan(new)technologyprovide? Somepossibletechnologies: Differentialprivacy. Analysisiscarriedoutonencrypteddata,soeventheplatformproviderdoesnot seethedata(homomorphicencryption). TheanalystusestrustedandprivacyVawareAPItowriteuptheiranalysisand submittheircodetodatacontroller;theapipreventstheabuseofdata. Storeextensiveauditlogsaboutanalystaccesstoensurethattheanalystisnotable tochainqueriesinordertogainaccesstoinappropriatedata. PrivacyVawareanalysisframeworkthathelpsanalystbepolicycompliant. SomeinitialthinkinghasbeengiventomanagingMOOCdataviadecisionandpolicy enginesbasedonheuristics.thisapproachwouldrequireseparatingthedatabasesand usingdifferentaccesscontrols. 23

2.11.2 Risks Whatrisksaretheretoeventhenewtechnology? Differentialprivacyonlyworkswithinacloseddataset;privacybreechesare possiblewhenexternaldatasetsarelinked. Encryptionactslikeaccesscontrolandisusefulwhentheplatformprovideris untrusted. ArestrictedAPIactslikeanaccesscontrolcombinedwithaudit. Auditingcanhandlepostfactoproblems. Theanalystplatformprovidesaholisticapproachtoaccesscontrol,privacy awareness,andensuringpolicycompliance.however,itrestrictstheanalysttoa singleplatform. 24

3 ResearchInfrastructureforSocialMedia Team:&Maritza&Johnson&(Facebook),&Dazza&Greenwood&(MIT),&Mona&Vernon&(Thomson& Reuters)& 3.1 Abstract Mostsocialmediaplatformsprovideatleasttwobasicfeatures:theabilitytoshareuserV generatedcontentandtheabilitytoconnectwithanaudience.differentsocialmedia platformsmakeitpossibleforuserstosharearangeofcontenttypesandsomeallowthe usertoselectivelychoosetheaudienceforindividualpiecesofcontent.onfacebook,for example,theusercouldsharetextvbasedstatusupdates,photos,orwebsitesurls.the userisalsoabletocommentoncontentpostedbyotherusers,installapplicationsthat utilizethefacebookapi,orcommunicatewithotherswithinaselfvorganizedgroupof people.betweentheuservgeneratedcontentandtheserverlogsthatcapturehowand whenpeopleinteractwiththeplatform,theseservicesareaninvaluablesourceof informationabouthumanbehaviorattheindividual,group,andevencountrylevels. Thegoalofthisscenarioistoevaluatetechnicalsolutionsthatwouldopenthisdataupto researcherswhileofferingdatasubjectsinformedconsentandcontrolovertheirdata. StudiesofsocialmediatodatehaveprovidedinsightsontopicsaswideVrangingassocial capital,socialinfluence,memeevolution,emotionalcontagion,mobility,andpolitics.fora varietyofreasons,muchofthisresearchiscurrentlylimitedtoemployeesofsocialmedia companies. 3.2 ScenarioIntroduction StudiesofsocialmediatodatehaveprovidedinsightsontopicsaswideVrangingassocial capital,socialinfluence,memeevolution,emotionalcontagion,mobility,andpolitics. Unfortunately,muchofthisresearchiscurrentlylimitedtoinVhouseresearchersatsocial mediacompanies.academicsandotherresearchershave,insomecases,leveragedpublicly availablecontentorapis,whentheyareavailable,buttherearenotablelimitationsto collectingdatathroughthesechannels.insomecases,studyingagroupofpeopleyieldsthe mostinterestinginsightsbutthisrequiresthatacriticalmassofthepopulationoptsvintoa researchprogram.inothercases,theuservgeneratedcontentisbestsupplementedby informationthatcanonlybefoundintheserverlogs,suchashowfrequentlyaperson visitstheplatform,howmuchtimetheyspend,andtheproportionoftimespent consumingcontentversusproducingcontent. Onewaytoincreasethevolumeofresearchinthisareaistodevelopasocialmedia researchinfrastructurethatallowsusers(datasubjects)tooptvintoaprogramthatmakes somesubsetoftheirsocialmediacontentandtheaccompanyingserverlogsavailableto researchers.theresultwouldbealargevscale,richdatasetthatwouldempower researcherstogeneratevariedandreproducibleresearch.socialmediaplatformsmight participateindatareleaseprogramwithvaryingoptions.forexample,onesuccessful implementationoftheprogrammightincludeapredefinedsetofuserdataanddatafrom serverlogs,afeaturethatallowsresearcherstocontactparticipantsforsupplementary dataorfollowupsurveys.itmightalsoincludeaportalwitheducationalcontentfor individualstovisittohelpthemunderstandtheinformationthey vechosentodonate,to seehowresearchersareusingit,andtogaugethelongvtermbenefitsofparticipation. 25

TheincentivefortheStudyParticipantsandSocialMediaProvidersistoactforthepublic good.theriskforthestudyparticipantsisthattheymightexperiencenegativeeffectsasa resultofcontributingtheirdatatothegeneraldataset.thedataexchangedmaycontain severalfeaturesofdataknowntobepersonallyidentifyingorsensitiveinnatureincluding race,sexualpreferences,genderchoice,andpoliticalviews.thedataexchangedcouldalso beusedformakingunexpectedinferencesthattheparticipantwasunawareofatthetime ofconsent. AsahighVleveloverview,theprogramwouldbeinitializedbytheSocialMediaProvider. ThesocialmediaproviderwouldadvertisetheoptVinresearchprogramtousers(potential participants),giveanoverviewofthestructureoftheprogram,therisks,andthebenefits andpresentthechoicesthatrepresenthowausermightparticipate.thisinformation wouldincludethemainfeaturesoftheprogram:thebasicsetofinformationthatis requiredtoparticipate;additionaloptionalfieldsthattheparticipantmaychooseto include;andthefeaturesthatwouldallowaresearchertocontactauserforadditional information. TheparticipantwillhavegranularoptVinchoicesforsharingasubsetoftheirpersonal data,forexample,somebasic(static)fieldsareincludedinthesetsuchasbirthmonthand year,currentcity,schoolhistory,jobhistory,etc.theparticipantisalsogiventheabilityto contributedynamicstreamsoftheirdata,includingphotos,posts,comments,and interests. Theinformationwillclearlydescribethepoliciesthatresearcherswillbeheldto,while makingitclearthatthedatasetisnotbelievedtobeanonymousordevidentifiedinarobust manner. 3.3 StakeholdersandInteractions Socialmediaprovidersarethedatacollectorsandwouldinitiallyserveasthedata platformproviders. Socialmediausersarethedatasubjectsandareaskedtoprovideinformedconsentforthe datatobetransmittedbysocialmediaprovidertoresearcherforpurposesofresearch study. Researchersaredataanalystsandreceivedatafromdatacollectors(socialmedia providers)bypermissionofthedatasubjects(socialmediausers).theresearchers becomedatacuratorsofthedatathattheyreceiveatthetimeofreceiptandany derivativedatathatisproducedasaresultoftheresearchactivities. Thedatacollectors(socialmediaproviders)remaindatacuratorsfortheunderlyingdata ofallsocialmediausersthattheycontinuetomaintain. Socialmediauserswillcontinuetointeractwiththesocialmediaplatformtogeneratenew content. Researchersmightcontactsocialmediauserstocollectadditionaldatatosupplementthe socialmediadata. Socialmediauserswillcontactthesocialmediaprovideriftheyexperienceissuesorhave concernsabouttheoverallprogram.userswillexpectthatthesocialmediaprovideris ultimatelyresponsibleforensuringapositiveexperience. 26

Researcherswouldprovideinformationtothedatasubjectsabouttheresearchthatresults fromusingthedatasubjects data. 3.3.1 Data Examplesofthedatathatcouldbemadeavailable: Posts:photos,statusupdates,locationcheckVins,etc. Commentsandthenumberoflikesonindividualposts Educationhistory Hometown Currentcity Religiousandpoliticalviews Informationaboutthefriendnetwork:summarystatisticslikecount,breakdownby agerange,currentcity(location),gender,politicalviews,andeducationlevel,etc. Forthedynamicfields,theinformedconsentdialogmightoffertheabilitytocontribute: Audience,keyword,tags,orsomeothermechanismcoulddefinetheexceptions. Allhistoricaldata Allhistoricaldatawithsomeexceptions Onlyfuturedata Onlyfuturedatawithexceptions Historicalandfuturedata Historicalandfuturedatawithexceptions Makingthedataavailable: Option1.Socialmediaprovidergeneratesdataslices: Onamonthly/quarterly/annualbasis,theSocialMediaProviderwouldcreatea newdatasliceforallactiveparticipantsintheprogram. ParticipantswouldbeabletooptVoutoftheprogram,buttheywouldnotbeableto removetheirdatafromthedatasetsthathadalreadybeenpublished.this&is&mainly& because&no&practical&guarantees&could&be&made&about&deletion&requests&once&the&data& has&been&released&to&researchers.&& Researcherswouldconductqueriesontheavailabledatasets,ordownloadthe entireavailablesetforagiventimeperiod. Option2:SocialmediaplatformprovidesasAPIspecificallyforthisprogram. 3.4 Systems Legalsystems Theprivacypolicy,ordatausepolicy,currentlygovernshowdatacanbe used. Socialsystems Whataretheexistingexpectationsaroundwhoownssharedcontent? Socialmediadatasometimesinvolvesmorethanonedatasubject.Considerforexamplea Facebookstatusupdatewithasetofcommentsand Likes. Thesimpletextofthepost belongstotheoriginalposter(thepersonweconsiderthedatasubjectthroughoutthis scenario).butthepostmightalsoinclude tags tootherpeople.thesestructured referencestootherusersrepresentotherindividuals.what sthebestwaytohandle 27

providingthisinformationinthedataset?similarly,onfacebook,commentsonapostin arestoredwiththeaccountofthepostauthorratherthanthecommenter.whodoesthis contentbelongto?thecommentsarerelevanttothecontextofthepost,butaregenerated byotherpeople.isconsentrequiredtoknowwhichusers liked apost?dowelimitthe datasothatonlythenumberoflikesisavailable? Businesssystems Humansubjectsresearchrequirestheapprovalofanethics committeeifthecommonrule 4 applies. Technicalsystems informedconsent,apermissionvbasedsystemtoallowtheuserto participateinawaytheyfeelcomfortable,transparencyandcontroloverhowdatais shared,deletionprotocols,devidentificationofdatatoprotectindividualswhenitis aggregated,andauditablesystemstounderstandwhohasaccess. 3.5 AnalyzetheScenario 3.5.1 Goals Theparticipantsbenefitfromcontributingtoageneralbodyofknowledgeand perhapstheywilllearnsomethingaboutthemselvesonanindividualbasistoo. Researchershaveaccesstoadatasetthatwaspreviouslyunavailable. Thesocialmediaprovidergainsinsightsabouttheuserbaseandcontributestothe generalbodyofknowledge. TheResearchersmaybeactingforthepublicgood,ortheymaybeactingto developtheirowncareers. 3.5.2 Risks Participantsagreetoparticipateintheprogramandthenlaterexperiencean unexpectedharm,duetoanunexpectedinferencethatarisesfromtheresearch. Participantsagreetoparticipateintheprogramandthenlaterexperiencean unexpectedharm,duemodificationofthesitebasedonthoseinferences,orasa partoftheexperimentitself. Thedatasetwouldbeavaluableresourceforresearchers,butitwouldbedifficult toquantifythebiasintroducedtothedatasetbasedonthecharacteristicsofthe peoplewhodecidetooptvintotheprogram. Researchersidentifyacorrelationinthestudypopulationthatcanbeextrapolated tothegeneralpopulation,greaterthanthepooloftheparticipantswhooptedin. DeletionrequestsVVisitreasonabletodesigntheprogramsuchthatpeoplecanopt inorchoosetooptout,butcannotremovetheirdatafromthealreadyvreleased dataslices?ifnot,thenhowwoulddeletionbehandledwhenthedatasliceshave alreadybeenreleased? Lackofcontrolonthedownstreamuseofthedata,orderiveddata:whatare expectationsandcommitmentstothepeoplewhooptinondownstreamusesofthe data?whennewinsightsemerge,howdoyouensurethattheinferences/derivative datahavebeencreatedinawaythatisconsistentwithanindividual s 4TheCommonRuleisthenameoftheU.S.federalpolicyontheethicsofuseofhumansubjectsinbiomedical andbehavioralresearch.formoredetailsee http://www.hhs.gov/ohrp/humansubjects/commonrule/index.html 28

expectations?howwouldwedetectamisuseofthedata?howwouldwetag derivativedatatounderstandwhereitcamefromandunderstandtheoriginal policyinordertodeterminewhethertheactionandthefutureusesarepolicy compliant? ThedatacopymaybedisposedofbytheResearchersafterthestudy,ormaybe retainedinacorpusforfurtherstudy.thedatacopymustbeheldsecurelyandthe Researchersareliableforabreach.However,theSocialMediaProvidermaybe liableiftheyhavenotassuredthattheresearchersareactingproperlyandalso mayriskcollateraldamageinthecaseofabreach,eveniftheproperprocesses havebeenfollowed.avarietyoftechnologiesandsystemswillbeusedtostoreand transmitthedata,includinginternetlinksandvariousdatabases.thedatamustbe heldaccordingtothevariousdataprotectionregulationsintheterritorythatthe datahasbeenexportedto,providedtheexportislegalinthefirstplace. 3.5.3 Rules TermsandConditionsofthesocialmediaprovider Thesocialmediaplatform sexistingaudiencecontrolsforcontent NoticeandconsentwhentheuseroptsVintotheprogram FTCSection5 FortheResearchers:applicablehumansubjectsresearchprotections(e.g.,The BelmontReportorTheCommonRule) Thepoliciesofpublicationvenues 3.5.4 Time Roughlytwotofouryears. 3.5.5 ExistingRelevantBestPractices HumansubjectsreviewcommitteeVVWhereTheCommonRuleappliesanethicscommittee wouldberequiredtogiveapprovalforhumansubjectsresearchandanappropriaterisk assessmentwouldbeundertakentovalidatethearrangementsthathavebeenputinplace tomanagethedatasecurityanddisclosure. OAuth2forenablingaccesstoauthorizedusersVVOncethedatasubjecthasprovidedthe clickvbasedgrantofauthorization,theresearchercouldbegrantedanoauth2tokento requestandreceivethatindividual sdataviatheapi.thedatawouldthenbetransferred toaresearchplatformanddatabasetoconducttheanalysis.theoauth2tokenwouldbe provisionedtoincludeauthorizedaccesstoascopeofaccessthatcorrespondstothe personaldatathatthedatasubjectagreedtoprovide. IntheUK,organizationsliketheUKDataArchivecanbeconsultedtomanagetheprivacy processesandpublicationofresultswithoutbreachingprivacy. 3.5.6 Gaps Theabovedescriptionincludesafewcaveatsthatarebasedonthelimitationsofour technicalabilitiesvvforexample,it simportantthattheparticipantsunderstandthat researcherswouldagreetoapolicythatprohibitedattemptstorevidentifyparticipants withinthedataset,butitwouldbedifficulttomakeanyguaranteesalongthoselinesgiven today stechnicalsolutions.similarly,therecouldbecontractuallimitationsinplace 29

arounddeletionandretention,however,wearelackingtechnicalsystemstoenforcethe policies. Themanagementofaccesstodataandtherisksassociatedwithpublication presentanimpedimenttotheuseofsocialmediadata. Gatheringinformedconsentfromsocialmediausersisparticularlyproblematic. Toenableresearchofthiskind,weneedtostreamlinetheseprocessesandprovide automaticverificationofthesafetyofdisclosures. 3.6 InnovationIdeasandOpportunities 3.6.1 Lookingat3Y5yearsopportunitiesandchallenges Oneofthemainopportunitiesliesintheabilitytocombinesocialdatafromdifferent sourcesinordertoconductmoreinsightfulresearchandenablingreproducibilityof research.thiswillrequiretechnologytoallowforprivacypreservation,ortheapplication ofrulesasthedataiscombinedwithotherdatasets. Howdowedeveloplegislation,ifitisnotalreadyinplace,tosetVupabaselinethatwillnot becountryvspecificandhencemakesitdifficulttomanageforthesocialmediaprovidersto complytomultipleformsoflegislation?ideally,therewillbeamechanismforallowing socialscienceresearchtobeconductedonaglobalscale. Theessenceofcomputationalsocialsciencemaybecomemorecommonand normal, comparedtothenicherolethatcomputationcurrentlyhasinthesocialsciences.atrue limitationoftheresearchareanowisthatonlysocialmediaplatformshaveeasyaccessto largevscaledatasets.mostacademicswhoworkinthespacehavepartnershipswith corporateentitiestoacquirelargedatasets.howwilltheresearchcommunitychange whenlargevscaledatasetsareavailabletoallsocialcomputingresearchers? Shiftingnormsareexpectedtocontinueevenbeyondthe3V5yearhorizonandthismeans thatweexpectcontinueddeepuncertainty. 3.6.2 OpenQuestions Whatifwedevelopeda CommonProgram&Protocol forinfrastructurevlevel servicestoenablepopulationvwidelivinglabssocialmediaresearch? WhatifFacebooksupportedafeatureforusersto"optVin"forparticipationinpreV qualifiedresearchstudiesandwemodeled/testedthatasacommonservice availabletoanyapprovedmitlivinglabapplication?intheory,thissortof capabilitycouldenablerevusableoreasyupdateofconsentacrosssimultaneous researchstudiesandforfuturestudies.thistypeofservicecouldcomprise fundamentalcapabilitiesthatarenowmissingforoperationalizingfairpermissionv baseduseofpersonaldatainbigdatacontexts. AnOAuth2scopetypedevelopedforresearchcontentcouldbeamodelforother socialnetworkstouse.oneofthebestaspectsofthefacebookandgeneralweb 2.0designpatternwithOAuth2isthattheauthorizationscanbeseenona dashboardandindividuallymodifiedorrevokedaccordingtotheagreements, potentiallyatanytime. Howcouldacommonservicetypeandinterfacespecificationbeusedby researcherstoenableothersocialmediaproviders(e.g.linkedin,googleplus, Twitter)toprovideconsentVbaseddatausinginteroperableprogramsand 30

accordingtothestandardprotocoldevelopedbymitandfacebook?whatissues ofscaling,cost/riskmanagement,businessvalue,andusabilitywouldneedtobe addressed,andatwhatphaseofdesign,development,testing,iteration,and deployment(alpha,beta,v1,v2)? CouldMITLivingLabspartnerwithFacebooktotestamodelOpenPDS(Personal DataStore)deploymentthatfurtherdevelopedinfrastructureVgradeservice interfaces,pipes,andgauges?wouldvorshouldvitmatterifopenpdswas situatedattheresearchinstitution(e.g.mitformitlivinglabs),oratathirdparty provider? 3.6.3 AlternativeA:InteractionsofPeople TheparticipanthasanaccountwithSocialMediaProvider,providesInformedConsentto ParticipateintheStudyand,withinthescopeofthestudy,providesauthorizationtoSocial MediaCompanytoreleasepersonaldatatoResearchersviatheirapplications. Alaboratoryhasanapprovedresearchstudyandhasreceivedtheinformedconsentof individualparticipantsandhasregisteredanapplicationwithasocialmediaproviderand selectedtheoauth2scopesforgrantofauthorizedaccessthatcorrespondtothepersonal datausedtoconducttheresearch.oncetheindividualhasprovidedtheclickvbasedgrant ofauthorization,thelab sappusesanoauth2tokentorequestandreceivethat individual spersonaldataviatheirappandintoaresearchplatformanddatabaseusedto conducttheanalysis. TheSocialMediaProviderprovidesanaccounttotheindividualunderitstermsand conditionsandprovidesadeveloperaccounttothelabunderanothersetoftermsand conditions.italsoprovidesthepersonaldataauthorizedbytheindividualforsharingwith theapplicationofthelabuponpermissionoftheindividualuser. 3.6.4 Data Allpastandcurrentavailabledataduringthecourseofparticipationinthestudythatis availablebyoauth2individualconsentfromincludedsocialmediaproviders. 3.7 NotesonScenario ThisexampleisbasedonastudythatiscurrentlyhappeningattheTechnicalUniversityof DenmarkincollaborationwiththeMITHumanDynamicsLab.However,referencesto potentialdownstreamsharingarrangementsbyparticipantsandresearchersrepresent prospectivefuturephaseresearchandassumeafuturestateofperhaps1v3yearsfrom now. 3.8 References Relatedtoapplicablerules & *&When&Facebook&has&the&data,&these&terms&apply: PlatformPolicy(AppliesviaResearcher sregistered Client App/Service) https://developers.facebook.com/policy 31

StatementofRightsandResponsibilities https://www.facebook.com/legal/terms DataUsePolicy https://www.facebook.com/about/privacy FacebookCommunityStandards https://www.facebook.com/communitystandards FacebookPrinciples https://www.facebook.com/principles.php *&When&the&Researchers&Receive&the&Data SensibleDTUExampleComputationalSocialScienceResearchStudy https://www.sensible.dtu.dk/?page_id=89 *&When&the&Participants&Share&Downstream&Via&Personal&Data&Services& MITHumanDynamicsLabModelPersonalDataSystemRules https://github.com/humandynamics/systemrules/blob/master/model_personal_data_sy stem_rules.md DraftDataRightsServicesAgreement https://github.com/humandynamics/legalagreements/blob/master/datarightsservices Agreement.md 32

4 DataforGood:PublicGoodandPublicPolicyResearch UsingSensorData/MobileDevices Team:&Jake&Kendall&(Gates&Foundation),&YvesMAlexandre&de&Montjoye&(MIT),&Cameron&Kerry& (MIT)& 4.1 Abstract Thereislittledoubtthatthecapacitytocollectandanalyzemobilephonedataatlarge scalehasgreatpotentialforgood[un][d4d].thereare,however,numerousbarriersthat needtobeovercomebeforethisdatacanbebroadlyusedbynonvgovernmental organizations(ngos)andresearchers: Thedataisgeneratedbythecarriers infrastructureandbelongtothem Theinfrastructuretomanageandanalyzethisdataatscaleforgoodhastobe developed DataVscienceskillsareneededwithinNGOstofullytakeadvantageofthedata, ThesedataarehighlysensitiveandpersonalVsimplyanonymizedmobilephone metadatahasbeenshowedtoberevidentifiable[unique],and Thelegalandregulatoryenvironmentisatbestuncertainandmaypreventcertain usesofthedata. Thisgroupisstudyingthetechnicalandlegalsolutionsthatcouldmakethisdataavailable inanoperationalcontext.wefirstfocusouranalysisontwoscenariosinspiredbythe availableacademicliterature.wethensketchproposedpracticalimplementationsto operationalizethesescenariosandanalyzethemfromaprivacyangle,focusingonrev identification,andalegalperspective,withafocusonafricancountries. 4.2 ScenarioDevelopment Afterconsideringanumberofdifferentscenarios,wefocusedontwothatcontrastscope andpurpose: Scenario1:Trackingpopulationmobilitywithinandacrossborderstomodelepidemic spread Scenario2:MicroVtargetingbehaviorchangeinterventionstoindividualsorspecificsubV setsofthepopulation. Scenario1ismodeledontheuseoflocationdatacomingfrommobilephonesinorderto betterunderstandandquantifythespreadofmalaria.thelocationofusersisrecordedat theantennalevelandeverytimeauserisinteractingwithhisphone(phonecall,text,or Internetsession),locationdataisusedtoestimatehismigrationsbetweenasetof predefinedregions,forexamplefromnairobitolakevictoria,aswellasthetotalnumber ofnightsspentbyeveryuserineveryregion.themainexpectedoutcomesofthisworkare twomatricesthatshowtheaveragemonthlyparasiteimportationbyreturningresidents andbyvisitors.inthescenarioweconsider,suchmatriceswouldbecomputedona monthlybasisandsharedwithlocalcdcs,ministriesofhealth,andngos.wealsoconsider acasewheredatafrommultipleoperatorsacrossneighboringcountrieswouldbeusedto estimatethemonthlyparasiteimportationsperregions.whilethisscenariohasaclear publicpurpose,thesensitivity,revidentifiability,andpotentialformisuseoffinevgrained 33

locationdata,suchastargetingofindividualsorgroupsformaliciouspurposes,hastobe considered. Scenario2,inspiredby[bigdatadriven],usesmobilephonemetadatatomicroVtarget peopleforspecificbehaviorchangepurposes:agriculturetechniquesandhealthseeking behaviors,forexample.inthiscase,locationdataattheantennalevel,aswellasother metadatafields,suchasanonymizedcallandtextlogs(excludingcontent),andrecharge informationareusedtoestimateanindividual sstatus(farmer,othersocioveconomic status)and/orpropensitytochangebehavior.inthisscenario,mobilephonemetadataare usedbymachinevlearningalgorithmsthroughasetofprevcomputedmetrics(e.g.daily distancetraveled,rechargingbehavior,timeittakestoansweratext,).userscanthenbe targetedforvariousbehaviorchangeorinformationalcampaignsthroughtextmessagesor phonecallssentbythecarrier,orbyathirdparty.whilecomputingthemetricsrequiresa richsetofdata,thisscenarioaimsatemphasizingthechallengesassociatedwithmicrov targetingindividualsandinintroducinganelementofintrusivenessthatisnotpresentin Scenario1,butinvolvesthesamepublicpurposes. 4.3 OperationofScenarios Foreachscenario,weproposetwopotentialimplementations.Wewillsubsequently analyzethesefourimplementationsfromaprivacyangleandalegalperspective. 4.3.1 Scenario1 InScenario1implementationA,thedifferentmobilenetworkoperators(MNOs)involved wouldsharesimplyanonymizedindividualmobilitydatawithonethirdvparty.tolimitthe risksofrevidentification,thedatawouldbecoarsenedspatiallyandtemporally.matching thestudy[quantifying],thespatialresolutionofthedatawouldbeatapredefinedregional levelorapproximately1000km²(692settlementsforthe581,309km²ofkenya).similarly, giventheimportanceofnightsformalariainfections(mosquitobites),thetemporal resolutionofthedatawouldbeof12h(e.g.6amv6pm).finally,asmalariasymptomsmay takeupto30daystomanifestthemselves,weworkundertheassumptionthatthree monthsofsuchmobilitydataareneededtoestimatetheimpactofhumanmobilityon malaria.differentmnoswouldhashaslatedversionofthemobilephonenumberofthe subscriberstoallowthethirdpartytoreconcilethedata.scenario1implementationais representedbelow. 34

ContrarilytoimplementationA,inimplementationB,MNOsonlyshareaggregated informationwiththirdparties.inthisimplementation,everymnowillprovideamodified versionofthemobilitymatricesdevelopedby[quantifying]tothethirdparty.usingthree monthsofdata,everymnowillassigneveryofitsuserstooneregion.thisregionwillbe theuser shomelocation.themnowillprovidethethirdpartywitharegionvregionmatrix containinghowmuchtimeuserswhosehomeisinregionihavebeenspendinginregionj. Forexample,therowcorrespondingtoregioniwilllooklikethefollowingmatrix: iv2 iv1 i i+1 i+2 1% 2% 87% 0.5% 2% Thisreadsthatalltheuserswhosehomelocationisinregioni,havebeenspending87%of theirtime(e.g.hourlyornights)inregioni,2%inregioniv1,1%inregioniv2overthe courseofthreemonths. EachMNOwillalsoprovidethethirdpartywiththenumberofitssubscriberswhohave beenassignedtoeachregion. 35

4.3.2 Scenario2 Herewewillalsoconsiderathirdpartyplatformprovider,althoughthearchitectureis fairlysimilarifthereisonlythemnoinvolved.theissueisonlythattheenduserswould havetotakeituponthemselvestolinktomultiplemnosiftheywantedtobeabletotarget clientsofeach. Heretheanalytictransformationofthedataconductedbytheserviceproviderwould selectasetofuniqueusers(notidentifiedbynameorotherpii,butbyencryptedkeyor otheranonymousuniqueidentifier),basedontheirusagepatternsandinferencesabout theirsocialstatusorothertraits.theywouldthenpasstheuniqueidstothemno,who wouldbeabletomatchthemtothecorrespondingphonenumbersforrevcontactwithan SMSorautomatedvoicemessageencouragingprogramparticipation. Case1 Thirdpartiesmayanalyzeanonymousdatatoselectindividuals,butthe mobileoperatoristheonlyoneintouchwithtargetsandtheyarenotidentifiedto thirdparties.thirdpartiesmaypassbackanencryptedkeyorotheridentifierto triggersendingamessage. Case2VAthirdVpartyisputdirectlyintouchwiththetargets,orcanidentifythem itself. 4.4 RegulatoryEnvironment ReviewofonlinesourcesondataprivacylawsinAfricaindicatesalandscapethatis evolvingalongtwolines.francophonecountriesinwestafricaandnorthafricathat reflectthefrenchcivilcodesystemhavetendedtoadoptprivacyframeworksmodeledon the1995europeanprivacydirective,supervisedbydataprotectionauthorities.englishv speakingcountrieswithcommonlawsystemshavelessdefinedprivacylaws. Thus,dataprotectionauthoritiesinanumberofFrenchVspeakingcountriesaroundthe worldhaveunitedinanassociationundertheleadershipofthefrenchcnil,andatleast 36

Benin,BurkinaFaso,Gabon,IvoryCoast,Senegal,Madagascar,Mali,Mauritius,and Moroccohavesuchprivacyregimesinplace,withnewlawsexpectedinMauritaniaand Niger.Manycountries(e.g.,Côted Ivoire)inbothcategoriesdonothaveanydata protectionlaws,butdoappeartohaveconstitutionalprovisionsforarighttoprivacythat providesatleastsomeauthorityforprotection. IntheEnglishVspeakingcountries,thesystemsarelessdeveloped.SouthAfricarecently adoptedlegislation,theprotectionofpersonalinformationbillthatadoptsprivacy principlestobeenforcedbyadataprotectionauthority;ittakeseffectattheendofthis year.bothnigeriaandkenyaareconsideringbroaderbillsthatresembleeachother. Basedonthisframework,wewillusetheEuropeanPrivacyDirective(EPD)asa benchmarkforcivilcodecountries.wewillalsolooktothe[consumerprivacybillof Rights]asawayofexploringitsapplicationanddevelopinganalternativeframework. 5 4.5 DataUtility 4.5.1 Scenario1 ImplementationA:Inthiscase,theutilityseemsclosetothesituationofhavingaccessto thefullrawdata.datapreprocessingandcleaningishardertodooncoarseneddata,as unusualbehaviormightbehiddenbythecoarsening(e.g.anunusuallyhighnumberof phonecalls). ImplementationB:Inthiscase,theaggregationthatisdoneatMNOleveldecreasesthe utilityofthedata.considerationsincludetrackingpeopleacrossborders,removingdual simmers,andtakingspecificperiodsoftimeintoaccount. 4.6 Privacy ImplementationA:ThereexistsariskofreVidentificationevenwhenthedatais coarsened.wewilllookatthenumberofantennaoverseveralregionstomatchtothe unicityformulaonspatialresolution.similarly,thetemporalresolutionherewouldbe twelve.thisshouldallowaveryroughestimateofthelikelihoodofrevidentificationgivenx points. ImplementationB:Whendataisaggregated,theriskofreVidentificationislower;theedge caseswouldbeverysmallregionsthathavebeenassignedashomeregionstoveryfew people.therisktoconsiderherewouldbeatthegrouplevel,e.g.peoplefromoneregion thatonlygotoanotherregion(ofthesameethnicgroup,forexample).acounterpoint wouldbepeoplewhospendtoomuchtimeinanotherregion.thisgoesbeyondpure privacyasriskofrevidentificationandmanyothercasesshouldbeconsidered. 5 CraigMundie,inarecentForeignAffairsarticle,suggestsanewmodelwheregovernanceandregulations shouldnotbefocusedasmuchatthepointofcollectionandstorageofpersonaldata,butratheronhowthat personaldataisusedandretained.thepresident scouncilofadvisersonscience&technology(ofwhich CraigMundieisamember)echoedmanyoftherecommendationsandthoughts.Intheirdocument,BigData: SeizingOpportunities,PreservingValue,inparticular,thebeliefthatregulatingusecasesandenforcingprivacy withstiffcontractualobligationsanddeterrentsmaybeneededtoextractvaluewhilemaintainingdatasecurity andprivacy. 37

4.7 CriticalIssues Businesscaseformobilecarriers.Mobilecarriersarenotinthebusinessof conductingsocialscienceorpublichealthresearch.ngoswillneedtodevelopa businessplanthatmakesdatavsharingworkforthecarriersinterestingand worthwhilefromtheirperspective.supportofgovernments(e.g.,healthministries andcommunicationsregulators)willbepivotal. Scenario1presentstechnicalissuesofdeVidentification.Thespatialandtemporal coarseningofcalldetailrecords(cdrs)substantiallymitigatesprivacyrisksand,if strongenough,cansidesteptheapplicationoftheeuprivacydirective.however,it canalsolimitthereliabilityandutilityofthedata. InScenario2,deVidentification,atleastforsignificantapplications,isnotanoption, becauseinterventionswilltargetedtospecificindividuals.thisscenariowill requireengagementofgovernmentstoenablethedatauseandidentification; withoutaffirmativesupportbyrelevanthealthanddataprotectionauthorities,this scenariomaybeimpossible.theimplicationofgovernmentswillalsorequire carefuldevelopmentofmechanismstoavoidmisuseandunwantedidentification. Furtherdevelopmentofspecificpracticesandtechnicalmethodstomanageprivacy protectioninaccordancewithvariousprinciplesoftheeuprivacydirectiveand theconsumerprivacybillofrights(e.g.dataretention,accountability) 4.8 PromisingPathsForward Acrossbothofthesescenariostherearepromisingpathsforwardintermsofemploying differenttechnicalarchitecturesandpracticestomeetdataprotectionneeds,whilestill extractingvaluefromthedata. 4.8.1 Scenario1 Inthiscase,therearealreadyprivatesectorcompaniesthatgrabmobilitydatafrommobile operatorsandsellitwithoutuserpermission(i.e.,basedonostensiblyachieving anonymity). AirsageisanexampleintheU.S.thatdemonstratesanumberofinnovativeapproachesto sharinganonymousmobilitydata.theyimprovethequalityofthepositionsignalover whatacdrwouldbeabletoprovidethroughtriangulation,whichtheyachieveby upgradingthebasestationsoftwareofthemno.theytheninstallsoftwarewithinthemno firewallthatanonymizesthedatabystrippingitdowntojustmobilitypatternsand aggregatestheoutputtoaminimumofsevenmobiletracesperobservation.hence,iftwo peoplemovedfromatobinagiventimeperiod,theywouldreportthat lessthanseven peoplemoved. Thefactthattheydotheiranonymizationwithinthefirewallremovesthe needtosharerawdata. AcompanycalledGrandatainMexicousesaformofdifferentialprivacyalgorithmtoadd somerandomnoiseandlimitthefidelityofqueriesontheirmobilitydatathattheysellto retailmarketers. Othertechniquestoexplorefurtherwouldincludeemergingdifferentialprivacy approaches,aswellassyntheticdatasetgenerationviamodelingmethodologies(e.g.dpv WHERE). 38

4.8.2 Scenario2 Becausedecisionsarebeingmadeaboutactionsinvolvingindividualsorsmallgroupsin thisscenario,andbecauseindividualleveldata(ratherthanaggregate)arebeingused,the factthatdataisanonymizedbybeingstrippedofpiidoesnotfullyameliorateprivacy concerns. Someapproachestoinvestigatehereare: IDkeyencryptionschemesandanonymizationapproachesthatgoasfaras possibletoprotectindividualidentity. Someformofregulatoryexception(e.g.specificlegalauthorizationorpublicpolicy exception)mightalsobeinorder,sinceevenfullyanonymizeddatawouldstill refertoindividuals. Developmentofethicalprinciplestomakesurethatdecisionsbeingmadeabout individualsarefairanddonotexplicitlydisadvantageanyone. Thisrequirescarefulthinkingabouttheuserexperience SMSorcallsthatare clearlytargetingthepersonmightfeel creepy andcareshouldbetakenotto makedatasubjectsfeeluncomfortableortargetedinanyway Thedevelopmentoftrustframeworkstomanagethedataandverifythelegitimacy ofitsuses 4.9 References 4.9.1 OverviewofAfricanPrivacyRegulation [D4D]http://arxiv.org/abs/1407.4885 [UN]http://www.unglobalpulse.org/Mobile_Phone_Network_DataVforVDev [unique]http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html [quantifying]http://www.sciencemag.org/content/338/6104/267.abstract [bigdatadriven]http://web.media.mit.edu/~yva/papers/sundsoy2014big.pdf https://docs.google.com/document/d/1tsjsadw41ymvhajqb9hcgc1s7kntab7p_jhejt epvas/edit?usp=sharing 4.9.2 Scenariodevelopmentdocument https://docs.google.com/document/d/1yg6w5althppw8koeigti_sr9lrotzbnw8w9eul TkP20/edit#heading=h.gjdgxs 39

5 AdditionalUseCases Summarized&by&Karen&Sollins&(MIT)& Inadditiontothethreescenariosdevelopedabove,fourothergroupsprovidedbriefer reports.theyaresummarizedhere,inordertofurtherbroadenourunderstandingofthe breadthoftheproblemdomainofconsiderationofprivacyintheworldofbigdata.these additionaltopicsare:(1)privacyinaggregateddiversedatasets,(2)creation, Management,Application,andAuditingofConsentonPersonalData,(3)Consumer Privacy/RetailMarketing,and(4)GenomicsandHealth. 5.1 PrivacyinAggregatedDiverseDataSets Team:&Evelyne&Viegas&(Microsoft),&Micah&Altman&(MIT),&YvesMAlexandre&de&Montjoye(MIT),& Elizabeth&Bruce&(MIT) Overview Microsoftisworkingwiththeresearchcommunityondevelopinganopensourceplatform forhostingdatasetsandcodeforthemachinelearningresearchcommunity.codalabisa MachineLearningServicethatallowsresearcherstoshareandbrowsecode,data,and createandshareexperimentsandworkflows.codalabhelpsnurtureanenvironmentof scientificrigorandopenupnewavenuesforcollaborationbetweenresearchers. Thecharacteristicsofdatathataresubmittedmightvarywidely.SuchdataincludeswellV known,previouslypublisheddata,suchasthatfromofficialstatisticsandcommunityv manageddataobtainedfromthirdparties,datacollectedbytheauthorsofthesubmission generallyfortheirresearch,andderivativedatasetspreparedspecificallyforapublication whichmayintegrate,correct,annotate,andrecodedatafrommultiplesources. Theemergingchallengesinthisareaarerelatedtothevarietyofdataandthelimited resourcesthatareavailableforvettingit.ownersofcommunityrepositoriesare particularlyconcernedwithdevelopingpoliciesthat1)arestrongenoughtostrengthen replicability,2)thatcanbeappliedwithoutintensecasevspecificscrutiny,and3)recognize commondisclosureofthreats,whilestillpermittingpostingandaccess. Stakeholders DatacollectorVwiderangeVVanypartythatcollectsoriginaldata,nodirectinteraction withserviceormainscenarios,mayhavesettermsunderwhichdatawasoriginally collected ServicehostVprovidesCodaLabserviceandhostsstorage,mayimposerestrictionsonuse DatasubjectsVwiderangeVVnodirectinteractionwithserviceormainscenarios DatacuratorVcuratorscreate competitions onthesite,providedatatotheservice,set termsofusethatarepresentedtocompetitors,(optionally)vetcompetitors DataanalystVentrantsinaparticularcompetition,typicallyresearcherswhoaimto developortunealgorithmsormodelstooptimizesomequantitativecompetitioncriteria, suchas%correctlypredicted,meanvsquarederror(mse) DatausersVsynonymouswithdataanalysts 40

Questionsandchallenges& Keygoals: Shareresearchshowingadvancementinfield(notjustincrementaladvances) Findexpertswhocanworkona(societal)problem Keyrisks: ReVidentificationattacks Inadvertentdisclosureofpersonalinformation Identifiedchallenges: Whatisthedatalifecycle? HowdoesaserviceownermanageprivacyVrelatedrisksresultingfromrunninga servicethatacceptingdatafromcurators? LowVeffortmethodsVVmustapplytomanydifferentdatasetsofheterogeneous typeswithoutexpertanalysisofeachdatabase Reuseacrosschallenges:mostcompetitionsdonotsupportreuseacrosschallenges, orlongvtermaccess.incontrast,agoalofcodalabistocontributetoalongvterm evidencebaseforresearchinthisarea. AutomaticorguidedidentificationofPII/datacurrentlyfocusesonmedical/health datacasesandmaynotbeappropriatetotherangeofdatabeingconsideredinthis usecase. Howdowemeasuretradeoffsbetweenutilityvs.privacyinthisusecase? ArethereautomatedtechniquesforidentifyingpotentialPIIindatasetsbeing submittedbyresearchers? 5.2 Creation,Management,ApplicationandAuditingofConsenton PersonalData Team:&Simon&Thompson&(BT),&Karen&Sollins&(MIT),&Arnie&Rosenthal&(Mitre)& Overview Personaldatahasmanystakeholders.Thisscenariofocusesontheabilityofthesubject,as animportantstakeholder,toinfluencehowtheirdataistreated:&&collected,shared,used, andprotected,andtheabilityofthecontrollersofpersonaldatatoabidebythese preferences.patientsandotherstakeholdersmusthaveincentivestoshare(andminimize disincentives),andtotrustotherstobehaveastheysaytheywilldo.otherwise,patients maywithholddatafromcliniciansandrecordholderswillresistforwardingdatatoothers, harmingpatients health,increasingcosts,andslowingoperationalimprovementsand researchprogress. Personaldataisofmanykinds,oftenrequiringdifferentpolicies.Thesedistinctionsinkind aremultivdimensional,andnosingledistinctiondominates.wenotethatauditmetadata andthesubject sownconsentspecificationsarethemselvespersonaldata.theydonot requirefundamentallydifferenttreatment,butmayhavesomespecificpoliciesattached. 41

Thisscenarioisrelevanttomanyimportantverticals,includingseveraleachinHealthcare, Education,andCommerce,butwhatiscentraltothisscenarioistheinterplayamong stakeholders wishes.thesedependonthekindofinformationinvolved.inparticular,the subjectmayhavedifferentrightswithregardtodifferentkindsofdata,andespeciallyin termsofmedicalcontent. Acriticalaspectofthisarenaisthatstakeholders,especiallythesubjects,deserve appropriate&controls,butcanrarelyhandlethetechnicalcomplexityofspecifyingthem. Theyneedawaytocustomizebehaviortobeapproximatelycorrect.Theregulatory frameworkmayneedtoallowforsituationswheretheuserdidnotspecifyorunderstand allbehavioraldetails(justasitallowssignoffonlegalesethatfewcitizensunderstand). Stakeholders: Thekeystakeholdersconsiderinthisrevieware: DataSubjects:thosedescribedbythedata Recordholders:thecollectorsandrepository Recipients:thosewhomayreceivethedata,including,forexamplewithmedicalrecords, caregivers,payers,researchers,marketers,orlegalauthorities,whothenmaybecome recordholders. Questions,challenges,andobservations: KeyGoals: Providesubjectswithappropriate(tothemselves)understandingandcontrol(user preferences)overprivacypoliciesofinformationaboutthemselves. Balancetheinterplaybetweeninterestsandresponsibilitiesofdifferent stakeholders,forexamplethesubject,regulators,caregivers,insurancecompanies, etc. Taggingorotherlabelingandgovernanceofdatainordertoenableapplicationof policies. Certifyingandmaintainingthequalityofthedata KeyChallenges: Preferencedataisitselfmetadataaboutthesubject:Consumerpreferencedata mustbetaggedbywhatcontentthepreferenceitselfrevealsvapatientpreference aboutreleasingabortiondatashoulditselftaggedasabortionvrelated,andcannot besharedwithallrecordholders.itisanopenquestionhowbesttocombine confidentialityandusabilityforsuchdata. Standardsforcompositionwhenglobalstandardsareimpossible:Global standards,globallycompliedacrossallindustries,areunlikely especiallyasone addsmoreandmoredetails.(afewbasicpracticesmightbestandardizedand compliedwith,butnotthediversityinamoderneconomy).howshould stakeholdersexpresspoliciesthatarerobust,evenwhensomeinformationis absent? Thediversityofenforcementmechanismswillcomplicateimplementation: Techniquesforamajorcorporationmaybeinappropriateforasmallbusinessand techniquessuitableformanaginglargedocumentsmaybeinappropriatefor 42

millionsofvaluesinadatabase.forexample,omittingadocumentdiffersfrom redactingadatabasevalue(whoseabsencemaybenoted). Trust:Toprovideaneffectiveprivacymanagementmechanism,theprivacy metadataofpersonalinformationmustbetrusted,andusedbytrusted components,i.e.,oneneedsaneffectivetrustnetworkthatassuresthateveryone willbehaveappropriately. 5.3 ConsumerPrivacy/RetailMarketing Team:&John&Ellenberger&(SAP).&Ilaria&Liccardi&(MIT),&Dazza&Greenwood&(MIT)& Overview: Thisgroupconsideredaspecificexampleinmarketing,acustomerloyaltyprogramina brickandmortarretailer.theyenvisionedasystemwiththreeelements:(1)the customer ssmartphone,(2)acloudvbasedintermediaryservice,and(3)theretailer s backend.theintermediaryserviceprovidestheserviceforcommunicationwiththesmart phone,bothcollectingdataandpushingoffers.theretailer sbackendcollects,manages andutilizesthecustomerdataandaspartofthatprovidesthesupportforanyprivacy policiesandmeetsanylegalrequirementsforprivacy. Asanextensiontothis,thegroupalsoconsideredacasewherethirdVpartydatamay becomeavailabletothebackendservice.thegroupconsideredtheproblemofmapping betweenthe identified datacollectedbytheretailerandthepotentiallyanonymizeddata fromathirdvpartymarketingfirm. Stakeholders: Subject Cloudserviceprovider Retailerrunningthebackenddatacollection,management,andanalysisservices PossiblethirdVpartymarketingdatasource Keygoals: Improvethecustomerexperienceinthestore Increasetheretailer smarketshare Totheextentthereareregulatoryrequirementsonprivacypolicyenforcement, complywiththelaw Keychallenges: Fusionofidentifieddata,legitimatelycollectedbytheretailerwiththirdVparty marketingdata.simplyfusingthesecorrectlyisextremelydifficult. Totheextentthatmergingdatamaycreate newdata aboutthesubject,thisis subjecttoregulations,especiallyineuropeanditwillrequirepermissionsfromthe subject. Inferenceofotherfactsaboutasubjectfromthebaselevelinformation.For example,itiswellunderstoodthatpatternsof likes maybeagoodpredictorof preferencesnotdirectlyexposedandthereforesubjecttoprivacypolicies.the demonstratedexampleispredictionofsexualpreferences. Morebroadlythisgroupdidnotconsidertheethicsoftheseapproaches. 43

5.4 GenomicsandHealth Team:&James&Williams&(Google/University&of&Toronto),&Michael&Power&(Osgoode&Hall&Law& School) Overview: Thisscenariofocusesonsharinghealthinformation(includinggeneticinformation)for bothhealthvrelatedresearchandpersonalizedmedicine.thescenarioinvolvesnumerous healthcareproviders(e.g.,hospitals)andresearchgroups(e.g.,universities)collaborating toexchangeinformationforavarietyofpurposes,includingtheprovisionofcare.asa result,itisinherentlycomplex;notonlyaretherenumerousorganizationsinvolved,but eachofthesemaybesubjecttodifferentlegalrequirementsbasedonthejurisdiction(e.g., country,state,province)inwhichtheyoperate. Whileadvancesingenomicresearchmethodshavemajorramificationsforthebiological sciencesingeneral,theyareparticularlyinterestingfromthestandpointofhealthmrelated& research.infact,someresearchershavearguedthattheanalysisoflargegenomic databases(i.e.,containingmillionsofsamples,asopposedtothousands)maybethekeyto unlockingnewdiscoveriesrelatedtohumanhealth.tonamebuttwoadvantages:1)larger datasetsempowerresearchersbysupportingawiderrangeofqueriesandobservations, and2)theuseofmodern,distributedcomputinginfrastructuresupportsinteractivemodes ofresearchthatoffermajoradvantagesovertraditionalapproaches. Thesituationbecomesevenmorepressingwhenonerealizesthatmanyresearchproblems canonlybeansweredbycombininggenotypeandphenotypedata.inpractice,thismeans themergingofgenomicrepositorieswithelectronic&medical&records&(emrs).indeed,the emergingfieldofpersonalizedmedicineisbasedontheabilitytocorrelateinformation betweenthesetwodomains.giventhemultitudesofhealthvrelatedissuesfacinghuman populations,andthepromiseofgenomicresearchandpersonalizedmedicinetoaddressa significantnumberofthem,itisimportanttodeveloptoolsandmethodsforfosteringthe sharingofgeneticandphenotypicinformationforresearchpurposes. Ofcourse,privacyisoneofthemostcommonlycitedconcernsthatarisewhenindividuals aresurveyedabouttheirattitudestowardssharinghealthinformation.itisvitalthatsuch datasharingbeaccomplishedinamannerthatminimizesriskstoprivacy.aspartof respectingprivacy,individualsmustbeprovidedwiththeabilitytocontroltheuseoftheir information,includingwithdrawingconsent. Whileinformationalprivacyconcernsareexplicitlyaddressedindataprotectionlaw,fair informationpractices,anddatasharingagreements,itisanopenquestionastowhether wecandesignbettermechanismstogiveeffecttothesenorms. Stakeholders: Patients,subjectsofthedata Cliniciansincludingbothphysiciansandalliedhealthprofessionals Researchers Healthcareserviceproviders InstitutionalReviewBoards(IRBs)orResearchEthicsBoards(REBs) Regulators 44

KeyGoals: Deliveryoftimelyandeffectivehealthcare(patients,clinicians) Participateinresearch(patients,possiblyclinicians,researchers Actinaccordancewith fiduciary responsibility(clinicians) Obtainandutilizelargegenomicdatasets(researchers) Obtainandutilizelargeclinical(i.e.phenotypedata)datasets(researchers) Integrateacrossthesetwotypesofdatasets(researchers) Maximizeefficiencyofhealthcaredelivery(healthcareserviceproviders) Utilize(andprofitfrom)intellectualpropertyinherentinpatientrecords(health careserviceproviders) Maintainsecurityofrecordssystems(healthcareserviceproviders) Minimizeprivacyrisks(regulators) Providerecourseforprivacyviolations(regulators) KeyChallenges: Atpresent,integrationisalmostimpossible.Mostdatasetaccessisrestrictedto peoplewithintheorganizationcollectingthedata. Integrationacrossdifferentregulatoryauthoritiesispoorlyunderstood. ThetradeVoffsbetweenprivacyandutilityinthecontextoftechnicalprivacy preservationmechanismsareparticularlyacuteinthecaseofgenomicresearch. Thereisalsoatensionbetweentheabilityofpatients(datasubjects)tocontrolthe useoftheirinformation,andtheabilityofresearcherstoaccumulatestabledata setsforresearchpurposes.forinstance,dynamicconsentmechanismsgive patientscontrolofdataattheexpenseofresearchers,whoseactivitiesmaybe interdictedbyrequeststoremovedatafromtheircorpus. EnablinginterVjurisdictionaltransferofdatamayrequiretheharmonizationof regulatoryregimes,aswellastheadoptionofcommonstandards. Thecurrenttransactioncostsfordatasharingagreementsareonerousformany organizations,creatingalandscapeof'silos'ofhealthinformationthathavegreat utility,butwhichcannotbeaccessed. Existingapproachestosharinghealthdatabetweenorganizationsrelyheavily uponbivlateraldatasharingagreements.thisapproachscalespoorlywhenthere aremultipleorganizationsthatwishtojointlysharedata. 45

6 Conclusions Karen&Sollins&(MIT)& Wegeneralizethreesetsofconclusionsfromthereviewofthescenariosdescribedabove insections2through5.thefirstisasetofoverarchingchallengesderivedfromthe systemicapproachestakenacrossthesebigdatascenariosinconsiderationofprivacy.the secondisacommonalthoughnotuniversalsetoftypesofstakeholdersinhandlingboth thebigdataitselfandinsupportoftheapplicationofprivacypolicies.finally,weobservea numberofkeyopenquestions,raisedbythesetofscenarios. Weobservefivekeychallengesfromthescenarios: Scale:Notonlyareweobservingincreasingsizesofdatasets,butalsothose increasesinsizewillleadtoincreasesinsizeoftheaccompanyingmetavdatathatis criticaltothesupportofprivacy.withoutsignificantimprovementsinefficiency, thegrowthinbothdataandmetavdatawillleadtountenableprocessingtimes,but thismustbeachievedwithoutcosttoprivacy. Diversity:Withincreasingdatasetsizeswillalsocomeanincreaseininterestsand typesofresponsibilities.thisincreaseislikelytoleadtoincreasedprobabilityof nonvalignedinterests.thisdiversityofobjectivesandinterestwillleadtoatleasta divergenceofprivacypoliciesandmorelikelytoincreasedincompatibilityof privacypolicies.capabilitiesforbothobservingandhandlingsuchdifferenceswill becomeincreasinglyimportant. Integration:Inadditiontothepointsaboveofscaleanddiversity,services increasinglysupporttheintegrationofpreviouslyindependentdatasets.ata minimumthiscanleadtosurprisingorunintendedinferencesacrossthesenewly integrateddatasets,resultinginpreviouslyunknownfactsaboutsubjects.thusa newchallengearisesfromthisintegrationintermsofprivacypoliciesforthese newlydiscoveredfactsordata. Impact&on&secondary&participants:Althoughdatamayitselfhaveaprimarysubject, increasinglytherewillalsobesecondaryparticipantsorsubjects,suchasfriends, parents,guardians,orbyvstanders,alsoreflectedinthedata.providingprivacy throughprivacypoliciesforthesesecondaryparticipantsmaybeevenmore challengingthanfortheprimarysubjectsofdata. Need&for&emergent&privacy&policies&for&emergent&data:Integrationmayleadto emergent,orpreviouslyunobservabledataaboutsubjects.thisnewlyobservable datawillalsorequireprivacypolicies,anditisnotclearthatthosenewpolicieswill simplybeaderivativeofthepoliciesapplicabletotheunderlyingoriginaldata.itis likelythatnew,emergentprivacypolicieswillbeneeded,andthechallengeishow thosenewpolicieswillbecreated,bywhomandunderwhatconditions. Thesecondsetofkeyobservationswederivefromthesescenariosisalistoftypesof stakeholders,whoplayaroleinsetting,enforcingandmitigatingthefailureofapplication ofprivacypolicies.webeginwiththesubjectsofthedataitself.insomecases,butnotall, theyplayaroleindeterminingapplicableprivacypolicies.additionally,adecisionvmaker, whodecideswhatdatatocollectandhowtohandleitmayplayasignificantorcentralrole insettingprivacypolicies.fromtherewemovetothe handlers ofthedata.thatdatawill becollectedbysomeparty,andmaybeseparatelycuratedforcompleteness,accuracy,and soforthbyacurator.thedatamaythenbestored,managed,andmadeavailablebyadata platformprovider.itwillthenbeusedbyadataanalyst.alloftheselastfourhaveaccessto 46

thedatainoneformoranother.wehavethenalsoidentifiedtwoadditiontypesof stakeholders,whoserolesfocusonenforcementofprivacypoliciesandrecordingor auditingofusageofthedata.thesetwofinalrolesaredistinctfromeach.itispossibleto haveauditingwithoutenforcement,foreitherlegalormitigationreasons,ifapolicyis violated.enforcementbenefitssignificantlyfromauditing,butisnotdependentonit. Finally,werecognizethattherearemanyopenquestions.Wehighlightfourhere: Novelty:Althoughweidentifiedanumberofchallengesabove,thereremainsa questionofwhetherbigdataleadstonewanduniquechallengesintheprovision ofprivacy,orwhetherthesechallengesareonlymoreobviousinthebigdata arena. Tradeoff:Eachofthescenariospresentsasignificantbenefit.Thesemaybe economic,social,medical,andsoforth.inaddition,eachpresentsriskstoprivacy, bothinherentlyandperhapsbecausethesituationisstillnewandnotwell understood.wemustaskhowtoevaluatethetradeoffsbetweenbenefitsandrisks, specificallytoprivacy.atthispoint,wedonotevenhaveametricorspectrum alongwhichtoconsiderthistradeoff,anditisnotclearthatasingleoneexists. Harm:Therisktoprivacymentionedaboveisneitherbinarynornecessarilystable. Thisleadstoaquestionofwhetherandhowtoevaluatetheharmthatmayresult fromdifferentchoicesinthetradeoffspacebetweenbenefitsandrisks. Trust:Trustreflectsawillingnessamongstakeholderstoacceptvulnerabilities. Thus,wemustaskhowitisthatstakeholdersdeterminetheirleveloftrustor mistrustinotherstakeholders,withrespecttotheapplicabilityofprivacypolicies. Thisincludesboththestakeholders modelsoftrust,howthoserelatetopeople s perceptionsofeachother,aswellaswhatmechanismsandtechnologiescan provideinsupportofthoselevelsoftrust.furthermore,onemustaskhowsuch trustevolveswithtimeandhowthatmightbesupportedtechnically. Itisimportanttorecognizethatourobservationsherearelimited.Theyarebasedonthis limitedsetofscenarios,andeveninthatcontext,maybeincomplete.theyarepresentedto givethereaderaclearersenseofthesortsofchallengesandquestionsthatarisefromthe intersectionofbigdataandprivacy. 47

48 A. Appendix:PrivacyScenarioTemplate Team:&&Simon&Thompson&(BT)&&&Dazza&Greenwood&(MIT&Media&Lab)& ElementsofBigDatascenario People/Stakeholders?(i.e.,Whoaretheparties,theirrespectiverolesand relationships?whoisdataowner(datacontroller)?whoisusingthedataand whatistheintendedpurpose?whoarethedatasubjects?whoisdoingthedata analytics?) Interactions?(i.e.,WhattransactionsorotherexchangesbetweenActors?)(What isthepowerdynamic?) Data Whatkindofpersonaldata?* Whattypeofbigdatamodels,analytics,orotheroutputsresultfromthis scenario? Howisthedataused? What sthedatalifecycle? Systems?(i.e.,Whatbusiness,legal,technical,orsocialsystemsmattermost?) BusinessSystems(Ethicscommittees,signVoffbyauthorized officers,recordkeeping,audit) LegalSystems(Contracts,Employeerules/procedures, certification/accreditations,compliancereviews,insurance/bonding requirements,industrystandardpolicy/guidelines,etc.) TechnicalSystems(Systempermissionsandsecurity,alarms& automateddetectionofpai,automaticanonymizationofdata, cryptography,etc.) SocialSystems(Whatsocialsystemsandcontextexists?) Analysisofscenario Goals(i.e.,WhataretheincentivesandthebenefitsdrivingtheActors?Who benefits?whatarefinancialincentives?) Rules:(i.e.,Whataretherelevantlawsandregulations,otherenforceablerules) Arethereexistingstatutes,contractualagreementsorothercommitments associatedwiththedate. i. Rulesaboutretention, ii. Liabilityforbreach? iii. Accuracy? iv. Others... Iftherearenotstatutoryorotherbindingrules,howwouldtheprinciples fromtheconsumerprivacybillofrightsguidethedevelopmentofrules? i. INDIVIDUALCONTROL:Consumershavearighttoexercisecontrol

overwhatpersonaldatacompaniescollectfromthemandhowthey useit. ii. TRANSPARENCY:Consumershavearighttoeasilyunderstandable andaccessibleinformationaboutprivacyandsecuritypractices. iii. RESPECT FOR CONTEXT: Consumers have a right to expect that companieswillcollect,use,anddisclosepersonaldatainwaysthat are consistent with the context in which consumers provide the data. iv. SECURITY: Consumers have a right to secure and responsible handlingofpersonaldata. v. ACCESS AND ACCURACY: Consumers have a right to access and correct personal data in usable formats, in a manner that is appropriate to the sensitivity of the data and the risk of adverse consequencestoconsumersifthedataisinaccurate. vi. FOCUSED COLLECTION: Consumers have a right to reasonable limitsonthepersonaldatathatcompaniescollectandretain. vii. ACCOUNTABILITY: Consumers have a right to have personal data handledbycompanieswithappropriatemeasuresinplacetoassure theyadheretotheconsumerprivacybillofrights. Risks:Whatarethepotentialharms?Whataretherisksofthoseharmsoccurring? Towhom?Iftheriskisanexternality,howmightitbemitigated? Assessmentofscenario Existingorrelatedbestpracticesforcontextofthisscenario Whatbusiness,legal,and/ortechnicalbestpractices? Gap IssuesNotAddressedbyExistingPracticesandSolutions BusinessSystems LegalSystems TechnicalSystems SocialSystems ShortFallBetweenCurrentandNeededPracticesandSolutions Keyoutcomesforeachscenario Promisingbestpractices Gapsthatneedtobefilledwithnewtechsolutionsorpolicyapproaches PersonalDataisdefinedbroadly,asfollows,fromtheConsumerPrivacyBillof Rights. Thistermreferstoanydata,includingaggregationsofdata,whichis linkabletoaspecificindividual.personaldatamayincludedatathatislinkedtoa specificcomputerorotherdevice.forexample,anidentifieronasmartphoneor familycomputerthatisusedtobuildausageprofileispersonaldata.this definitionprovidestheflexibilitythatisnecessarytocapturethemanykindsof dataaboutconsumersthatcommercialentitiescollect,use,anddisclose. 49

50 B. Appendix:Stakeholders Elizabeth&Bruce&(MIT),&Karen&Sollins&(MIT)& DataStakeholders Decription/Examples "datacollector" Partythatcollectsthe raw ororiginaldata fromthedatasubjects "datasubject(s)" Aperson(e.g.apatient,student, customer )orgroupofpeople(orentity) thatdataisbeingcollectedfrom;thisisthe groupofdataprovidersorparticipants. Subjectsmaybecontributingdatawith informedconsent(e.g.byoptingvinto researchstudy);ordatamaybecollectedinv directlyorinaggregate. Datamaybegeneratedby anindividual/consumer(e.g.takingan onlineclass,acustomeratabank) theinteractionsofagroupofindividuals (e.g.peertopeerinteractions;social networkgraphs) combining/aggregatingdataovera group/populationofsubjects. "datacurator (also:controller,provideror caretaker) Partythatstoresandmanagesthedataand isresponsibleforgranting/controlling accesstothedata;datacuratorisoftenthe stakeholderthatrequiresotherstoformally submittoapolicy(ordatauseagreement) inordertoaccessthedata.theremaybe morethanonedatacurator: originaldatacurator thirdvpartydatacurators "dataanalyst"(also:datascientist) Partydoingtheanalyticsonthedata;may usemanydifferenttypesoftools,software etcforanalysis,explorationand visualization( relyingparty ) "decisionmaker" Thestakeholder(s)thatbenefitsfromthe data;adecisionmakerthatultimately derivesnewinsightsandvaluefromthe dataanalysis;thisstakeholderwill ultimatelymakedecisionsbasedonthedata andmayormaynottakeactionforsome purpose.thispurposeoruseofthedata

dataplatformprovider dataregulator(s) dataauditor maybefor:personalbenefit;forvprofitor commercialuse;orsocietalbenefite.g. NGOs/government). Databeneficiarymaybe: anindividual agroupofindividuals aninstitutionororganization(private; commercial;government;nonvprofit) acontentprovider aserviceprovider Thepartythatbuildsthesystem(s)fordata collectionandprovidesaservice.platform provideranddatacollectormayormaynot bethesameentity/organization.inthecase thattheyaredifferent,theplatform providermayhaveitsowndatausepolicy separatefromthedatacollector. Anarbiterthatsetspolicies;thegoverning regulatorybodythatdevelopspoliciesthat controlsdatacollection,sharinganduse amongstakeholders couldbeatthelocal, state,federal,internationallevel(e.g. HIPPA,FERPAetc) Theenforcingbodyresponsibleforensuring thepoliciesandregulationsareenforced. Mayrequireauditlogging,documentation toensurepoliciesareenforced,anddatais managedasrequired 51

52 C. Appendix:StakeholderDatafromMOOCsandOnline LearningEnvironments(OLEs) Elizabeth&Bruce&(MIT)& DataStakeholder Example TypeofData Allclickstreamdatacapturinginteractionsbetween studentandcontent,includingwhenwatch video/lessons,quizanswers,textfromdiscussion forums,etc.useofvideosandotherevresources,such asdigitizedreferencematerial,wikis,andforums. Assessmentbehavior:attempts,correctness,useof immediatefeedback. MayincludePII(name,email,address)dependingon whatinformationrequiredwhenregisterforcourse. SelfVreportedbackground,preandpostVtestsurveys. DataSubjects Studentswhotaketheonlinecourse,complete assignmentsandreceivecredit Studentshavezeroorlittleaccesstotheirdatabeyond officialgrade/recordscreatedfortheireducation purposes DataPlatformProvider Cousera,EdX,Udacity,StanfordU,etc. ContentProvider IndividualContentProvidersincludefaculty,teachers, staffwhoprovidetheteachingcontentandmaterial (videos,lessons,quizzes,etc),supportdiscussions, interactwithstudents(thedatasubjects)directly,and responsibleforgrading/credit InstitutionalContentProvidersincludeinstitutionsand organizationsthatarebehindtheteachingcontent(i.e. MIT,Harvard,oranindividualprivateenterprise) DataCollector DataPlatformProvidersandInstitutionalContent Providers DataCurator DataPlatformProvidersandInstitutionalContent Providers DataScientist Analystsincluderesearchers,theirstudents(ifthe researchersareacademics),andeducation technologists.teachingstaff,platformproviders,and

DecisionMaker DataAuditor(andCompliance) DataRegulator institutionalcontentprovidersmayalsoactasanalysts. TypicallytheDataPlatformProvidersandInstitutional ContentProviders,sometimestheIndividualContent Providers(i.e.theteachers) Government Government FERPApolicies 53