Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2015-030 October 1, 2015 Big Data Privacy Scenarios Elizabeth Bruce, Karen Sollins, Mona Vernon, and Danny Weitzner massachusetts institute of technology, cambridge, ma 02139 usa www.csail.mit.edu
BigDataPrivacyScenarios BigDataPrivacyWorkingGroup September2015 BigDataPrivacyWorkingGroupChairs: ElizabethBruce(MIT) KarenSollins(MIT) MonaVernon(ThomsonReuters) DannyWeitzner(MIT)
Acknowledgements WegratefullyacknowledgethemanycontributorstothisScenarioWorkingDocument. ThisincludesalloftheBigDataPrivacyWorkingGroupleaders,teammembers,andguides fortheirthoughtfulefforts.aspecialthankyoutodazzagreenwoodofmitmedialaband SimonThompsonfromBTforcreatingtheoriginaltemplateforthescenariosummaries. BigDataPrivacyScenarioContributors/Teams:MicahAltman(MIT),ElizabethBruce (MIT),DavidDietrich(EMC),JohnEllenberger(SAP),DazzaGreenwood(MIT),Maritza Johnson(Facebook),LalanaKagal(MIT),JakeKendall(GatesFoundation),CameronKerry (MIT),IlariaLiccardi(MIT),YvesVAlexandredeMontjoye(MIT),UnaVMayO Reilly(MIT), MichaelPower(OsgoodeHallLawSchool),ArnieRosenthal(Mitre),KarenSollins(MIT), SimonThompson(BT),MonaVernon(ThomsonReuters),EvelyneViegas(Microsoft),and JamesWilliams(Google/UniversityofToronto) BigDataPrivacyWorkingGroupEditor:BarbaraMack(PingryHillEnterprises,Inc.) 2
TableofContents ExecutiveSummary...5 UseCase:MassiveOpenOnlineCourses(MOOCs)andOnlineLearningEnvironments (OLEs)...6 UseCase:ResearchInfrastructureforSocialMedia...7 UseCase:DataforGood:PublicGoodandPublicPolicyResearchUsingSensor Data/MobileDevices...9 OtherUseCases...10 Conclusions...10 1 Introduction...12 1.1 OverarchingObservations...13 1.2 Stakeholders...14 1.3 OpenQuestionsandIssues...14 1.4 RemainderofThisDocument...15 2 PrivacyIssuesforDataCollectedfromMOOCsandOnlineLearning Environments...16 2.1 Abstract...16 2.2 DetailedNarrative...17 2.3 PrivacyImpactAssessmentVTheSpecificContextofScenario1...18 2.4 GoalsofOLEs...20 2.5 Data...21 2.6 Systems...22 2.7 Risks...22 2.8 Rules/Regulations...22 2.9 Technologies...23 2.10 PrivacyConstraints...23 2.11 TechnologyInformingandSupportingOLEDataPrivacyandConfidentiality Policy 23 3 ResearchInfrastructureforSocialMedia...25 3.1 Abstract...25 3.2 ScenarioIntroduction...25 3.3 StakeholdersandInteractions...26 3
3.4 Systems...27 3.5 AnalyzetheScenario...28 3.6 InnovationIdeasandOpportunities...30 3.7 NotesonScenario...31 3.8 References...31 4 DataforGood:PublicGoodandPublicPolicyResearchUsingSensor Data/MobileDevices...33 4.1 Abstract...33 4.2 ScenarioDevelopment...33 4.3 OperationofScenarios...34 4.4 RegulatoryEnvironment...36 4.5 DataUtility...37 4.6 Privacy...37 4.7 CriticalIssues...38 4.8 PromisingPathsForward...38 4.9 References...39 5 AdditionalUseCases...40 5.1 PrivacyinAggregatedDiverseDataSets...40 5.2 Creation,Management,ApplicationandAuditingofConsentonPersonalData.41 5.3 ConsumerPrivacy/RetailMarketing...43 5.4 GenomicsandHealth...44 6 Conclusions...46 A. B. Appendix:PrivacyScenarioTemplate...48 Appendix:Stakeholders...50 C. Appendix:StakeholderDatafromMOOCsandOnlineLearningEnvironments (OLEs)...52 4
ExecutiveSummary Karen&Sollins&(MIT)& TheMITBigDataPrivacyWorkingGrouplaunchedaseriesofworkshopsbeginningin 2013toexplorethechallengesandpossibletechnologicalsolutionstoelementsofthose challenges.asasuccessortothoseworkshops,theworkinggroupbegantofocusona collectionofrealworldscenariosandusecases,toilluminatethechallengesmore concretely. Thedeeperquestionexploredbythisexerciseiswhat&is&distinctive&about&privacy&in&the& context&of&big&data.althoughprivacyasageneralissueincomputingandcommunications remainsatopicofsignificantattentionanddisagreement,inthiseffortwenarrowour attentiontothe BigData context,tounderstandmoreclearlytheparticularchallenges andpossibleapproachesthatderivefromthecollection,pooling,andcombinationofvast amountsofdata,specificallyaboutpeople.thisfocusonpeopleasthesubjectsofattention inthebigdatacontextiscentraltothedefinitionofprivacy,whichitselffocusesoncontrol data,informationandinferencesaboutpeopleandhowthatcanorshouldbeused, exposed,orotherwisemadeavailable. Wesummarizehereaninitiallistofissuesforprivacythatderivespecificallyfromthe natureofbigdata.thesederivefromobservationsacrosstherealworldscenariosanduse casesexploredinthisprojectaswellaswiderreadinganddiscussions. Scale:Thesheersizeofthedatasetsleadstochallengesincreating,managingand applyingprivacypolicies. Diversity:TheincreasedlikelihoodofmoreandmorediverseparticipantsinBig Datacollection,management,anduse,leadstodifferingagendasandobjectives.By nature,thisislikelytoleadtocontradictoryagendasandobjectives. Integration:Withincreaseddatamanagementtechnologies(e.g.cloudservices, datalakes,andsoforth),integrationacrossdatasets,withnewandoftensurprising opportunitiesforcrossvproductinferences,willalsocomenew information about individualsandtheirbehaviors. Impactonsecondaryparticipants:Becausemanypiecesofinformationare reflectiveofnotonlythetargetedsubject,butsecondary,oftenunattended, participants,theinferencesandresultinginformationwillincreasinglybereflective ofotherpeople,notoriginallyconsideredasthesubjectofprivacyconcernsand approaches. Needforemergentpoliciesforemergentinformation:Asinferencesovermerged datasetsoccur,emergentinformationorunderstandingwilloccur.althougheach uniquedatasetmayhaveexistingprivacypoliciesandenforcementmechanisms,it isnotclearthatitispossibletodeveloptherequisiteandappropriateemerged privacypoliciesandappropriateenforcementofthemautomatically. Theprimarycontentofthisreportisanumberofrealworldscenarios,resultingfrom discussionandthensubgroupeffortswithintheprivacyworkinggroup.eachcasewas analyzedalongacollectionofaxes:keystakeholders,datalifecycle,keysystems,potential privacyrisks,andexistingbestpracticeswithinthecontextofthatscenario.thetemplate waslaidoutinitiallybydazzagreenwoodofthemitmedialabandsimonthompsonof BTandcanbefoundinAppendixA. 5
Asaresultofcollatingthesescenarios,twokindsofpointsemergedacrossthem.Thefirst isasmallsetofcommonquestions.thesecondisalistofcategoriesofstakeholders.we summarizethosehere. Thekeyquestionsthataroseare: Whatnew/uniquechallengesemergewhenitcomestomanagingprivacyinthe contextofbigdata? Howdoweassessbenefitvs.risk? Howdoweevaluate harm?giventhatharmissubjective,difficulttoquantify, andfallsonaspectrumfrominappropriateonlineadvertisementstodiscrimination insettinginsuranceratestolifeordeathmedicalintervention,isitpossibleto evaluateharmuniformlyandifso,howwouldonedothat? Howcanweestablishandassesstrustamongthestakeholders?What mechanisms/modelsdowehaveforunderstandingtrust? Atableofthecategoriesofstakeholdersderivedfromthescenarioscanbefoundin AppendixB.Inaddition,AppendixCdemonstratesanapplicationofthesestakeholder categoriestothefirstusescenarioonmoocsandoles. Theinitiallistofcategoriesofstakeholdersincludes: Datasubject(s) DecisionVmaker Datacollector Datacurator Dataanalyst Dataplatformprovider Policyenforcer Auditor BothofthesesetsofpointsarediscussedinmoredetailinthecompaniontechnologyV mappingdocument,andareprovidedheretoidentifycrosscuttingobservationsfromthe variousscenarios.althoughthecurrentlyidentifiedsetofpotentialstakeholdersislisted here,itisimportanttorecognizethatprivacyisamuchmorecomplexproblemthat concernsmorethanthestakeholdersalone. TheWorkingGroupexploredsevenusecases.Thisreportpresentsthreeintheircomplete formsinsections2v4;thosethreecasesaredescribedbrieflyintheexecutivesummary. Inaddition,inthefinalsectionofthereport,inSection5,summariesoftheadditionalfour casesarepresented,becausethesewerestudiedinlessdetail. UseCase:MassiveOpenOnlineCourses(MOOCs)andOnlineLearning Environments(OLEs) Anyonlinelearningsituationprovidesanopportunitytorecordalltheactivitiesof everyoneinvolvedintheteachingexperience,primarilybutnotexclusivelystudentsand teachingstaff.moocsasasubsetofonlinelearningtakethistonewscalesandoftento newlevelsofautomationaswellasexpandingrolesinthecollectionof,responsibilityfor, anduseofthedatathatderivesfromthoseteachingexperiences. 6
Infocusingonprivacyinthiscontext,oneisconcentratingonquestionsofwhichbehaviors andinformationaboutindividualsmaybeexposedinwaysthattheymayfindcontradicts theirmodelsofprivacy.thechallengesariseatleastinpartfromthenewopportunities thatmoocsprovidetocollect,merge,andreasonovereducationaldataatascaleandwith aneasenotpreviouslypossible.thedatamaynowbeusedinnovelwaysandinvolvenew stakeholdersincludingdatacurators,dataplatformproviders,researchers,andthose interestedinnovelapproachestopedagogy.thechallengeistoachievethatinwaysthat respecttheprivacyoftheindividualstudent,perhapstheteachingstaff,andpossibly secondarypeopleaswell,suchasparentsandguardians,especiallyinthefaceof asymmetricpowerrelationships.oneaspectofthechallengeistounderstandthe implicationsofprivacy violations inthiscontext.theymayarisenotonlyfromthedirect exposureofinformationabouttheindividualthatwasneitherintendednordesired,but alsofrommoresubtleconcernsoverdiscrimination,harassment,inaccessibility,or violationofothercivilandhumanrights. Thecontributionidentifiesanumberofkeyinsightsintoprivacychallengesthatarisein themoocandolearenas,including: Thenatureoftheinformationbeingcollected,includingclickstreams, contributionstoonlinediscussions,forums,andquestionnaires,aswellas behaviorswithrespecttobothaccessingandsubmittingcontent(reading, watchingonlinelecturesorvideos,attemptsatdoinghomework,etc.); Toolsandnormsforexpressionofprivacypolicies,includingcurrent,future, aggregation,andintegrationwithotherdata; Thetusslesinobjectivesamongstudents,teachingstaff,ownersoftheeducational content,crowdorstudentprovisionofcontributions(throughgradingorsocial networkingfacilities)totheexperienceofotherstudents,institutionalhosts, educationalsystems(suchasmunicipalschoolsystemsorstateuniversity systems),researchersandanalysts,andserviceproviderssuchasdatacurators, datastorageandanalysisservices; Thenatureofthepotentialprivacyviolationharmstothevariousstakeholders; TranslationoftheFamilyEducationalRightsandPrivacyAct(FERPA)intothis increasinglyrich,complex,growing,andevolvingdomaininwhichcollectionsof educationaldataiscollected,curated,collatedandperhapsintegrated; Thefactthatthisispreviousunchartedterritorywithsocial,legal,andmoral challengesasyetnotclearlyidentified,whichisalsoevolvingduetoincreased technologicalcapabilities,oftenindependentlyofprivacyobjectivesandinterests. UseCase:ResearchInfrastructureforSocialMedia Thebehaviorsofindividualsandgroupsonlinecanprovidethebasisforsignificantdeeper understandingandpredictionofhumanbehaviorsandinterests.thekindsofdatathatcan beusefulingainingthatincreased social understandingrangefromthevarious contributionsmadebyindividualssuchastext,photos,variouskindsofstreamingmedia andotherinformationrelatingtotheparticipantsaswellasloggedinformationsuchas clickstreams,frequencyandotherpatternsofaccess,etc.atpresentthemajorityofaccess tosuchsocialmediainformationisprimarilyrestrictedtoinvhouseanalysisbysocial mediaorganizations. Thequestionexploredbythisgroupiswhetherandhowonemightprovidea privacy frameworkforsuchinformation,givingthesubjectsoptvincontrolofwhichinformation 7
aboutthemselvescanbemadeavailableforbroaderstudiesandwideravailabilityofthe information.theintentionisthatpermissionforuseremainswiththesubjects,butby givingthemtheopportunitiestoshare,richer,andlargerstudiescanoccur,withallthe potentialsocietalbenefitsthatthosestudiesmightentail.thesubjectmustbegivencontrol overboththegranularityandtypesofthedata,includingbothstaticdatasuchasbirthdate, address,jobhistoryandsoforth,anddynamicdatasuchasongoingpostsinvariousmedia. Intermsofthestakeholders,therearethreekeyparticipants,1)thesubjectsthemselves, 2)thesocialmediaorganizationswhowillplaytheroleofdatacollectors,oftendata curators,anddataplatformproviders,3)thedataanalysts,whomayalsoplaytheroleof datacurators,iftheyprovidedaddedunderstanding(curation)overthedatasets.there aretwogeneralapproachestomakingthedataavailable.thefirstistogenerateslices,on someregularbasis,ofthedatathatistobeexposedanddeliverthattotheanalysts.the alternativeistoretainalldataonacontrolledservicewithaclearlydefinedapi,providing onlyconstrainedaccesstothedata.thefirstgivestheanalystmorefreedomtoexplore,but reducesthesubject sabilitytoretaincontrol,especiallywithrespecttowithdrawingfrom astudyretroactively. Thereareatleastfourcontextsinwhichsuchasystemmustoperate:legal,social,business, andtechnical.thechallengeisthatprivacymustberespectedinthecontextofallofthese domainssimultaneously. Thestudygroupidentifiedalistofrisksorchallengestoprivacythatmustbeconsideredin suchascenarioincluding: Unexpectedinferenceresultingfromtheanalysis; Unexpectedharmduetomodificationsofthedataplatform,duetoinferences,orto thenatureoftheresearchitself; UnpredictablebiasintheresultingresearchbasedonbiasintheselfVselecting natureofparticipation; Unexpectedcorrelationbetweenthestudysubjectpopulationandthegeneral population; Removalfromstudiesafteragreeingtoparticipate; Controlofdownstreamuseofthedata,beyondtheoriginalanalystagreement.This raisesquestionsofprovenance(whohastouchedthedataandhowmightthey havemodifiedit),tohowtoenforcepoliciesbeyondtheboundsofpairwise agreements,toidentificationandrecourseformisuse,forstarters; Responsibilityfordatabreachesbothbythesocialmediaprovideractingas repositoryandcuratorandbytheresearchersandanalysts; Findingthebalancebetweenprivacyandpublicationofresults; Managementofinformedconsents; Automationofasmuchofthisaspossible,whileunderstandingtherisksthatmay beintroducedthroughsuchautomation. Thestudyalsoidentifiedsomekeytechnologiesthatexistandsomeplaceswhere technologiesareneeded,butnotyetavailable. ThescenarioisbasedonacurrentcollaborativestudyinvolvingtheTechnicalUniversityof DenmarkandtheMITHumanDynamicsLaboratory. 8
UseCase:DataforGood:PublicGoodandPublicPolicyResearchUsing SensorData/MobileDevices Thechallengefacedinthisscenarioistotakeadvantageofmobilephonedata(mobility data)withoneoftwopossibleobjectives.thefirstistomodelandpredictoutbreaksof epidemicsandthesecondistoenablemicrovtargetingofindividualsorgroupsofpeople withinterventionsinordertoreduceorpreventoutbreaksofepidemics.thegeographic regionoffocusinthisworkisafrica.ofparticularinterestarepeoplemovingacrossareas whereanepidemicmaybemoreprevalentandthosewhereitmaybelessso. Inadditiontothetwokindsofobjectives,thestudyexaminestwodistinctsystemdesigns orimplementations.inallcases,theoriginaldataiscollectedbythemobilenetwork operators(mno).inoneimplementation,eachmnoanonymizesandcoarsensthedata bothspatiallyandtemporally.thus,forexample,thetimemaybereportedin12vhour blocksrepresentingdayandnightandlocationmayberepresentedasparticularregions wheremalariaisprevalentornot.theindividualityofeachrecordisretained.this enablesthetargetingofindividualsthroughoneoftwomeans.theanonymizedidentifier ispresentedtothemno,whichinturneitherprovidesaccessinformationtotheanalystor actsasanintermediaryconveyinginformationbetweentheanalystandsubject.inthe otherimplementationdesign,dataismergedonaregionalbasisbeforebeingaggregated, soforexample,themnomightreportthataspecificpercentageoftheresidentsofone areaspentadifferentspecificpercentageofnightsinadifferenttargetarea.thissecond designsignificantlyincreasesthesubject sprivacyandreducesthepossibilityofrev identificationorexposure,aswellasreducingtheaccuracyandpotentialutilityofthedata. Thisstudyidentifiedanumberofchallenges: Thescenarioexposesadirecttradeoffbetweenhealthrisks(andpossible mitigation)fortheindividualandpersonalprivacy; Thescenarioalsoexposesadirecttradeoffbetweenanalysiscapabilitiesand personalprivacy; MNOsaregenerallynotinthebusinessofanonymizing,curatingandproviding datatootherentities.inthesecases,theanalystroleisoftentakenonbynational healthministries; ThelegalbasesforprivacyinAfricaarecomplexandgenerallybasedinhistorical traditionfromthecountriesthatcolonizedtheminpreviouscenturies.those WesternandNorthernAfricamostlyderivefromtheFrenchcivilcode,withexplicit privacyframeworksandarecloselyrelatedtotheeuropeanprivacydirective. ThosesuchasSouthAfricathatderivefromtheEnglishcommonlawtraditionhave muchlessconcretepolicieswithrespecttoprivacy.toaddtothis,aspopulations movefromonecountrytoanother,theymayalsobemovingfromoneprivacy policymodeltoanother; TheintentionofthisuseVcasestudywastoallowthegrouptoelicitcommonalitiesand distinctionsamongthecasesthatmightallowustogeneralize.thatinturnalsohas providedthebasisforacompanionpaper,whichconcentratesoncurrentandnearvterm futuretoolstoimprovethepossibilityofprovidingprivacy,whilecontinuingtoallowfor BigDataanalysisandthebenefitsthataccruefromthat. 9
OtherUseCases Thereportconcludeswithabriefsummaryoftheadditionalfourusecasesexaminedby theworkinggroup.theseincludeprivacyunderconditionsofintegratingoverdiverse datasets,thecreationandmanagementofuserconsentoverexposureanduseofpersonal data,consumerprivacyandretailmarketing,andgenomicsandhealth. Conclusions Fromthesescenarioswedrawthreecategoriesofconclusions.Thefirstisasetofcommon overarchingchallenges.inorderofincreasingcomplexitytheseare: Scale:ThesheersizeofboththedataitselfandtheaccompanyingmetaVdatathatis necessarytomanageitandprovideprivacypoliciesisincreasing. Diversity:Withgrowth,wealsoseeanincreaseinthetypesofdata,interestsof analystsorusersofthedata,andrichnessofprivacypoliciesinthesenew scenarios. Integration:ThereisincreasingpressureandopportunitytomergeorcrossV fertilizeamongthesediversedatasets.thisleadstoresultsthatmayhave previouslybeeninaccessible,butthatareexposedthroughperhapsdiffering integratedobservationsoftheindividual. Secondarysubjects:Althoughmuchdataisbasedonprimarysubjects,itmayalso, perhapsinadvertentlyalsoreflectonsecondarysubjects.handlingprivacypolicies forthismoreintegratedsituationissignificantlymorecomplexthanthepolicies applicabletoasinglesubject. Emergentprivacypolicies:Withboththeintegrationofdatasetsandtheincreasing captureofdataaboutsecondarysubjects,thereisalsoaneedforprivacypoliciesto reflectthisemergentdata.thechallengeofhowthesenewpoliciescomeinto existencewillplayanincreasinglyimportantrole. Thesescenarioshaveprovideuswithabasisforaninitialobservationaboutthediffering stakeholdersinvolvedinthehandlingofbigdataandtheprivacypoliciesapplicableto them.webeginwiththesubjectsthemselves,perhapsbothprimaryandsecondary,and thedecisionvmakerswhosetouttohavethedatacollectandmadeavailable.wethen identifyasetofdifferentstakeholdershavingtodowiththecollection,managementand provisionofthedata.thisincludestheactualdatacollector,thedatacurator,andthedata platformprovider.wethenidentifythreekindsofstakeholdersinvolvedintheactivitiesof usageofthedata,thedataanalyst,theprivacypolicyenforcer,andthedataaccessauditor. Withthesechallengesandobservationsinmind,wealsorecognizethatthereareanumber ofopenquestions.thesequestionsrevolvearoundseveralkeyelements.thefirstis whetherornotbigdatabringsnewchallengestotheprovisionofprivacyorwhetherit exposesexistingproblemsperhapsmoreclearly.moreimportantly,arequestionsofrisk vs.benefitstradeoffs.oneofthechallengesonefaceshereisprivacyandtheriskof violationofprivacyisnotbinaryandperhapsnotevenmeasurable.thus,oneisthenledto askabouttheharmsthatmayresultfromdifferentlevelsofprivacypoliciesand/orthe violationsofthoseprivacies.finally,weareleftwithasetofquestionsrelatedtotrust,how itcomesintoexistence,howitmayevolve,howhumans trustcanbemodeled,andhow trustmaybesupportedtechnically. 10
Wenotethatthissetofobservations,challengesandquestionsareonlyrepresentativeof whatonemightdrawevenfromthislimitedsetofscenarios.abroaderstudymightleadto yetmorechallengesandquestions. 11
1 Introduction Karen&Sollins&(MIT)& ThevastamountsofdiversedatathatarenowbeingcalledBigDatapresentsocietywith anextremelyinterestingsetofchallenges,rangingfromhowtouseanyonesuchdataset forawideandincreasingsetofopportunities.thesemayrangefromimprovedproduct recommendationstoimprovedmodelingofhumanmobilityinregionsofinfectious diseasestomanyotherpointsinbetween.butbigdatapresentsadditionalopportunities thatincludeabroaderanddeeperunderstandingacrosssuchdatasets.ifonecanmerge mobilitydatawithmedicalhistories,forexample,onemightprovideamuchmoreaccurate modelofpotentialepidemics,dependingonbothmobilityandpriorepidemicsofdiseases towhichimmunitiesaredeveloped. Atthesametime,societiesandcommunitiesarebecomingincreasinglyconcernedoverthe questionsofwhoknowswhataboutthemandwhetherornottheyhavecontroloverthose datacollectorsandanalyzersknowingthingsaboutthem.theconcerniscapturedinthe word privacy.the problemofprivacy isinfactacomplexandsubtleone,withmany challengesandoftentoofewsolutionstothosechallenges.onemustaskquestionssuchas, Whoisthesubjectofthedata? Theremaybeaprimarysubject,butdataabout interactionsmayhavemultipleprimarysubjects.theremaybesecondarysubjects,suchas theparentsorlegalguardiansofachildwhohappenstobethesubjectofthedata.in addition,onecanaskquestionsaboutwhoelseisinvolvedwiththedatainvariousways, suchascollectingorstoringit,protectingit, curating itforaccuracyandcompleteness, analyzingit,andsoforth.onecanalsoaskwhatpoliciesshouldbeappliedtothedatafor controllingaccesstoit,tomeetanyprivacyconstraintsfromalegitimatepolicysource.or, howmightthatpolicybeenforced?orhowcanonebeconfident(trust)thatthepolicyis eitherbeingdefinedbyalegitimatepolicysourceorbeingenforcedbyatrustvworthy enforcer?andsoforth.thequestionsofwhatismeantbyprivacy,whocandefine appropriateprivacyandhowthatmightbeimplementedareonlynowbeginningtobe examined,withsignificantprogressinsomeareasandlessadvancementinothers. ThechallengewefaceintheBigDataarenaisattheintersectionofthesetwodriving forces,bigdataitselfandallthatithasthepotentialtoprovide,andprivacy,asitbecomes increasinglywellvunderstoodtobeadesignvdriverforsystemsinthecybervage. TheMITBigDataPrivacyWorkingGroupconcentratesonthisproblemdomain.Tothat end,severalworkshopswereorganizedbyandheldatmit. 1 Inaddition,theWorking Grouptookontwoinitialagendaitems:1)documentationofasetofscenariosinorderto betterilluminatesomeofthecentralchallengestoprovidingprivacyina BigData world; 1Seeworkshopreports: 1. Big&Data&Privacy:&Exploring&the&Future&Role&of&Technology&in&Protecting&Privacy,June19,2013.Availableat: report. (http://bigdata.csail.mit.edu/sites/bigdata/files/u9/mitbigdataprivacy_wkshp_2013_finalvweb.pdf) 2. MIT&White&House&Big&Data&Privacy&Workshop:&Advancing&the&State&of&the&Art&in&Technology&and&Practice, March3,2014.Availableat:report.(http://web.mit.edu/bigdataV priv/images/mitbigdataprivacyworkshop2014_final05142014.pdf) 12
2)roadmappingofcurrentandnearVtermfuturetechnologiesthathavepromiseof addressingpartsoftheprivacyinbigdatachallenge.thisdocumentisthefirstofthese. Belowintheremainderofthissectionwewillsummarizeanumberofconclusionswe drawfromthescenarios.thesetakethreeforms.thefirstisasetofissuesthatderivefrom thelargerchallenge.thesecondisasetofcategoriesofstakeholdersweextractfromthe scenarios.finally,weconcludetheintroductionwithasetofquestions,whichremain unanswered,butappeartobecentraltotheproblemdomain. 1.1 OverarchingObservations Inexaminingtheusescenarioshere,wecanidentifyaninitialsetofsignificantissueson theconsiderationofprivacy,whichderivespecificallyfromthenatureofbigdata.these arealsoinformedbywiderreadinganddiscussionsonthetopic: Scale:Thesheersizeofthedatasetsleadtochallengesincreating,managingand applyingprivacypolicies.becausethedatasetsthemselvesareofsuchincreasing size,themanagementofthemetavdatathatreflectsprivacypoliciesaboutitwill incurparallelgrowth.oneofthechallengesisthatasdatasetsgrow,efficiencywill playanincreasingrole.thatwillalsobetrueoftheprivacypolicymanagement associatedwiththegrowingdata. Diversity:Asdatasetsbecome bigdata, itwillbeincreasinglylikelythatmoreand morediversestakeholderswillbeinvolved.eachmaycometotheeffortwithhisor herownagenda.withanincreasingnumberofstakeholderswithdifferent responsibilitieswillalsocomeanincreasedprobabilitythattheirinterests,agendas andobjectiveswilllessalignedwitheachotherandhencetheirapproachesto privacypolicieswillalsobemoredivergentandpossiblyconflicting.thus,privacy policyconflictresolutionwillplayanincreasinglyimportantrole. Integration:Withincreaseddatamanagementtechnologies(e.g.cloudservices, datalakes,andsoforth),integrationacrossdatasets,withnewandoftensurprising opportunitiesforcrossvproductinferences,willalsocomenew information about individualsandtheirbehaviors.thechallengeisthatreasoning,inferenceand otheranalysistoolswillallowfortherecognitionordiscoveryofhithertohidden facts(data)aboutthesubjects.thisraisesaquestionofhowtocreateandenforce privacypoliciesonthisnew data. Impact&on&secondary&participants:Muchdataaboutindividualsubjectstendsto reflectonotherpeopleaswell.thismayrangefrompeoplewho liked apostto peoplewhoarementionedinemailorposts,totruesecondaryparticipants,suchas familymembersorcovworkers.onequestionthatwillbecomeincreasingly importantishowtoobservetheprivacyrightsoftheseotherpeople,whoarenot theprimarysubjectofthedataandmaynotbeavailabletoapplyaprivacypolicy whenthatispossible.evenifthesesecondarypeopleareavailable,itisnotclear howtohandleconflictingprivacypoliciesinthisdomain. Need&for&emergent&policies&for&emergent&information:Asinferencesovermerged datasetsoccur,emergentinformationorunderstandingwilloccur;thiswillbe basedasmentionedaboveonbothsimplymergingdatasets,butperhapsmore importantlyallowingfortheexposureofpreviouslyhiddendatathatisonly exposedinthemergingofdatasets.althougheachuniquedatasetmayhave existingprivacypoliciesandenforcementmechanisms,itisnotclearthatitis possibletoautomaticallydeveloptherequisiteandappropriateemergedprivacy policiesandappropriateenforcementofthem. 13
1.2 Stakeholders Asthereaderwillseeinthescenariosthemselves,thereareanumberofkeystakeholder categoriesthatappearrepeatedly.notallcaseswillincludeallofthesestakeholders.in somecases,individualsmayplaymorethanonestakeholderrole.thus,forexample,the datacollectorandthedatacuratormaybethesame,orthedataplatformprovider,the policyenforcerandtheauditormightbethesame.butothercombinationsarelikelytobe foundaswell.itisalsoimportanttorememberthattheprivacypoliciesforadatasetmay bedefinedbypeopleindifferentrolesindifferentsituationsand,insomecases,the policiesmaybedefinedbyoutsidersonbehalfofoneormoreofthesestakeholders,asfor examplemaybetrueunderaregulatoryregime.thus,itmaybethatonbehalfofthedata subject,thegovernmentrequirescertainprivacypolicies. Datasubject(s) DecisionVmaker Datacollector Datacurator Dataanalyst Dataplatformprovider Policyenforcer Auditor This list was drawn from the scenarios and should only be considered representative rather than complete. Appendix B includes a table with definitions of each of these stakeholder roles. It is also considered at greater length in the companion paper on technologies. Appendix C demonstrates an application of these definitions to the first scenarioonmoocs. 1.3 OpenQuestionsandIssues Instudyingthesescenarios,weareleftwithanumberofchallengingquestionsandissues: Novelty:Whatnew/uniquechallengesemergewhenitcomestomanagingprivacy inthecontextofbigdata? Tradeoff:Howdoweassessbenefitvs.risk?Partofthechallengeinthesedomains isthatthattherisksandtradeoffsneedtobeevaluated,totheextentthattheycan beevaluatedbymetrics,bothbydifferentmetricsandatdifferenttimescales.asan extremelysimpleexample,thebenefitsofmoocanalysismaybetofuturestudents, whiletherisksmaybetothesubjectsofthedata,thestudentsaboutwhomdata hasbeencollected.akeystrokeloggingsystemmayhelpcurrentstudentsifthe teachingstaffcangetimmediatefeedbackonhowlongittakeseachindividual studenttocompleteaparticularexercise,butitmaybethatsystematicchanges mayonlyoccuronalongertermbasisthantheperiodduringwhichaparticular studentisinvolvedwithaparticularcourse.atthesametime,totheextentthatthe datacanprofileindividualstudentsinnumerouswaysbothinrealtimeand perhapsoverthelongerlifevtimeofthedataset,andperhapsinconjunctionwith thedatafromothercoursesthestudenthastaken,theirrisksofviolationofprivacy maycontinuetogrow,anddefinitelyareunrelatedtothebenefitsforfuture students.oneofthechallengesinthisdomainofmetricsisthatprivacyisnot binary.inpartbecauseitiscontextualandinpartsimplybecausetheprivacyof someinformationismorecriticalthanotherinformation,thisquestionofthe 14
tradeofforbalancebetweenbenefitandriskisbothcomplexatanyinstantandisa movingtarget. Harm:Howdoweevaluate harm?asmentionedabove,therisktoprivacyis neitherbinarynornecessarilystable.thedeeperchallengeistounderstandthe potentialharmthatmayaccruefrompotentialrisks.infact,wemayneedtoturn thisissuearound.thequestionwemayneedtoaskis, Whichharmsareimportant totheindividualsandinwhatcontexts? Thus,harmscouldbeimaginedona spectrumfrominappropriateonlineadvertisingtodiscriminationinsetting insuranceratestosomethingthatisalifeordeathmatterintermsofmedical intervention.fromthatwemightbeabletoconsiderwhetherthereissomemetric forevaluatingharmgenerically,orwhetheranycomparativeevaluationcanonlybe doneintermsofspecificharms.interm,fromtheidentificationofharms,wemay alsobeabletoidentifytherisksthatwouldleadtothoseharms.thisisanother wayoftalkingabouttherelatedtopicfromthesecuritycommunity:threats. Trust:Howcanweestablishandassesstrustamongthestakeholders?Whatdoes itmeanforthevariousstakeholderstotrustormistrusteachotherorsetsof others?whatmodelsdowehaveforunderstandingtrust?whatarethecurrent andpredictablefuturemechanismsandtechnologiesforestablishingtrustandhow dotheyrelatetothemodelsinpeople smindsandperception?howistrust establishedandmaintained?howdoesitevolveovertime? Withallthesequestionsandissuesinmind,theremainderofthisreportpresentsthe scenarioanalysisdonebyvarioussubgroupsofthebigdataprivacyworkinggroupfrom whichwedrewtheseobservations,thoughtsandquestions. 1.4 RemainderofThisDocument Theremainderofthedocumentfocusesondescriptionsofthescenariosasoutlinedby subgroupsofthelargerworkinggroup.thefirstfocusesonmoocs(massiveopenonline Courses)andOLEs(OnLineEducationalsystems).Thesecondaddressesthechallengesin usingsocialnetworkingdataforresearch.thethirdconsiderstheuseofmobilecellphone datatoreflecthumanmobilityintoandoutofregionsofhighlyinfectiousdiseases, especiallyindevelopingpartsoftheworld.thefinalsectionofthepapersummarizesa numberofadditionalscenariosaddressedbythegroup,butinlessdepth.theyilluminate moreofthebreadthoftheproblemdomain.thepaperconcludeswiththreeappendices: A)thetemplatedevelopedbythegroupfororganizingtheindividualscenarios,B)amore invdepthtableofthestakeholdercategories,c)anapplicationofthestakeholderanalysisto thefirstscenarioaboutmoocsandoles,asanexample. 15
2 PrivacyIssuesforDataCollectedfromMOOCsandOnline LearningEnvironments Team:&UnaMMay&O Reilly&(MIT),&David&Dietrich&(EMC),&Lalana&Kagal&(MIT)& 2.1 Abstract MOOCs(MassiveOpenOnlineCourses)representaspecifictypeofOnlineLearning Environment(OLE),whichcanbedeployedonInternetVservedplatformsthatcollectlarge volumesofgranularbehavioralinformationaboutstudents learningactivities.somedata revealeachindividualstudent sdetailedstudybehaviorsuchasvideousage,consultation oftextorlearningtools,andthesequenceinwhichmaterialwasnavigated.otherdata includeassessments,grades,andsocialinteractionsandcommunicationonforumswithin theplatform.collectivelythedatacanbelinkedtoauxiliarydemographicinformationsuch asage,sex,andsocioeconomicstatus.itcanalsobelinked,ifnotanonymized,topublic onlinebehavior.ageneralsetoflegitimateusesofthisdataincludeseducationresearch, examination,andanalysesthatdirectlyorindirectlyhelpinstructorsteachandconduct studentassessments.some,butnotall,oftheseusecaseshavecommercializablemodels forpartiesbeyondtheplatformprovider. 2.1.1 DefinitionofaMOOCandtheScopeofOLEandMOOCinthisdocument MOOCisanacronym(MassiveOpenOnlineCourse)originatingin2012.Theacronymhas beenshortvlived,asmoochasevolvedintoanounwithmeaningsfallingoutsidethe acronym.forexample,todayweseemoocsthatarenotopentoallcomersandmoocs thatareonlypartiallyonline,becausetheyareintegratedintoblendedlearningorflipped classroommodels. 2 MOOCssharehistorywithITS IntelligentTutoringSystemsandother learningmanagementsystems,suchasblackboardandmoodles. Wearefocusingondataanditsrelatedprivacyandconfidentialityissuesinthisdocument. NoOLEplatformcollectsexactlythesamedata,butwhereveritislargelyunimportantto differentiateeachplatformbyitsspecificname,wewillrefertothemallasoles. 2.1.2 StateofDataPrivacyOrganization OLEs,andMOOCsinparticular,attheircurrentscalearerelativelyrecent,sodataprivacy andaccesspoliciesareemergentanddynamic.policymakersrangeingovernancescale fromthefederalgovernmenttoplatformproviders,andfurthertoinstitutionaland independentcontentproviders.defactopoliciesandinterimpoliciesthathavebeen necessarytocoverfastvpacedoleactivitybothexist.furthermore,existingpolicieson dataprivacyhavebeeninterpretedinnewcircumstances.policycommitteesandmeetings 2GiventhisfluidityofthemeaningofMOOC,somepeoplereasonablydisputetheoriginofthewidely recognizedfirstmooc,believinglargescaleonlinecoursesatthecollegelevelprecedingng sorthrun sat Stanfordin2012tobevalidexamples.ItisarguablethatCourseraandMITX/edXexamplesaremoreprecisely called xmoocs, whilepreviousonlinelearningcourses,whicharegenerallymuchmorefluidinnaturein termsofcontentdeployment,aremorepreciselycalled connectivist or cmoocs. 16
abound.policymakingisattheinformationcollecting,optiondrafting,andrevisionstages. Thereisapotentialtoleveragetheexperiencefrommanyotherdatadomainsandshapea strongnationalexample.thiswillrequireinputfromdatastakeholders,thelegal community,andtechnologyexperts.thelatterareimportantbecausetheycanadviseon technicalrisksofprivacyandconfidentialitybreaches,whilealsoindicatingthecapabilities andpotentialpowerofnewtechnologies. 2.2 DetailedNarrative TheOnlineLearningEnvironment(OLE)dataprivacyscenarioisrelativelystraightforward comparedtosomeotherdomains,suchashealthrecordsorpersonalgenotyping,for severalreasons: BecauseOLEsarerecent,therearefewdatalegacycomplexities. Becausethenumberofplatformsismodestrightnow,thekindsofdataare enumerableandtheirformatsareknown.however,thiswillchange. Becausethereareenumerableclassesofstakeholdersinthespaceandpolicy precedentsinrelateddomains,thereisgenerallylessdivergenceand/or disagreementonwhatapolicyshouldcoverandwhattheprinciplesandshouldbe. 2.2.1 OpenIssues Recognizingthedynamicnatureofcontrolofthedataandacknowledgingthatthe circumstancesaroundthatcontrolmaychange.thedataisreplicatedandpassed bytheplatformprovidertotheinstitutionofthecontentprovider.atthispoint, twopartieshavecontrol.hereafter,designatedcontrollersmayexpandinnumber, orthecontrolmaybepassedfrompartytopartyinstages.differentcontrollers havedifferentinterestsinthedataandallowvariouspartiestoaccessitundera diversesetofgoalsandagreements.thereisnouniformitytoinstitutional practicesacrossthecountry.ifabroaderpolicyandsetofpracticesweretobe developedbygovernment,theirinterpretationmightstillresultinheterogeneous localpractices. Defininganddetermininglegitimateusesofthedataandhowtheseusesshouldbe controlledinaclear,specific,andopenvendedmanner. Settingguidelinesorstatedpoliciesrelatedtothesale,trade,orsharingofthisdata inolesandmoocs. Defininganddeterminingthelegitimatecommercialuseofthedata,ifany. Definingtheroleoftechnologyinaidingthedraftingandgovernanceofpolicy. AnticipatingcommercialandeducationalactivitiesaroundOLEdata,aswellas potentialmaliciousactivities,andconsideringwhattechnologycandotosupport them(orpreventthem),asnecessary. Thetradeoffsforpolicyarounddatacontrolandaccessinclude: Students righttoconfidentiality,privacy,andaccesstotheirowndata. Institutions andcontentproviders righttoaccessbecauseofcontentprovision. Platformproviders righttoaccessbecauseofserviceprovision. Thebenefitofresearch,theresearchVmotivatedrighttoaccess,andthe countervailingriskofidentification. Thepotentiallinkingofanonymizeddatawithoutsidedata. 17
Commercializationopportunitiesthatmaybeunforeseenorunanticipatedby studentswhograntpermissiontocollectandcontroltheirdata. Thereasonablelimitsoftechnologyforprivacyandconfidentialitypolicysupport. 2.2.2 AdditionalPrivacyConcerns Forumdiscussionsanddatalinkability. OnecommonwaytogradeassignmentsisviapeergradinginMOOCs,whichmay createpowerrelationshipsandopportunitiesformisuse. Powerdynamicsmaynotrespectbasicrights,astheyrelatetothelinkeddataor thetextualinformationfromthediscussionforums.inaddition,themoocscan presentasymmetricalpowerdynamics.considerthecaseofchildrenand prisoners,wherepeoplewithinasystem(educational,correctional)maybe requiredtodothingsaspartofthethatsystem,orinthiscase,themooc,andthey maybeinfluencedtobendtherules,giventheexistingpowerdynamics. Therefore,thisareaneedsadditionalprotection,sinceMOOCshavethepotentialto enablecoercionandpowerimbalance.therearefreemoocsandmoocsfocused oncertificationsandjobs.thereisanasymmetricpowerrelationshipinsome situationsandwhenthisexists,thereshouldbeseparateregulationsgoverning thesemoocstoensurethatthedynamicsarefairandthereisfreewillandclear consent. 2.3 PrivacyImpactAssessmentYTheSpecificContextofScenario1 2.3.1 Actors Students:Userswhotakethecourse,completetheassignments,andreceiveagrade. Teachingcontentproviders:Facultyandteachingstaffthatprovidetheteaching material,monitorandsupportthediscussions,andhandlethegrading. CrowdParticipants:AtVlargepartieswhomightvolunteertogradeorofferfeedbackon assessments,programmingassignments,andsoforth,butwhoarenotstudentsorcore teachingstaff. PeerGraders:Aspecificcaseofstudents,inwhichstudentsareexpectedtogradeeach othersworkinordertomanagethegradingatlargescales,asoccursinsomemoocs contexts. Institutionalcontentprovider:Theinstitutionbehindtheteachingcontentproviders. ExamplesincludeanenterpriseofferinginVhouselearningplatform,auniversityofferinga MOOC,anenterpriseofferingproducteducationforclients,orthegeneralpublic. Platformprovider(e.g.Coursera,edX,StanfordU):Apartythatdeploysthecourseonthe Webviaaplatform.Insomecases,thesamepartydevelopsandmaintainstheplatform. Forexample,edXisanotVforVprofitorganizationthatdevelops,maintains,anddeploysa MOOCplatformasaservicewithaconsortiumofuniversitypartners,includingMITand Harvard.Courseraisacommercialentityandhasdifferentuniversityrelationships.Open edxisanopensourceplatformthatanycontentprovidercanadoptanduseforcontent deployment. Analyst:ApartywhoexaminesthedatacollectedfromOLEs.Analystsinclude researchers,theirstudents(iftheresearchersareacademics),andeducationtechnologists. 18
Teachingstaff,platformproviders,andinstitutionalcontentprovidersmayalsoactas analysts. Datacontrollers:DatacontrolofOLEdataisnotalwayscentralizedorstationary. Examplesofdatacontrollersincludetheplatformproviderandinstitutionalcontent provider.withineachoftheseinstitutions,therecouldbemultiplecontrollers.theymay controlthedataatdifferenttimes,ortheymaycontrolitconcurrently.forexample,atmit, theofficeofdigitallearningreceivesthedata,controlsitsdistributionatonepoint,and thenlaterpassesthisroleontotheinstituteregistrar. 2.3.2 ActorsandRelationships Analystsinteractwithdatacontrollerstogainaccessthedata.Thedatacontrolleroften askedtheanalyststoformallysubmittoapolicy.eventuallyanalystswilltransformsource databylinkingandinterpretationintomoreabstractrepresentationsofstudentbehavior, e.g.variablesformodeling,allthewhiletryingtoenforcestudentanonymity.analystswill interactwithdatacontrollerstoworkouthowtomorewidelysharesuchvariablesandto evaluatetheriskthattheyandmodelsusingthempresentsomeriskofrevidentification. Datacontrollersinteractwitheachothertopassorsharethedata. Studentsinteractwiththeplatformproviderandtheteachingcontentprovider.They registerwiththeplatformprovidertogainentrytotheplatformandcourse.theyprovide backgroundinformation,participateinthecourse,includingitsforumsandassessments, andprovidesurveyinformation.asdatacontrollers,bothproviderswillaccessthis information.itshouldbenotedthatstudentsoftenconfusetheplatformandcontent providers.astudentisshownaprivacyandaccesspolicybytheplatformwhenheorshe registers.astudentagreestoaplatformusepolicywhenregistering.forexample,edx s usepolicystipulatesnoscraping. StudentsindirectlyinteractviatheOLEwiththeInstitutionalContentProviderwhen theyhavegradesplacedintheiracademicrecords,orwhentheyreceivecreditor proficiencycertificates. Studentsindirectlyinteractwithanalysts.Theygainabenefitfromassistancethatcould befoundedontheresearchers analysisoftheirdata bothastudent sindividualdataand thedataofotherstudentsinaggregate. Studentsinteractwithotherstudents,generatingdataofgreatinterest.These interactionsfrequentlytakeplaceonforumswithintheplatform.importantly,fordata privacyreasons,theymaytakeplaceoutsidetheplatform,informallyarising,ratherthan beingorganizedbythecoursestructure.examplesofdigitalrecordsoftheseinteractions arefacebookorlinkedingroups.sometimesstudentsassesstheworkofotherstudentsin peervtovpeerrelationships.studentsmayalsoworkingroupsonprojectsorhomework. StudentsinteractwithCrowdParticipantswhentheyreceivefeedbackfromthem.For example,onecourseatmitinvitesalumnitocommentonstudentsoftwaredesigns. Studentsrarelyinteractwithdatacontrollersatthistimeandhavezeroorlittleaccessto theirdatabeyondofficialrecordscreatedfortheireducationpurposes. Institutionalcontentprovidersemploytheteachingstaff,i.e.teachingcontent providersandhaveagreementswiththemregardingintellectualpropertyrelatedtothe course,andremunerationforinstruction.theinstitutionisusuallythedatacontroller, 19
ratherthantheteachingcontentprovider.infact,thelatterpartymayneedtoseek permissionfordataaccesstotheverycoursesheorshehastaught. TeachingcontentprovidersinteractwithCrowdParticipantstoprovideguidelineson gradingandgetfeedbackonstudentperformanceandinterestinthecourse. Teachingcontentprovidersprovidefeedbacktoinstitutionalcontentprovidersand platformprovidersonusability,additionalfeatures,andstudentperformance,for example. Teachingcontentprovidersmayinteractwithanalyststounderstandhowstudents learnandinteractwiththeirteachingcontentinordertoimprovethatcontent. Teachingcontentprovidersmayinteractwithdatacontrollerstogetaccesstodata abouttheircourseinordertoanalyzeitandtoimprovetheteachingcontent. Institutionalcontentprovidersinteractwiththeplatformproviderstoensurethatthe coursesaresupportedproperlyandprovidefeedbackonadditionalfeatures. Institutionalcontentprovidersinteractwithdatacontrollerstoidentifyand/orspecify thepoliciesthattheywishtoenforceandtodiscussenforcementmechanisms. ThecoreinteractionisthestudentlearningviaanOLE.Aroundthispoint,studentsinteract witheachotherandteachingstaff.intermsofprivacy,studentsareidentifiedbytheirlogin idontheoleplatform.theymayalsorevealtheir offline identifytoeachotherandstaff inthecontentoftheirdiscussionposts.studentsagreetoaplatformuseagreementthat impliesthattheyaccepttheplatform sdatausepolicy.duringthelearningprocess,the platformprovidercapturesclickstream,assessment,discussion,andwikidata.inreal time,oratlongerintervals,theplatformprovideraggregatesthisdatafrommanystudents interactions.theplatformandtheinstitutionalcontentproviderscontrolthesedata.they aregenerallynotaccessibletothestudent,buttheyareaccessibletoteachingcontent providersandanalysts.institutionscontrollingthedataareresponsibleformeetingferpa requirementsandpseudovanonymizingdatatowhichtheywilllinkandprovideaccess. Theyalsodevelopandprovidetechnicalsupportfordataaccesspolicies.Analysts transformsourcedatainthecourseoftheirmodelingactivities.theymaycombinelow levelobservations(e.g.mouseclickactivity)intovariables(e.g.referralstotextduring problemsolving)andcompilelargedatasetsofthem.thesedatasetsdescribestudent behavioratarecognizablelevelofhumanactivity.theyaredestinedtobecomethedata currency ofanalyticresearch.howtohandlethecontrolandprivacyprotectionofsuch secondarydata(i.e.whocanitbesharedwith,givenpotentialforstudentrevidentification) remainstoberesolved. 2.4 GoalsofOLEs General:Toeducate.WithcollegeOLEs,theeducationcouldhave(secondary)outreach, accessibilitygoals.withcorporateoles,theeducationcouldhave(secondary)product adoption,sentiment,andpublicitygoals.inaddition,goalsspecifictoactorsare: Teachingcontentproviders:Providingteachingmaterials,jobtasksforanemployer. CrowdParticipants:Altruisticorprofessionaleducationgoals. PeerGraders:Evaluateotherstudentworkinanappropriate,objectivemanner. 20
Institutionalcontentprovider:Sometimesthroughgeneratingrevenuedirectlyor indirectly;reputation. Platformprovider:Revenuestreamsviaadvertising,signaturetracks,recruiting. Possiblecrosssellingtosteerpeopletowardformaldegreeprogramsatuniversitiesthat providecontent.owntheecosystem,astheyowntheactualplatformandaccessthedata. Analysts:Researchintoeducation,improvementofOLEexperienceforstudentsand teachersbyinterpretinghistoricaldata.inevitably,financialprofitcouldbeagoalforthis kindofactor. Datacontrollers:Thesearethedatagatekeepers.Theyregulateaccesstothedataatthe momentforanalystsandotherpotentialcontrollers.theirgoalistoensurethatthe privacyandconfidentialitypoliciesgoverningthedataarerespected,whileproviding accesstoappropriateanalysts. Thereisalurkingunnamedadversarialgoal/actorinthisspace:Thoseexploitingthedata forcommercialorhackingpurposes,outsidetherealmofeducationaluse,i.e.toidentify someoneandtargetherorhimspecificallyforrevelationsorforprofitvbasedactivities. Forexample,thereisasignificantpotentialfortargetedadvertising. 2.5 Data MOOCsofferapotentialsocialsciencelaboratoryorstudysettingwherestudents behavior andinteractionwithcoursecontentcanbealmostmicroscopicallyobserved.technology allowsustocaptureatremendousamountofdetaileddata,including: ClickVstreaminteractionsbetweenastudentandcontent. UseofvideosandothereVresources,suchasdigitizedreferencematerial,wikis,and forums. Assessmentbehavior:attempts,correctness,useofimmediatefeedback. SelfVreportedbackground,preVandpostVtestsurveys. Moredatathaninaresidentialsetting,butwithlesscontextualinformation accompanyingit. Thisdatacanbesegmentedinseveralways,asoutlinedbelow. 2.5.1 CourseYrelated Coursecontentfromcontentprovider. Dataexhaustfromplatform,asstudentsinteractwithWebservers.Thisisoften calledclickstreamdata.foredx,itisjsonlogsofeveryget/postofdatatotheweb site. StudentinputtotheOLEviawikianddiscussionforumentries,questionnaires,and selfvreportingsurveys. Assessments bothgradesandresponses;certificateachievement. 2.5.2 InstitutionorPlatformYrelated Curriculardatarelatedtocoursestaken,timing,andlearningpaths. Registrationdata,suchasprofileinformationaboutstudents. Paymentdataperhaps(e.g.,CourseraSignatureTracks,otherthirdparties). Certificatedata. 21
ThesedataareindiverseformatsandcanbelinkedtoformstudentVorientedortimeV orienteddescriptions(theformerbeingmoreactionable)oflearningactivitywithin&the& platform.onesuchopenorganizationofmoocplatformdataismoocdbwithinthe MoocDBproject.MoocDBisaplatformagnosticfunctionaldatamodelfordataexhaust frommoocs.themoocdbprojectwillprovideopensourcesoftwareofmooctoolsand frameworks. 2.6 Systems Businesssystems.Asanexample,CourseraisaforVprofitorganization,providingan onlineservice.inthepast,courseraoffereda"freemium"modelinthemarketplace,and hasevolvedtoofferlowcostcoursesandspecializations.signaturetrackingverifies studentauthenticity,recruitersareinthemodelandserveasarevenuesource,andlifelong learnerstakecourseswellbeyondthetypicalstudentyears.inthecaseofacorporate MOOC,HRlearningsystemsarepartofthispictureaswell. 2.7 Risks Thebiggestdatariskisthatsomeoneinthedataisidentifiedandthiscausesharmtothem. DatahastobepseudoVanonymizedbeforerelease,butthatdoesnotassurethatreV identificationwillnotbepossiblewith100%confidence.revidentificationcantakeplace inatleastthreeways: PseudoVrandomizeddatahasconfidentialcrossVreferencetablestotrueidentity. Thesetables,ifnotadequatelyprotected,couldbecompromised. Somereferenceinthecontentofthedata,forexamplefreetextpostsindiscussions ortimestampswilldirectlyorindirectlyallowcrossvreferencingtopublicdatathat revealsidentity. Apreviouslycompromiseddatasetcanpotentiallybeusedtolearnthebehaviorof astudentandthisbehaviorpatterncanthenbeappliedtonewdatasetstoidentify thestudent. Severaladditionalrisksexist: Datacontrolisnotinthehandsofthedataproviders,i.e.thestudent.Therefore, thereisariskthatthedatacanbeusedinawaythatthedataproviderdidnot anticipate,orforareasonthattheydonotapprove. Datareleasedforresearchpurposeswillbeusedforcommercialpurposes. Datawillbeusedtoevaluatetheteachingabilityoftheteachingcontentprovider andtocompareteachingcontentacrossdifferentinstitutionalcontentproviders withoutexplicitconsentrelatedtoindividualdatasharing. StudentsmaynotunderstandtheprivacypolicythattheyhaveagreedtoatsignVup, andtheirpersonaldatagetssharedormonetizedwithouttheirinformedconsent. 2.8 Rules/Regulations IntheUnitedStatesmuchoftheregulationofacademicdataisregulatedbytheFamily EducationalRightsandPrivacyAct(1974), 3 whichdefinestherightsofparentsand 3Seehttp://www2.ed.gov/policy/gen/guid/fpco/ferpa/students.htmlforgeneralinformationaboutFERPA. 22
guardianstoaccessandsomecontroloverwhohasaccesstowhichinformationabout childrenunder18yearsold.italsodefinestherightsofstudentsover18,suchasstudents incollege.itisimportanttorecognizethattheremaybeanumberofnonvferpa regulationswithrespecttotheprivacyofinformationaboutstudents.anexampleofthisis theu.s.healthinsuranceportabilityandaccountabilityact(hipaa),butthereareothers aswell.thisgroupdidnotdiscusstherelationshipsamongthesevariousdifferentfactors intheprivacyofeducationaldata,butjustnotedthatsuchdifferencesandpossible conflictsexist. 2.9 Technologies LearningPlatforms(usingthisbroadlytorefertoplatformssuchasedX,Coursera, Udacity,andotherMOOCproviders,aswellasmoretraditionalLearning ManagementSystems(LMS)suchasBlackboard; Softwareframeworksforprocessinglargedatasets,suchasHadoopanddatalakes thatstoreacombinationofstructuredandunstructureddata; Webbrowsersandfrontendtools; Analyticaltools; Cloudcomputingplatforms(e.g.,AmazonWebServicesandothers); Codeondifferentsystems; Mobiledevices. 2.10 PrivacyConstraints PrivacyconstraintsinaMOOCareverydifferentfromthoseofaphysicalclassroom experience.thereisaperspectivethatsincemoocsaremuchmoreopen,studentsare morevulnerableonline,comparedwithatraditionalclassroomsetting. 2.11 TechnologyInformingandSupportingOLEDataPrivacyand ConfidentialityPolicy 2.11.1 Whattoolsandapproachescan(new)technologyprovide? Somepossibletechnologies: Differentialprivacy. Analysisiscarriedoutonencrypteddata,soeventheplatformproviderdoesnot seethedata(homomorphicencryption). TheanalystusestrustedandprivacyVawareAPItowriteuptheiranalysisand submittheircodetodatacontroller;theapipreventstheabuseofdata. Storeextensiveauditlogsaboutanalystaccesstoensurethattheanalystisnotable tochainqueriesinordertogainaccesstoinappropriatedata. PrivacyVawareanalysisframeworkthathelpsanalystbepolicycompliant. SomeinitialthinkinghasbeengiventomanagingMOOCdataviadecisionandpolicy enginesbasedonheuristics.thisapproachwouldrequireseparatingthedatabasesand usingdifferentaccesscontrols. 23
2.11.2 Risks Whatrisksaretheretoeventhenewtechnology? Differentialprivacyonlyworkswithinacloseddataset;privacybreechesare possiblewhenexternaldatasetsarelinked. Encryptionactslikeaccesscontrolandisusefulwhentheplatformprovideris untrusted. ArestrictedAPIactslikeanaccesscontrolcombinedwithaudit. Auditingcanhandlepostfactoproblems. Theanalystplatformprovidesaholisticapproachtoaccesscontrol,privacy awareness,andensuringpolicycompliance.however,itrestrictstheanalysttoa singleplatform. 24
3 ResearchInfrastructureforSocialMedia Team:&Maritza&Johnson&(Facebook),&Dazza&Greenwood&(MIT),&Mona&Vernon&(Thomson& Reuters)& 3.1 Abstract Mostsocialmediaplatformsprovideatleasttwobasicfeatures:theabilitytoshareuserV generatedcontentandtheabilitytoconnectwithanaudience.differentsocialmedia platformsmakeitpossibleforuserstosharearangeofcontenttypesandsomeallowthe usertoselectivelychoosetheaudienceforindividualpiecesofcontent.onfacebook,for example,theusercouldsharetextvbasedstatusupdates,photos,orwebsitesurls.the userisalsoabletocommentoncontentpostedbyotherusers,installapplicationsthat utilizethefacebookapi,orcommunicatewithotherswithinaselfvorganizedgroupof people.betweentheuservgeneratedcontentandtheserverlogsthatcapturehowand whenpeopleinteractwiththeplatform,theseservicesareaninvaluablesourceof informationabouthumanbehaviorattheindividual,group,andevencountrylevels. Thegoalofthisscenarioistoevaluatetechnicalsolutionsthatwouldopenthisdataupto researcherswhileofferingdatasubjectsinformedconsentandcontrolovertheirdata. StudiesofsocialmediatodatehaveprovidedinsightsontopicsaswideVrangingassocial capital,socialinfluence,memeevolution,emotionalcontagion,mobility,andpolitics.fora varietyofreasons,muchofthisresearchiscurrentlylimitedtoemployeesofsocialmedia companies. 3.2 ScenarioIntroduction StudiesofsocialmediatodatehaveprovidedinsightsontopicsaswideVrangingassocial capital,socialinfluence,memeevolution,emotionalcontagion,mobility,andpolitics. Unfortunately,muchofthisresearchiscurrentlylimitedtoinVhouseresearchersatsocial mediacompanies.academicsandotherresearchershave,insomecases,leveragedpublicly availablecontentorapis,whentheyareavailable,buttherearenotablelimitationsto collectingdatathroughthesechannels.insomecases,studyingagroupofpeopleyieldsthe mostinterestinginsightsbutthisrequiresthatacriticalmassofthepopulationoptsvintoa researchprogram.inothercases,theuservgeneratedcontentisbestsupplementedby informationthatcanonlybefoundintheserverlogs,suchashowfrequentlyaperson visitstheplatform,howmuchtimetheyspend,andtheproportionoftimespent consumingcontentversusproducingcontent. Onewaytoincreasethevolumeofresearchinthisareaistodevelopasocialmedia researchinfrastructurethatallowsusers(datasubjects)tooptvintoaprogramthatmakes somesubsetoftheirsocialmediacontentandtheaccompanyingserverlogsavailableto researchers.theresultwouldbealargevscale,richdatasetthatwouldempower researcherstogeneratevariedandreproducibleresearch.socialmediaplatformsmight participateindatareleaseprogramwithvaryingoptions.forexample,onesuccessful implementationoftheprogrammightincludeapredefinedsetofuserdataanddatafrom serverlogs,afeaturethatallowsresearcherstocontactparticipantsforsupplementary dataorfollowupsurveys.itmightalsoincludeaportalwitheducationalcontentfor individualstovisittohelpthemunderstandtheinformationthey vechosentodonate,to seehowresearchersareusingit,andtogaugethelongvtermbenefitsofparticipation. 25
TheincentivefortheStudyParticipantsandSocialMediaProvidersistoactforthepublic good.theriskforthestudyparticipantsisthattheymightexperiencenegativeeffectsasa resultofcontributingtheirdatatothegeneraldataset.thedataexchangedmaycontain severalfeaturesofdataknowntobepersonallyidentifyingorsensitiveinnatureincluding race,sexualpreferences,genderchoice,andpoliticalviews.thedataexchangedcouldalso beusedformakingunexpectedinferencesthattheparticipantwasunawareofatthetime ofconsent. AsahighVleveloverview,theprogramwouldbeinitializedbytheSocialMediaProvider. ThesocialmediaproviderwouldadvertisetheoptVinresearchprogramtousers(potential participants),giveanoverviewofthestructureoftheprogram,therisks,andthebenefits andpresentthechoicesthatrepresenthowausermightparticipate.thisinformation wouldincludethemainfeaturesoftheprogram:thebasicsetofinformationthatis requiredtoparticipate;additionaloptionalfieldsthattheparticipantmaychooseto include;andthefeaturesthatwouldallowaresearchertocontactauserforadditional information. TheparticipantwillhavegranularoptVinchoicesforsharingasubsetoftheirpersonal data,forexample,somebasic(static)fieldsareincludedinthesetsuchasbirthmonthand year,currentcity,schoolhistory,jobhistory,etc.theparticipantisalsogiventheabilityto contributedynamicstreamsoftheirdata,includingphotos,posts,comments,and interests. Theinformationwillclearlydescribethepoliciesthatresearcherswillbeheldto,while makingitclearthatthedatasetisnotbelievedtobeanonymousordevidentifiedinarobust manner. 3.3 StakeholdersandInteractions Socialmediaprovidersarethedatacollectorsandwouldinitiallyserveasthedata platformproviders. Socialmediausersarethedatasubjectsandareaskedtoprovideinformedconsentforthe datatobetransmittedbysocialmediaprovidertoresearcherforpurposesofresearch study. Researchersaredataanalystsandreceivedatafromdatacollectors(socialmedia providers)bypermissionofthedatasubjects(socialmediausers).theresearchers becomedatacuratorsofthedatathattheyreceiveatthetimeofreceiptandany derivativedatathatisproducedasaresultoftheresearchactivities. Thedatacollectors(socialmediaproviders)remaindatacuratorsfortheunderlyingdata ofallsocialmediausersthattheycontinuetomaintain. Socialmediauserswillcontinuetointeractwiththesocialmediaplatformtogeneratenew content. Researchersmightcontactsocialmediauserstocollectadditionaldatatosupplementthe socialmediadata. Socialmediauserswillcontactthesocialmediaprovideriftheyexperienceissuesorhave concernsabouttheoverallprogram.userswillexpectthatthesocialmediaprovideris ultimatelyresponsibleforensuringapositiveexperience. 26
Researcherswouldprovideinformationtothedatasubjectsabouttheresearchthatresults fromusingthedatasubjects data. 3.3.1 Data Examplesofthedatathatcouldbemadeavailable: Posts:photos,statusupdates,locationcheckVins,etc. Commentsandthenumberoflikesonindividualposts Educationhistory Hometown Currentcity Religiousandpoliticalviews Informationaboutthefriendnetwork:summarystatisticslikecount,breakdownby agerange,currentcity(location),gender,politicalviews,andeducationlevel,etc. Forthedynamicfields,theinformedconsentdialogmightoffertheabilitytocontribute: Audience,keyword,tags,orsomeothermechanismcoulddefinetheexceptions. Allhistoricaldata Allhistoricaldatawithsomeexceptions Onlyfuturedata Onlyfuturedatawithexceptions Historicalandfuturedata Historicalandfuturedatawithexceptions Makingthedataavailable: Option1.Socialmediaprovidergeneratesdataslices: Onamonthly/quarterly/annualbasis,theSocialMediaProviderwouldcreatea newdatasliceforallactiveparticipantsintheprogram. ParticipantswouldbeabletooptVoutoftheprogram,buttheywouldnotbeableto removetheirdatafromthedatasetsthathadalreadybeenpublished.this&is&mainly& because&no&practical&guarantees&could&be&made&about&deletion&requests&once&the&data& has&been&released&to&researchers.&& Researcherswouldconductqueriesontheavailabledatasets,ordownloadthe entireavailablesetforagiventimeperiod. Option2:SocialmediaplatformprovidesasAPIspecificallyforthisprogram. 3.4 Systems Legalsystems Theprivacypolicy,ordatausepolicy,currentlygovernshowdatacanbe used. Socialsystems Whataretheexistingexpectationsaroundwhoownssharedcontent? Socialmediadatasometimesinvolvesmorethanonedatasubject.Considerforexamplea Facebookstatusupdatewithasetofcommentsand Likes. Thesimpletextofthepost belongstotheoriginalposter(thepersonweconsiderthedatasubjectthroughoutthis scenario).butthepostmightalsoinclude tags tootherpeople.thesestructured referencestootherusersrepresentotherindividuals.what sthebestwaytohandle 27
providingthisinformationinthedataset?similarly,onfacebook,commentsonapostin arestoredwiththeaccountofthepostauthorratherthanthecommenter.whodoesthis contentbelongto?thecommentsarerelevanttothecontextofthepost,butaregenerated byotherpeople.isconsentrequiredtoknowwhichusers liked apost?dowelimitthe datasothatonlythenumberoflikesisavailable? Businesssystems Humansubjectsresearchrequirestheapprovalofanethics committeeifthecommonrule 4 applies. Technicalsystems informedconsent,apermissionvbasedsystemtoallowtheuserto participateinawaytheyfeelcomfortable,transparencyandcontroloverhowdatais shared,deletionprotocols,devidentificationofdatatoprotectindividualswhenitis aggregated,andauditablesystemstounderstandwhohasaccess. 3.5 AnalyzetheScenario 3.5.1 Goals Theparticipantsbenefitfromcontributingtoageneralbodyofknowledgeand perhapstheywilllearnsomethingaboutthemselvesonanindividualbasistoo. Researchershaveaccesstoadatasetthatwaspreviouslyunavailable. Thesocialmediaprovidergainsinsightsabouttheuserbaseandcontributestothe generalbodyofknowledge. TheResearchersmaybeactingforthepublicgood,ortheymaybeactingto developtheirowncareers. 3.5.2 Risks Participantsagreetoparticipateintheprogramandthenlaterexperiencean unexpectedharm,duetoanunexpectedinferencethatarisesfromtheresearch. Participantsagreetoparticipateintheprogramandthenlaterexperiencean unexpectedharm,duemodificationofthesitebasedonthoseinferences,orasa partoftheexperimentitself. Thedatasetwouldbeavaluableresourceforresearchers,butitwouldbedifficult toquantifythebiasintroducedtothedatasetbasedonthecharacteristicsofthe peoplewhodecidetooptvintotheprogram. Researchersidentifyacorrelationinthestudypopulationthatcanbeextrapolated tothegeneralpopulation,greaterthanthepooloftheparticipantswhooptedin. DeletionrequestsVVisitreasonabletodesigntheprogramsuchthatpeoplecanopt inorchoosetooptout,butcannotremovetheirdatafromthealreadyvreleased dataslices?ifnot,thenhowwoulddeletionbehandledwhenthedatasliceshave alreadybeenreleased? Lackofcontrolonthedownstreamuseofthedata,orderiveddata:whatare expectationsandcommitmentstothepeoplewhooptinondownstreamusesofthe data?whennewinsightsemerge,howdoyouensurethattheinferences/derivative datahavebeencreatedinawaythatisconsistentwithanindividual s 4TheCommonRuleisthenameoftheU.S.federalpolicyontheethicsofuseofhumansubjectsinbiomedical andbehavioralresearch.formoredetailsee http://www.hhs.gov/ohrp/humansubjects/commonrule/index.html 28
expectations?howwouldwedetectamisuseofthedata?howwouldwetag derivativedatatounderstandwhereitcamefromandunderstandtheoriginal policyinordertodeterminewhethertheactionandthefutureusesarepolicy compliant? ThedatacopymaybedisposedofbytheResearchersafterthestudy,ormaybe retainedinacorpusforfurtherstudy.thedatacopymustbeheldsecurelyandthe Researchersareliableforabreach.However,theSocialMediaProvidermaybe liableiftheyhavenotassuredthattheresearchersareactingproperlyandalso mayriskcollateraldamageinthecaseofabreach,eveniftheproperprocesses havebeenfollowed.avarietyoftechnologiesandsystemswillbeusedtostoreand transmitthedata,includinginternetlinksandvariousdatabases.thedatamustbe heldaccordingtothevariousdataprotectionregulationsintheterritorythatthe datahasbeenexportedto,providedtheexportislegalinthefirstplace. 3.5.3 Rules TermsandConditionsofthesocialmediaprovider Thesocialmediaplatform sexistingaudiencecontrolsforcontent NoticeandconsentwhentheuseroptsVintotheprogram FTCSection5 FortheResearchers:applicablehumansubjectsresearchprotections(e.g.,The BelmontReportorTheCommonRule) Thepoliciesofpublicationvenues 3.5.4 Time Roughlytwotofouryears. 3.5.5 ExistingRelevantBestPractices HumansubjectsreviewcommitteeVVWhereTheCommonRuleappliesanethicscommittee wouldberequiredtogiveapprovalforhumansubjectsresearchandanappropriaterisk assessmentwouldbeundertakentovalidatethearrangementsthathavebeenputinplace tomanagethedatasecurityanddisclosure. OAuth2forenablingaccesstoauthorizedusersVVOncethedatasubjecthasprovidedthe clickvbasedgrantofauthorization,theresearchercouldbegrantedanoauth2tokento requestandreceivethatindividual sdataviatheapi.thedatawouldthenbetransferred toaresearchplatformanddatabasetoconducttheanalysis.theoauth2tokenwouldbe provisionedtoincludeauthorizedaccesstoascopeofaccessthatcorrespondstothe personaldatathatthedatasubjectagreedtoprovide. IntheUK,organizationsliketheUKDataArchivecanbeconsultedtomanagetheprivacy processesandpublicationofresultswithoutbreachingprivacy. 3.5.6 Gaps Theabovedescriptionincludesafewcaveatsthatarebasedonthelimitationsofour technicalabilitiesvvforexample,it simportantthattheparticipantsunderstandthat researcherswouldagreetoapolicythatprohibitedattemptstorevidentifyparticipants withinthedataset,butitwouldbedifficulttomakeanyguaranteesalongthoselinesgiven today stechnicalsolutions.similarly,therecouldbecontractuallimitationsinplace 29
arounddeletionandretention,however,wearelackingtechnicalsystemstoenforcethe policies. Themanagementofaccesstodataandtherisksassociatedwithpublication presentanimpedimenttotheuseofsocialmediadata. Gatheringinformedconsentfromsocialmediausersisparticularlyproblematic. Toenableresearchofthiskind,weneedtostreamlinetheseprocessesandprovide automaticverificationofthesafetyofdisclosures. 3.6 InnovationIdeasandOpportunities 3.6.1 Lookingat3Y5yearsopportunitiesandchallenges Oneofthemainopportunitiesliesintheabilitytocombinesocialdatafromdifferent sourcesinordertoconductmoreinsightfulresearchandenablingreproducibilityof research.thiswillrequiretechnologytoallowforprivacypreservation,ortheapplication ofrulesasthedataiscombinedwithotherdatasets. Howdowedeveloplegislation,ifitisnotalreadyinplace,tosetVupabaselinethatwillnot becountryvspecificandhencemakesitdifficulttomanageforthesocialmediaprovidersto complytomultipleformsoflegislation?ideally,therewillbeamechanismforallowing socialscienceresearchtobeconductedonaglobalscale. Theessenceofcomputationalsocialsciencemaybecomemorecommonand normal, comparedtothenicherolethatcomputationcurrentlyhasinthesocialsciences.atrue limitationoftheresearchareanowisthatonlysocialmediaplatformshaveeasyaccessto largevscaledatasets.mostacademicswhoworkinthespacehavepartnershipswith corporateentitiestoacquirelargedatasets.howwilltheresearchcommunitychange whenlargevscaledatasetsareavailabletoallsocialcomputingresearchers? Shiftingnormsareexpectedtocontinueevenbeyondthe3V5yearhorizonandthismeans thatweexpectcontinueddeepuncertainty. 3.6.2 OpenQuestions Whatifwedevelopeda CommonProgram&Protocol forinfrastructurevlevel servicestoenablepopulationvwidelivinglabssocialmediaresearch? WhatifFacebooksupportedafeatureforusersto"optVin"forparticipationinpreV qualifiedresearchstudiesandwemodeled/testedthatasacommonservice availabletoanyapprovedmitlivinglabapplication?intheory,thissortof capabilitycouldenablerevusableoreasyupdateofconsentacrosssimultaneous researchstudiesandforfuturestudies.thistypeofservicecouldcomprise fundamentalcapabilitiesthatarenowmissingforoperationalizingfairpermissionv baseduseofpersonaldatainbigdatacontexts. AnOAuth2scopetypedevelopedforresearchcontentcouldbeamodelforother socialnetworkstouse.oneofthebestaspectsofthefacebookandgeneralweb 2.0designpatternwithOAuth2isthattheauthorizationscanbeseenona dashboardandindividuallymodifiedorrevokedaccordingtotheagreements, potentiallyatanytime. Howcouldacommonservicetypeandinterfacespecificationbeusedby researcherstoenableothersocialmediaproviders(e.g.linkedin,googleplus, Twitter)toprovideconsentVbaseddatausinginteroperableprogramsand 30
accordingtothestandardprotocoldevelopedbymitandfacebook?whatissues ofscaling,cost/riskmanagement,businessvalue,andusabilitywouldneedtobe addressed,andatwhatphaseofdesign,development,testing,iteration,and deployment(alpha,beta,v1,v2)? CouldMITLivingLabspartnerwithFacebooktotestamodelOpenPDS(Personal DataStore)deploymentthatfurtherdevelopedinfrastructureVgradeservice interfaces,pipes,andgauges?wouldvorshouldvitmatterifopenpdswas situatedattheresearchinstitution(e.g.mitformitlivinglabs),oratathirdparty provider? 3.6.3 AlternativeA:InteractionsofPeople TheparticipanthasanaccountwithSocialMediaProvider,providesInformedConsentto ParticipateintheStudyand,withinthescopeofthestudy,providesauthorizationtoSocial MediaCompanytoreleasepersonaldatatoResearchersviatheirapplications. Alaboratoryhasanapprovedresearchstudyandhasreceivedtheinformedconsentof individualparticipantsandhasregisteredanapplicationwithasocialmediaproviderand selectedtheoauth2scopesforgrantofauthorizedaccessthatcorrespondtothepersonal datausedtoconducttheresearch.oncetheindividualhasprovidedtheclickvbasedgrant ofauthorization,thelab sappusesanoauth2tokentorequestandreceivethat individual spersonaldataviatheirappandintoaresearchplatformanddatabaseusedto conducttheanalysis. TheSocialMediaProviderprovidesanaccounttotheindividualunderitstermsand conditionsandprovidesadeveloperaccounttothelabunderanothersetoftermsand conditions.italsoprovidesthepersonaldataauthorizedbytheindividualforsharingwith theapplicationofthelabuponpermissionoftheindividualuser. 3.6.4 Data Allpastandcurrentavailabledataduringthecourseofparticipationinthestudythatis availablebyoauth2individualconsentfromincludedsocialmediaproviders. 3.7 NotesonScenario ThisexampleisbasedonastudythatiscurrentlyhappeningattheTechnicalUniversityof DenmarkincollaborationwiththeMITHumanDynamicsLab.However,referencesto potentialdownstreamsharingarrangementsbyparticipantsandresearchersrepresent prospectivefuturephaseresearchandassumeafuturestateofperhaps1v3yearsfrom now. 3.8 References Relatedtoapplicablerules & *&When&Facebook&has&the&data,&these&terms&apply: PlatformPolicy(AppliesviaResearcher sregistered Client App/Service) https://developers.facebook.com/policy 31
StatementofRightsandResponsibilities https://www.facebook.com/legal/terms DataUsePolicy https://www.facebook.com/about/privacy FacebookCommunityStandards https://www.facebook.com/communitystandards FacebookPrinciples https://www.facebook.com/principles.php *&When&the&Researchers&Receive&the&Data SensibleDTUExampleComputationalSocialScienceResearchStudy https://www.sensible.dtu.dk/?page_id=89 *&When&the&Participants&Share&Downstream&Via&Personal&Data&Services& MITHumanDynamicsLabModelPersonalDataSystemRules https://github.com/humandynamics/systemrules/blob/master/model_personal_data_sy stem_rules.md DraftDataRightsServicesAgreement https://github.com/humandynamics/legalagreements/blob/master/datarightsservices Agreement.md 32
4 DataforGood:PublicGoodandPublicPolicyResearch UsingSensorData/MobileDevices Team:&Jake&Kendall&(Gates&Foundation),&YvesMAlexandre&de&Montjoye&(MIT),&Cameron&Kerry& (MIT)& 4.1 Abstract Thereislittledoubtthatthecapacitytocollectandanalyzemobilephonedataatlarge scalehasgreatpotentialforgood[un][d4d].thereare,however,numerousbarriersthat needtobeovercomebeforethisdatacanbebroadlyusedbynonvgovernmental organizations(ngos)andresearchers: Thedataisgeneratedbythecarriers infrastructureandbelongtothem Theinfrastructuretomanageandanalyzethisdataatscaleforgoodhastobe developed DataVscienceskillsareneededwithinNGOstofullytakeadvantageofthedata, ThesedataarehighlysensitiveandpersonalVsimplyanonymizedmobilephone metadatahasbeenshowedtoberevidentifiable[unique],and Thelegalandregulatoryenvironmentisatbestuncertainandmaypreventcertain usesofthedata. Thisgroupisstudyingthetechnicalandlegalsolutionsthatcouldmakethisdataavailable inanoperationalcontext.wefirstfocusouranalysisontwoscenariosinspiredbythe availableacademicliterature.wethensketchproposedpracticalimplementationsto operationalizethesescenariosandanalyzethemfromaprivacyangle,focusingonrev identification,andalegalperspective,withafocusonafricancountries. 4.2 ScenarioDevelopment Afterconsideringanumberofdifferentscenarios,wefocusedontwothatcontrastscope andpurpose: Scenario1:Trackingpopulationmobilitywithinandacrossborderstomodelepidemic spread Scenario2:MicroVtargetingbehaviorchangeinterventionstoindividualsorspecificsubV setsofthepopulation. Scenario1ismodeledontheuseoflocationdatacomingfrommobilephonesinorderto betterunderstandandquantifythespreadofmalaria.thelocationofusersisrecordedat theantennalevelandeverytimeauserisinteractingwithhisphone(phonecall,text,or Internetsession),locationdataisusedtoestimatehismigrationsbetweenasetof predefinedregions,forexamplefromnairobitolakevictoria,aswellasthetotalnumber ofnightsspentbyeveryuserineveryregion.themainexpectedoutcomesofthisworkare twomatricesthatshowtheaveragemonthlyparasiteimportationbyreturningresidents andbyvisitors.inthescenarioweconsider,suchmatriceswouldbecomputedona monthlybasisandsharedwithlocalcdcs,ministriesofhealth,andngos.wealsoconsider acasewheredatafrommultipleoperatorsacrossneighboringcountrieswouldbeusedto estimatethemonthlyparasiteimportationsperregions.whilethisscenariohasaclear publicpurpose,thesensitivity,revidentifiability,andpotentialformisuseoffinevgrained 33
locationdata,suchastargetingofindividualsorgroupsformaliciouspurposes,hastobe considered. Scenario2,inspiredby[bigdatadriven],usesmobilephonemetadatatomicroVtarget peopleforspecificbehaviorchangepurposes:agriculturetechniquesandhealthseeking behaviors,forexample.inthiscase,locationdataattheantennalevel,aswellasother metadatafields,suchasanonymizedcallandtextlogs(excludingcontent),andrecharge informationareusedtoestimateanindividual sstatus(farmer,othersocioveconomic status)and/orpropensitytochangebehavior.inthisscenario,mobilephonemetadataare usedbymachinevlearningalgorithmsthroughasetofprevcomputedmetrics(e.g.daily distancetraveled,rechargingbehavior,timeittakestoansweratext,).userscanthenbe targetedforvariousbehaviorchangeorinformationalcampaignsthroughtextmessagesor phonecallssentbythecarrier,orbyathirdparty.whilecomputingthemetricsrequiresa richsetofdata,thisscenarioaimsatemphasizingthechallengesassociatedwithmicrov targetingindividualsandinintroducinganelementofintrusivenessthatisnotpresentin Scenario1,butinvolvesthesamepublicpurposes. 4.3 OperationofScenarios Foreachscenario,weproposetwopotentialimplementations.Wewillsubsequently analyzethesefourimplementationsfromaprivacyangleandalegalperspective. 4.3.1 Scenario1 InScenario1implementationA,thedifferentmobilenetworkoperators(MNOs)involved wouldsharesimplyanonymizedindividualmobilitydatawithonethirdvparty.tolimitthe risksofrevidentification,thedatawouldbecoarsenedspatiallyandtemporally.matching thestudy[quantifying],thespatialresolutionofthedatawouldbeatapredefinedregional levelorapproximately1000km²(692settlementsforthe581,309km²ofkenya).similarly, giventheimportanceofnightsformalariainfections(mosquitobites),thetemporal resolutionofthedatawouldbeof12h(e.g.6amv6pm).finally,asmalariasymptomsmay takeupto30daystomanifestthemselves,weworkundertheassumptionthatthree monthsofsuchmobilitydataareneededtoestimatetheimpactofhumanmobilityon malaria.differentmnoswouldhashaslatedversionofthemobilephonenumberofthe subscriberstoallowthethirdpartytoreconcilethedata.scenario1implementationais representedbelow. 34
ContrarilytoimplementationA,inimplementationB,MNOsonlyshareaggregated informationwiththirdparties.inthisimplementation,everymnowillprovideamodified versionofthemobilitymatricesdevelopedby[quantifying]tothethirdparty.usingthree monthsofdata,everymnowillassigneveryofitsuserstooneregion.thisregionwillbe theuser shomelocation.themnowillprovidethethirdpartywitharegionvregionmatrix containinghowmuchtimeuserswhosehomeisinregionihavebeenspendinginregionj. Forexample,therowcorrespondingtoregioniwilllooklikethefollowingmatrix: iv2 iv1 i i+1 i+2 1% 2% 87% 0.5% 2% Thisreadsthatalltheuserswhosehomelocationisinregioni,havebeenspending87%of theirtime(e.g.hourlyornights)inregioni,2%inregioniv1,1%inregioniv2overthe courseofthreemonths. EachMNOwillalsoprovidethethirdpartywiththenumberofitssubscriberswhohave beenassignedtoeachregion. 35
4.3.2 Scenario2 Herewewillalsoconsiderathirdpartyplatformprovider,althoughthearchitectureis fairlysimilarifthereisonlythemnoinvolved.theissueisonlythattheenduserswould havetotakeituponthemselvestolinktomultiplemnosiftheywantedtobeabletotarget clientsofeach. Heretheanalytictransformationofthedataconductedbytheserviceproviderwould selectasetofuniqueusers(notidentifiedbynameorotherpii,butbyencryptedkeyor otheranonymousuniqueidentifier),basedontheirusagepatternsandinferencesabout theirsocialstatusorothertraits.theywouldthenpasstheuniqueidstothemno,who wouldbeabletomatchthemtothecorrespondingphonenumbersforrevcontactwithan SMSorautomatedvoicemessageencouragingprogramparticipation. Case1 Thirdpartiesmayanalyzeanonymousdatatoselectindividuals,butthe mobileoperatoristheonlyoneintouchwithtargetsandtheyarenotidentifiedto thirdparties.thirdpartiesmaypassbackanencryptedkeyorotheridentifierto triggersendingamessage. Case2VAthirdVpartyisputdirectlyintouchwiththetargets,orcanidentifythem itself. 4.4 RegulatoryEnvironment ReviewofonlinesourcesondataprivacylawsinAfricaindicatesalandscapethatis evolvingalongtwolines.francophonecountriesinwestafricaandnorthafricathat reflectthefrenchcivilcodesystemhavetendedtoadoptprivacyframeworksmodeledon the1995europeanprivacydirective,supervisedbydataprotectionauthorities.englishv speakingcountrieswithcommonlawsystemshavelessdefinedprivacylaws. Thus,dataprotectionauthoritiesinanumberofFrenchVspeakingcountriesaroundthe worldhaveunitedinanassociationundertheleadershipofthefrenchcnil,andatleast 36
Benin,BurkinaFaso,Gabon,IvoryCoast,Senegal,Madagascar,Mali,Mauritius,and Moroccohavesuchprivacyregimesinplace,withnewlawsexpectedinMauritaniaand Niger.Manycountries(e.g.,Côted Ivoire)inbothcategoriesdonothaveanydata protectionlaws,butdoappeartohaveconstitutionalprovisionsforarighttoprivacythat providesatleastsomeauthorityforprotection. IntheEnglishVspeakingcountries,thesystemsarelessdeveloped.SouthAfricarecently adoptedlegislation,theprotectionofpersonalinformationbillthatadoptsprivacy principlestobeenforcedbyadataprotectionauthority;ittakeseffectattheendofthis year.bothnigeriaandkenyaareconsideringbroaderbillsthatresembleeachother. Basedonthisframework,wewillusetheEuropeanPrivacyDirective(EPD)asa benchmarkforcivilcodecountries.wewillalsolooktothe[consumerprivacybillof Rights]asawayofexploringitsapplicationanddevelopinganalternativeframework. 5 4.5 DataUtility 4.5.1 Scenario1 ImplementationA:Inthiscase,theutilityseemsclosetothesituationofhavingaccessto thefullrawdata.datapreprocessingandcleaningishardertodooncoarseneddata,as unusualbehaviormightbehiddenbythecoarsening(e.g.anunusuallyhighnumberof phonecalls). ImplementationB:Inthiscase,theaggregationthatisdoneatMNOleveldecreasesthe utilityofthedata.considerationsincludetrackingpeopleacrossborders,removingdual simmers,andtakingspecificperiodsoftimeintoaccount. 4.6 Privacy ImplementationA:ThereexistsariskofreVidentificationevenwhenthedatais coarsened.wewilllookatthenumberofantennaoverseveralregionstomatchtothe unicityformulaonspatialresolution.similarly,thetemporalresolutionherewouldbe twelve.thisshouldallowaveryroughestimateofthelikelihoodofrevidentificationgivenx points. ImplementationB:Whendataisaggregated,theriskofreVidentificationislower;theedge caseswouldbeverysmallregionsthathavebeenassignedashomeregionstoveryfew people.therisktoconsiderherewouldbeatthegrouplevel,e.g.peoplefromoneregion thatonlygotoanotherregion(ofthesameethnicgroup,forexample).acounterpoint wouldbepeoplewhospendtoomuchtimeinanotherregion.thisgoesbeyondpure privacyasriskofrevidentificationandmanyothercasesshouldbeconsidered. 5 CraigMundie,inarecentForeignAffairsarticle,suggestsanewmodelwheregovernanceandregulations shouldnotbefocusedasmuchatthepointofcollectionandstorageofpersonaldata,butratheronhowthat personaldataisusedandretained.thepresident scouncilofadvisersonscience&technology(ofwhich CraigMundieisamember)echoedmanyoftherecommendationsandthoughts.Intheirdocument,BigData: SeizingOpportunities,PreservingValue,inparticular,thebeliefthatregulatingusecasesandenforcingprivacy withstiffcontractualobligationsanddeterrentsmaybeneededtoextractvaluewhilemaintainingdatasecurity andprivacy. 37
4.7 CriticalIssues Businesscaseformobilecarriers.Mobilecarriersarenotinthebusinessof conductingsocialscienceorpublichealthresearch.ngoswillneedtodevelopa businessplanthatmakesdatavsharingworkforthecarriersinterestingand worthwhilefromtheirperspective.supportofgovernments(e.g.,healthministries andcommunicationsregulators)willbepivotal. Scenario1presentstechnicalissuesofdeVidentification.Thespatialandtemporal coarseningofcalldetailrecords(cdrs)substantiallymitigatesprivacyrisksand,if strongenough,cansidesteptheapplicationoftheeuprivacydirective.however,it canalsolimitthereliabilityandutilityofthedata. InScenario2,deVidentification,atleastforsignificantapplications,isnotanoption, becauseinterventionswilltargetedtospecificindividuals.thisscenariowill requireengagementofgovernmentstoenablethedatauseandidentification; withoutaffirmativesupportbyrelevanthealthanddataprotectionauthorities,this scenariomaybeimpossible.theimplicationofgovernmentswillalsorequire carefuldevelopmentofmechanismstoavoidmisuseandunwantedidentification. Furtherdevelopmentofspecificpracticesandtechnicalmethodstomanageprivacy protectioninaccordancewithvariousprinciplesoftheeuprivacydirectiveand theconsumerprivacybillofrights(e.g.dataretention,accountability) 4.8 PromisingPathsForward Acrossbothofthesescenariostherearepromisingpathsforwardintermsofemploying differenttechnicalarchitecturesandpracticestomeetdataprotectionneeds,whilestill extractingvaluefromthedata. 4.8.1 Scenario1 Inthiscase,therearealreadyprivatesectorcompaniesthatgrabmobilitydatafrommobile operatorsandsellitwithoutuserpermission(i.e.,basedonostensiblyachieving anonymity). AirsageisanexampleintheU.S.thatdemonstratesanumberofinnovativeapproachesto sharinganonymousmobilitydata.theyimprovethequalityofthepositionsignalover whatacdrwouldbeabletoprovidethroughtriangulation,whichtheyachieveby upgradingthebasestationsoftwareofthemno.theytheninstallsoftwarewithinthemno firewallthatanonymizesthedatabystrippingitdowntojustmobilitypatternsand aggregatestheoutputtoaminimumofsevenmobiletracesperobservation.hence,iftwo peoplemovedfromatobinagiventimeperiod,theywouldreportthat lessthanseven peoplemoved. Thefactthattheydotheiranonymizationwithinthefirewallremovesthe needtosharerawdata. AcompanycalledGrandatainMexicousesaformofdifferentialprivacyalgorithmtoadd somerandomnoiseandlimitthefidelityofqueriesontheirmobilitydatathattheysellto retailmarketers. Othertechniquestoexplorefurtherwouldincludeemergingdifferentialprivacy approaches,aswellassyntheticdatasetgenerationviamodelingmethodologies(e.g.dpv WHERE). 38
4.8.2 Scenario2 Becausedecisionsarebeingmadeaboutactionsinvolvingindividualsorsmallgroupsin thisscenario,andbecauseindividualleveldata(ratherthanaggregate)arebeingused,the factthatdataisanonymizedbybeingstrippedofpiidoesnotfullyameliorateprivacy concerns. Someapproachestoinvestigatehereare: IDkeyencryptionschemesandanonymizationapproachesthatgoasfaras possibletoprotectindividualidentity. Someformofregulatoryexception(e.g.specificlegalauthorizationorpublicpolicy exception)mightalsobeinorder,sinceevenfullyanonymizeddatawouldstill refertoindividuals. Developmentofethicalprinciplestomakesurethatdecisionsbeingmadeabout individualsarefairanddonotexplicitlydisadvantageanyone. Thisrequirescarefulthinkingabouttheuserexperience SMSorcallsthatare clearlytargetingthepersonmightfeel creepy andcareshouldbetakenotto makedatasubjectsfeeluncomfortableortargetedinanyway Thedevelopmentoftrustframeworkstomanagethedataandverifythelegitimacy ofitsuses 4.9 References 4.9.1 OverviewofAfricanPrivacyRegulation [D4D]http://arxiv.org/abs/1407.4885 [UN]http://www.unglobalpulse.org/Mobile_Phone_Network_DataVforVDev [unique]http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html [quantifying]http://www.sciencemag.org/content/338/6104/267.abstract [bigdatadriven]http://web.media.mit.edu/~yva/papers/sundsoy2014big.pdf https://docs.google.com/document/d/1tsjsadw41ymvhajqb9hcgc1s7kntab7p_jhejt epvas/edit?usp=sharing 4.9.2 Scenariodevelopmentdocument https://docs.google.com/document/d/1yg6w5althppw8koeigti_sr9lrotzbnw8w9eul TkP20/edit#heading=h.gjdgxs 39
5 AdditionalUseCases Summarized&by&Karen&Sollins&(MIT)& Inadditiontothethreescenariosdevelopedabove,fourothergroupsprovidedbriefer reports.theyaresummarizedhere,inordertofurtherbroadenourunderstandingofthe breadthoftheproblemdomainofconsiderationofprivacyintheworldofbigdata.these additionaltopicsare:(1)privacyinaggregateddiversedatasets,(2)creation, Management,Application,andAuditingofConsentonPersonalData,(3)Consumer Privacy/RetailMarketing,and(4)GenomicsandHealth. 5.1 PrivacyinAggregatedDiverseDataSets Team:&Evelyne&Viegas&(Microsoft),&Micah&Altman&(MIT),&YvesMAlexandre&de&Montjoye(MIT),& Elizabeth&Bruce&(MIT) Overview Microsoftisworkingwiththeresearchcommunityondevelopinganopensourceplatform forhostingdatasetsandcodeforthemachinelearningresearchcommunity.codalabisa MachineLearningServicethatallowsresearcherstoshareandbrowsecode,data,and createandshareexperimentsandworkflows.codalabhelpsnurtureanenvironmentof scientificrigorandopenupnewavenuesforcollaborationbetweenresearchers. Thecharacteristicsofdatathataresubmittedmightvarywidely.SuchdataincludeswellV known,previouslypublisheddata,suchasthatfromofficialstatisticsandcommunityv manageddataobtainedfromthirdparties,datacollectedbytheauthorsofthesubmission generallyfortheirresearch,andderivativedatasetspreparedspecificallyforapublication whichmayintegrate,correct,annotate,andrecodedatafrommultiplesources. Theemergingchallengesinthisareaarerelatedtothevarietyofdataandthelimited resourcesthatareavailableforvettingit.ownersofcommunityrepositoriesare particularlyconcernedwithdevelopingpoliciesthat1)arestrongenoughtostrengthen replicability,2)thatcanbeappliedwithoutintensecasevspecificscrutiny,and3)recognize commondisclosureofthreats,whilestillpermittingpostingandaccess. Stakeholders DatacollectorVwiderangeVVanypartythatcollectsoriginaldata,nodirectinteraction withserviceormainscenarios,mayhavesettermsunderwhichdatawasoriginally collected ServicehostVprovidesCodaLabserviceandhostsstorage,mayimposerestrictionsonuse DatasubjectsVwiderangeVVnodirectinteractionwithserviceormainscenarios DatacuratorVcuratorscreate competitions onthesite,providedatatotheservice,set termsofusethatarepresentedtocompetitors,(optionally)vetcompetitors DataanalystVentrantsinaparticularcompetition,typicallyresearcherswhoaimto developortunealgorithmsormodelstooptimizesomequantitativecompetitioncriteria, suchas%correctlypredicted,meanvsquarederror(mse) DatausersVsynonymouswithdataanalysts 40
Questionsandchallenges& Keygoals: Shareresearchshowingadvancementinfield(notjustincrementaladvances) Findexpertswhocanworkona(societal)problem Keyrisks: ReVidentificationattacks Inadvertentdisclosureofpersonalinformation Identifiedchallenges: Whatisthedatalifecycle? HowdoesaserviceownermanageprivacyVrelatedrisksresultingfromrunninga servicethatacceptingdatafromcurators? LowVeffortmethodsVVmustapplytomanydifferentdatasetsofheterogeneous typeswithoutexpertanalysisofeachdatabase Reuseacrosschallenges:mostcompetitionsdonotsupportreuseacrosschallenges, orlongvtermaccess.incontrast,agoalofcodalabistocontributetoalongvterm evidencebaseforresearchinthisarea. AutomaticorguidedidentificationofPII/datacurrentlyfocusesonmedical/health datacasesandmaynotbeappropriatetotherangeofdatabeingconsideredinthis usecase. Howdowemeasuretradeoffsbetweenutilityvs.privacyinthisusecase? ArethereautomatedtechniquesforidentifyingpotentialPIIindatasetsbeing submittedbyresearchers? 5.2 Creation,Management,ApplicationandAuditingofConsenton PersonalData Team:&Simon&Thompson&(BT),&Karen&Sollins&(MIT),&Arnie&Rosenthal&(Mitre)& Overview Personaldatahasmanystakeholders.Thisscenariofocusesontheabilityofthesubject,as animportantstakeholder,toinfluencehowtheirdataistreated:&&collected,shared,used, andprotected,andtheabilityofthecontrollersofpersonaldatatoabidebythese preferences.patientsandotherstakeholdersmusthaveincentivestoshare(andminimize disincentives),andtotrustotherstobehaveastheysaytheywilldo.otherwise,patients maywithholddatafromcliniciansandrecordholderswillresistforwardingdatatoothers, harmingpatients health,increasingcosts,andslowingoperationalimprovementsand researchprogress. Personaldataisofmanykinds,oftenrequiringdifferentpolicies.Thesedistinctionsinkind aremultivdimensional,andnosingledistinctiondominates.wenotethatauditmetadata andthesubject sownconsentspecificationsarethemselvespersonaldata.theydonot requirefundamentallydifferenttreatment,butmayhavesomespecificpoliciesattached. 41
Thisscenarioisrelevanttomanyimportantverticals,includingseveraleachinHealthcare, Education,andCommerce,butwhatiscentraltothisscenarioistheinterplayamong stakeholders wishes.thesedependonthekindofinformationinvolved.inparticular,the subjectmayhavedifferentrightswithregardtodifferentkindsofdata,andespeciallyin termsofmedicalcontent. Acriticalaspectofthisarenaisthatstakeholders,especiallythesubjects,deserve appropriate&controls,butcanrarelyhandlethetechnicalcomplexityofspecifyingthem. Theyneedawaytocustomizebehaviortobeapproximatelycorrect.Theregulatory frameworkmayneedtoallowforsituationswheretheuserdidnotspecifyorunderstand allbehavioraldetails(justasitallowssignoffonlegalesethatfewcitizensunderstand). Stakeholders: Thekeystakeholdersconsiderinthisrevieware: DataSubjects:thosedescribedbythedata Recordholders:thecollectorsandrepository Recipients:thosewhomayreceivethedata,including,forexamplewithmedicalrecords, caregivers,payers,researchers,marketers,orlegalauthorities,whothenmaybecome recordholders. Questions,challenges,andobservations: KeyGoals: Providesubjectswithappropriate(tothemselves)understandingandcontrol(user preferences)overprivacypoliciesofinformationaboutthemselves. Balancetheinterplaybetweeninterestsandresponsibilitiesofdifferent stakeholders,forexamplethesubject,regulators,caregivers,insurancecompanies, etc. Taggingorotherlabelingandgovernanceofdatainordertoenableapplicationof policies. Certifyingandmaintainingthequalityofthedata KeyChallenges: Preferencedataisitselfmetadataaboutthesubject:Consumerpreferencedata mustbetaggedbywhatcontentthepreferenceitselfrevealsvapatientpreference aboutreleasingabortiondatashoulditselftaggedasabortionvrelated,andcannot besharedwithallrecordholders.itisanopenquestionhowbesttocombine confidentialityandusabilityforsuchdata. Standardsforcompositionwhenglobalstandardsareimpossible:Global standards,globallycompliedacrossallindustries,areunlikely especiallyasone addsmoreandmoredetails.(afewbasicpracticesmightbestandardizedand compliedwith,butnotthediversityinamoderneconomy).howshould stakeholdersexpresspoliciesthatarerobust,evenwhensomeinformationis absent? Thediversityofenforcementmechanismswillcomplicateimplementation: Techniquesforamajorcorporationmaybeinappropriateforasmallbusinessand techniquessuitableformanaginglargedocumentsmaybeinappropriatefor 42
millionsofvaluesinadatabase.forexample,omittingadocumentdiffersfrom redactingadatabasevalue(whoseabsencemaybenoted). Trust:Toprovideaneffectiveprivacymanagementmechanism,theprivacy metadataofpersonalinformationmustbetrusted,andusedbytrusted components,i.e.,oneneedsaneffectivetrustnetworkthatassuresthateveryone willbehaveappropriately. 5.3 ConsumerPrivacy/RetailMarketing Team:&John&Ellenberger&(SAP).&Ilaria&Liccardi&(MIT),&Dazza&Greenwood&(MIT)& Overview: Thisgroupconsideredaspecificexampleinmarketing,acustomerloyaltyprogramina brickandmortarretailer.theyenvisionedasystemwiththreeelements:(1)the customer ssmartphone,(2)acloudvbasedintermediaryservice,and(3)theretailer s backend.theintermediaryserviceprovidestheserviceforcommunicationwiththesmart phone,bothcollectingdataandpushingoffers.theretailer sbackendcollects,manages andutilizesthecustomerdataandaspartofthatprovidesthesupportforanyprivacy policiesandmeetsanylegalrequirementsforprivacy. Asanextensiontothis,thegroupalsoconsideredacasewherethirdVpartydatamay becomeavailabletothebackendservice.thegroupconsideredtheproblemofmapping betweenthe identified datacollectedbytheretailerandthepotentiallyanonymizeddata fromathirdvpartymarketingfirm. Stakeholders: Subject Cloudserviceprovider Retailerrunningthebackenddatacollection,management,andanalysisservices PossiblethirdVpartymarketingdatasource Keygoals: Improvethecustomerexperienceinthestore Increasetheretailer smarketshare Totheextentthereareregulatoryrequirementsonprivacypolicyenforcement, complywiththelaw Keychallenges: Fusionofidentifieddata,legitimatelycollectedbytheretailerwiththirdVparty marketingdata.simplyfusingthesecorrectlyisextremelydifficult. Totheextentthatmergingdatamaycreate newdata aboutthesubject,thisis subjecttoregulations,especiallyineuropeanditwillrequirepermissionsfromthe subject. Inferenceofotherfactsaboutasubjectfromthebaselevelinformation.For example,itiswellunderstoodthatpatternsof likes maybeagoodpredictorof preferencesnotdirectlyexposedandthereforesubjecttoprivacypolicies.the demonstratedexampleispredictionofsexualpreferences. Morebroadlythisgroupdidnotconsidertheethicsoftheseapproaches. 43
5.4 GenomicsandHealth Team:&James&Williams&(Google/University&of&Toronto),&Michael&Power&(Osgoode&Hall&Law& School) Overview: Thisscenariofocusesonsharinghealthinformation(includinggeneticinformation)for bothhealthvrelatedresearchandpersonalizedmedicine.thescenarioinvolvesnumerous healthcareproviders(e.g.,hospitals)andresearchgroups(e.g.,universities)collaborating toexchangeinformationforavarietyofpurposes,includingtheprovisionofcare.asa result,itisinherentlycomplex;notonlyaretherenumerousorganizationsinvolved,but eachofthesemaybesubjecttodifferentlegalrequirementsbasedonthejurisdiction(e.g., country,state,province)inwhichtheyoperate. Whileadvancesingenomicresearchmethodshavemajorramificationsforthebiological sciencesingeneral,theyareparticularlyinterestingfromthestandpointofhealthmrelated& research.infact,someresearchershavearguedthattheanalysisoflargegenomic databases(i.e.,containingmillionsofsamples,asopposedtothousands)maybethekeyto unlockingnewdiscoveriesrelatedtohumanhealth.tonamebuttwoadvantages:1)larger datasetsempowerresearchersbysupportingawiderrangeofqueriesandobservations, and2)theuseofmodern,distributedcomputinginfrastructuresupportsinteractivemodes ofresearchthatoffermajoradvantagesovertraditionalapproaches. Thesituationbecomesevenmorepressingwhenonerealizesthatmanyresearchproblems canonlybeansweredbycombininggenotypeandphenotypedata.inpractice,thismeans themergingofgenomicrepositorieswithelectronic&medical&records&(emrs).indeed,the emergingfieldofpersonalizedmedicineisbasedontheabilitytocorrelateinformation betweenthesetwodomains.giventhemultitudesofhealthvrelatedissuesfacinghuman populations,andthepromiseofgenomicresearchandpersonalizedmedicinetoaddressa significantnumberofthem,itisimportanttodeveloptoolsandmethodsforfosteringthe sharingofgeneticandphenotypicinformationforresearchpurposes. Ofcourse,privacyisoneofthemostcommonlycitedconcernsthatarisewhenindividuals aresurveyedabouttheirattitudestowardssharinghealthinformation.itisvitalthatsuch datasharingbeaccomplishedinamannerthatminimizesriskstoprivacy.aspartof respectingprivacy,individualsmustbeprovidedwiththeabilitytocontroltheuseoftheir information,includingwithdrawingconsent. Whileinformationalprivacyconcernsareexplicitlyaddressedindataprotectionlaw,fair informationpractices,anddatasharingagreements,itisanopenquestionastowhether wecandesignbettermechanismstogiveeffecttothesenorms. Stakeholders: Patients,subjectsofthedata Cliniciansincludingbothphysiciansandalliedhealthprofessionals Researchers Healthcareserviceproviders InstitutionalReviewBoards(IRBs)orResearchEthicsBoards(REBs) Regulators 44
KeyGoals: Deliveryoftimelyandeffectivehealthcare(patients,clinicians) Participateinresearch(patients,possiblyclinicians,researchers Actinaccordancewith fiduciary responsibility(clinicians) Obtainandutilizelargegenomicdatasets(researchers) Obtainandutilizelargeclinical(i.e.phenotypedata)datasets(researchers) Integrateacrossthesetwotypesofdatasets(researchers) Maximizeefficiencyofhealthcaredelivery(healthcareserviceproviders) Utilize(andprofitfrom)intellectualpropertyinherentinpatientrecords(health careserviceproviders) Maintainsecurityofrecordssystems(healthcareserviceproviders) Minimizeprivacyrisks(regulators) Providerecourseforprivacyviolations(regulators) KeyChallenges: Atpresent,integrationisalmostimpossible.Mostdatasetaccessisrestrictedto peoplewithintheorganizationcollectingthedata. Integrationacrossdifferentregulatoryauthoritiesispoorlyunderstood. ThetradeVoffsbetweenprivacyandutilityinthecontextoftechnicalprivacy preservationmechanismsareparticularlyacuteinthecaseofgenomicresearch. Thereisalsoatensionbetweentheabilityofpatients(datasubjects)tocontrolthe useoftheirinformation,andtheabilityofresearcherstoaccumulatestabledata setsforresearchpurposes.forinstance,dynamicconsentmechanismsgive patientscontrolofdataattheexpenseofresearchers,whoseactivitiesmaybe interdictedbyrequeststoremovedatafromtheircorpus. EnablinginterVjurisdictionaltransferofdatamayrequiretheharmonizationof regulatoryregimes,aswellastheadoptionofcommonstandards. Thecurrenttransactioncostsfordatasharingagreementsareonerousformany organizations,creatingalandscapeof'silos'ofhealthinformationthathavegreat utility,butwhichcannotbeaccessed. Existingapproachestosharinghealthdatabetweenorganizationsrelyheavily uponbivlateraldatasharingagreements.thisapproachscalespoorlywhenthere aremultipleorganizationsthatwishtojointlysharedata. 45
6 Conclusions Karen&Sollins&(MIT)& Wegeneralizethreesetsofconclusionsfromthereviewofthescenariosdescribedabove insections2through5.thefirstisasetofoverarchingchallengesderivedfromthe systemicapproachestakenacrossthesebigdatascenariosinconsiderationofprivacy.the secondisacommonalthoughnotuniversalsetoftypesofstakeholdersinhandlingboth thebigdataitselfandinsupportoftheapplicationofprivacypolicies.finally,weobservea numberofkeyopenquestions,raisedbythesetofscenarios. Weobservefivekeychallengesfromthescenarios: Scale:Notonlyareweobservingincreasingsizesofdatasets,butalsothose increasesinsizewillleadtoincreasesinsizeoftheaccompanyingmetavdatathatis criticaltothesupportofprivacy.withoutsignificantimprovementsinefficiency, thegrowthinbothdataandmetavdatawillleadtountenableprocessingtimes,but thismustbeachievedwithoutcosttoprivacy. Diversity:Withincreasingdatasetsizeswillalsocomeanincreaseininterestsand typesofresponsibilities.thisincreaseislikelytoleadtoincreasedprobabilityof nonvalignedinterests.thisdiversityofobjectivesandinterestwillleadtoatleasta divergenceofprivacypoliciesandmorelikelytoincreasedincompatibilityof privacypolicies.capabilitiesforbothobservingandhandlingsuchdifferenceswill becomeincreasinglyimportant. Integration:Inadditiontothepointsaboveofscaleanddiversity,services increasinglysupporttheintegrationofpreviouslyindependentdatasets.ata minimumthiscanleadtosurprisingorunintendedinferencesacrossthesenewly integrateddatasets,resultinginpreviouslyunknownfactsaboutsubjects.thusa newchallengearisesfromthisintegrationintermsofprivacypoliciesforthese newlydiscoveredfactsordata. Impact&on&secondary&participants:Althoughdatamayitselfhaveaprimarysubject, increasinglytherewillalsobesecondaryparticipantsorsubjects,suchasfriends, parents,guardians,orbyvstanders,alsoreflectedinthedata.providingprivacy throughprivacypoliciesforthesesecondaryparticipantsmaybeevenmore challengingthanfortheprimarysubjectsofdata. Need&for&emergent&privacy&policies&for&emergent&data:Integrationmayleadto emergent,orpreviouslyunobservabledataaboutsubjects.thisnewlyobservable datawillalsorequireprivacypolicies,anditisnotclearthatthosenewpolicieswill simplybeaderivativeofthepoliciesapplicabletotheunderlyingoriginaldata.itis likelythatnew,emergentprivacypolicieswillbeneeded,andthechallengeishow thosenewpolicieswillbecreated,bywhomandunderwhatconditions. Thesecondsetofkeyobservationswederivefromthesescenariosisalistoftypesof stakeholders,whoplayaroleinsetting,enforcingandmitigatingthefailureofapplication ofprivacypolicies.webeginwiththesubjectsofthedataitself.insomecases,butnotall, theyplayaroleindeterminingapplicableprivacypolicies.additionally,adecisionvmaker, whodecideswhatdatatocollectandhowtohandleitmayplayasignificantorcentralrole insettingprivacypolicies.fromtherewemovetothe handlers ofthedata.thatdatawill becollectedbysomeparty,andmaybeseparatelycuratedforcompleteness,accuracy,and soforthbyacurator.thedatamaythenbestored,managed,andmadeavailablebyadata platformprovider.itwillthenbeusedbyadataanalyst.alloftheselastfourhaveaccessto 46
thedatainoneformoranother.wehavethenalsoidentifiedtwoadditiontypesof stakeholders,whoserolesfocusonenforcementofprivacypoliciesandrecordingor auditingofusageofthedata.thesetwofinalrolesaredistinctfromeach.itispossibleto haveauditingwithoutenforcement,foreitherlegalormitigationreasons,ifapolicyis violated.enforcementbenefitssignificantlyfromauditing,butisnotdependentonit. Finally,werecognizethattherearemanyopenquestions.Wehighlightfourhere: Novelty:Althoughweidentifiedanumberofchallengesabove,thereremainsa questionofwhetherbigdataleadstonewanduniquechallengesintheprovision ofprivacy,orwhetherthesechallengesareonlymoreobviousinthebigdata arena. Tradeoff:Eachofthescenariospresentsasignificantbenefit.Thesemaybe economic,social,medical,andsoforth.inaddition,eachpresentsriskstoprivacy, bothinherentlyandperhapsbecausethesituationisstillnewandnotwell understood.wemustaskhowtoevaluatethetradeoffsbetweenbenefitsandrisks, specificallytoprivacy.atthispoint,wedonotevenhaveametricorspectrum alongwhichtoconsiderthistradeoff,anditisnotclearthatasingleoneexists. Harm:Therisktoprivacymentionedaboveisneitherbinarynornecessarilystable. Thisleadstoaquestionofwhetherandhowtoevaluatetheharmthatmayresult fromdifferentchoicesinthetradeoffspacebetweenbenefitsandrisks. Trust:Trustreflectsawillingnessamongstakeholderstoacceptvulnerabilities. Thus,wemustaskhowitisthatstakeholdersdeterminetheirleveloftrustor mistrustinotherstakeholders,withrespecttotheapplicabilityofprivacypolicies. Thisincludesboththestakeholders modelsoftrust,howthoserelatetopeople s perceptionsofeachother,aswellaswhatmechanismsandtechnologiescan provideinsupportofthoselevelsoftrust.furthermore,onemustaskhowsuch trustevolveswithtimeandhowthatmightbesupportedtechnically. Itisimportanttorecognizethatourobservationsherearelimited.Theyarebasedonthis limitedsetofscenarios,andeveninthatcontext,maybeincomplete.theyarepresentedto givethereaderaclearersenseofthesortsofchallengesandquestionsthatarisefromthe intersectionofbigdataandprivacy. 47
48 A. Appendix:PrivacyScenarioTemplate Team:&&Simon&Thompson&(BT)&&&Dazza&Greenwood&(MIT&Media&Lab)& ElementsofBigDatascenario People/Stakeholders?(i.e.,Whoaretheparties,theirrespectiverolesand relationships?whoisdataowner(datacontroller)?whoisusingthedataand whatistheintendedpurpose?whoarethedatasubjects?whoisdoingthedata analytics?) Interactions?(i.e.,WhattransactionsorotherexchangesbetweenActors?)(What isthepowerdynamic?) Data Whatkindofpersonaldata?* Whattypeofbigdatamodels,analytics,orotheroutputsresultfromthis scenario? Howisthedataused? What sthedatalifecycle? Systems?(i.e.,Whatbusiness,legal,technical,orsocialsystemsmattermost?) BusinessSystems(Ethicscommittees,signVoffbyauthorized officers,recordkeeping,audit) LegalSystems(Contracts,Employeerules/procedures, certification/accreditations,compliancereviews,insurance/bonding requirements,industrystandardpolicy/guidelines,etc.) TechnicalSystems(Systempermissionsandsecurity,alarms& automateddetectionofpai,automaticanonymizationofdata, cryptography,etc.) SocialSystems(Whatsocialsystemsandcontextexists?) Analysisofscenario Goals(i.e.,WhataretheincentivesandthebenefitsdrivingtheActors?Who benefits?whatarefinancialincentives?) Rules:(i.e.,Whataretherelevantlawsandregulations,otherenforceablerules) Arethereexistingstatutes,contractualagreementsorothercommitments associatedwiththedate. i. Rulesaboutretention, ii. Liabilityforbreach? iii. Accuracy? iv. Others... Iftherearenotstatutoryorotherbindingrules,howwouldtheprinciples fromtheconsumerprivacybillofrightsguidethedevelopmentofrules? i. INDIVIDUALCONTROL:Consumershavearighttoexercisecontrol
overwhatpersonaldatacompaniescollectfromthemandhowthey useit. ii. TRANSPARENCY:Consumershavearighttoeasilyunderstandable andaccessibleinformationaboutprivacyandsecuritypractices. iii. RESPECT FOR CONTEXT: Consumers have a right to expect that companieswillcollect,use,anddisclosepersonaldatainwaysthat are consistent with the context in which consumers provide the data. iv. SECURITY: Consumers have a right to secure and responsible handlingofpersonaldata. v. ACCESS AND ACCURACY: Consumers have a right to access and correct personal data in usable formats, in a manner that is appropriate to the sensitivity of the data and the risk of adverse consequencestoconsumersifthedataisinaccurate. vi. FOCUSED COLLECTION: Consumers have a right to reasonable limitsonthepersonaldatathatcompaniescollectandretain. vii. ACCOUNTABILITY: Consumers have a right to have personal data handledbycompanieswithappropriatemeasuresinplacetoassure theyadheretotheconsumerprivacybillofrights. Risks:Whatarethepotentialharms?Whataretherisksofthoseharmsoccurring? Towhom?Iftheriskisanexternality,howmightitbemitigated? Assessmentofscenario Existingorrelatedbestpracticesforcontextofthisscenario Whatbusiness,legal,and/ortechnicalbestpractices? Gap IssuesNotAddressedbyExistingPracticesandSolutions BusinessSystems LegalSystems TechnicalSystems SocialSystems ShortFallBetweenCurrentandNeededPracticesandSolutions Keyoutcomesforeachscenario Promisingbestpractices Gapsthatneedtobefilledwithnewtechsolutionsorpolicyapproaches PersonalDataisdefinedbroadly,asfollows,fromtheConsumerPrivacyBillof Rights. Thistermreferstoanydata,includingaggregationsofdata,whichis linkabletoaspecificindividual.personaldatamayincludedatathatislinkedtoa specificcomputerorotherdevice.forexample,anidentifieronasmartphoneor familycomputerthatisusedtobuildausageprofileispersonaldata.this definitionprovidestheflexibilitythatisnecessarytocapturethemanykindsof dataaboutconsumersthatcommercialentitiescollect,use,anddisclose. 49
50 B. Appendix:Stakeholders Elizabeth&Bruce&(MIT),&Karen&Sollins&(MIT)& DataStakeholders Decription/Examples "datacollector" Partythatcollectsthe raw ororiginaldata fromthedatasubjects "datasubject(s)" Aperson(e.g.apatient,student, customer )orgroupofpeople(orentity) thatdataisbeingcollectedfrom;thisisthe groupofdataprovidersorparticipants. Subjectsmaybecontributingdatawith informedconsent(e.g.byoptingvinto researchstudy);ordatamaybecollectedinv directlyorinaggregate. Datamaybegeneratedby anindividual/consumer(e.g.takingan onlineclass,acustomeratabank) theinteractionsofagroupofindividuals (e.g.peertopeerinteractions;social networkgraphs) combining/aggregatingdataovera group/populationofsubjects. "datacurator (also:controller,provideror caretaker) Partythatstoresandmanagesthedataand isresponsibleforgranting/controlling accesstothedata;datacuratorisoftenthe stakeholderthatrequiresotherstoformally submittoapolicy(ordatauseagreement) inordertoaccessthedata.theremaybe morethanonedatacurator: originaldatacurator thirdvpartydatacurators "dataanalyst"(also:datascientist) Partydoingtheanalyticsonthedata;may usemanydifferenttypesoftools,software etcforanalysis,explorationand visualization( relyingparty ) "decisionmaker" Thestakeholder(s)thatbenefitsfromthe data;adecisionmakerthatultimately derivesnewinsightsandvaluefromthe dataanalysis;thisstakeholderwill ultimatelymakedecisionsbasedonthedata andmayormaynottakeactionforsome purpose.thispurposeoruseofthedata
dataplatformprovider dataregulator(s) dataauditor maybefor:personalbenefit;forvprofitor commercialuse;orsocietalbenefite.g. NGOs/government). Databeneficiarymaybe: anindividual agroupofindividuals aninstitutionororganization(private; commercial;government;nonvprofit) acontentprovider aserviceprovider Thepartythatbuildsthesystem(s)fordata collectionandprovidesaservice.platform provideranddatacollectormayormaynot bethesameentity/organization.inthecase thattheyaredifferent,theplatform providermayhaveitsowndatausepolicy separatefromthedatacollector. Anarbiterthatsetspolicies;thegoverning regulatorybodythatdevelopspoliciesthat controlsdatacollection,sharinganduse amongstakeholders couldbeatthelocal, state,federal,internationallevel(e.g. HIPPA,FERPAetc) Theenforcingbodyresponsibleforensuring thepoliciesandregulationsareenforced. Mayrequireauditlogging,documentation toensurepoliciesareenforced,anddatais managedasrequired 51
52 C. Appendix:StakeholderDatafromMOOCsandOnline LearningEnvironments(OLEs) Elizabeth&Bruce&(MIT)& DataStakeholder Example TypeofData Allclickstreamdatacapturinginteractionsbetween studentandcontent,includingwhenwatch video/lessons,quizanswers,textfromdiscussion forums,etc.useofvideosandotherevresources,such asdigitizedreferencematerial,wikis,andforums. Assessmentbehavior:attempts,correctness,useof immediatefeedback. MayincludePII(name,email,address)dependingon whatinformationrequiredwhenregisterforcourse. SelfVreportedbackground,preandpostVtestsurveys. DataSubjects Studentswhotaketheonlinecourse,complete assignmentsandreceivecredit Studentshavezeroorlittleaccesstotheirdatabeyond officialgrade/recordscreatedfortheireducation purposes DataPlatformProvider Cousera,EdX,Udacity,StanfordU,etc. ContentProvider IndividualContentProvidersincludefaculty,teachers, staffwhoprovidetheteachingcontentandmaterial (videos,lessons,quizzes,etc),supportdiscussions, interactwithstudents(thedatasubjects)directly,and responsibleforgrading/credit InstitutionalContentProvidersincludeinstitutionsand organizationsthatarebehindtheteachingcontent(i.e. MIT,Harvard,oranindividualprivateenterprise) DataCollector DataPlatformProvidersandInstitutionalContent Providers DataCurator DataPlatformProvidersandInstitutionalContent Providers DataScientist Analystsincluderesearchers,theirstudents(ifthe researchersareacademics),andeducation technologists.teachingstaff,platformproviders,and
DecisionMaker DataAuditor(andCompliance) DataRegulator institutionalcontentprovidersmayalsoactasanalysts. TypicallytheDataPlatformProvidersandInstitutional ContentProviders,sometimestheIndividualContent Providers(i.e.theteachers) Government Government FERPApolicies 53