Core Technology Development Team Meeting

Agenda v Presentation User Needs Survey/Analysis (Todd/Vidya) v PP integration v Brief updates from others Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego 2

[Year] biocaddie Interview Summary VIDYA NARAYANA

Total interviews conducted until 06/22/2015: 10 User profiles: 1 Radiation Oncologist 1 public health Ph.D student 1 anesthesiologist 1 Assistant Professor, Molecular Diagnostic laboratory 1 Associate professor, Department of Chemistry 1 Graduate student, Molecular biology 1 researcher, National Cancer Institute of NIH 1 Professor of Physiology, Medicine and Cardiology 1 Project scientist, Department of Physiology, Medicine and anesthesiology, NHLBI proteomics center 1 Assistant research scientist, Comparative genomics Themes of questions asked: 1) Datasets involved/working on 2) Searching strategies 3) Preferred data formats, problems associated with it and most suitable format in their field 4) Data Visualization strategies preferred, problems and opinions 5) Access rights 6) Current trending problems in their respective field 7) Opinion on how to make data more discoverable Datasets involved/working on: All of the interviewees gave a detailed descriptions of the datasets that they are currently involved in (I can provide a detailed list). In 1 case, the interviewee mentioned the time gap involved in working with a particular dataset. Searching Strategies All of the interviewees mentioned that they knew exactly where they wanted to go to and did not have any problem in searching for data. However in one case, 1 candidate although not having problem in finding data expressed the need of having a list of the databases and descriptions about what could be found here and quick links to them. Preferred data formats

All of the interviewees gave a list of formats they work on (I can provide a detailed list here too), however most of them (7/10) preferred to download the raw txt file and convert to a format they wished. 1 person used a step by step protocol to curate and wrangle the data for class purposes, however she wished the data would be available to her in a form she required because her students were spending more time in curating the data than making a meaning out of it. Access rights None of the interviewees faced any problems in gaining access to their data. However 1 person expressed the need of shortening the process and the number of people it involved in gaining access to clinical patient data. Trending problems in their respective field This question had to be modified several ways (with the aim being constant) in asking people, keeping in mind the experience level of the interview. Answers to this question varied from individual to individual. Some of them being: 1 person had a lot of data available to her in the form of pictures. They were stored into the database in the form of pictures over many years. But when she wanted to use the data she had to go over the files individually to get clinical details from it. She calls it the failure of the system. She wishes Bigdata set captures the nitty gritty details of individual case files. In her opinion, bigdata is not just accumulating a lot of data, but powerful enough to capture very minute information even from less amount of data The same person above, looked for a particular syndrome amongst kids in the hospital s database for research, she could not search for it because the database wouldn t give an option to search by a certain factor. She then recruited a graduate student to look through 3000+ papers. 1 person described the lack of description of data at an individual level. Each of the data element and their connection with the main purpose should and must be described according to her. 1 person felt that there is a lot of useful data collected by independent researchers. She personally knows many people who collect data but are not willing to share. She feels there should be incentives (not necessarily money incentives) of some kind to make people give out their datasets. She also expressed concerns that if it is monetary incentive many people might misuse also. 1 person expressed that the metadata in bigdata itself are the biggest problems. He expressed that almost no one uses controlled vocabulary. He also stated that metadata is highly contextually variable. However his biggest problem in databases is that most databases do not support querying into data (metadata v/s content) 1 person expressed that there is a learning gap/education gap between communities that have embraced Hadoop and communities that use pipelines as means of constructing their big data. He feels that it is difficult to transition from the pipeline world to the big data world partly because there is not enough matured expertise available at the moment (he meant that it is

hard to get sufficient training for people). Training and tools he summarized as the big problems. 2 people expressed that discoverability of data is the biggest problem because of lack of indexing. Although not having faced these problems themselves they expressed that they are aware of people who face this problem. In general the people whom I am interviewing from the BD2K centers are able to give a holistic view of the problems and the people whom I interviewed in TMC are able to express problems at their individual level. Seems like a fair combination. Opinion on making data more discoverable This question was also modified (with the aim of the question being constant) according to the person being interviewed. Some opinions were Referencing to the dataset in the abstract of a paper so that it comes up faster in a search Collaborating with the popular data hosting websites The rest of the opinions were all associated with some examples. I can provide an exhaustive summary after all the interviews. Number of upcoming interviews: 8

Other issues v Any other issues? v Thank You