Balancing Data Safety and Utility in Large De-Identified Patient Datasets: A Summer Research Experience

by Edgar Robitaille

Introduction

This summer, I had the privilege to work with Dr. Christopher Chute at the School of Medicine on a computational research project relating to the evaluation of the balance between data safety and utility in large de-identified patient datasets. This was part of the Bloomberg Distinguished Professor (BDP) summer program, in which each student is paired with a distinguished Hopkins faculty to be mentored on a research project. More specifically, my project was related to measuring a dataset metric called the “k-level” with various subsets of de-identified patient data. This value indicates how “safe” the dataset is from a linking attack, in which a dataset might be reidentified by matching de-identified records with identifiable external data sources, putting patient privacy at risk.

Key Experiences/Challenges

I had two major components to my project, with the first being figuring out how to preprocess the data into a usable format, and the second how to actually do the analysis on the preprocessed data of the k-level metric. The main challenges to preprocessing was due to the sheer size of the dataset and that it contained protected health information. The first platform I was provided to work on was completely new (I was the first user) and it was unable to work with that large of a dataset. The second platform I worked with had difficulty connecting to the database also due to its novelty and the nature of healthcare data being difficult to access. I had to spend hours working with IT to get access to the data along with working alongside people familiar with the dataset to come up with a unique solution to compress the data in a usable format easy to work with. Eventually, I was able to figure this out and was able to write code to run the k-level analysis. This was a challenge in itself but ran smoother than the original data preprocessing due to no immediate bottlenecks.

Skills and Knowledge

Gained It was my first time working with this much data and a de-identified dataset. I was able to learn how to utilize various python libraries to both connect to the data and use it to analyze the data, something I had not done before at this large of scale. I had worked with smaller dataset libraries but at this scale those libraries don’t work as well, so I had to use a new library more well acquainted with large datasets. In addition to these technical skills, I was also able to develop soft-skills in working with my mentors. Each week, I presented my work and had to decide how exactly to articulate what I had been working on or what I accomplished that week. I further learned how to ask better questions, when to listen, and when to share my own input. I learned that my input matters too and to stand firm by my opinions and thoughts.

Impact of OKRs

Setting OKRs was a great way to write down a goal and really stick to it. Sometimes, if I don’t write things down concretely I find myself ignoring it later on. The OKRs allowed me to visualize my goals for the summer as a whole and kind of created a checklist which let me satisfactorily cross things out as they were completed throughout the summer. They allowed me to set small goals to motivate myself to continue working. Small goals build up to large goals. They helped me stay organized, motivated, and gave me a sense of achievement as I progressed.

Lessons Learned

The main thing I learned during the experience is that sometimes things don’t always go as planned. Even outside my project alone, I had other goals this summer which were maybe reached but not in the original way I thought they would be. Though, in the project, I thought that the preprocessing would be relatively easy to do quickly, but later found that was not the case! I found it to maybe even be one of the hardest aspects of the project. I learned not to takethings at face value and to realize that sometimes goals can be reshaped or reformed in new ways. I also learned how to act professionally and how to properly present information and work done through weekly meetings.

Future Applications

Through this summer and in my project I believe I have set an example for the types of projects or problems I may want to solve in the future. I really enjoy tackling medical-related problems from a computational approach and thought that this project nicely intertwined both myundergraduate majors—BME and CS. I think the project has solidified my interest in both of these fields and has inspired me to further pursue research and opportunities that combine them together in a beautiful manner. I am applying to med school, so it’s likely I won’t always have the opportunity (or necessity) to apply engineering or computer science skills with new problems I face. However, in the future, I would like to continue finding interesting methods to apply my unique background to these emerging problems, similar to that from this summer project.

By Life Design Lab
Life Design Lab