Deidentification of Patient Data

By November 1, 2021Uncategorized

In the United States, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) governs the use and disclosure of protected health information (PHI). Under HIPAA, entities like healthcare providers must carefully protect PHI from disclosure. However, providers are not barred from sharing other kinds of health information. In fact, HIPAA contains guidelines for the release of health information, which facilitates data sharing, commerce, and research. If the identifying information is removed from patient data through a process called deidentification, HIPAA protections no longer apply, and entities are free to use or disclose such deidentified health information.

Drawing on statistics, algorithms, and machine learning, academic researchers have devised many methods for deidentification of patient data. For example, some methods use manually crafted patterns to classify what information is “PHI-like” or “not PHI-like”. Other methods use supervised machine learning to automatically determine patterns as to what qualifies as PHI [1]. Regardless, the end goal is the same: deidentified data must be immune to reverse-engineering (called reidentification) that could compromise patient privacy. Under HIPAA, providers are liable if they fail to take reasonable deidentification measures [2]. On the other hand, HIPAA standards strive to minimize the risk of reidentification without unduly burdening providers.

Providers can deidentify data using what HIPAA calls the expert determination method. Under this method, a person with sufficient experience analyzes the data to determine whether the risk of reidentification is “very small” [3]. The expert does not need a specific education credential or professional certification, but they should have experience with statistics, mathematics, data science, or a similar field. To improve accountability, the expert must maintain documentation of their methods and results. By hiring an expert to make these determinations, healthcare providers can have greater assurance that their patient data is HIPAA-compliant.

Without access to a qualified expert, providers can also deidentify data through the safe harbor method. In this case, the provider must remove eighteen different types of data that could identify a patient [3]. These data include names, street addresses, telephone numbers, email addresses, Social Security numbers, full-face photographs, and others. HIPAA contains specific provisions for handling certain kinds of data, like birthdates and ZIP codes. Additionally, providers must take additional steps to remove information that they know could be identifiable [3]. For example, if one patient has a specific and unique occupation title, the provider should not include occupation data in their dataset. If healthcare providers cannot readily hire an expert, performing the safe harbor method in-house should ensure HIPAA compliance.

In summary, US federal law standardizes the handling of protected health information and deidentified health information. Yet, questions remain about whether HIPAA’s standards for the deidentification of patient data are sufficient. In the statistics literature, reidentification has been performed on many kinds of datasets outside the healthcare space. For example, reidentification was performed on an anonymized dataset of Netflix subscribers by cross-referencing Netflix watch data with public IMDB movie/television ratings [4]. Researchers have not systematically studied whether current reidentification methods are effective on patient data, and further research is needed to assist expert determinations and healthcare policymakers [5]. Future research may inform new standards for improving patient privacy or new paradigms for handling patient data.



[1] S. M. Meystre, et al. Automatic De-identification of Textual Documents in the Electronic Health Record: A Review of Recent Research. BMC Medical Research Methodology 2010; 10: 70. DOI: 10.1186/1471-2288-10-70.  

[2] US Department of Health and Human Services. Special Topics: Research. 2018. URL: 

[3] US Department of Health and Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. 2015. URL: 

[4] A. Narayanan and V. Shmatikov. How To Break Anonymity of the Netflix Prize Dataset. 2006. ArXiv:cs/0610105.  

[5] K. El Emam, et al. A Systematic Review of Re-Identification Attacks on Health Data. PLOS One 2011. DOI: 10.1371/journal.pone.0028071.