Project 3: De-identification of health care data

Background and Objectives:

Over the past two decades an increasing amount of medical health data (MHD) has been collected for utilization in secondary purposes for instance??. MHD contains information such as patients’ demographics, diagnostics, medication history and in some cases family history. MHD is normally available in databases available to medical researchers (Gkoulalas-Divanis & Loukides, 2015). This allows for researchers to conduct research about epidemiology, novel treatment quality, register based cohort studies etc (Gkoulalas-Divanis & Loukides, 2015). By using MHD . A large amount of data has been collected systematically for millions of e.g. American patients in databases for secondary usage (Schneble et al., 2020). This has however also increased the risk of reidentification (RR) attack (El Emam et al., 2011). A systematic review by Khaled El Imam and colleagues revealed that 34% of reidentification attacks on medical data were successful (El Emam et al., 2011). Although this study was limited to datasets with relatively small sample sizes, it is clear that RR might be a significant threat . To minimize the risk of reidentification due to systematic cyber assaults on MHD, researchers have developed sophisticated techniques and algorithms to anonymize data to a sufficient extent such that the data can be applied for secondary purposes while maintaining the anonymity of patients simultaneously(Langarizadeh et al., 2018). If the data is anonymized sufficiently in compliance with ethical guidelines, it is not required to obtain the written consent of patients to utilize the data for secondary purposes, thus the risk of bias, due to consensus  from a fraction of the patients only and not the entire population, is eliminated (El Emam & Arbuckle, 2014). This procedure allows for an efficient and less arduous means of obtaining data and applying it for relevant analysis (El Emam & Arbuckle, 2014). What makes anonymization quite tedious is the delicate balance, that must be maintained between data utility and privacy (El Emam & Arbuckle, 2014; Sánchez et al., 2014). If the data is anonymized to such an extent that it provides no beneficial information abouts patients, it is rendered useless, on the contrary if the data utility is quite high, the risk of reidentification grows substantially (Sánchez et al., 2014). Datafly is one of the earliest anonymization programs, that applied generalization, insertion, substitution and removing of information to deidentify data (Sweeney, 1998). One the biggest limitations of Datafly is the guesswork involved in profiling sensitive fields, if a specific field is necessary to link one database to another database, that the recipient holds and if the recipient does not specify linking likelihood to 1, this will result in the releasing of less secure MHD (Sweeney, 1998). One of the  widely utilized deidentification methods is the Optimal Lattice Anonymization (OLA) method, which utilizes the k-anonymity method and primarily deidentifies quasi-identifiers(El Emam et al., 2009). OLA provides an optimal solution for de-identification and the records, which are anonymized has higher utility than some of the previous techniques such as Datafly, one of the limitations of OLA is that the applied information loss metric is monotonic, while other methods such as Incognito do not have such restraints, nonetheless OLA is much faster than Incognito (El Emam et al., 2009). Another novel anonymization method is Utility-Preserving Anonymization for Privacy Preserving Data Publishing (PPDP), this method applies the k-anonymity technique as well and comprises of three parts: utility preserving model, counterfeit record insertion and catalog of counterfeit records(Sánchez et al., 2014). This approach produced significantly better results than OLA (Sánchez et al., 2014). Nonetheless all methods have their strengths and weaknesses regarding risk of reidentification and data utility (El Emam & Arbuckle, 2014; Langarizadeh et al., 2018). The aim of this study is to systematically review articles that investigate the strengths and weaknesses of different anonymization approaches and computer programs to anonymize MHD regarding risk of reidentification and data utility. A secondary objective is to systematically review studies that present new anonymization approaches and computer programs, even though they do not assess data utility and risk of reidentification of that approach.


This systematic review will be conducted in accordance with PRISMA guidelines for systematic review (Moher et al., 2009). The following databases will be searched systematically: Pubmed, Ebscohost, ACM digital library, Medline, IEEE, Embase, Web of Science Collection and Scopus. Additionally, a manual search in the following journals will be conducted: Studies in Health Technology and Informatics, International Journal of e-Healthcare Information Systems and Journal of Biomedical Informatics. Finally, ProQuest dissertation and Theses Global will be searched to include as many eligible studies as possible. In addition to the above mentioned resources, manual search in reference lists of papers about the subject, contacts of experts in bioinformatics and a campaign using Twitter and LinkedIn accounts of #OpenSourceResearch collaboration (refernce: was launched to collect data about any other algorithms/software in order to ensure as complete as possible overview about the subject. The following keywords will be utilized.







Medical Health data,

Medical Health records,

Electronic health data,

Electronic medical records,

Digital health records,

Digital medical data,

Data utility

Data usefulness

Inclusion and exclusion Criteria

The following exclusion and inclusion criteria will be applied to include eligible studies:

Inclusion criteria

Exclusion criteria

· Anonymized medical records where an assessed of the risk of reidentification and data utility has been made.

· If journal articles are not peer-reviewed

· Applied the proper de-identification techniques and relevant algorithms to anonymize structured medical data, assess risk of reidentification and data utility.

· Full text is not available

· No restriction on national setting is applied

· Published before 2000

· If medical data is particularly anonymized for secondary usage.

· Books, newspaper articles, posters, letter to editors

· All types of peer reviewed articles, doctoral thesis, and relevant books

Data extraction:

The following parameters will be extracted: author, year of publication, sample size, relevant methods and computer programs applied to anonymize data and finally summary of outcomes.

Data extraction will be done by three authors independently (V.A, A.S, O.A). Any disagreement will be resolved by discussion or involvement of senior author (A.E).

Displaying of data:

The systematic search and the final number of studies included in the review will be presented by  PRISMA flow chart. Additionally, a table will be utilized to display the characteristics of included studies, the parameters on the table will be the aforementioned parameters in the data extraction section.


El Emam, K., & Arbuckle, L. (2014). Anonymizing Health Data (A. Oram & A. MacDonald (eds.)). O’Reilly Media, Inc.

El Emam, K., Dankar, F. K., Issa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J. P., Walker, M., Chowdhury, S., Vaillancourt, R., Roffey, T., & Bottomley, J. (2009). A Globally Optimal k-Anonymity Method for the De-Identification of Health Data. Journal of the American Medical Informatics Association, 16(5), 670–682.

El Emam, K., Jonker, E., Arbuckle, L., & Malin, B. (2011). A systematic review of re-identification attacks on health data. PLoS ONE, 6(12).

Gkoulalas-Divanis, A., & Loukides, G. (2015). Medical data privacy handbook. In Medical Data Privacy Handbook.

Langarizadeh, M., Orooji, A., & Sheikhtaheri, A. (2018). Effectiveness of anonymization methods in preserving patients’ privacy: A systematic literature review. Studies in Health Technology and Informatics, 248(6), 80–87.

Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. BMJ (Online), 339(7716), 332–336.

MSKTC. (2019). A Guide for Developing a Protocol for Conducting Literature Reviews. 90, 7.

Sánchez, D., Batet, M., & Viejo, A. (2014). Utility-preserving privacy protection of textual healthcare documents. Journal of Biomedical Informatics, 52, 189–198.

Schneble, C. O., Elger, B. S., & Shaw, D. M. (2020). Google’s Project Nightingale highlights the necessity of data science ethics review. EMBO Molecular Medicine, 12(3), 3–4.

Sweeney, L. (1998). Datafly: a system for providing anonymity in medical data. 356–381.