Distance-based Linkage of Personal Microbiome Records for Identification and its Privacy Implications
In the paper, we explore privacy attacks via linkage of samples of human microbiome, and extend on the current threat landscape by a more effective linkage attack. We further discuss mitigation actions. Our paper was published in “Computers & Security“, a Q1 ranked journal by Elsevier.
Title
Distance-based Linkage of Personal Microbiome Records for Identification and its Privacy Implications
Authors
Rudolf Mayer, Markus Hittmeir, Andreas Ekelhart
Journal
Abstract
Due to its high potential for analysis in clinical settings, research on the human microbiome has been flourishing for several years. As an increasing amount of data on the microbiome is gathered and stored, analysing the temporal and individual stability of microbiome readings, and the succeeding privacy risks, has gained importance. In 2015, Franzosa et al. demonstrated the feasibility of matching and linking individuals in microbiome-based datasets from the Human Microbiome Project, which could lead to re-identification of individuals, and thus poses privacy implications for microbiome study designs. Their technique is based on the construction of body site-specific metagenomic codes that maintain a certain stability over time.
In this paper, we establish a distance-based technique for personal microbiome identification, which is combined with a solution for avoiding spurious, false positive matches. In a direct comparison with the approach from Franzosa et al., which assumes that information is available as microbial records, rather than at the more detailed (but less likely to be shared) nucleic acid level, our method improves upon the identification results on most of the considered datasets. Our main finding is an increase of the average percentage of true positive identifications of 30% on the widely studied microbiome of the gastrointestinal tract. While we particularly recommend our method for application on the gut microbiome, we also observed substantial identification success on other body sites. Our results demonstrate the potential of privacy threats in microbiome data gathering, storage, sharing, and analysis, and thus underline the need for solutions to protect the microbiome as personal and sensitive medical data. We also show that the method is robust to various hyper-parameter settings.
Based on our observations, we further identify challenges in personal microbiome identification research, specifically, the scarcity of benchmark data and associated data analysis tasks. Based on our experience, we propose solutions for a more systematic and comparable evaluation, considering also aspects of costs entailed with applying privacy-preserving methods.