Guide to the De-Identification of Personal Health Information: Second

Description

The growing availability of health data has created unprecedented opportunities for research, public health, quality improvement, and data-driven innovation. At the same time, health information is among the most sensitive forms of personal data, and its use raises legal, ethical and regulatory challenges. This book helps the reader understand, assess and mitigate privacy risks to facilitate... Read more

The growing availability of health data has created unprecedented opportunities for research, public health, quality improvement, and data-driven innovation. At the same time, health information is among the most sensitive forms of personal data, and its use raises legal, ethical and regulatory challenges.

This book helps the reader understand, assess and mitigate privacy risks to facilitate responsible health data use and sharing. It presents a comprehensive, risk-based de-identification methodology. Rather than treating de-identification as a simple removal of identifiers, it conceptualizes it as a structured risk-management process that reduces the likelihood of re-identification to a very low level within a given data-sharing context. The methodology integrates legal, technical, and organizational perspectives and aligns with international standards and regulatory guidance. It is also based on more than two decades of performing risk assessments and de-identification of health data globally.

Foundational chapters examine the life cycle of health data, data subject perspectives, and the legal frameworks governing the use and disclosure of personal health information across jurisdictions. Building on this context, the book develops a conceptual model of risk that incorporates adversaries, attack scenarios, and different forms of disclosure, including identity, attribute, and membership disclosure. It reviews transformation methods including emerging technologies such as synthetic data to effectively reduce data vulnerability, and algorithmic approaches that optimize the balance between privacy protection and analytical utility.

A central contribution of the book is its detailed methodology for quantitatively measuring risk. It introduces formal models for threat modelling including adversary motives, capacity and background knowledge, along with statistical metrics for estimating vulnerability at a dataset level. Such a quantitative model supports standardized and repeatable risk assessments with transparent decision rules and risk thresholds that guide whether and how data should be transformed before release.

Case studies, assessment instruments, and operational strategies support organizations in applying these techniques consistently and defensibly.

Designed for a broad audience including data custodians, researchers, privacy professionals, regulators, and health data practitioners, this book provides both the conceptual foundations and practical tools needed to enable responsible secondary use of health data while preserving privacy and maintaining the value of data.

Read less

Chapter 1: Introduction — Section I: The Case for De-Identifying Personal Health Information — Chapter 2: Common Data Flows of Health Data (Data Life Cycle, Collection, Use and Disclosure of Health Data, More Complex Real-World Data Flows, Key Takeaways, References) — Chapter 3: Privacy from the Data Subject Perspective (Health Information Privacy Concerns, Privacy Protective Behavior, Antecedents of Privacy Concerns, Key Takeaways, References) — Chapter 4: Privacy from the Legal Perspective (Legal Definitions of Personal Health Information, Use and Disclosure of Personal Health Information, Lawful Bases for the Discretionary Use and Disclosure of PHI, The Lawful Basis of De-Identification, Key Takeaways, References) — Chapter 5: The Need for De-Identification (Lawful Use and Disclosure of Data, Building Trust with Data Subjects, Addressing Real-World Risks, Data Breaches in the Healthcare Sector, Historical Misuse of Data, Key Takeaways, References) — Chapter 6: Alternatives and Complementary Approaches to De-Identification (Consent as Lawful Basis for Use and Disclosure, Privacy-Enhancing Technologies, Remote or On-Site Access, Homomorphic Encryption, Secure Multiparty Computation, Trusted Execution Environments, Remote Execution, Remote Queries, Differential Privacy, Federated Learning, Key Takeaways, References) — Section II: Understanding Disclosure Risks — Chapter 7: Basic De-Identification Concepts (The De-Identification Process, Metrics and Measurements, Original Data and DFs, Unit of Analysis, Aggregate Data As The Unit of Analysis, Organization of Tabular Data, Relational Databases, Flat-File Databases, Types of Tabular Data, Cross-Sectional Data, Transactional Data, Sequential Data, Longitudinal Data, Trajectory Data, Types of Variables, Directly Identifying Variables, Indirectly Identifying Variables (Quasi-Identifiers), Sensitive Variables, Key Takeaways, References) — Chapter 8: Defining Re-Identification Threats and Attacks (Types of Threats, Motives and Capacity of Adversaries, Why Adversaries Deliberately Attack, Capacity and Constraints on Adversaries, Background Knowledge as a Precondition for Re-Identification, Key Takeaways, References) — Chapter 9: Types of Disclosures (Identity Disclosure, The Case of William Weld, Re-Identification of the Chicago Homicide Victims Dataset, Risks From GPS Trajectory App Data, An Attack Based on Semantic Information, Other Commonly Cited Examples, Attribute and Membership Disclosure, Information Gain in Disclosure, Identifiability, De-Identification, and Legal Interpretations, Key Takeaways, References) — Chapter 10: FAQs About De-Identification (Is a Dataset De-Identified After Removing Direct Identifiers?, Can We Have Zero Re-Identification Risk?, Can Datasets De-Identified Now Be Re-Identified in the Future?, Is A Dataset Identifiable If A Person Can Find Their Own Record?, Are Incorrect Claims of Successful Re-Identification Considered in the Risk Assessment?, Is a Dataset De-Identified in One Scenario Also De-Identified in Another?, Is Risk Assessment for Public and Non-Public Use and Disclosure Scenarios the Same?, How Can De-Identified Data Comply to Requirements in Different Jurisdictions?, Should I Document the De-Identification Process for Each Dataset?, Is a Risk-Based Approach Better than a Simpler Safe Harbor Approach?, Are Machine Learning Models Free of Risk?, Key Takeaways, References) — Chapter 11: A Methodology for Managing Disclosure Risk – De-Identification (Disclosure Risk and Vulnerability, Risk Threshold for Effective Risk Management, Managing Disclosure Risk, Key Takeaways, References) — Section III: Measuring Disclosure Risks — Chapter 12: Risk and Decision Rules (Components of Risk Measurement, Decision Rules, Key Takeaways, References) — Chapter 13: Probability of Attack (Plausible Attacks, Non-Public Release Scenario, Internal Deliberate Attack, Internal Inadvertent Attack, Data Breach, Public Release Scenario, Key Takeaways, References) — Chapter 14: Modeling The Adversary (Determining Quasi-Identifiers, Correlated Quasi-Identifiers, Adversary Power, Defining The Target Population, Assuming Membership Knowledge, Key Takeaways, References) — Chapter 15: Measuring Identity Disclosure Vulnerability (Equivalence Classes, Direction of Attack, Attack-Based Vulnerability, With Membership Knowledge, Without Membership Knowledge, Uniqueness as Vulnerability, With Membership Knowledge, Without Membership Knowledge, Identity Disclosure in Synthetic Data, Key Takeaways, References) — Chapter 16: Derived Identity Disclosure Vulnerability Metrics (Derived Vulnerability Metrics, Maximum Vulnerability, Average Vulnerability, Proportion of Uniqueness, Strict-Average Vulnerability, Choice of Dataset-Level Metric, Key Takeaways, References) — Chapter 17: Identity Disclosure Risk Assessment, Thresholds and Decision Rules (Identity Disclosure Risk in a Public Release, With Membership Knowledge, Without Membership Knowledge, Identity Disclosure Risk in a Non-Public Release, With Membership Knowledge, Without Membership Knowledge, Identity Disclosure Decision Rules and Thresholds, Choosing Among Thresholds, Choosing The Threshold, Choosing The Threshold, Choosing The Threshold, Key Takeaways, References) — Chapter 18: Measuring Membership Disclosure Vulnerability (Current Regulatory Expectations and Best Practice, Membership Inference Without Access to The Dataset, Membership Disclosure Vulnerability, Adversarial Attack Models, The Partitioning Method, Choosing the Membership Disclosure Threshold, From Vulnerability to Risk, Key Takeaways, References) — Chapter 19: Measuring Attribute Disclosure Vulnerability (Current Regulatory Expectations and Best Practice, Disclosure-Independent Attribute Inference, Attribute Disclosure Vulnerability, Derived Metrics, Choosing the Attribute Disclosure Threshold, From Vulnerability to Risk, Key Takeaways, References) — Section IV: Practical Methods — Chapter 20: Pseudonymization Techniques (Variable Suppression, Randomization, Coding, Irreversible Coding, Reversible Coding, Hashing and Encryption, Tokenization, Other Techniques That Do Not Work Well, Constraining Names, Character Masking and Truncation, Encoding, Key Takeaways, References) — Chapter 21: Data Transformation Beyond Pseudonymization (Traditional Data Transformation, Generalization, Suppression, Noise Addition, Synthetic Data Generation, Key Takeaways, References) — Chapter 22: Algorithmic De-Identification – Tools and Case Study (Optimization Algorithms, Optimal Lattice Anonymization (OLA), Fast Local Cell Suppression, Available Software Tools, Case Study: De-Identification of BORN Data, Parameters for De-Identification, Deliberate Attack (T1), Inadvertent Attack (T2), Breach (T3), Summary, Key Takeaways, References) — Chapter 23: Practical Considerations for Risk Assessment (Using a Subsample for Risk Assessment, Knowing a Target's Membership in Cohorts of a Registry, Cohort Defined on Quasi-Identifiers Only, Cohort Defined on Non-Quasi-Identifiers Only, Cohort Defined on Non-Quasi-Identifiers and Quasi-Identifiers, Impact of Data Quality, The Granularity of Adversarial Knowledge, Estimating Equivalence Class Sizes, Limitations of Average Metrics for Cohort Construction, Key Takeaways, References) — Chapter 24: Strategies for Operationalizing De-Identification (Disclosed Files Should Be Samples, Disclosing Multiple Samples, Linking a DF to Other Datasets, Publicizing Re-Identification Risk Assessment, Creating Function De-Identified Data, De-Identification in the Context of a Data Warehouse, De-Identification at Scale, Key Takeaways, References) — Section V: End Matter — Chapter 25: Assessing the Motives and Capacity Construct (Dimensions, Assessment of Motives and Capacity, Motives to Re-Identify the Data, Capacity to Re-Identify the Data, References) — Chapter 26: Assessing the Mitigating Controls Construct (Who Can Do the Assessment?, What Should be Assessed?, Remediation, Practice, Implementation and Audits, Assessment of Mitigating Controls, Controlling Access, Disclosure, Retention and Disposition of the Data, Safeguarding the Data, Ensuring Accountability and Transparency in the Management of the Data, References) — Chapter 27: Assessing the Invasion of Privacy Construct (Assessment of the Invasion of Privacy, Sensitivity of the Data, Potential Harm to Data Subjects, References) — Chapter 28: How Many Friends Do We Have? (Dunbar's Number, Is Dunbar's Number Still Valid?, References) — Chapter 29: Further Identity Disclosure Vulnerability Metrics (Proportion of Highly Vulnerable Records, With Membership Knowledge, Without Membership Knowledge, The and Threshold, Uniqueness Metrics, Proportion of Sample Uniques That Are Population Uniques, Probabilistic Vulnerability of Sample Uniques, References) — Chapter 30: Precedents for Identity Disclosure Risk Thresholds (The Notion of Thresholds in Law, Translating Probability Words Into Quantitative Values, References) — Chapter 31: An Analysis of Historical Breach Notification Trends (Introduction, Methods, Definitions, Breach Lists, Sponsors of Lists, Data Quality, Breach Categories, Estimating the Number of Disclosed Breaches, Results, References) — Chapter 32: Who Else Could Be The Targets (Targets From a Superset of the Dataset, Targets With a Partial Overlap With the Dataset, Targets From the Same Population as The Dataset, References)

Author(s)

Biography

Dr. Lisa Pilgram is a Postdoctoral Fellow at the School of Epidemiology and Public Health at the University of Ottawa, Head of Operations at the Ottawa Medical AI Research Institute, and a Clinician Scientist at Charité – Universitätsmedizin Berlin. She conducts research within the Electronic Health Information Laboratory on de-identification methods, including synthetic data generation, and the application of machine learning and generative AI to health data.

Lisa received her MD in virology and immunobiology from the University of Wuerzburg, Germany, in 2020. As a researcher at the Goethe University in Frankfurt/Main, Germany, she contributed to establishing national and international collaborative research platforms during the COVID-19 pandemic. She subsequently trained clinically in the Department of Nephrology and Medical Intensive Care at Charité – Universitätsmedizin Berlin.

Her work centers on translating real-world healthcare data into actionable insights using AI-based methods to support clinical research and decision-making.

Dr. Khaled El Emam is the Canada Research Chair (Tier 1) in Medical AI at the University of Ottawa, where he is a Professor in the School of Epidemiology and Public Health and Director of Medical AI at the Faculty of Medicine. He is also a Senior Scientist at the Children’s Hospital of Eastern Ontario Research Institute and Director of the multi-disciplinary Electronic Health Information Laboratory, conducting research on privacy enhancing technologies to enable the sharing of health data for secondary purposes, and methodology and applied machine learning on health data. Khaled also recently completed a one-year term as Scholar-in-Residence at the Office of the Information and Privacy Commissioner of Ontario (IPC).

Khaled has founded or co-founded eight product and services companies involved with data management and data analytics, with some having successful exits. Prior to his academic roles, he was a Senior Research Officer at the National Research Council of Canada. He also served as the head of the Quantitative Methods Group at the Fraunhofer Institute in Kaiserslautern, Germany.

He participates in a number of committees including the European Medicines Agency Technical Anonymization Group, the Panel on Research Ethics advising on the TCPS, the Strategic Advisory Council of the Office of the Information and Privacy Commissioner of Ontario, and he is also co-editor-in-chief of the JMIR AI journal.

In 2003 and 2004, he was ranked as the top systems and software engineering scholar worldwide by the Journal of Systems and Software based on his research on measurement and quality evaluation and improvement. He held the Canada Research Chair in Electronic Health Information at the University of Ottawa from 2005 to 2015. Khaled has a PhD from the Department of Electrical and Electronics Engineering, King’s College, at the University of London, England.

Guide to the De-Identification of Personal Health Information Second Edition

Description

Table of Contents

Author(s)

Biography

SOCIAL NETWORKS

Secure Shopping Payment Options