Requirements and Mistakes: HIPAA De-Identification

Practical guidance for healthcare teams and business associates

HIPAA De-Identification: Requirements and Mistakes

\n\n

Practical guidance for healthcare teams and business associates

\n\n

HIPAA De-Identification Requirements

\n\n

Many groups think data is de-identified after they remove the obvious items. But this does not meet HIPAA standards. Deleting the patient's name alone is not enough. Removing just the medical record number is also not enough.

\n\n

Even removing a whole list of obvious items may still fall short. That is, if the data can tie back to the person. HIPAA de-identification has two known paths. Both need care. More than most teams expect.

\n\n

This is one of the easiest areas to get wrong while still believing you are being careful.

\n\n

If your team uses data for:

\n
    \n
  • Vendor troubleshooting
  • \n
  • Analytics
  • \n
  • Marketing
  • \n
  • Model training
  • \n
\n\n

Weak de-identification steps create a false comfort. They give the feel of "anonymous" data that is not truly anonymous.

\n\n

Why De-Identification Matters

\n\n

Data that is truly de-identified is no longer PHI under HIPAA. This matters a lot. It changes how you can use the data and lowers privacy risks. But the standard only applies if the data is truly de-identified. Groups run into trouble for one of two reasons:

\n\n
    \n
  • They remove too little
  • \n
  • They overestimate how anonymous the data is
  • \n
  • They use de-identified data practices inconsistently across teams and vendors
  • \n
\n\n

The result is risk that could have been avoided. This is true when data leaves its first setting. In practice, weak de-identification often goes hand in hand with weak minimum necessary controls and broad vendor access.

\n\n

The Two HIPAA Methods

\n\n

HIPAA allows two ways to de-identify data:

\n
    \n
  • Safe Harbor
  • \n
  • Expert Determination
  • \n
\n\n

They are different methods. Teams should stop treating them as the same thing.

\n\n

Method 1: Safe Harbor

\n\n

Safe Harbor is the more checklist-based path. Under this method, the group removes certain types of data. These include items that point to the person, their family, their employer, or household members. The group must also not know that the leftover data could ID the person. Most people have heard of this method. It is easier to explain. But it is also the method most often made too simple.

\n\n

What Safe Harbor Requires

\n\n

Safe Harbor calls for the removal of 18 types of data, such as:

\n
    \n
  • names
  • \n
  • geographic areas smaller than a state, with limited ZIP code exceptions
  • \n
  • all date details tied to the person (except year in most cases)
  • \n
  • phone numbers
  • \n
  • email addresses
  • \n
  • Social Security numbers
  • \n
  • medical record numbers
  • \n
  • account numbers
  • \n
  • certificate and license numbers
  • \n
  • vehicle IDs
  • \n
  • device IDs and serial numbers
  • \n
  • URLs
  • \n
  • IP addresses
  • \n
  • biometric IDs
  • \n
  • full-face images
  • \n
  • any other unique number, trait, or code that could ID a person
  • \n
\n\n

That last item matters. Teams often learn the list but miss the bigger point. Safe Harbor is not about checking off fields. It is about the data that remains. Can it still point back to a person in the real world?

\n\n

The 18 Safe Harbor Identifiers: Full Reference Table

\n\n

OCR’s guidance lays out all 18 types in detail. Each one has edge cases that trip teams up in practice.

\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
#Identifier CategoryWhat It Includes — Common Edge Cases
1NamesFirst, last, middle, maiden, initials. Nicknames count if they can be linked.
2Geographic data smaller than a stateStreet address, city, county, ZIP code, geocodes. Three-digit ZIP codes may be kept in some cases — see the ZIP rule below.
3Dates (except year) related to the individualAdmission date, discharge date, birth date, death date, surgery date. Ages over 89 must be grouped into a “90 or older” bucket.
4Phone numbersAll phone numbers: cell, fax, and work lines.
5Fax numbersListed apart from phone numbers in the rule.
6Email addressesPersonal and work email. A generic office email is not a patient ID item; a patient-linked one is.
7Social Security numbersFull or partial SSN. Even the last four digits must be removed.
8Medical record numbersEHR-assigned IDs, chart numbers, sign-up numbers.
9Health plan beneficiary numbersInsurance member IDs, Medicare/Medicaid member IDs.
10Account numbersPatient billing account numbers, money-related account IDs.
11Certificate and license numbersDriver’s license, work license, DEA number when tied to a patient.
12Vehicle identifiers and serial numbersVIN, license plate numbers, vehicle sign-up data.
13Device identifiers and serial numbersMedical device serial numbers, implant IDs, wearable device IDs.
14Web URLsPatient-linked web addresses, portal login URLs.
15IP addressesPatient device IP addresses found in logs or portal access records.
16Biometric identifiersFingerprints, voiceprints, retinal scans, facial shape data.
17Full-face photographs and comparable imagesClinical photos, ID photos, images where the face is visible and can be linked to a person.
18Any other unique identifying number, characteristic, or codeThe catch-all. If a piece of data can single out one person in a dataset, it fits here even if not named above.
\n
\n\n

Three-Digit ZIP Code Rule: When Geography Can Stay

\n\n

ZIP codes are not always off-limits under Safe Harbor. The rule is based on how many people live in the area:

\n\n
    \n
  • If all ZIP codes sharing the same first three digits cover a geographic area containing more than 20,000 people, the three-digit prefix may remain in the dataset.
  • \n
  • If the three-digit ZIP area covers 20,000 or fewer people, all digits must be recoded to 000.
  • \n
\n\n

This matters most in rural areas. A three-digit ZIP prefix that covers a small rural county may need to be zeroed out. Urban ZIP prefixes that cover dense metro areas often pass Safe Harbor. Teams working with any map-level data below the state level should run the count check before calling the data clean.

\n\n

Date Handling: Ages, Years, and the 90-Plus Rule

\n\n

Dates are the most often misread Safe Harbor item. The rule allows keeping year alone — not month, not day. All other date details tied to the person must go.

\n\n

The second trap is age. Safe Harbor allows age as a value, with one hard cutoff: anyone aged 90 or older must have their age listed as “90 or older.” The reason is simple. Very old patients in small groups are easy to re-identify from age alone. A 97-year-old patient in a rural area who had a rare treatment can be picked out with no other data point.

\n\n

In practice, this means any dataset that keeps exact ages above 89 does not pass Safe Harbor. It does not matter what other steps were taken.

\n\n

Dates and Geography Cause More Problems Than People Expect

\n\n

Staff will think data is de-identified once names are gone. The problem is that the dataset still holds:

\n
    \n
  • exact admission dates
  • \n
  • discharge dates
  • \n
  • surgery dates
  • \n
  • city-level location data
  • \n
  • mixes of age, rare diagnosis, and event timing
  • \n
\n\n

Those details can point to a person fast. This is true in small towns or for unusual clinical events. A dataset that shows a:

\n
    \n
  • 92-year-old patient
  • \n
  • In one small town
  • \n
  • Who had a rare event
  • \n
  • On a specific date
  • \n
\n\n

Is easy to re-identify even without a name attached.

\n\n

Method 2: Expert Determination

\n\n

Expert Determination is more flexible, but also harder. Under this method, a qualified person uses stats to show that the risk of re-identification is very small. This path works when a team needs to keep more data than Safe Harbor allows. For example:

\n
    \n
  • research-support datasets
  • \n
  • daily analytics
  • \n
  • product or model development
  • \n
  • quality projects that need more time-based or location detail
  • \n
\n\n

But Expert Determination is not “our data scientist looked at it” or “IT said it seems fine.” It needs a strong expert process and clear records.

\n\n

Who Qualifies as an Expert?

\n\n

OCR does not certify experts for this method. The rule calls for a person with the right knowledge and skill in accepted stats and science methods for making data not linkable to a person. In practice this means:

\n\n
    \n
  • A stats expert with privacy or health data skills
  • \n
  • A bio-stats expert who knows re-identification research
  • \n
  • A data scientist with published work in health data masking
  • \n
  • An academic or consultant who can write up their method and stand behind it in an audit
  • \n
\n\n

An internal IT analyst or compliance officer does not qualify unless they can show that background. The expert label must be solid — meaning if OCR asked, you could produce credentials, methods, and a signed report.

\n\n

What the Expert’s Report Must Contain

\n\n

There is no required template, but a solid Expert Determination report will cover:

\n\n
    \n
  • The expert’s skills and relevant background
  • \n
  • What data was reviewed and how it will be used
  • \n
  • The stats methods used to check re-identification risk
  • \n
  • The risk threshold found (OCR guidance points to a “very small” standard, which courts and scholars read as below 0.09 — roughly a 1-in-11 chance of re-identification)
  • \n
  • Any leftover risks found and the reason for accepting them
  • \n
  • A signed statement that re-identification risk is very small
  • \n
\n\n

Without a written report, Expert Determination is just a loose opinion. OCR expects records that can hold up to review after the fact.

\n\n

Statistical Methods Used in Expert Determination

\n\n

Common approaches include:

\n\n
    \n
  • K-anonymity: Each record looks the same as at least k-1 other records on a set of quasi-identifiers. A dataset with k=5 means any one person shares all key traits with at least four others.
  • \n
  • L-diversity and T-closeness: These build on k-anonymity. They manage how sensitive traits spread within groups. This cuts guessing attacks even when a person’s identity is hidden.
  • \n
  • Differential privacy: Adds tuned noise to query results so that no single record’s presence can be found from the output. Major health analytics tools use this method. OCR has begun to cite it in informal guidance.
  • \n
  • Risk-based modeling: Rates the real chance of re-identification using group-level data, how unique records are, and known outside data sources (e.g., voter rolls, social media).
  • \n
\n\n

The choice of method depends on the data type, its planned use, and the risk down the line. A report that explains why a method was chosen and how it was used is far stronger than one that just states a result.

\n\n

When Expert Determination Is Better Than Safe Harbor

\n\n

Expert Determination is the right choice when the data loses key research or day-to-day value under Safe Harbor’s strict removal rules. Common cases:

\n\n
    \n
  • Research datasets that need monthly admission trends, not just year-only dates
  • \n
  • Health analytics that need sub-state location detail below the Safe Harbor ZIP threshold
  • \n
  • AI and machine learning training sets where time or location detail makes the model more accurate
  • \n
  • Quality projects that need links across data points that Safe Harbor would strip
  • \n
\n\n

The tradeoff is cost and time. Expert Determination means hiring a qualified pro, paying for their review, and keeping their report as a compliance record. For datasets that pass Safe Harbor cleanly, that extra work is not worth it.

\n\n

Safe Harbor vs. Expert Determination

\n\n

The practical tradeoff is simple:

\n\n

Safe Harbor is more rigid but easier to explain

\n\n

Expert Determination is more flexible but requires stronger expertise and records

\n\n

Groups often choose Safe Harbor when they want a more standard compliance path. They choose Expert Determination when data loses too much value under Safe Harbor. The mistake is doing Expert Determination with only a rough internal review.

\n\n

The Biggest De-Identification Mistakes

\n\n

These are the failures that show up often:

\n\n

1. Removing Names and Stopping There

\n\n

This is the classic error. Teams strip the obvious direct IDs and think the dataset is now anonymous. It often is not.

\n\n

2. Ignoring Combinations of Data Points

\n\n

Single fields may seem harmless, but mixes create risk. Examples:

\n
    \n
  • rare condition plus exact date
  • \n
  • age over 89 plus location
  • \n
  • very specific service line plus event timing
  • \n
  • small staff or patient group plus internal day-to-day data
  • \n
\n\n

3. Reusing Internal Data for New Purposes

\n\n

Data that was fine inside one workflow may not be de-identified for other uses like:

\n
    \n
  • External sharing
  • \n
  • Vendor use
  • \n
  • Marketing analysis
  • \n
  • Testing
  • \n
\n\n

Context changes the risk.

\n\n

4. Sending Live PHI to Vendors for Troubleshooting

\n\n

This happens all the time. Teams send screenshots. They create exports, or logs. They use sample data with vendors to skip the hassle of using real info. That is not a de-identification plan. It is shortcut culture.

\n\n

5. No Documentation of the Method Used

\n\n

If someone asks how the data was de-identified, there should be a clear reply:

\n
    \n
  • which method used
  • \n
  • what identifiers removed
  • \n
  • who approved the process
  • \n
  • what leftover risk was weighed
  • \n
\n\n

If nobody can answer those questions, the process is weak.

\n\n

"Actual Knowledge" Still Matters

\n\n

Safe Harbor is not just about deleting items. It needs the group to not have knowledge that leftover info will ID a person. You cannot strip listed data and then ignore clues your team already knows. If everyone on the team can tell who the patient is from the remaining dataset, the data is not de-identified.

\n\n

De-Identification vs. Limited Data Sets

\n\n

This is another common confusion point.

\n\n

A limited data set is not the same as de-identified data. A limited data set may still hold some dates and limited location details. It needs a data use agreement. Teams sometimes apply limited-data-set logic while calling the result de-identified. That is the wrong label, and it matters. If the data is only partly stripped and still depends on use limits, you may not be in de-identified space at all.

\n\n

What a Limited Data Set May Contain

\n\n

A limited data set strips direct IDs but keeps data that Safe Harbor would remove on purpose. It may retain:

\n\n
    \n
  • Town, city, state, and five-digit ZIP codes (items Safe Harbor removes)
  • \n
  • Date details — such as admission, discharge, and service dates — that Safe Harbor limits to year only
  • \n
  • Ages, including ages over 89 that Safe Harbor requires grouping
  • \n
\n\n

What it must remove: names, postal address (street), phone numbers, fax numbers, email addresses, Social Security numbers, medical record numbers, health plan member numbers, account numbers, certificate and license numbers, vehicle IDs, device IDs, web URLs, IP addresses, and full-face photos.

\n\n

Data Use Agreement Requirements

\n\n

A limited data set can only be shared with someone who has signed a Data Use Agreement (DUA). The DUA must state that the person or group will:

\n\n
    \n
  • Use or share the limited data set only for the purposes named in the agreement
  • \n
  • Not try to re-identify or contact the people in the data
  • \n
  • Use proper safeguards to stop uses or sharing not allowed by the agreement
  • \n
  • Report any use or sharing that breaks the agreement
  • \n
  • Make sure any agents or subs handling the data agree to the same rules
  • \n
\n\n

A DUA is not the same as a Business Associate Agreement, though some groups combine them. The DUA is specific to limited data sets. It governs how the receiving party handles data that is not fully de-identified.

\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
FeatureDe-Identified Data (Safe Harbor)Limited Data Set
PHI status under HIPAANot PHI — HIPAA rules do not applyStill PHI — HIPAA rules apply
Agreement requiredNoneData Use Agreement (DUA)
Dates retainedYear onlyFull dates permissible
Geographic detailState-level or limited 3-digit ZIPCity, ZIP, county permissible
Ages over 89Must be grouped as “90 or older”Exact age permissible
Permitted usesNo limits — no longer PHIResearch, public health, healthcare operations only
Re-identification riskMust be very small per OCR rulesHandled by DUA limits, not removed
\n
\n\n

The bottom line: if the data must keep dates and city-level details for research or analytics, a limited data set with a proper DUA is often the right path. If the planned use has no HIPAA limits (such as business analytics or public release), full de-identification is needed.

\n\n

Common De-Identification Use Cases

\n\n

Knowing where de-identification is used in the real world helps teams set up the right steps before data leaves a controlled setting.

\n\n

Clinical Research and IRB Studies

\n\n

Academic medical centers and health systems often de-identify patient records for research studies. IRBs often accept Safe Harbor as enough to waive consent. But they need proof of the method used. Expert Determination is used when the study needs date or location detail that Safe Harbor strips.

\n\n

Population Health Analytics

\n\n

Health systems and payers use de-identified claims and clinical data to track disease rates, find care gaps, and model risk at the local level. Safe Harbor often works here. The exception is when the review covers small areas or rare conditions. Those cases may need Expert Determination to avoid re-identification through a mix of small groups and kept data points.

\n\n

AI and Machine Learning Training Data

\n\n

This is where teams most often miss the re-identification risk. Training a clinical model on de-identified patient data sounds simple. But large language models and neural nets can learn patterns that re-identify people from outputs. This risk goes beyond whether the input data was cleaned. If you use patient data for AI work, treat model outputs as a possible re-identification path, not just the training set.

\n\n

Healthcare Marketing and Analytics

\n\n

Using de-identified patient data for marketing analytics is allowed under HIPAA. But only if the data is truly de-identified — not just stripped of names. Ad platforms that get patient data for audience modeling are a common source of risk. This is true when teams confuse limited data sets with de-identified data. The minimum necessary rule applies to what data reaches marketing workflows in the first place.

\n\n

Re-Identification Risk: What “Very Small” Actually Means

\n\n

HIPAA’s Expert Determination standard says re-identification risk must be “very small.” The rule does not set a number. OCR’s guidance and the research behind it point to a benchmark: re-identification odds below roughly 0.09. That means fewer than 1 in 11 people in the dataset can be correctly re-identified using real-world methods.

\n\n

That threshold matters because re-identification attacks have become much easier. Research has shown that:

\n\n
    \n
  • 87% of Americans can be singled out using only ZIP code, birth date, and sex — three fields many teams see as harmless
  • \n
  • Mixing a rare diagnosis with age, location, and rough event date can single out a patient in a small group with no other data
  • \n
  • Public data sources — voter rolls, social media, property records — can be joined to “anonymous” datasets to re-identify people at scale
  • \n
\n\n

Under Safe Harbor, you manage re-identification risk by removing all 18 types and confirming no one knows that the leftover data points to a person. Under Expert Determination, it takes formal testing and records. Either way, the “very small” standard is not a goal. It is a rule that can be tested and pushed in enforcement.

\n\n

Operational Uses That Deserve Review

\n\n

You should review de-identification steps if your group uses data for:

\n
    \n
  • vendor troubleshooting
  • \n
  • software testing
  • \n
  • marketing analytics
  • \n
  • AI or model training
  • \n
  • internal dashboards
  • \n
  • quality reporting
  • \n
  • case studies
  • \n
  • public presentations
  • \n
\n\n

These are the settings where teams often move fast. They assume the data is harmless because it has no patient names. That belief is what creates later exposure with vendors, contractors, and internal teams.

\n\n

A Practical De-Identification Review Checklist

\n\n
    \n
  • Which method are we using: Safe Harbor or Expert Determination?
  • \n
  • Have all required data types been removed if we claim Safe Harbor?
  • \n
  • Does the leftover dataset still create clear re-identification risk in context?
  • \n
  • Are dates, location, age, and rarity creating a combined exposure?
  • \n
  • Are vendors receiving data that should be masked or de-identified first?
  • \n
  • Is the method written up enough to explain later?
  • \n
\n\n

If those questions produce fuzzy answers, the process needs work.

\n\n

Final Takeaway

\n\n

HIPAA de-identification rules are not met by surface-level edits to a dataset. The real question is not whether the most obvious IDs are gone. The question is: Does the leftover info connect back to a person with fair effort?

\n\n

That is why disciplined de-identification matters:

\n
    \n
  • choose the right method
  • \n
  • document the process
  • \n
  • think about combinations, not just fields
  • \n
  • do not use live PHI when masked or fake data would work
  • \n
\n\n

Groups that get this right reduce risk. They do so without pretending data is safer than it is.

\n\n

If your current process depends on:

\n
    \n
  • Ad hoc redaction
  • \n
  • Screenshots
  • \n
  • Loose judgment calls
  • \n
\n\n

Review it before it turns into a privacy problem.

\n\n

Learn about HIPAA consulting support

\n\n

Documentation Requirements for Both Methods

\n\n

HIPAA does not have a single template for de-identification records. But both methods need files that hold up in an OCR audit or legal challenge. At minimum, a group should be able to produce:

\n\n
    \n
  • Method chosen: Safe Harbor or Expert Determination, and the reason for the choice
  • \n
  • Dataset description: What data was processed, from what source, and for what intended use
  • \n
  • Items removed (Safe Harbor): A record showing all 18 types were reviewed and handled, including how edge cases (ZIP codes, ages over 89, dates) were treated
  • \n
  • Expert’s report (Expert Determination): A signed report from a qualified expert stating that re-identification risk is very small, with the method described
  • \n
  • Approval chain: Who approved the de-identification process and who reviewed the output before it was shared
  • \n
  • Ongoing review: Whether the de-identification process gets reviewed again as the data setting or use case changes
  • \n
\n\n

Teams that treat de-identification as a one-time tech step rather than a recorded process create a gap that shows up under audit. Good records also protect you if a later party misuses data. Proof that the method was applied right shifts blame where it belongs.

\n\n

If your group needs help building a strong de-identification framework, HIPAA consulting support can speed up the process without starting from scratch.

\n\n

Frequently Asked Questions

\n\n\n\n

What are the 18 identifiers that must be removed for HIPAA Safe Harbor de-identification?

\n

The 18 identifiers are: (1) names, (2) geographic data smaller than state level, (3) dates except year related to the individual, (4) phone numbers, (5) fax numbers, (6) email addresses, (7) Social Security numbers, (8) medical record numbers, (9) health plan beneficiary numbers, (10) account numbers, (11) certificate and license numbers, (12) vehicle identifiers, (13) device identifiers, (14) web URLs, (15) IP addresses, (16) biometric identifiers, (17) full-face photographs, and (18) any other unique identifying number or code. Ages over 89 must also be aggregated into a “90 or older” category.

\n\n

What is the difference between Safe Harbor and Expert Determination under HIPAA?

\n

Safe Harbor means removing all 18 listed identifiers — a checklist any compliance team can follow. Expert Determination lets you keep more detailed data such as monthly dates or sub-state geography, but a qualified statistician or privacy expert must formally certify that re-identification risk is “very small.” Expert Determination is more flexible but costs more and needs stronger records.

\n\n

Can ZIP codes be included in de-identified data?

\n

It depends on population size. Under Safe Harbor, you may retain the first three digits of a ZIP code only if the geographic area formed by all ZIP codes sharing those three digits contains more than 20,000 people. If that area covers 20,000 or fewer people, the ZIP must be recoded as 000. Rural ZIP prefixes often fail this threshold.

\n\n

Does de-identified data still fall under HIPAA?

\n

No. Once data is properly de-identified using either Safe Harbor or Expert Determination, it is no longer PHI and is not subject to HIPAA’s Privacy Rule. This is the core value of de-identification — it allows data to be used for research, analytics, or marketing without HIPAA restrictions. A limited data set, by contrast, remains PHI and still requires a Data Use Agreement.

\n\n

What is a limited data set under HIPAA?

\n

A limited data set is a middle-ground option — it strips direct identifiers like name, address, and SSN, but may keep geographic data at the city and ZIP level plus full dates. It can be shared for research, public health, or healthcare operations under a Data Use Agreement, without full de-identification. It is still PHI; HIPAA still applies.

\n\n

What are the two HIPAA de-identification methods?

\n

They are Safe Harbor and Expert Determination. Safe Harbor removes specific identifiers from a defined list of 18 categories. Expert Determination relies on a qualified expert applying statistical principles to certify that re-identification risk is very small.

\n\n

Is deleting the patient's name enough to de-identify data?

\n

No. Removing a name alone does not de-identify someone under HIPAA. If the remaining data — such as dates, ZIP codes, age, or rare diagnoses — can still identify a person through combination or context, the data is not de-identified.

\n\n

What is the difference between de-identified data and a limited data set?

\n

A limited data set is not de-identified. It can still contain certain dates and limited geographic details such as city and ZIP code. It also requires a Data Use Agreement and remains subject to HIPAA. Fully de-identified data carries none of those restrictions.

\n\n

Can I send partially redacted patient data to a vendor for troubleshooting?

\n

Not safely by default. If the data is still identifiable, you may still be sending PHI, which requires a Business Associate Agreement and appropriate safeguards. The better approach is masking, de-identifying, or using a small synthetic dataset that removes all real patient data.

\n\n

Related Reading

\n\n\n