HIPAA De-Identification: Requirements and Mistakes
\n\nPractical guidance for healthcare teams and business associates
\n\nHIPAA De-Identification Requirements
\n\nMany groups think data is de-identified after they remove the obvious items. But this does not meet HIPAA standards. Deleting the patient's name alone is not enough. Removing just the medical record number is also not enough.
\n\nEven removing a whole list of obvious items may still fall short. That is, if the data can tie back to the person. HIPAA de-identification has two known paths. Both need care. More than most teams expect.
\n\nThis is one of the easiest areas to get wrong while still believing you are being careful.
\n\nIf your team uses data for:
\n- \n
- Vendor troubleshooting \n
- Analytics \n
- Marketing \n
- Model training \n
Weak de-identification steps create a false comfort. They give the feel of "anonymous" data that is not truly anonymous.
\n\nWhy De-Identification Matters
\n\nData that is truly de-identified is no longer PHI under HIPAA. This matters a lot. It changes how you can use the data and lowers privacy risks. But the standard only applies if the data is truly de-identified. Groups run into trouble for one of two reasons:
\n\n- \n
- They remove too little \n
- They overestimate how anonymous the data is \n
- They use de-identified data practices inconsistently across teams and vendors \n
The result is risk that could have been avoided. This is true when data leaves its first setting. In practice, weak de-identification often goes hand in hand with weak minimum necessary controls and broad vendor access.
\n\nThe Two HIPAA Methods
\n\nHIPAA allows two ways to de-identify data:
\n- \n
- Safe Harbor \n
- Expert Determination \n
They are different methods. Teams should stop treating them as the same thing.
\n\nMethod 1: Safe Harbor
\n\nSafe Harbor is the more checklist-based path. Under this method, the group removes certain types of data. These include items that point to the person, their family, their employer, or household members. The group must also not know that the leftover data could ID the person. Most people have heard of this method. It is easier to explain. But it is also the method most often made too simple.
\n\nWhat Safe Harbor Requires
\n\nSafe Harbor calls for the removal of 18 types of data, such as:
\n- \n
- names \n
- geographic areas smaller than a state, with limited ZIP code exceptions \n
- all date details tied to the person (except year in most cases) \n
- phone numbers \n
- email addresses \n
- Social Security numbers \n
- medical record numbers \n
- account numbers \n
- certificate and license numbers \n
- vehicle IDs \n
- device IDs and serial numbers \n
- URLs \n
- IP addresses \n
- biometric IDs \n
- full-face images \n
- any other unique number, trait, or code that could ID a person \n
That last item matters. Teams often learn the list but miss the bigger point. Safe Harbor is not about checking off fields. It is about the data that remains. Can it still point back to a person in the real world?
\n\nThe 18 Safe Harbor Identifiers: Full Reference Table
\n\nOCR’s guidance lays out all 18 types in detail. Each one has edge cases that trip teams up in practice.
\n\n| # | \nIdentifier Category | \nWhat It Includes — Common Edge Cases | \n
|---|---|---|
| 1 | \nNames | \nFirst, last, middle, maiden, initials. Nicknames count if they can be linked. | \n
| 2 | \nGeographic data smaller than a state | \nStreet address, city, county, ZIP code, geocodes. Three-digit ZIP codes may be kept in some cases — see the ZIP rule below. | \n
| 3 | \nDates (except year) related to the individual | \nAdmission date, discharge date, birth date, death date, surgery date. Ages over 89 must be grouped into a “90 or older” bucket. | \n
| 4 | \nPhone numbers | \nAll phone numbers: cell, fax, and work lines. | \n
| 5 | \nFax numbers | \nListed apart from phone numbers in the rule. | \n
| 6 | \nEmail addresses | \nPersonal and work email. A generic office email is not a patient ID item; a patient-linked one is. | \n
| 7 | \nSocial Security numbers | \nFull or partial SSN. Even the last four digits must be removed. | \n
| 8 | \nMedical record numbers | \nEHR-assigned IDs, chart numbers, sign-up numbers. | \n
| 9 | \nHealth plan beneficiary numbers | \nInsurance member IDs, Medicare/Medicaid member IDs. | \n
| 10 | \nAccount numbers | \nPatient billing account numbers, money-related account IDs. | \n
| 11 | \nCertificate and license numbers | \nDriver’s license, work license, DEA number when tied to a patient. | \n
| 12 | \nVehicle identifiers and serial numbers | \nVIN, license plate numbers, vehicle sign-up data. | \n
| 13 | \nDevice identifiers and serial numbers | \nMedical device serial numbers, implant IDs, wearable device IDs. | \n
| 14 | \nWeb URLs | \nPatient-linked web addresses, portal login URLs. | \n
| 15 | \nIP addresses | \nPatient device IP addresses found in logs or portal access records. | \n
| 16 | \nBiometric identifiers | \nFingerprints, voiceprints, retinal scans, facial shape data. | \n
| 17 | \nFull-face photographs and comparable images | \nClinical photos, ID photos, images where the face is visible and can be linked to a person. | \n
| 18 | \nAny other unique identifying number, characteristic, or code | \nThe catch-all. If a piece of data can single out one person in a dataset, it fits here even if not named above. | \n
Three-Digit ZIP Code Rule: When Geography Can Stay
\n\nZIP codes are not always off-limits under Safe Harbor. The rule is based on how many people live in the area:
\n\n- \n
- If all ZIP codes sharing the same first three digits cover a geographic area containing more than 20,000 people, the three-digit prefix may remain in the dataset. \n
- If the three-digit ZIP area covers 20,000 or fewer people, all digits must be recoded to 000. \n
This matters most in rural areas. A three-digit ZIP prefix that covers a small rural county may need to be zeroed out. Urban ZIP prefixes that cover dense metro areas often pass Safe Harbor. Teams working with any map-level data below the state level should run the count check before calling the data clean.
\n\nDate Handling: Ages, Years, and the 90-Plus Rule
\n\nDates are the most often misread Safe Harbor item. The rule allows keeping year alone — not month, not day. All other date details tied to the person must go.
\n\nThe second trap is age. Safe Harbor allows age as a value, with one hard cutoff: anyone aged 90 or older must have their age listed as “90 or older.” The reason is simple. Very old patients in small groups are easy to re-identify from age alone. A 97-year-old patient in a rural area who had a rare treatment can be picked out with no other data point.
\n\nIn practice, this means any dataset that keeps exact ages above 89 does not pass Safe Harbor. It does not matter what other steps were taken.
\n\nDates and Geography Cause More Problems Than People Expect
\n\nStaff will think data is de-identified once names are gone. The problem is that the dataset still holds:
\n- \n
- exact admission dates \n
- discharge dates \n
- surgery dates \n
- city-level location data \n
- mixes of age, rare diagnosis, and event timing \n
Those details can point to a person fast. This is true in small towns or for unusual clinical events. A dataset that shows a:
\n- \n
- 92-year-old patient \n
- In one small town \n
- Who had a rare event \n
- On a specific date \n
Is easy to re-identify even without a name attached.
\n\nMethod 2: Expert Determination
\n\nExpert Determination is more flexible, but also harder. Under this method, a qualified person uses stats to show that the risk of re-identification is very small. This path works when a team needs to keep more data than Safe Harbor allows. For example:
\n- \n
- research-support datasets \n
- daily analytics \n
- product or model development \n
- quality projects that need more time-based or location detail \n
But Expert Determination is not “our data scientist looked at it” or “IT said it seems fine.” It needs a strong expert process and clear records.
\n\nWho Qualifies as an Expert?
\n\nOCR does not certify experts for this method. The rule calls for a person with the right knowledge and skill in accepted stats and science methods for making data not linkable to a person. In practice this means:
\n\n- \n
- A stats expert with privacy or health data skills \n
- A bio-stats expert who knows re-identification research \n
- A data scientist with published work in health data masking \n
- An academic or consultant who can write up their method and stand behind it in an audit \n
An internal IT analyst or compliance officer does not qualify unless they can show that background. The expert label must be solid — meaning if OCR asked, you could produce credentials, methods, and a signed report.
\n\nWhat the Expert’s Report Must Contain
\n\nThere is no required template, but a solid Expert Determination report will cover:
\n\n- \n
- The expert’s skills and relevant background \n
- What data was reviewed and how it will be used \n
- The stats methods used to check re-identification risk \n
- The risk threshold found (OCR guidance points to a “very small” standard, which courts and scholars read as below 0.09 — roughly a 1-in-11 chance of re-identification) \n
- Any leftover risks found and the reason for accepting them \n
- A signed statement that re-identification risk is very small \n
Without a written report, Expert Determination is just a loose opinion. OCR expects records that can hold up to review after the fact.
\n\nStatistical Methods Used in Expert Determination
\n\nCommon approaches include:
\n\n- \n
- K-anonymity: Each record looks the same as at least k-1 other records on a set of quasi-identifiers. A dataset with k=5 means any one person shares all key traits with at least four others. \n
- L-diversity and T-closeness: These build on k-anonymity. They manage how sensitive traits spread within groups. This cuts guessing attacks even when a person’s identity is hidden. \n
- Differential privacy: Adds tuned noise to query results so that no single record’s presence can be found from the output. Major health analytics tools use this method. OCR has begun to cite it in informal guidance. \n
- Risk-based modeling: Rates the real chance of re-identification using group-level data, how unique records are, and known outside data sources (e.g., voter rolls, social media). \n
The choice of method depends on the data type, its planned use, and the risk down the line. A report that explains why a method was chosen and how it was used is far stronger than one that just states a result.
\n\nWhen Expert Determination Is Better Than Safe Harbor
\n\nExpert Determination is the right choice when the data loses key research or day-to-day value under Safe Harbor’s strict removal rules. Common cases:
\n\n- \n
- Research datasets that need monthly admission trends, not just year-only dates \n
- Health analytics that need sub-state location detail below the Safe Harbor ZIP threshold \n
- AI and machine learning training sets where time or location detail makes the model more accurate \n
- Quality projects that need links across data points that Safe Harbor would strip \n
The tradeoff is cost and time. Expert Determination means hiring a qualified pro, paying for their review, and keeping their report as a compliance record. For datasets that pass Safe Harbor cleanly, that extra work is not worth it.
\n\nSafe Harbor vs. Expert Determination
\n\nThe practical tradeoff is simple:
\n\nSafe Harbor is more rigid but easier to explain
\n\nExpert Determination is more flexible but requires stronger expertise and records
\n\nGroups often choose Safe Harbor when they want a more standard compliance path. They choose Expert Determination when data loses too much value under Safe Harbor. The mistake is doing Expert Determination with only a rough internal review.
\n\nThe Biggest De-Identification Mistakes
\n\nThese are the failures that show up often:
\n\n1. Removing Names and Stopping There
\n\nThis is the classic error. Teams strip the obvious direct IDs and think the dataset is now anonymous. It often is not.
\n\n2. Ignoring Combinations of Data Points
\n\nSingle fields may seem harmless, but mixes create risk. Examples:
\n- \n
- rare condition plus exact date \n
- age over 89 plus location \n
- very specific service line plus event timing \n
- small staff or patient group plus internal day-to-day data \n
3. Reusing Internal Data for New Purposes
\n\nData that was fine inside one workflow may not be de-identified for other uses like:
\n- \n
- External sharing \n
- Vendor use \n
- Marketing analysis \n
- Testing \n
Context changes the risk.
\n\n4. Sending Live PHI to Vendors for Troubleshooting
\n\nThis happens all the time. Teams send screenshots. They create exports, or logs. They use sample data with vendors to skip the hassle of using real info. That is not a de-identification plan. It is shortcut culture.
\n\n5. No Documentation of the Method Used
\n\nIf someone asks how the data was de-identified, there should be a clear reply:
\n- \n
- which method used \n
- what identifiers removed \n
- who approved the process \n
- what leftover risk was weighed \n
If nobody can answer those questions, the process is weak.
\n\n"Actual Knowledge" Still Matters
\n\nSafe Harbor is not just about deleting items. It needs the group to not have knowledge that leftover info will ID a person. You cannot strip listed data and then ignore clues your team already knows. If everyone on the team can tell who the patient is from the remaining dataset, the data is not de-identified.
\n\nDe-Identification vs. Limited Data Sets
\n\nThis is another common confusion point.
\n\nA limited data set is not the same as de-identified data. A limited data set may still hold some dates and limited location details. It needs a data use agreement. Teams sometimes apply limited-data-set logic while calling the result de-identified. That is the wrong label, and it matters. If the data is only partly stripped and still depends on use limits, you may not be in de-identified space at all.
\n\nWhat a Limited Data Set May Contain
\n\nA limited data set strips direct IDs but keeps data that Safe Harbor would remove on purpose. It may retain:
\n\n- \n
- Town, city, state, and five-digit ZIP codes (items Safe Harbor removes) \n
- Date details — such as admission, discharge, and service dates — that Safe Harbor limits to year only \n
- Ages, including ages over 89 that Safe Harbor requires grouping \n
What it must remove: names, postal address (street), phone numbers, fax numbers, email addresses, Social Security numbers, medical record numbers, health plan member numbers, account numbers, certificate and license numbers, vehicle IDs, device IDs, web URLs, IP addresses, and full-face photos.
\n\nData Use Agreement Requirements
\n\nA limited data set can only be shared with someone who has signed a Data Use Agreement (DUA). The DUA must state that the person or group will:
\n\n- \n
- Use or share the limited data set only for the purposes named in the agreement \n
- Not try to re-identify or contact the people in the data \n
- Use proper safeguards to stop uses or sharing not allowed by the agreement \n
- Report any use or sharing that breaks the agreement \n
- Make sure any agents or subs handling the data agree to the same rules \n
A DUA is not the same as a Business Associate Agreement, though some groups combine them. The DUA is specific to limited data sets. It governs how the receiving party handles data that is not fully de-identified.
\n\n| Feature | \nDe-Identified Data (Safe Harbor) | \nLimited Data Set | \n
|---|---|---|
| PHI status under HIPAA | \nNot PHI — HIPAA rules do not apply | \nStill PHI — HIPAA rules apply | \n
| Agreement required | \nNone | \nData Use Agreement (DUA) | \n
| Dates retained | \nYear only | \nFull dates permissible | \n
| Geographic detail | \nState-level or limited 3-digit ZIP | \nCity, ZIP, county permissible | \n
| Ages over 89 | \nMust be grouped as “90 or older” | \nExact age permissible | \n
| Permitted uses | \nNo limits — no longer PHI | \nResearch, public health, healthcare operations only | \n
| Re-identification risk | \nMust be very small per OCR rules | \nHandled by DUA limits, not removed | \n
The bottom line: if the data must keep dates and city-level details for research or analytics, a limited data set with a proper DUA is often the right path. If the planned use has no HIPAA limits (such as business analytics or public release), full de-identification is needed.
\n\nCommon De-Identification Use Cases
\n\nKnowing where de-identification is used in the real world helps teams set up the right steps before data leaves a controlled setting.
\n\nClinical Research and IRB Studies
\n\nAcademic medical centers and health systems often de-identify patient records for research studies. IRBs often accept Safe Harbor as enough to waive consent. But they need proof of the method used. Expert Determination is used when the study needs date or location detail that Safe Harbor strips.
\n\nPopulation Health Analytics
\n\nHealth systems and payers use de-identified claims and clinical data to track disease rates, find care gaps, and model risk at the local level. Safe Harbor often works here. The exception is when the review covers small areas or rare conditions. Those cases may need Expert Determination to avoid re-identification through a mix of small groups and kept data points.
\n\nAI and Machine Learning Training Data
\n\nThis is where teams most often miss the re-identification risk. Training a clinical model on de-identified patient data sounds simple. But large language models and neural nets can learn patterns that re-identify people from outputs. This risk goes beyond whether the input data was cleaned. If you use patient data for AI work, treat model outputs as a possible re-identification path, not just the training set.
\n\nHealthcare Marketing and Analytics
\n\nUsing de-identified patient data for marketing analytics is allowed under HIPAA. But only if the data is truly de-identified — not just stripped of names. Ad platforms that get patient data for audience modeling are a common source of risk. This is true when teams confuse limited data sets with de-identified data. The minimum necessary rule applies to what data reaches marketing workflows in the first place.
\n\nRe-Identification Risk: What “Very Small” Actually Means
\n\nHIPAA’s Expert Determination standard says re-identification risk must be “very small.” The rule does not set a number. OCR’s guidance and the research behind it point to a benchmark: re-identification odds below roughly 0.09. That means fewer than 1 in 11 people in the dataset can be correctly re-identified using real-world methods.
\n\nThat threshold matters because re-identification attacks have become much easier. Research has shown that:
\n\n- \n
- 87% of Americans can be singled out using only ZIP code, birth date, and sex — three fields many teams see as harmless \n
- Mixing a rare diagnosis with age, location, and rough event date can single out a patient in a small group with no other data \n
- Public data sources — voter rolls, social media, property records — can be joined to “anonymous” datasets to re-identify people at scale \n
Under Safe Harbor, you manage re-identification risk by removing all 18 types and confirming no one knows that the leftover data points to a person. Under Expert Determination, it takes formal testing and records. Either way, the “very small” standard is not a goal. It is a rule that can be tested and pushed in enforcement.
\n\nOperational Uses That Deserve Review
\n\nYou should review de-identification steps if your group uses data for:
\n- \n
- vendor troubleshooting \n
- software testing \n
- marketing analytics \n
- AI or model training \n
- internal dashboards \n
- quality reporting \n
- case studies \n
- public presentations \n
These are the settings where teams often move fast. They assume the data is harmless because it has no patient names. That belief is what creates later exposure with vendors, contractors, and internal teams.
\n\nA Practical De-Identification Review Checklist
\n\n- \n
- Which method are we using: Safe Harbor or Expert Determination? \n
- Have all required data types been removed if we claim Safe Harbor? \n
- Does the leftover dataset still create clear re-identification risk in context? \n
- Are dates, location, age, and rarity creating a combined exposure? \n
- Are vendors receiving data that should be masked or de-identified first? \n
- Is the method written up enough to explain later? \n
If those questions produce fuzzy answers, the process needs work.
\n\nFinal Takeaway
\n\nHIPAA de-identification rules are not met by surface-level edits to a dataset. The real question is not whether the most obvious IDs are gone. The question is: Does the leftover info connect back to a person with fair effort?
\n\nThat is why disciplined de-identification matters:
\n- \n
- choose the right method \n
- document the process \n
- think about combinations, not just fields \n
- do not use live PHI when masked or fake data would work \n
Groups that get this right reduce risk. They do so without pretending data is safer than it is.
\n\nIf your current process depends on:
\n- \n
- Ad hoc redaction \n
- Screenshots \n
- Loose judgment calls \n
Review it before it turns into a privacy problem.
\n\nLearn about HIPAA consulting support
\n\nDocumentation Requirements for Both Methods
\n\nHIPAA does not have a single template for de-identification records. But both methods need files that hold up in an OCR audit or legal challenge. At minimum, a group should be able to produce:
\n\n- \n
- Method chosen: Safe Harbor or Expert Determination, and the reason for the choice \n
- Dataset description: What data was processed, from what source, and for what intended use \n
- Items removed (Safe Harbor): A record showing all 18 types were reviewed and handled, including how edge cases (ZIP codes, ages over 89, dates) were treated \n
- Expert’s report (Expert Determination): A signed report from a qualified expert stating that re-identification risk is very small, with the method described \n
- Approval chain: Who approved the de-identification process and who reviewed the output before it was shared \n
- Ongoing review: Whether the de-identification process gets reviewed again as the data setting or use case changes \n
Teams that treat de-identification as a one-time tech step rather than a recorded process create a gap that shows up under audit. Good records also protect you if a later party misuses data. Proof that the method was applied right shifts blame where it belongs.
\n\nIf your group needs help building a strong de-identification framework, HIPAA consulting support can speed up the process without starting from scratch.
\n\nFrequently Asked Questions
\n\n\n\nWhat are the 18 identifiers that must be removed for HIPAA Safe Harbor de-identification?
\nThe 18 identifiers are: (1) names, (2) geographic data smaller than state level, (3) dates except year related to the individual, (4) phone numbers, (5) fax numbers, (6) email addresses, (7) Social Security numbers, (8) medical record numbers, (9) health plan beneficiary numbers, (10) account numbers, (11) certificate and license numbers, (12) vehicle identifiers, (13) device identifiers, (14) web URLs, (15) IP addresses, (16) biometric identifiers, (17) full-face photographs, and (18) any other unique identifying number or code. Ages over 89 must also be aggregated into a “90 or older” category.
\n\nWhat is the difference between Safe Harbor and Expert Determination under HIPAA?
\nSafe Harbor means removing all 18 listed identifiers — a checklist any compliance team can follow. Expert Determination lets you keep more detailed data such as monthly dates or sub-state geography, but a qualified statistician or privacy expert must formally certify that re-identification risk is “very small.” Expert Determination is more flexible but costs more and needs stronger records.
\n\nCan ZIP codes be included in de-identified data?
\nIt depends on population size. Under Safe Harbor, you may retain the first three digits of a ZIP code only if the geographic area formed by all ZIP codes sharing those three digits contains more than 20,000 people. If that area covers 20,000 or fewer people, the ZIP must be recoded as 000. Rural ZIP prefixes often fail this threshold.
\n\nDoes de-identified data still fall under HIPAA?
\nNo. Once data is properly de-identified using either Safe Harbor or Expert Determination, it is no longer PHI and is not subject to HIPAA’s Privacy Rule. This is the core value of de-identification — it allows data to be used for research, analytics, or marketing without HIPAA restrictions. A limited data set, by contrast, remains PHI and still requires a Data Use Agreement.
\n\nWhat is a limited data set under HIPAA?
\nA limited data set is a middle-ground option — it strips direct identifiers like name, address, and SSN, but may keep geographic data at the city and ZIP level plus full dates. It can be shared for research, public health, or healthcare operations under a Data Use Agreement, without full de-identification. It is still PHI; HIPAA still applies.
\n\nWhat are the two HIPAA de-identification methods?
\nThey are Safe Harbor and Expert Determination. Safe Harbor removes specific identifiers from a defined list of 18 categories. Expert Determination relies on a qualified expert applying statistical principles to certify that re-identification risk is very small.
\n\nIs deleting the patient's name enough to de-identify data?
\nNo. Removing a name alone does not de-identify someone under HIPAA. If the remaining data — such as dates, ZIP codes, age, or rare diagnoses — can still identify a person through combination or context, the data is not de-identified.
\n\nWhat is the difference between de-identified data and a limited data set?
\nA limited data set is not de-identified. It can still contain certain dates and limited geographic details such as city and ZIP code. It also requires a Data Use Agreement and remains subject to HIPAA. Fully de-identified data carries none of those restrictions.
\n\nCan I send partially redacted patient data to a vendor for troubleshooting?
\nNot safely by default. If the data is still identifiable, you may still be sending PHI, which requires a Business Associate Agreement and appropriate safeguards. The better approach is masking, de-identifying, or using a small synthetic dataset that removes all real patient data.
\n\nRelated Reading
\n\n- \n
- HIPAA Minimum Necessary Rule - Why reducing data scope matters before anything leaves the workflow \n
- Your Vendor Got Hacked: Now What? - Why vendor data handling shortcuts become major exposure during incidents \n
- HIPAA Authorization Form Requirements - A separate but related records area. Teams often assume broad permission. \n