The Basics of k-Anonymity: Making Individuals Harder to Identify

Swetha Srihari
Jun 5
6 min read

Updated: Jun 5

Part 2 of the Data Privacy Series

If you're new to data privacy concepts, you may find it helpful to start with my previous article, The Privacy Challenge in Machine Learning

Figure 1. Without k-Anonymity, a unique individual may stand out within a dataset, increasing the risk of re-identification. After applying k-Anonymity, records become part of a larger group, making it more difficult to distinguish any single individual. — **Figure 1.** Without k-Anonymity, a unique individual may stand out within a dataset, increasing the risk of re-identification. After applying k-Anonymity, records become part of a larger group, making it more difficult to distinguish any single individual.

Suppose you are given the following information about a person:

Age	Gender	ZIP Code
29	Female	97205

There is no name, address, social security or any such personal identifiers in the table.

Here’s the thing — ZIP code 97205 is a fairly specific area of Portland, Oregon. If you cross-reference publicly available data, the number of 29-year-old women living there might be surprisingly small. In some cases, small enough to make a confident guess about who this person is.

That’s the problem k-Anonymity was built to solve. Removing names isn’t enough when the remaining attributes are specific enough to act as a fingerprint. k-Anonymity tries to blur that fingerprint by making each record look like several others.

What Is k-Anonymity?

The core idea is simple: each record in a dataset should be indistinguishable from at least k − 1 other records based on a chosen set of attributes.

If k = 5, that means for any given record, there are at least four others that look identical. An attacker who knows certain details about a specific person can’t pinpoint their record — they’re just looking at a group. Here’s what that looks like in practice.

Before applying k-Anonymity, each record is unique:

Age	Gender	ZIP Code
24	Female	97205
26	Female	97202
29	Female	97208
22	Female	97201
28	Female	97209

Table 1. Original records containing unique combinations of attributes.

After applying generalization to achieve 5-anonymity:

Age Range	Gender	ZIP Code
20–30	Female	972**
20–30	Female	972**
20–30	Female	972**
20–30	Female	972**
20–30	Female	972**

Table 2. Records after applying generalization to achieve 5-anonymity.

The five individuals still have different ages and live in different ZIP codes — but you can’t tell that from the table anymore. That’s the point. Each person is now hidden inside a group of five, and that’s what 5-anonymity means.

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Before going further, it’s worth understanding how different types of data contribute to privacy risk. Not all attributes are equal.

Attribute Type	Examples	Description
Identifiers	Name, Social Security Number, Email Address	Directly identify an individual.
Quasi-Identifiers	Age, Gender, ZIP Code	May identify an individual when combined with other information.
Sensitive Attributes	Medical Diagnosis, Salary, Political Affiliation	Information that should remain private and protected from disclosure.

Table 3. Types of attributes commonly found in datasets and their roles in privacy protection.

Identifiers are the obvious ones — names, SSNs, email addresses. These get stripped first, and most people know to remove them.

Quasi-identifiers are trickier. Age, gender, ZIP code — none of these identify you alone. But combined? They can. This is where most re-identification attacks happen, and it’s what k-Anonymity is specifically designed to address.

Sensitive attributes are the things people actually care about keeping private: medical diagnoses, income, political views. k-Anonymity doesn’t hide these directly — it just tries to prevent them from being linked to a specific person. That distinction matters, and we’ll come back to it when we talk about the technique’s limitations.

How Does k-Anonymity Work?

There are two main tools: generalization and suppression. In practice, you usually need both.

Generalization replaces specific values with broader ones. Age 29 becomes “20–30.” ZIP code 97205 becomes “972**.” You keep the rough shape of the data while reducing its uniqueness.

Suppression drops values entirely when generalization alone isn’t enough. A ZIP code becomes “*”. A gender field goes blank. It’s a blunter instrument than generalization, but sometimes it’s necessary.

Age Range	Gender	ZIP Code
20–30	Female	*
20–30	Female	*
20–30	Female	*
20–30	Female	*
20–30	Female	*

Table 4. Example of suppression applied to a quasi-identifier.

In real datasets, you’ll almost always use both together. Here’s an example where age is generalized, ZIP codes are partially masked, and one gender value is fully suppressed:

Age Range	Gender	ZIP Code
20–30	Female	972**
20–30	*	972**
20–30	Female	972**
20–30	Female	972**
20–30	Female	972**

Table 5. Example of combining generalization and suppression to reduce re-identification risk.

The goal in all cases is the same: make each record look like enough other records that no individual stands out.

Choosing the Value of k

This is where judgment comes in. The value of k sets the size of the anonymity group, and there’s a direct trade-off: bigger groups mean stronger privacy, but also more generalization, which means less useful data.

k Value	Privacy Protection	Data Utility
k = 2	Low	High
k = 5	Moderate	Moderate
k = 10	High	Lower
k = 20	Very High	Lower

Table 6. Illustrative relationship between k values, privacy protection, and data utility.

I’ve seen people default to k = 5 without much thought, and in many cases that’s reasonable. But the right answer really does depend on context. How sensitive is the data? Who are the likely attackers? How precise does the analysis need to be?

A healthcare dataset containing HIV diagnoses warrants a much higher k than a dataset of general shopping preferences. The cost of getting this wrong isn’t symmetric — too low and you expose people; too high and you destroy the data’s usefulness.

Limitations of k-Anonymity

Here’s where I think a lot of introductions to k-Anonymity sell it a bit short: they mention the limitations but don’t quite convey how serious they can be. Let me try to be more direct.

k-Anonymity protects identity. It doesn’t protect sensitive information. That distinction can bite you in two specific ways.

Homogeneity Attack

This happens when everyone in an anonymity group shares the same sensitive value. The group hides which record is yours — but it doesn’t matter, because every record says the same thing.

Age Range	Gender	ZIP Code	Disease
20–30	Female	972**	HIV
20–30	Female	972**	HIV
20–30	Female	972**	HIV
20–30	Female	972**	HIV
20–30	Female	972**	HIV

Table 7. Example of a homogeneity attack.

If an attacker knows you’re in this group, they immediately know your diagnosis. k-Anonymity did its job — you can’t be identified — but the sensitive information leaked anyway. That’s a problem.

Background Knowledge Attack

This one is subtler. Even when a group contains diverse sensitive values, an attacker who knows something about you from outside the dataset might be able to narrow it down.

Age Range	Gender	ZIP Code	Disease
20–30	Female	972**	HIV
20–30	Female	972**	HIV
20–30	Female	972**	Flu
20–30	Female	972**	Cancer
20–30	Female	972**	Flu

Table 8. Example of a background knowledge attack.

Suppose an attacker knows — from a news article, a social media post, a hospital visit logged elsewhere — that the person they’re looking for recently underwent cancer treatment. Suddenly the “Cancer” record in that group is the obvious match. The k-Anonymity held up technically. The privacy didn’t.

The key difference between these two attacks: homogeneity is a problem with the data itself; background knowledge is a problem with the world outside the data. k-Anonymity can’t fully protect against either one, which is why more advanced techniques like l-diversity and differential privacy were developed to address these gaps.

Real-World Applications of k-Anonymity

Despite its limitations, k-Anonymity is still genuinely useful — especially as a first layer of protection or in contexts where the privacy stakes are moderate.

Healthcare: Patient datasets shared for research or public health analysis are often anonymized using k-Anonymity before release. It’s not perfect, but it’s better than publishing raw records.

Government and Census Data: Demographic data published for policymaking and research needs to be detailed enough to be useful but not so specific that individuals can be identified. k-Anonymity helps strike that balance.

Academic Research: Many research datasets contain personal information. Applying k-Anonymity lets organizations share data with researchers while reducing exposure for study participants.

Business Analytics: Before sharing customer data internally or with partners, companies often apply anonymization. k-Anonymity is a common starting point.

Conclusion

k-Anonymity is one of those ideas that’s elegant in its simplicity and genuinely useful in practice — but it’s easy to over-rely on it. It does one thing well: it makes individuals harder to pick out of a crowd. What it can’t do is guarantee that sensitive information stays hidden once you’ve found the crowd.

That’s not a reason to dismiss it. It’s a reason to understand it clearly and use it as part of a broader privacy strategy rather than a complete solution on its own. The techniques that came after it — l-diversity, t-closeness, differential privacy — were all built to address the gaps k-Anonymity left open. Understanding those gaps is, I’d argue, the most important takeaway from this article.

Next in the series: Differential Privacy — and why adding noise to data can actually be a feature, not a bug.

References

Sweeney, L. (2002). k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570.
Samarati, P., & Sweeney, L. (1998). Protecting Privacy When Disclosing Information: k-Anonymity and Its Enforcement Through Generalization and Suppression. Proceedings of the IEEE Symposium on Research in Security and Privacy.
Machanavajjhala, A., Gehrke, J., Kifer, D., & Venkitasubramaniam, M. (2007). l-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1).
Dwork, C. (2006). Differential Privacy. Proceedings of the 33rd International Conference on Automata, Languages and Programming (ICALP).

Welcome
to NumpyNinja Blogs

The Basics of k-Anonymity: Making Individuals Harder to Identify

What Is k-Anonymity?

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Choosing the Value of k

Limitations of k-Anonymity

Homogeneity Attack

Background Knowledge Attack

Real-World Applications of k-Anonymity

Conclusion

References

Recent Posts

Welcome to NumpyNinja Blogs

What Is k-Anonymity?

Identifiers, Quasi-Identifiers, and Sensitive Attributes

Choosing the Value of k

Limitations of k-Anonymity

Homogeneity Attack

Background Knowledge Attack

Real-World Applications of k-Anonymity

Conclusion

References

Welcome
to NumpyNinja Blogs