- K-anonymity:
This approach is quite different from the one that I described earlier. With
K-anonymity, we’re not aiming to ‘hide’ any data, but rather are softly
‘masking’ the QI values. The most popular techniques used in k-anonymity are
purging and generalization. Purging simply replaces QI values with random
strings like ‘-’ (the idea is similar to suppression). Generalization doesn’t
remove QI values completely but replaces them with ranges instead of set
numbers(e.g. 20-30 years old). The main goal of k-anonymity is to provide a guarantee
that any arbitrary query on
a large dataset will not reveal information that can help narrow a group down
below a threshold of ‘k’ individuals. Strictly speaking, ‘k-anonymity’ ensures
that all possible equivalence groups of a dataset have at least ‘k’ records
(equivalence groups are the subsets of datasets, which have the same value for
one or more QIs). For instance, a 3-anonymity dataset ensures that for each
query that a potential attacker can perform, we will have at least 3 individuals,
which cannot be distinguished based on the QI values. - l-diversity.
Unfortunately, k-anonymity techniques may still be subject to attacks, which is
usually because each of the equivalence groups may not have attribute diversity.
A rare case for that is when all QI records of the equivalence group are the
same, enabling the attacker to easily make an inference. l-diversity makes sure
that there is enough diversity among QI records in each of the possible
equivalence groups. - t-closeness.
When speaking about the distributions, which are created by purging and
generalization techniques, it is worth noting that the distributions of data in
the equivalence groups should be similar to the distributions in the whole
dataset. Specifically, the difference should not be bigger than the pre-specified
value ‘t’. Earth Mover’s distance is used to measure the distance between the
distributions.
One may learn that preserving all of
these rules, which are defined by l-diversity, k-anonymity and t-closeness can
cause complex combinatorial problems. At this point, machine learning
techniques become quite useful as long as they can operate data in separate hyperplanes
and perform computations there, which can be very complex tasks for the
approaches described earlier.
This is the fourth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part five, click here.
The post Part 4: Standard Ways to Process Datasets with QI Values appeared first on 1touch.io.
*** This is a Security Bloggers Network syndicated blog from 1touch.io authored by Halyna Oliinyk. Read the original post at: https://1touch.io/part-4-process-datasets-qi-values/

