Part 4: Standard Ways to Process Datasets with QI Values

  • K-anonymity:
    This approach is quite different from the one that I described earlier. With
    K-anonymity, we’re not aiming to ‘hide’ any data, but rather are softly
    ‘masking’ the QI values. The most popular techniques used in k-anonymity are
    purging and generalization. Purging simply replaces QI values with random
    strings like ‘-’ (the idea is similar to suppression). Generalization doesn’t
    remove QI values completely but replaces them with ranges instead of set
    numbers(e.g. 20-30 years old). The main goal of k-anonymity is to provide a guarantee
    that any arbitrary query on
    a large dataset will not reveal information that can help narrow a group down
    below a threshold of ‘k’ individuals. Strictly speaking, ‘k-anonymity’ ensures
    that all possible equivalence groups of a dataset have at least ‘k’ records
    (equivalence groups are the subsets of datasets, which have the same value for
    one or more QIs). For instance, a 3-anonymity dataset ensures that for each
    query that a potential attacker can perform, we will have at least 3 individuals,
    which cannot be distinguished based on the QI values.
  • l-diversity.
    Unfortunately, k-anonymity techniques may still be subject to attacks, which is
    usually because each of the equivalence groups may not have attribute diversity.
    A rare case for that is when all QI records of the equivalence group are the
    same, enabling the attacker to easily make an inference. l-diversity makes sure
    that there is enough diversity among QI records in each of the possible
    equivalence groups.
  • t-closeness.
    When speaking about the distributions, which are created by purging and
    generalization techniques, it is worth noting that the distributions of data in
    the equivalence groups should be similar to the distributions in the whole
    dataset. Specifically, the difference should not be bigger than the pre-specified
    value ‘t’. Earth Mover’s distance is used to measure the distance between the
    distributions.

One may learn that preserving all of
these rules, which are defined by l-diversity, k-anonymity and t-closeness can
cause complex combinatorial problems. At this point, machine learning
techniques become quite useful as long as they can operate data in separate hyperplanes
and perform computations there, which can be very complex tasks for the
approaches described earlier.

This is the fourth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part five, click here.

The post Part 4: Standard Ways to Process Datasets with QI Values appeared first on 1touch.io.


*** This is a Security Bloggers Network syndicated blog from 1touch.io authored by Halyna Oliinyk. Read the original post at: https://1touch.io/part-4-process-datasets-qi-values/