Home » Security Bloggers Network » Part 4: Standard Ways to Process Datasets with QI Values

Part 4: Standard Ways to Process Datasets with QI Values

by Halyna Oliinyk on July 31, 2019

K-anonymity:
This approach is quite different from the one that I described earlier. With
K-anonymity, we’re not aiming to ‘hide’ any data, but rather are softly
‘masking’ the QI values. The most popular techniques used in k-anonymity are
purging and generalization. Purging simply replaces QI values with random
strings like ‘-’ (the idea is similar to suppression). Generalization doesn’t
remove QI values completely but replaces them with ranges instead of set
numbers(e.g. 20-30 years old). The main goal of k-anonymity is to provide a guarantee
that any arbitrary query on
a large dataset will not reveal information that can help narrow a group down
below a threshold of ‘k’ individuals. Strictly speaking, ‘k-anonymity’ ensures
that all possible equivalence groups of a dataset have at least ‘k’ records
(equivalence groups are the subsets of datasets, which have the same value for
one or more QIs). For instance, a 3-anonymity dataset ensures that for each
query that a potential attacker can perform, we will have at least 3 individuals,
which cannot be distinguished based on the QI values.
l-diversity.
Unfortunately, k-anonymity techniques may still be subject to attacks, which is
usually because each of the equivalence groups may not have attribute diversity.
A rare case for that is when all QI records of the equivalence group are the
same, enabling the attacker to easily make an inference. l-diversity makes sure
that there is enough diversity among QI records in each of the possible
equivalence groups.
t-closeness.
When speaking about the distributions, which are created by purging and
generalization techniques, it is worth noting that the distributions of data in
the equivalence groups should be similar to the distributions in the whole
dataset. Specifically, the difference should not be bigger than the pre-specified
value ‘t’. Earth Mover’s distance is used to measure the distance between the
distributions.

One may learn that preserving all of
these rules, which are defined by l-diversity, k-anonymity and t-closeness can
cause complex combinatorial problems. At this point, machine learning
techniques become quite useful as long as they can operate data in separate hyperplanes
and perform computations there, which can be very complex tasks for the
approaches described earlier.

This is the fourth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part five, click here.

The post Part 4: Standard Ways to Process Datasets with QI Values appeared first on 1touch.io.

Part 4: Standard Ways to Process Datasets with QI Values

Senator Sanders Wants to Own AI Companies — and Hand America’s Adversaries the Keys

NIST’s Nine: The PQC Signature Race Moves to Round Three

The Quantum Arms Race: Why Washington Just Wrote a $2 Billion Check to Nine Companies

Beyond Moore’s Law: The Hyper-Acceleration of Autonomous AI Cyber Capabilities

The Exception Economy: When Security Teams Stop Protecting and Start Negotiating

GoPlus’s Latest Report Highlights How Blockchain Communities Are Leveraging Critical API Security Data To Mitigate Web3 Threats

C2A Security’s EVSec Risk Management and Automation Platform Gains Traction in Automotive Industry as Companies Seek to Efficiently Meet Regulatory Requirements

Zama Raises $73M in Series A Lead by Multicoin Capital and Protocol Labs to Commercialize Fully Homomorphic Encryption

RSM US Deploys Stellar Cyber Open XDR Platform to Secure Clients

ThreatHunter.ai Halts Hundreds of Attacks in the past 48 hours: Combating Ransomware and Nation-State Cyber Threats Head-On

Randall Munroe’s XKCD ‘Soniferous Aether’