- Differential
Privacy (DP): This mathematical framework gives the ability to control to what
extent the model ‘remembers’ and ‘forgets’ potentially sensitive data, which is
its big advantage. The most popular concept of DP is ‘noisy counting’, which is
based on drawing samples from Laplace distribution and using them to make the
dataset represent augmented values, not the real one. However, the main disadvantage
of Differential Privacy is the potential for the attacker to estimate the actual
value from the repeated queries. Predictions made by using different private
datasets are accurate enough, but with each new query made by the attacker,
more and more sensitive information is getting released. - Federated
Learning. The core idea of federated learning is very similar to distributed learning,
because we’re not trying to train our model with all of our data at once, but instead
are training it on subsets of it. This is quite a powerful method as long as we
can effectively train and improve the model on separate devices while holding
different subsets of data and gradually improve it. - ‘Private
Aggregation of Teacher Ensembles’ (PATE): This framework uses part of the different
privacy methods, which is storing personal/sensitive data in a way that doesn’t
reveal any kind of individual personal information. The core idea of PATE is
that if two models trained on separate data agree on some outcome, it is less
likely that sharing the outcome to the consumer will leak any sensitive data
about a specific user. Training methodology is quite similar to federated
learning (and bagging techniques, of course) because at the first step we need
to split our dataset into smaller subsets and then train different models on
them. Predictions are made by aggregating all of the predictions from different
models and injecting noise into them.
Another important feature of PATE is that we’re continuously training our downstream ‘student’ model using this ‘noisy’ data and at finally showing the user not the ‘teacher’ models, but rather the ‘student’ ones, which ensures that sensitive/personal data is not revealed during inference phase.
We would love to hear your thoughts on this series. Please feel free to respond here with comments/questions/other feedback.
This is the fifth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part four, click here.
The post Part 5: Machine Learning Methods to Process Datasets With QI Values appeared first on 1touch.io.
*** This is a Security Bloggers Network syndicated blog from 1touch.io authored by Halyna Oliinyk. Read the original post at: https://1touch.io/part-5-machine-process-datasets-qi-values/

