Part 5: Machine Learning Methods to Process Datasets With QI Values

  • Differential
    Privacy (DP): This mathematical framework gives the ability to control to what
    extent the model ‘remembers’ and ‘forgets’ potentially sensitive data, which is
    its big advantage. The most popular concept of DP is ‘noisy counting’, which is
    based on drawing samples from Laplace distribution and using them to make the
    dataset represent augmented values, not the real one. However, the main disadvantage
    of Differential Privacy is the potential for the attacker to estimate the actual
    value from the repeated queries. Predictions made by using different private
    datasets are accurate enough, but with each new query made by the attacker,
    more and more sensitive information is getting released.
  • Federated
    Learning. The core idea of federated learning is very similar to distributed learning,
    because we’re not trying to train our model with all of our data at once, but instead
    are training it on subsets of it. This is quite a powerful method as long as we
    can effectively train and improve the model on separate devices while holding
    different subsets of data and gradually improve it.
  • ‘Private
    Aggregation of Teacher Ensembles’ (PATE): This framework uses part of the different
    privacy methods, which is storing personal/sensitive data in a way that doesn’t
    reveal any kind of individual personal information. The core idea of PATE is
    that if two models trained on separate data agree on some outcome, it is less
    likely that sharing the outcome to the consumer will leak any sensitive data
    about a specific user. Training methodology is quite similar to federated
    learning (and bagging techniques, of course) because at the first step we need
    to split our dataset into smaller subsets and then train different models on
    them. Predictions are made by aggregating all of the predictions from different
    models and injecting noise into them.

Another important feature of PATE is that we’re continuously training our downstream ‘student’ model using this ‘noisy’ data and at finally showing the user not the ‘teacher’ models, but rather the ‘student’ ones, which ensures that sensitive/personal data is not revealed during inference phase.

We would love to hear your thoughts on this series. Please feel free to respond here with comments/questions/other feedback.

This is the fifth part of a five-part series about machine learning methodologies for de-identifying and securing personal data by 1touch.io. For part one, click here. For part two, click here. For part three, click here. For part four, click here.

The post Part 5: Machine Learning Methods to Process Datasets With QI Values appeared first on 1touch.io.


*** This is a Security Bloggers Network syndicated blog from 1touch.io authored by Halyna Oliinyk. Read the original post at: https://1touch.io/part-5-machine-process-datasets-qi-values/