Big Data Security Series Part 3: How to Run Analytics on Protected Data

by Felix Rosbach on November 15, 2019

In the past two posts we discussed how data is everywhere, but security isn’t and we explored the reasons why it’s so tough to protect BDA environments. So how can we protect data in these complex environments and be compliant? How can we use the data beautifully when we have challenges like that?
One word: tokenization. Tokenization provides strong data protection that enables you to run analytics on protected data.

As I pointed out in the previous post, perimeter defenses just don’t cut it anymore, which is why more and more organizations are moving from an infrastructure focus to a data-centric focus. Data-centric security allows you to check off all the boxes:

Does it offer strong protection? ✔
Does it guarantee compliance? ✔
Can I still use the data without compromising on security? ✔
Can I get it up and running with little to no source code changes? ✔

Ye Data Protection Methods of Olde

There are many methods that can be used to protect data, each with its own advantages and disadvantages that make them ideal in some scenarios but impractical in others, especially when it comes to data analytics. Here are just a few examples:

Vault Level Encryption

Vault Level Encryption (VLE) is a method many companies use that protects the data by encrypting the hard disk. This might sound good, until you ask what happens to the data on the disk. The answer is nothing. Nothing happens with the data. It’s locked away on the hard disk and as soon as you want to use it, you have to decrypt the drive and put the data back in the clear.

Now, don’t get me wrong, VLE is great in certain situations, for instance, if your laptop is stolen or you forget it in a cab and you don’t want anyone to be able to take out the hard drive and access your data. For a data center however, the only thing VLE is protecting you from is someone crawling in through the heating ducts, then rappelling down Mission Impossible style and stealing your hardware.

data center security

“For a data center, the only thing VLE protects you from is someone crawling in through the heating ducts, then rappelling down Mission Impossible style and stealing your hardware.”

Classic Encryption

Classic encryption protects the data itself rather than the storage device, but still has the disadvantage that the data needs to be unencrypted every time it’s used. Every stage where the data is unencrypted is an additional attack vector. Additionally, it isn’t format-preserving, meaning that it’s incompatible with systems that expect a specific number or type of characters in a string of data. Classic encryption also burdens you with key management, which further complicates things and presents yet another security risk if mistakes are made.

Masking

Masking has its applications, but in the context of analytics, masking also significantly limits your capabilities to understand your data. Once the data has been masked it’s very difficult to run analytics.

Tokenization is the answer to big data security

With the methods listed above, protecting the data creates a dilemma where you either keep the data protected or you de-protect it to allow for data analysis; you can’t perform analytics on data while it was in a protected state. Fortunately, there is an alternative called tokenization.

With tokenization, enterprises can protect sensitive information within big data environments, without impacting the ability to use the data in existing applications and systems. It also allows you to comply with regulatory requirements without prohibiting or restricting access to certain data sets that contain sensitive information. Furthermore, it prevents costly and reputation damaging data breaches by converting sensitive data into a state that is useless to attackers. And best of all, it allows you to gain insights from and monetize your data without compromising on security.

How tokenization works

Tokenization exchanges clear text values with tokens. With tokens, only sensitive data elements are replaced. This makes the protection format-preserving so that the data retains its referential integrity and can still be used for analytics. Let’s take the name “Clint Eastwood” for example:

Clear text: Clint Eastwood
Encrypted: aK#3âfHs*43^sF%421J&a7F_2;b)*6&a2%3;oF{_*$()#f$w(*&
Masked: Clint E*******
Tokenized: Clint Etafcajf

With tokenization you have a 1 to 1 correlation between the data in the clear text and the token.

Data-centric security is guided by two principals: protect sensitive data as soon as it’s created and only de-protect when absolutely necessary. This ensures the highest level of security and minimizes the risk of accidental exposure. Combined with good IAM integration, you can decide on a case by case basis who has the right to access data in the clear, who has access to tokens, and who has no access at all.

With this kind of solution, every stakeholder in your big data project wins. Analysts are happy because the data is still meaningful and relevant. IT is happy because the solution can be implemented with little to no changes to source code. Data security teams are happy because the Hadoop cluster doesn’t contain sensitive data, only tokens. And compliance and risk management are happy because real data can’t be exposed.

Protect sensitive data wherever it goes

Protecting data as far upstream as possible will give you the most protection because then the data can travel anywhere in your enterprise and still be protected. Point solutions, vendor-specific solutions, network or storage layer protection can’t offer you that level of security. With tokenization, over 90% of workloads can be done on protected data without revealing any sensitive data, giving you the ability to protect the entire enterprise.