Home » Security Bloggers Network » Part I: Unmasking the Real Value of Masked Data

Part I: Unmasking the Real Value of Masked Data

by Delphix on September 12, 2019

Data security is critical to business survival, and data masking is a key component. Some enterprises protect data by simply redacting sensitive values; others employ fake data or write their own homegrown tools or scripts to mask certain environments.

We need to protect data in a way that keeps sensitive information safe and preserves our capacity to test effectively, maintain rapid feature delivery, and draw business insights from data.

Steering Clear of Data Mismatch

In our app-driven world, there are always concerns over test coverage, testing velocity, and tester productivity. When you test against datasets that originate from different points in time or you run a second test without rolling data back, you create the conditions for data mismatch.

Suppose you have dataset A from January 1 and dataset B from February 1. Even if both datasets A and B have “good” test data, they can yield bad test results because dataset B has changed. The same can happen if you run a test again against a “dirty” dataset because it costs way too much to reset the dataset or to re-mask it. Test failures are notoriously elusive to match and correct because the state of the data is fluid or its characteristics no longer match the original dataset.

Data’s Slippery Slope

Well-masked production data preserves existing implicit data relationships across the enterprise. For example (just as with your real data) a newborn probably shouldn’t have an AARP enrollment date. In addition, making sure key data elements mask the same way in disparate systems is crucial for testing. If you mask system A and convert Jason to Bob, but in system B you convert Jason to Chris, matching up test results becomes difficult and labor-intensive.

We could say this is not a problem for unstructured data, but the reality is that unstructured data almost always requires validation or is exchanged with applications that do have structured data. In most cases, production data has solved the implicit relationship problems at scale as well as validated that implausible or impossible values don’t exist.

When you mask data consistently, you can preserve implicit data relationships and beneficial data characteristics without complications. Other approaches, however, have to intuit or declare these data rules for every system that’s added. It can become very expensive to declare all the possible rules in order for the data to reflect the characteristics that it ought, and that expense often rises in relationship to the number of systems tested together.

Uncovering the Business Value of Masked Data

Data masking, also referred to as de-identification or obfuscation, is a method of protecting sensitive data by replacing the original value with a fictitious but realistic equivalent that is valuable to testers, developers, and data scientists. But what makes masked data both protected and useful? While there are plenty of masking technologies in the market, here are the 3 must-have characteristics of a solution that will preserve the data’s business value and ultimately enable quicker testing and better insights for the enterprise.

Referentially consistent – It maintains referential integrity at scale (within and across applications) in a manner consistent with the unmasked data, so that you find real errors instead of building new data that does have integrity or chasing ghost errors caused by mismatched data.
Representative – It shares many characteristics with the original cleartext, including producing similar test results and data patterns that can yield insight.
Realistic – The values in masked fields are fictitious but plausible, meaning that the values reflect real life scenarios and the relationships are consistent, but the referent (e.g., the customer whose name you retrieved) doesn’t exist. (So realistic, in fact, that a data thief wouldn’t even know that the data is masked just by looking.)

On a more general note, masking is typically:

Irreversible – The original protected data is not recoverable from the masked data.

Repeatable – It can be done again (and on command) as data and metadata evolve.

Making Data Your Innovation Superpower

When the cycle to get freshly masked data takes too long, you slow down development by forcing teams to work with stale data, impacting productivity and producing ghost errors. Even if you can get masked data fast, when masked data is not referentially consistent or synchronous, ghost errors still exist. Similarly, insight on masked data can suffer if the process doesn’t maintain distributed referential consistency. If you’re forced to redact it with dummy values, then it may perfectly be “valid” but worthless for insight.

In short, masked data with business value yields better testing and better insights on the data. In a world where every company is becoming a data company, you need a platform-based solution, like DataOps, that can bring enormous business value by accelerating velocity (both in the feature delivery pipeline and masked data delivery pipeline), deliver a lower cost of change and error, and make business insights readily available all while implementing the data protection you need. In the next installment, we’ll explore how a DataOps strategy is critical for laying the foundation for agility, security and transformational change in enterprise organizations.