Synthetic Data Removes Data Privacy Risks

The idea that data has value certainly isn’t new. It’s been called the new oil, the new gold – in fact, insert any rare commodity, and someone has probably created an analogy! Ironically, though, now that there’s almost universal recognition that this value exists, unlike any of these other commodities, it has become infinitely harder for companies and investors to actually monetize that data. Across every industry, the ability – or lack thereof – to leverage data and analytics is quickly separating the winners from the losers.

The regulatory environment, first and foremost, has created a sense of uncertainty and risk that has limited how an organization seeks to use data. Everyone is familiar with GDPR, of course. But in the C-Suite, executives are probably more familiar with the penalties for noncompliance than where the guardrails actually reside in terms of what’s allowable and what isn’t within the scope of the regulations. And the penalties – at four percent of global sales – can overshadow the anticipated payoff of most base-case projections associated with any data project. The introduction of California Privacy Rights Act in 2018 only added another set of rules around how organizations can collect, process, store and ultimately monetize data.

What is Synthetic Data?

There is a solution, one that many C-suite executives aren’t familiar with – synthetic data. Synthetic data effectively applies an algorithm that converts real data sets into a re-fabricated data set that retains all of the statistical properties of the original database, without any of the personal or financial information that is so sensitive to regulators and clients. If organizations were previously paralyzed because they weren’t quite sure where the boundaries were in terms of what data they could use and in what circumstances, synthetic data helps to eliminate the uncertainty.

The first question posed by most CIOs is generally, “How is synthetic data any different than anonymized data?” There’s a technical (read: more complicated) answer and a simpler answer. The simpler answer is that anonymized and pseudonymized data can be reverse engineered to reveal sensitive information. That’s not possible with synthetic data.

The technical, more complicated answer is it’s not just about obfuscating data through hash functions; synthetic data also injects mathematical noise into the transaction fields, using multiple algorithms selected at random for each and every data record. From a security perspective, this translates into the inability to do exact and fuzzy matching, while still maintaining 99.9% statistical
accuracy with the original data set.

Making the Business Case(s)

Anytime a synthetic data set can be used, personal privacy is protected. This may not seem like a big deal, but from a use-case perspective, it truly democratizes data across an enterprise. These protections allow organizations to limit any legal exposure related to data security. More importantly, though, in terms of being able to leverage and monetize data, synthetic data falls
outside of the scope of GDPR and other regulations. This eliminates the biggest sticking point, which opens up a whole world of possibilities.

Just being able to leverage the data internally by different groups regardless of permissions or clearances, creates a whole array of use cases. In a retail setting, for instance, business analysts can suddenly analyze data that would otherwise be too sensitive. And this informs a whole range of business decisions, from merchandising strategies to operations. Without concerns about running afoul of regulators, there’s generally a mental shift that occurs that translates into more creativity to gain deep conviction. Real estate decisions, for instance, may be informed by analysis into average ticket sizes of nearby stores. And, among consultants or other data vendors, they can obviously monetize this data in a more direct way or create more fulsome data sets through the use of synthetic data to fill in any holes.

Another consideration, particularly in financial services and among fintechs, is that so many of these organizations haven’t really been able to leverage the cloud due to data security concerns. They’ve had to invest heavily to build out on-premises data warehouses, but this isn’t an ideal environment for data scientists to engage in deep learning using GPUs that aren’t found in most legacy data centers. Synthetic data allows these organizations to take this sensitive data and put it into the cloud. This is a game changer for many organizations.

Beyond just the opportunities available to organizations, an overlooked benefit is that synthetic data also helps improve data governance and mitigates the risk of a data breach or enforcement action. Synthetic data, at the same time it unlocks the value of an organization’s data, also protects this data, infinitely expanding the opportunity set while eliminating the risk.

Avatar photo

Lorn Davis

Lorn Davis leads corporate and product strategy at Facteus. Lorn brings in-depth experience in strategy development and product management from his work at SeraCare Life Sciences, where he was the Senior Director of Data Products & Analytics and a member of the Executive Leadership Team that led the business to a successful exit at the end of 2018. Prior to joining SeraCare, he worked as a consultant leading data analytics teams engaged with large government and private organizations. Lorn earned his BA in Liberal Arts from St. John’s College (NM) and his MBA from Harvard Business School with a focus on technology and entrepreneurship.

lorn-davis has 1 posts and counting.See all posts by lorn-davis