Securing Data Catalog Implementation
If you have been reached out by your data engineering team to give security approval for a particular data catalog vendor and wondering what a data catalog solution can do, its purpose and how to securely integrate a data catalog solution into your data stack workflow, then you are in the right place. In this article I would like to go over briefly what is a data catalog, some existing vendors in the market, its use cases and how to secure its implementation.
What is a Data Catalog?
Data catalog is a one-stop solution for all your data-related questions, it gathers metadata about your data assets, data infrastructure, data quality, metrics etc. and you can cater your data catalog to your team and organizational needs. A data catalog can help you contextualize your data which in turn can help in making business-related decisions. For example, it helps to determine the number of active users of your product for the last quarter which can provide data points that help you drive more users’ traction.
What Can a Data Catalog Help You Answer?
At any point in time, a data catalog helps you determine whether the columns have been changed or deleted in your dataset if a specific data source is using a lot of processing power, or if your dataset contains sensitive information that requires masking.
Data Catalog Solutions Available in the Market, In-House Solutions
There are many data catalog vendors available in the market, such as Atlan, Informatica, Collibra, Alation and Alteryx Connect, to name a few. Many of these vendors offer their solution as SaaS avoiding the overhead to set up and manage your infrastructure for another data tool. Some companies such as Netflix, Uber, Airbnb and Linkedin developed their own data catalogs in-house customized to their needs.
Is a Data Catalog Just Another Data Tool in Your Data Stack With Management and Cost Overhead, or Does it Solve Use Cases for You?
In this section, let’s discuss some real-time use cases of data catalog:
Integrations: A data catalog can be integrated with all of your data stack such as data warehouses, dbt, ETL tools, dashboards, slack, Jira etc. data catalog curates the data from various sources and metadata that has been cataloged is no longer stored in one place and unusable, but flows into your workflow to define the purpose of the data.
Avoid wasting time looking for the appropriate dataset for your needs: For instance, if you manage a company that has an app for ordering food and you want to see how many orders were placed in the past month, you’ll need to look at the orders’ dataset among other data sets, such as partner restaurants, sales data, user information, etc. Similarly, you can discover who owns a dataset, whether it is actively being used, what tables are used and so on, with accuracy.
When the business team asks for specific information, the metadata captured by the data catalog solution will help you provide answers in minutes, as opposed to hours and days as in the past.
Collaborative workspaces: A data catalog creates a common ground for different teams of your organization to come together, experiment on a business strategy, look at the preview and create something useful from their collective datasets.
Data Lineage: It is crucial to understand the data source and origin, how it transformed over time and its usage and application in different scenarios to determine if the data is suitable for business-related decision-making. Data catalogs provide a visual representation of the data lifecycle from inception to its current state.
Data Governance: At the core data catalog features build a foundation for a data governance framework for the organization, data governance establishes policies and procedures on how to securely collect, store, manage, use and delete the data. It also ensures compliance with regulatory standards.
Not all solutions available in the market may be suitable for your use cases, it is better to demo a platform before bringing it on board and integrating it into your datastack.
How do Data Catalogs Extract the Metadata From Your Data Sources?
Data catalogs crawl through the connected datasets to discover the metadata, they use AI and machine learning abilities to fetch the metadata. AI and machine learning capabilities not only help in extracting tags and metadata, but they can discover new datasets added to your stack and allow natural language search capability using non-technical terms or business terms for easier finding and access.
When You Decide to Bring Another Tool Onboard you Need to Ensure That it’s Securely Integrated and Implemented
Some recommendations for securing the data catalog implementation:
Note: the following recommendations generally apply to both vendor and in-house data catalog solutions.
Secure the instance: Ensure the data catalog instance can only be accessed over your company network. Make sure that the instance and its URL are not discoverable over the internet.
Authentication and authorization: If you’re using username-password login or SAML 2.0-based SSO login to your data catalog, ensure the password strength is as per NIST standard and enforce 2-factor authentication and role-based authorization.
- Define policies to ensure only authorized users can access the data sets with sensitive or confidential data
- Make sure user information such as roles, policies and user role mappings are stored securely in the backend, passwords are hashed and stored securely.
Secure the integrations: Create a separate set of connection strings for each of your data sources such as (MySQL, Snowflake and Data Lake) that are used to connect to the data catalog. Ensure connection strings/credentials are stored in a secret manager or vault with an IAM role assigned to the data catalog for secure access before establishing a connection.
- Assign only read permissions to the connection strings for the data sources unless required otherwise.
Rotate Secrets: Ensure the secrets/credentials configured for integrations with the data catalog are rotated periodically, typically every 90 days.
Encryption in transit: When connecting to a data catalog from data sources make sure that the data and metadata transmitted over the internet is SSL-encrypted with HTTPS. If you are connecting sensitive datasets, it is recommended to use a private link connection over the internet for secure data transfer.
Encryption at rest: Ensure that the metadata and user information captured by the data catalog is stored encrypted at rest at all times in the underlying infrastructure.
Logging and Monitoring: Enable logging and monitoring for all user activities within the platform to detect any malicious or anomalous activity.
Infrastructure Security: all the underlying infrastructure is secured from unauthorized access, even if you choose a vendor-based solution, ensure that proper controls are in place for infrastructure security, vulnerability management, periodic application security testing and patch management.