Discover Module
The Discover module is how you can automate discovering and tagging data across your data platform. It encompasses the identification and classification of data using frameworks.
Requirements
- Native SDD enabled
- Frameworks enabled
- Registered Snowflake, Databricks, Redshift, or Starburst (Trino) data sources
Components
The Immuta UI has separate sections for identification frameworks and classification frameworks. Both frameworks are made of rules, criteria, and resulting tags, but the criteria types differ for each framework type. Identification frameworks use competitive pattern matching and column name matching to discover data types and tag them. Classification frameworks use tags on the column, neighboring columns, and data source for context and then tag the columns based on that context. Find more information about each framework type below.
Identification frameworks
Identification frameworks run with sensitive data discovery (SDD). They use data patterns to discover data and tag it based on what the data is.
Supported criteria and pattern types
-
Competitive pattern analysis: This criteria is a process that will review all the regex and dictionary patterns within the rules of the framework and search for the pattern with the best fit. In this review, each competitive pattern analysis criteria in the framework competes against each other to find the best and most specific pattern that fits the data. The resulting tags for the best pattern's rule are then applied to the column.
- Regex pattern: This pattern contains a case-insensitive regular expression that searches for matches against
column values. Create a regex pattern in the UI or with the
sdd/classifier
endpoint. - Dictionary pattern: This pattern contains a list of words and phrases to match against column values.
Create a dictionary pattern in the UI or with the
sdd/classifier
endpoint.
- Regex pattern: This pattern contains a case-insensitive regular expression that searches for matches against
column values. Create a regex pattern in the UI or with the
-
Column name: This criteria matches a column name pattern to the column names in the data sources. The rule's resulting tags will be applied to the column where the name is found.
- Column name pattern: This pattern includes a case-insensitive regular expression matched against column
names, not against the values in the column.
Create a column name pattern in the UI or with the
sdd/classifier
endpoint.
- Column name pattern: This pattern includes a case-insensitive regular expression matched against column
names, not against the values in the column.
Create a column name pattern in the UI or with the
Related guides
- To start using identification frameworks in the UI, see the Getting started guide.
- To manage identification frameworks with the API, see the
/sdd/template
endpoint reference guide.
Classification frameworks
Classification frameworks run with the classify service. They determine rule match and criteria fit based on proximity tags and then tag data based on the context it is within.
Supported criteria
- Match column tag: This criteria applies resulting tags based on specific tags already on the column.
- Match neighboring column tag: This criteria applies resulting tags based on specific tags on neighboring columns.
Related guides
- To manage classification frameworks in the UI, see the Activate frameworks guide.
- To create a classification framework with the API, see the
/frameworks
endpoint reference guide.
Data inventory dashboard
Private Preview
This feature is only available to select accounts.
The data inventory dashboard visualizes information about your organization's data. It presents your entire data corpus within the context of the frameworks you have actively tagging your data with details like when your data was scanned last or how much of the scanned data is relevant to your active frameworks.
In the data inventory dashboard you will see tiles for scanned coverage and the percent of data scanned within a specific time frame. These tiles are referencing data scanned by an identification framework with SDD. To increase the number of your data sources that have been scanned, run SDD.
The next section of the dashboard shows tiles for the compliance frameworks. Within each graph is the separation of columns found containing or not containing the data important to the compliance framework. These graphs update every time classification runs, which will happen from these events.
For information on the frameworks visualized in the dashboard, see the Immuta frameworks reference guide.
Workflow
The Discover workflow involves both identification with SDD and classification:
- A user with the
GOVERNANCE
permission enables SDD and activates classification frameworks. - Users register data in Immuta.
- SDD runs:
- Immuta generates a SQL query using the identification framework's rules.
- That query is executed in the native database.
- Immuta receives the query results containing the column name and the matching rules but no raw data values.
- SDD applies the resulting tags to the relevant columns.
- Classification runs:
- The data source's current tags are checked against the framework's rules.
- When a matching rule is found, the resulting tags are applied to the relevant columns.
- Users with the
GOVERNANCE
permission or data owners can view the data inventory dashboard with visualizations of their scanned data.
Frequency
This workflow will run when a new data source is manually registered in Immuta or found from schema monitoring. Additionally, SDD alone will run from the following events:
- A new data source is created.
- Schema monitoring is enabled, and a new data source is detected.
- Column detection is enabled, and new columns are detected. Here, SDD will only run on new columns, and no existing tags will be removed or changed.
- A user manually triggers it from the data source health check menu.
- A user manually triggers it from the identification frameworks page.
- A user manually triggers it through the API.
Classification will run from the following events:
- A framework gets created, updated, or deleted.
- A tag gets added to or removed from a column manually or by SDD.
- A tag gets added to a data source.
- A user manually triggers it from the data source health check menu.
- A user manually triggers it through the API.
Caveat
- Customizing classification frameworks currently requires users to use the Immuta API.
Discover section contents
Conceptual guides:
Getting started guide:
How-to guides:
-
Identification guides:
-
Classification guides:
Reference guides: