In our previous article for this series, Purview Part 2: Data Catalog, we examined the portion of the end user experience where people will spend the majority of their time. But the question is, how does that Data Catalog get populated? The Data Catalog is populated by the Scanning and Classification features of Purview, which is the focus of this article.
There are some prerequisites that need to be mentioned before starting with the Scanning and Classification features. You will need an Azure subscription, an Azure Purview account, Azure Key Vault (for managing data source credentials), and the appropriate Azure Purview roles.
Within your Azure Subscription, you will need administrative access permissions and the ability to create resources. The administrative access is required because you will have to register some Resource Providers if they do not already exist. Those resource providers are:
Azure Purview Account
Once your Azure subscription has been configured, you will need a Purview account. While you can have multiple Purview accounts (three max per tenant), you can only add one Purview account at a time. Part of creating the Purview account is selecting the location (Azure region) for your Purview account. Your location will depend on your situation, but usually you want the region closest to where your data resides, or your users, if you are primarily an on-premises organization. All Purview accounts are created with a default Data Map size of 1 capacity unit (CU), where 1 CU supports up to 25 data map operations per second and includes up to 2GB of storage for your meta data. The data map is elastic, which means it will automatically scale based on the load request up to a maximum of 100 CUs. By default, the scaling is configured to not scale more than 10 times the steady state capacity in order to control costs. For more detailed information about the cost and Elastic Data Map, see the Purview Pricing page and the quotas for resources from Microsoft. Azure Purview accounts can be created using the Azure portal interface via your browser, but if you prefer to do it programmatically, they can also be created using Azure PowerShell, the .NET SDK, or Python.
Azure Key Vault
Azure Key Vault is Microsoft’s cloud service for securely storing keys, secrets, and certificates. Purview uses Azure Key Vault to securely store your data source credentials. There are currently three supported authentication methods that use Azure Key Vault: Account Key, SQL Authentication, and Service Principal. You can also use Azure Purview Managed Identity, which does not require creating credentials in Azure Key Vault. Your authentication method will be determined by data source type and networking requirements. Microsoft wrote a great article on Credentials for source Authentication in Azure Purview to help you decide what authentication method to use.
Azure Purview Roles
Azure Purview Roles determine who can do what in Purview. In order to scan your data sources, one or more security principals need to be added to one of the predefined Data Plane roles: Purview Data Reader, Purview Data Curator, or Purview Data Source Administrator. Azure Purview roles support individual users, Azure Active Directory Groups, and Service Principals. By default, the creator of the Azure Purview Account will be treated as if they are in both the Purview Data Curator and Purview Data Source Administrator roles.
|Access to Purview Portal
Read all content except scan bindings
Access to Purview Portal
Data Source Administrator
No Access to Purview Portal
Register your Data Sources
Now that the prerequisites have been met, the final step before scanning is to register your data sources. Registering can be done manually via the Purview portal or programmatically via the REST API. Currently, there are several supported data sources from on-premises like SQL Server and Oracle DB, to SaaS sources like Power BI and SAP HANA, to cloud providers like Azure and Amazon S3. Check out the Supported Data Stores page from Microsoft for a complete and up to date list. If you have multiple data sources in Azure, Amazon, or Azure Synapse Analytics, you can register them in a single effort using the Multiple feature.
Scanning is when the Purview engine connects to your data source and starts collecting its metadata. What metadata is it collecting? Well, that depends on your data source type. For example, when scanning a SQL Server database, schema names, table names, view names, column names, and their data types are collected. In addition to the meta data about your data source, you can specify Scan Rules, which will assist with classification efforts. There are some out-of-the-box scan rules, called System Scan Rules to get you started. However, if those don’t provide all the information you need, you can always create custom Scan Rule Sets specific to your organization.
For example, the System Scan Rule set for AzureStorage scans the following file types: CSV, JSON, PSV, SSV, TSV, GZIP, TXT, XML, PARQUET, AVRO, ORC, DOC, DOCM, DOCX, DOT, ODP, ODS, ODT, PDF, POT, PPS, PPSX, PPT, PPTM, PPTX, XLC, XLS, XLSB, XLSM, XLSX, XLT, and includes 206 classification rules. If you wanted to exclude a file type or modify the classification rules, you can create a Custom Scan Rule set that will only scan the file types you specify and use the classification rules you select.
Scanning your on-premises data sources deserves a special call out. To scan your on-premises data sources, you will need to install the latest self-hosted integration runtime (IR). The self-hosted integration runtime is the compute infrastructure that Azure Data Factory uses to provide data collection abilities across different network environments. This is how Purview will communicate with your data source and import the metadata about your data source. It is best practice to install the self-hosted integration runtime on its own machine, which can be either a physical or virtual machine. Depending on your data source type, you may also need to install JDK 11, the Visual C++ Redistributable 2012 Update 4, and any necessary data source drivers on the same machine where the self-hosted integration runtime is running.
Classifications help you identify what types of data you have in your data estate. Currently there are five categories of classifications: Government, Financial, Personal, Security, and Miscellaneous. A few examples with these classifications with their respective categories include:
- various National Identification Numbers, Passport numbers, and Taxpayer Identification Numbers for the Government category
- ABA routing numbers, Credit Card Numbers, and various national Bank Account Numbers for the Financial category
- email address, date of birth, and phone number for the Personal category
Classifications can be applied manually or automatically through the scan rule sets. Classifications can be applied at the resource set level (manually), table level (manually) or column level (automatically). Once classifications have been applied, re-scanning will not overwrite the assigned classifications, but new classifications will be added if they are detected. You can only remove classifications manually via the Purview portal or programmatically via the REST API.
That’s it for the Scanning and Classification installment of our Azure Purview series. If you missed the other posts in this series, they can be found here:
Azure Purview Series - Part 1: An Overview
Azure Purview Series - Part 2: Data Catalog
Azure Purview Series - Part 4: Data Map
Note: Azure Purview is now Generally Available as of 9/28/21.