Challenge
A large payment technology company sources hundreds of data products from different data vendors. These data products are received from a diverse set of sources: direct files, AWS Data Exchange (ADX), Databricks Marketplace and others. Managing access to the available data products and identifying what data products are available was a large operational overhead. Inefficiencies within data sourcing lead to delays in business units getting access to requested data or duplication in licensing with the same vendor.
Our Approach
Rearc partnered with Databricks to implement a workflow utilizing Databricks Unity Catalog to meet the needs of the customer. Rearc designed an ingestion process powered by Databricks Workflows to ingest marketplace data from an external S3 bucket or a Delta Share.
Solution
A central ingestion Workflow moves data products from an S3 bucket into Unity Catalog. This process relies on a configuration file for each data product. This ingestion manifest provides the appropriate attributes and mapping to onboard the data product into Unity Catalog. Data sources such as ADX are auto-exported into S3 through native AWS workflows. Direct files are first uploaded into S3 for processing by the central workflow. Data Products from Databricks Marketplace are already available in Unity Catalog. In those cases, the workflow applies appropriate tags and permissions from the ingestion manifest without any data movement.
The configuration driven approach, utilizing ingestion manifests, allows many data products to be managed through a singular workflow. This reduces sprawl, improves consistency and lowers the technical resources required to onboard new data products.
Outcome
The customer has successfully onboarded a group of initial data products utilizing the workflow. Business Analysts within the companies data sourcing team were able to on board new data products into Unity Catalog without technical assistance. Teams were able to easily discover data sets and be granted access all within one platform through Databricks Unity Catalog governance.