Blog

Rearc's Publisher Coordinator for AWS Data Exchange

We recently open sourced the Publisher Coordinator for AWS Data Exchange (ADX) to benefit the larger community of data providers and we are happy to share more details in this blog post. Rearc’s ADX Publisher Coordinator offers a scalable cloud-based infrastructure for data publishing process on ADX. If you are a data provider with valuable data offerings that you are excited to share with the world and monetize, keep reading!

Data is an essential component for survival and growth of organizations and businesses. The advent of data marketplaces is a natural response from cloud service providers to address the growing need for high quality data that is easy to find and access.

AWS Data Exchange (ADX) is one of the major data marketplaces which bridges the gap between data providers and data subscribers by facilitating data discovery, subscription, storage, ownership, delivery, and billing. Rearc is one of the largest data providers on ADX. We offer data products in various verticals including financial services, healthcare, sustainability, and public sector data, as well as many others.

We have published more than 400 datasets with different sizes, sourcing methods, preprocessing workflows, and compliance requirements to ADX. This has put us in the unique position to experience first-hand the challenges and limitations involved in publishing and maintaining high quality data products. We have been solving the challenges of the publishing process and constantly improving our data infrastructure throughout our journey as a data provider on ADX.

Rearc's ADX Publisher Coordinator automates the process of publishing your data products to the AWS Data Exchange marketplace, abstracting away the engineering complexities involved to achieve a reliable and scalable workflow for data providers. Our solution is recommended by the AWS Data Exchange team to data providers.

What's So Hard About Publishing to ADX, Anyway?

Let's go over some of the challenges you may experience with publishing your data products on ADX:

  • Errors due to adding too many assets (more than 10,000 assets) to a new revision
  • Errors due to triggering too many (10+) concurrent import jobs to import assets from S3 to ADX
  • Errors due to your publishing pipeline timing out if you are using AWS Lambda for compute.
  • Duplicate code across automation pipelines for your ADX data products which makes adjustments for ADX API changes a maintainability issue.

Rearc's Solution

All these challenges are addressed by the AWS Data Exchange Publisher Coordinator. It enables data providers to focus on their data quality rather than ADX publishing details. Providers can source their datasets to an S3 bucket. Once the data is ready, the provider can simply upload a JSON manifest file to a second S3 bucket. This will trigger and run the publishing lifecycle of dataset revisions using the infrastructure set up by the ADX Publisher Coordinator.

Each manifest file can reference an arbitrary number of assets and the ADX publisher coordinator will take care of chunking the assets into bundles to accommodate for the ADX API limits. Current quota limits for the ADX API include a maximum of 100 assets per import job, a maximum of 10 concurrently running import jobs, and a maximum of 10,000 assets per dataset revision.

The publisher coordinator will create and publish one or more dataset revisions based on the ADX API quota and the number of assets specified in the manifest file. New dataset revisions will be automatically distributed to subscribers through a scalable, consistent, and reliable process without the need to worry about ADX API quotas.

There are a number of benefits to using Rearc's ADX Publisher Coordinator:

  • Achieve operational efficiency in your publishing pipelines by eliminating manual processes
  • Save development time and bypass the engineering complexity of a custom automation
  • Ensure scalability and reliability of your data publishing process
  • Reduce errors caused by hitting service limits of the AWS Data Exchange API
  • Deploy a scalable infrastructure for publishing dataset revisions in minutes
  • Easily publish multiple new revisions with arbitrary number of assets to your existing data products by uploading a JSON manifest files to S3

Architecture

The following infrastructure will be provisioned in your provider AWS account when you deploy this solution.

ADX Publisher Coordinator Architecture

Components

This solution uses three S3 buckets: an asset bucket, a manifest bucket, and a manifest logging bucket. The asset S3 bucket and the manifest logging S3 bucket should exist prior to deployment. The CloudFormation template creates the following resources in your AWS account:

  • A manifest S3 bucket, where manifests should be uploaded every time a dataset update is available
  • An starter lambda function that starts an execution of the step function workflow
  • A step functions workflow that the publishing process. Each execution of the step function workflow will start three or more lambda functions depending on the number of assets specified in the manifest file
  • Necessary IAM (AWS Identity and Access Management) roles to give all necessary permissions to the above resources.

How does it work?

For every dataset update, you need to upload your data assets into the asset bucket and then upload a manifest file into the manifest bucket. The manifest's key must end with .json and be valid JSON with the form:

{
  "product_id": "<string>",
  "dataset_arn": "<string>",
  "asset_list": [{
      "Bucket": "<asset_s3_bucket>",
      "Key": "<s3_object_key_1>"
    },
    {
      "Bucket": "<asset_s3_bucket>",
      "Key": "<s3_object_key_2>"
    },
    ...
  ]
}

Once a manifest is uploaded to the manifest bucket, the starter lambda is triggered by an S3 upload event. The starter lambda creates a new manifest from the contents of the uploaded manifest that accounts for ADX service quotas. It then uploads this new manifest to the manifest bucket and passes the manifest bucket name and key of the new manifest as inputs to a new execution of the step function workflow. Here's how the new manifest file is structured:

{
  "product_id": "<string>",
  "dataset_arn": "<string>",
  "asset_lists_10k": [
    [
      [{
          "Bucket": "<asset_s3_bucket>",
          "Key": "<s3_object_key_1>"
        },
        {
          "Bucket": "<asset_s3_bucket>",
          "Key": "<s3_object_key_2>"
        },
        {
          "Bucket": "<asset_s3_bucket>",
          "Key": "<s3_object_key_100>"
        }
      ]
    ],
    [
      [
        [
          < up to 100 s3 object keys >
        ],
        ...
      ],
      ...
    ],
    [
      <up to 10000 s3 object keys total>
    ],
    ...
  ]
}

The step function workflow looks like:

Step Function Workflow Details

Details of the Step Functions Workflow

Tasks in the AWS Step Functions workflow use S3 Select to query the manifest file for the number of asset objects referenced. There are two nested tasks in map state which ensure the calls to the ADX API stay within the quota limits. The inner map state imports all the assets referenced in the manifest from S3 to ADX, creates new revisions, finalizes each revision, and finally assigns each finalized revision to the product referenced in the manifest file. Upon completion of the workflow, your subscribers will have access to the new dataset revisions.

What do I need to deploy this solution?

We use CloudFormation to simplify the deployment process. You can deploy this solution to your AWS provider account using the AWS Command Line Interface, AWS SDK, or through the AWS Console. You can use this open source version yourself following the steps discussed in this blog post and in project's Github repository, but if you want us to host this for you, reach out to our data team at data@rearc.io. Rearc offers a data platform that is a managed end-to-end solution for data providers which you can learn more about in this blog post.

Prior to deployment, you need to have the following in place:

Approved ADX provider account

To publish file-based data products through the AWS Marketplace using AWS Data Exchange, you must register as a provider and publish a product. For detailed instructions on becoming a provider and publishing a product, refer to Providing Data Products on AWS Data Exchange in the AWS Data Exchange User Guide.

Existing ADX product

This solution requires an existing ADX product. In order to create one, the first step is to create a dataset with at least one finalized revision. Then, you can use that dataset to publish a new product through the ADX console. Record your product ID and dataset ID, as you will need to include both of them in your manifest when it is time to publish.

IAM Permissions

The IAM user role you use to deploy this solution should have permissions to create and execute CloudFormation stacks which include IAM roles and policies, Lambda functions, Step Functions, and Amazon S3 buckets. You can review the security permissions of the CloudFormation resources here.

S3 buckets

This solution uses three S3 buckets: an asset bucket, a manifest bucket, and a manifest logging bucket. You only need to create the asset S3 bucket and the manifest logging S3 bucket prior to deployment of this solution. Please refer to the instructions on our Github repository for details on how to create those buckets. The manifest bucket is created by the CloudFormation stack during the deployment.

Source Code

You can access the source code for this solution in the project's Github repository.

Next steps

Ready to talk about your next project?

1

Tell us more about your custom needs.

2

We’ll get back to you, really fast

3

Kick-off meeting

Let's Talk