AWS S3 Legacy

Updated by Patrick Madsen

AWS S3 Legacy

Destination AWS S3 Legacy is a secure way to get a dump of the data platform sent to an S3 bucket in your own AWS organization using a dedicated IAM Role.

The Data Platform gives you access to a raw dump of the database tables that Dreamdata uses to build all its insights around. Having access to the raw data enables you to both, load this into any existing database platform that can load data from S3, such as Redshift and Snowflake, as well as to create insights tailored to specific needs that your organization might have.

The guide below helps you to set up an S3 bucket and describes how to create an IAM Role that grants access to the aforementioned bucket, but also allows Dreamdata to assume control of it, so that we can send data to your organization.

Note: The AWS S3 Datawarehouse Legacy only supports Schema v2. You can find Schema v2 here.

Guide

The following steps describe how to create a role that will allow Dreamdata to push data to an S3 bucket in your AWS organization in a safe and secure manner.

This guide will cover the following 4 steps:

  1. Create or select an S3 bucket as the destination of the data.
  2. Create a new role dedicated to this purpose and give it an identifying name, eg. dreamdata-data-platform that Dreamdata will assume control of, when copying data.
  3. Create a trust relationship policy for the dreamdata-data-platform role that allows Dreamdata to stsAssume it and act on its behalf given a specified externalID.
  4. Create a policy for the dreamdata-data-platform role that will grant access to the bucket and permissions to create folders and write files.

Permissions

The person performing these steps must have the necessary permissions in your own AWS organisation to:

  1. Create buckets and edit bucket policies.
  2. Create roles in the customer's AWS organisation.
  3. Edit trust and permission policies on roles.

1. Create a new S3 bucket

Create a new bucket inside your AWS organization (or select an already existing bucket) and copy the name, not the ARN, of the bucket. Bucket names are global, so be sure to name it something unique when creating it or it might fail. Here is an official guide on how to do just that.

Dreamdata does not delete previous data dumps, so we recommend that a storage lifecycle policy is put in place to limit the data size, to make sure that the folder does not grow indefinitely. A value of 7 days would be a sensible value, but it can vary depending on your use-case.

Throughout the rest of this guide, we will refer to the bucket created in this step using the <S3_BUCKET> placeholder.

2. Create a new role

Dreamdata uses a single dedicated role to assume control of external systems using stsAssume. The ARN of this role is:
- arn:aws:iam::485801740390:role/dreamdata_destination_s3

Under IAM > Roles, click the Create role and perform the following sub-steps:

  1. Trusted Entity Type: select "AWS account".
  2. An AWS account: select "Another AWS account" and enter 485801740390 (the Dreamdata organization ID).
  3. Options: select "Require external ID (Best practice when a third party will assume this role)" and enter your Dreamdata External ID, which can be found in Data Platform -> Data Access -> AWS S3 Legacy in the Dreamdata App.
    1. using an External ID guards against the confused deputy problem.
  4. In the next step, create the following policy and select it in the list - remember to replace <S3_BUCKET> with the actual bucket created in Step 1:
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "DreamdataDataPlatform",
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:PutObjectAcl",
    ],
    "Resource": [
    "arn:aws:s3:::<S3_BUCKET>/*",
    "arn:aws:s3:::<S3_BUCKET>"
    ]
    }
    ]
    }
  5. Click the Next button.
  6. On the next screen, specify the role name and optionally a description, before clicking Create.
  7. Now that the role is created, we need to configure it. Find and select it, before performing the following step(s):
    1. Required step: We need to make sure that only the dedicated Dreamdata role arn:aws:iam::485801740390:role/dreamdata_destination_s3 can assume this role, by further restricting the trust policy. To do that, we update the Principal in the policy like so:
      {
      "Version": "2012-10-17",
      "Statement": [
      {
      "Sid": "DreamdataStsPolicy",
      "Effect": "Allow",
      "Principal": {
      "AWS": "arn:aws:iam::485801740390:role/dreamdata_destination_s3"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
      "StringEquals": {
      "sts:ExternalId": "<customer-dreamdata-account-id>" // This is your Account ID, found in the Dreamdata App
      }
      }
      }
      ]
      }
    2. Conditional step: When using bucket KMS encryption, the role also needs the GenerateDataKey and Decrypt permissions on the bucket. Either add these permissions to the above policy, or attach a new policy to the role containing the following:
      {
      "Version": "2012-10-17",
      "Statement": [
      {
      "Sid": "KMSEncryption",
      "Effect": "Allow",
      "Action": [
      "kms:GenerateDataKey",
      "kms:Decrypt"
      ],
      "Resource": "<YOUR_KMS_KEY_ARN>"
      }
      ]
      }

These are all the required steps.

To finalise the steps, you will need to paste these values in the fields on the page Data Platform -> Data Access -> AWS S3 Legacy in the Dreamdata App:

  • the bucket name of the bucket created in Step 1 (not the ARN).
  • the Role ARN that was created in Step 2.

How the Data looks

  • The different tables and their schemas are documented on dbdocs.io.
  • Each folder contains a complete dump of the table in the .parquet format.

Data will appear in the bucket using the structure shown below. If a Folder path is optionally specified in the Dreamdata App, all files will be nested under it.

The following examples assume that no Folder is configured, with the files being placed at the root level of the bucket:

receipt.json
2023-01-02T15:04/companies/companies_*.parquet.gz
2023-01-02T15:04/contacts/contacts_*.parquet.gz
2023-01-02T15:04/events/events_*.parquet.gz
2023-01-02T15:04/revenue/revenue_*.parquet.gz
2023-01-02T15:04/revenue_attribution/revenue_attribution_*.parquet.gz
2023-01-02T15:04/paid_ads/paid_ads_*.parquet.gz

Inside each folder are one or more parquet gzip files. Here, the files inside the companies folder are shown:

2023-01-02T15:04/companies/companies_000000000000.parquet.gz
2023-01-02T15:04/companies/companies_000000000001.parquet.gz
...

A receipt.json file is created/updated upon every successful data dump, containing a description of all dumped data, including a timestamp, table names and their respective folder names and file counts. An S3 trigger can be set up to fire whenever this file is updated.

Here is a sample receipt.json file:

{
"timestamp": "2023-03-14T04:03:07.963883Z",
"tables": {
"companies": {
"folder": "2023-03-14T04:03/companies",
"total_file_count": 58
},
"contacts": {
"folder": "2023-03-14T04:03/contacts",
"total_file_count": 58
},
"events": {
"folder": "2023-03-14T04:03/events",
"total_file_count": 64
},
"paid_ads": {
"folder": "2023-03-14T04:03/paid_ads",
"total_file_count": 51
},
"revenue": {
"folder": "2023-03-14T04:03/revenue",
"total_file_count": 51
},
"revenue_attribution": {
"folder": "2023-03-14T04:03/revenue_attribution",
"total_file_count": 61
}
}
}

Each entry in the tables list can be used to automate the loading of the data using either an AWS Lambda function or similar, by iterating over it and performing a load operation on the folder value.

Schedule

A full dump of each data platform table is created after each successful Data Modelling run.


How did we do?