Destination AWS S3

Updated by Patrick Madsen

Destination AWS S3

Destination AWS S3 is a secure way to get a dump of the data platform sent to an S3 bucket in your own organization that uses a dedicated IAM Role.

The Data Platform gives you access to a raw dump of the database tables that Dreamdata uses to build all its insights around. Having access to the raw data enables you to both load this in to any existing database platform that can load data from S3, such as Redshift and Snowflake, but also to create insights tailored to specific needs that your organization might have.

The below guide helps you to set up an `S3 bucket` and describes how to create an IAM Role that grants access to the aforementioned bucket, but also allows Dreamdata to assume control of it so we can send the data to your organization.

Note: The AWS S3 Datawarehouse only supports schema v2. You can find Schema v2 here.

Guide

The following steps describe how to create a role that will allow Dreamdata to push data to an s3 bucket in your AWS organization in a safe and secure manner.

This guide will cover the following 4 steps:

  1. create or select an S3 bucket as the destination of the data
  2. create a new role dedicated to this purpose and give it an identifying name, eg. dreamdata-data-platform that Dreamdata will assume control of when copying data
  3. create a trust relationship policy for the dreamdata-data-platform role that allows dreamdata to stsAssume it and act on its behalf given a specified externalID
  4. create a policy for the dreamdata-data-platform role that will grant access to the bucket and permissions to create folders and write files.

Permissions

The person performing these steps must be able to perform the following actions in your own AWS organisation:

  1. create buckets and edit bucket policies
  2. create roles in the customer's AWS organisation
  3. edit trust and permission policies on roles

1. create an S3 bucket

Create a bucket inside your AWS organization (or select an already existing bucket) and copy the name, not the ARN, of the bucket. Bucket names are global, so be sure to name it something unique when creating it or it might fail. Here is an official guide on how to do just that.

Dreamdata does not delete previous data dumps, but we recommend that a storage lifecycle policy is put in place to limit the data size, to make sure that the folder does not grow infinitely. A value of 7 days would be a sensible value, but it can vary depending on your usecase.

We will use the name <S3_BUCKET> for the rest of this guide as a placeholder for the bucket created in this step.

2. create a new role

Dreamdata uses a single dedicated role to assume control of external systems using stsAssume. The ARN of this role is:
- arn:aws:iam::485801740390:role/dreamdata_destination_s3

Under IAM > Roles, click the Create role and perform the following steps:

  1. Trusted Entity Type: select "AWS account"
  2. An AWS account: select "Another AWS account" and enter 485801740390 (the Dreamdata organization ID)
  3. Options: select `"Require external ID (Best practice when a third party will assume this role)"` and enter your Dreamdata Account ID
    1. using an External ID guards against the confused deputy problem
  4. In the next step, create the following policy and select it in the list - remember to replace the <S3_BUCKET> with the actual bucket from step 1:
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "DreamdataDataPlatform",
    "Effect": "Allow",
    "Action": [
    "s3:PutObject",
    "s3:PutObjectAcl",
    ],
    "Resource": [
    "arn:aws:s3:::<S3_BUCKET>/*",
    "arn:aws:s3:::<S3_BUCKET>"
    ]
    }
    ]
    }
  5. Click the Next button
  6. Specify the name of the role on the next screen and optionally a description and click Create
  7. The role is now created. Find the newly created role and select it. We now want to limit the trust policy even further, by making sure that only the dedicated Dreamdata role arn:aws:iam::485801740390:role/dreamdata_destination_s3 can assume this role. To do that, we update the Principal in the policy like so:
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "DreamdataStsPolicy",
    "Effect": "Allow",
    "Principal": {
    "AWS": "arn:aws:iam::485801740390:role/dreamdata_destination_s3"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
    "StringEquals": {
    "sts:ExternalId": "<customer-dreamdata-account-id>" // this is your account ID in the dreamdata system
    }
    }
    }
    ]
    }
  8. Conditional step: When using KMS encryption on the bucket, the role also needs permissions to GenerateDataKey and Decrypt on the buket. Either add these permissions to the above policy, or attach a new policy to the role containing the following:
    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Sid": "KMSEncryption",
    "Effect": "Allow",
    "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
    ],
    "Resource": "<YOUR_KMS_KEY_ARN>"
    }
    ]
    }

These are all the required steps.

To finalise the steps, you will need to past these values in the fields on the page Data Platform->Data Access->AWS S3:

  • the bucket name of the bucket created in step 1 (not the ARN)
  • the Role ARN that was created in step 2

How the Data looks

  • The different tables and their schemas are documented on dbdocs.io.
  • Each folder contains a complete dump of the table in the .parquet format.

The way the data will appear in the bucket using the following structure - if a sub-folder is specified, we will nest the below files under that folder - the following examples assume we are placing files at the root level of the bucket:

receipt.json
2023-01-02T15:04/companies/companies_*.parquet.gz
2023-01-02T15:04/contacts/contacts_*.parquet.gz
2023-01-02T15:04/events/events_*.parquet.gz
2023-01-02T15:04/revenue/revenue_*.parquet.gz
2023-01-02T15:04/revenue_attribution/revenue_attribution_*.parquet.gz
2023-01-02T15:04/paid_ads/paid_ads_*.parquet.gz

Example:

2023-01-02T15:04/companies/companies_000000000000.parquet.gz
2023-01-02T15:04/companies/companies_000000000001.parquet.gz
...

The timestamp will indicate when the dump was performed, and following a complete dump the receipt.json will be updated with the new data such that an S3 trigger can be configured on that file on every update. The receipt.json file has the following information:

{
"timestamp": "2023-03-14T04:03:07.963883Z",
"tables": {
"companies": {
"folder": "2023-03-14T04:03/companies",
"total_file_count": 58
},
"contacts": {
"folder": "2023-03-14T04:03/contacts",
"total_file_count": 58
},
"events": {
"folder": "2023-03-14T04:03/events",
"total_file_count": 64
},
"paid_ads": {
"folder": "2023-03-14T04:03/paid_ads",
"total_file_count": 51
},
"revenue": {
"folder": "2023-03-14T04:03/revenue",
"total_file_count": 51
},
"revenue_attribution": {
"folder": "2023-03-14T04:03/revenue_attribution",
"total_file_count": 61
}
}
}

Each entry in the list tables can be used to automate the loading of the data using either a lambda function or similar, iterating over it and performing a load operation on the folder value.

Schedule

A full dump of each data platform table will be created after each successful datamodeling run, which is scheduled differently for paid and free accounts:

  • 4:00 AM UTC
  • 3:00 PM UTC


How did we do?