Destination AWS S3
Destination AWS S3
Destination AWS S3 is a secure way to get a dump of the data platform sent to an S3 bucket
in your own organization that uses a dedicated IAM Role
.
The Data Platform gives you access to a raw dump of the database tables that Dreamdata uses to build all its insights around. Having access to the raw data enables you to both load this in to any existing database platform that can load data from S3, such as Redshift and Snowflake, but also to create insights tailored to specific needs that your organization might have.
The below guide helps you to set up an `S3 bucket` and describes how to create an IAM Role
that grants access to the aforementioned bucket, but also allows Dreamdata to assume control of it so we can send the data to your organization.
Guide
The following steps describe how to create a role that will allow Dreamdata to push data to an s3 bucket in your AWS organization in a safe and secure manner.
This guide will cover the following 4 steps:
- create or select an S3 bucket as the destination of the data
- create a new role dedicated to this purpose and give it an identifying name, eg.
dreamdata-data-platform
that Dreamdata will assume control of when copying data - create a trust relationship policy for the
dreamdata-data-platform
role that allows dreamdata tostsAssume
it and act on its behalf given a specifiedexternalID
- create a policy for the
dreamdata-data-platform
role that will grant access to the bucket and permissions to create folders and write files.
Permissions
The person performing these steps must be able to perform the following actions in your own AWS organisation:
- create buckets and edit bucket policies
- create roles in the customer's AWS organisation
- edit trust and permission policies on roles
1. create an S3 bucket
Create a bucket inside your AWS organization (or select an already existing bucket) and copy the name, not the ARN
, of the bucket. Bucket names are global, so be sure to name it something unique when creating it or it might fail. Here is an official guide on how to do just that.
Dreamdata does not delete previous data dumps, but we recommend that a storage lifecycle policy is put in place to limit the data size, to make sure that the folder does not grow infinitely. A value of 7 days would be a sensible value, but it can vary depending on your usecase.
We will use the name <S3_BUCKET>
for the rest of this guide as a placeholder for the bucket created in this step.
2. create a new role
stsAssume
. The ARN of this role is:-
arn:aws:iam::485801740390:role/dreamdata_destination_s3
Under IAM > Roles, click the Create role
and perform the following steps:
- Trusted Entity Type: select "AWS account"
- An AWS account: select "Another AWS account" and enter
485801740390
(the Dreamdata organization ID) - Options: select `"Require external ID (Best practice when a third party will assume this role)"` and enter your Dreamdata Account ID
- using an
External ID
guards against the confused deputy problem
- using an
- In the next step, create the following policy and select it in the list - remember to replace the
<S3_BUCKET>
with the actual bucket from step 1:{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DreamdataDataPlatform",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
],
"Resource": [
"arn:aws:s3:::<S3_BUCKET>/*",
"arn:aws:s3:::<S3_BUCKET>"
]
}
]
} - Click the Next button
- Specify the name of the role on the next screen and optionally a description and click Create
- The role is now created. Find the newly created role and select it. We now want to limit the trust policy even further, by making sure that only the dedicated Dreamdata role
arn:aws:iam::485801740390:role/dreamdata_destination_s3
can assume this role. To do that, we update thePrincipal
in the policy like so:{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DreamdataStsPolicy",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::485801740390:role/dreamdata_destination_s3"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "<customer-dreamdata-account-id>" // this is your account ID in the dreamdata system
}
}
}
]
} - Conditional step: When using KMS encryption on the bucket, the role also needs permissions to GenerateDataKey and Decrypt on the buket. Either add these permissions to the above policy, or attach a new policy to the role containing the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "KMSEncryption",
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt"
],
"Resource": "<YOUR_KMS_KEY_ARN>"
}
]
}
These are all the required steps.
To finalise the steps, you will need to past these values in the fields on the page Data Platform->Data Access->AWS S3:
- the bucket name of the bucket created in step 1 (not the
ARN
) - the Role
ARN
that was created in step 2
How the Data looks
- The different tables and their schemas are documented on dbdocs.io.
- Each folder contains a complete dump of the table in the .parquet format.
The way the data will appear in the bucket using the following structure - if a sub-folder is specified, we will nest the below files under that folder - the following examples assume we are placing files at the root level of the bucket:
receipt.json
2023-01-02T15:04/companies/companies_*.parquet.gz
2023-01-02T15:04/contacts/contacts_*.parquet.gz
2023-01-02T15:04/events/events_*.parquet.gz
2023-01-02T15:04/revenue/revenue_*.parquet.gz
2023-01-02T15:04/revenue_attribution/revenue_attribution_*.parquet.gz
2023-01-02T15:04/paid_ads/paid_ads_*.parquet.gz
Example:
2023-01-02T15:04/companies/companies_000000000000.parquet.gz
2023-01-02T15:04/companies/companies_000000000001.parquet.gz
...
The timestamp will indicate when the dump was performed, and following a complete dump the receipt.json
will be updated with the new data such that an S3 trigger can be configured on that file on every update. The receipt.json
file has the following information:
{
"timestamp": "2023-03-14T04:03:07.963883Z",
"tables": {
"companies": {
"folder": "2023-03-14T04:03/companies",
"total_file_count": 58
},
"contacts": {
"folder": "2023-03-14T04:03/contacts",
"total_file_count": 58
},
"events": {
"folder": "2023-03-14T04:03/events",
"total_file_count": 64
},
"paid_ads": {
"folder": "2023-03-14T04:03/paid_ads",
"total_file_count": 51
},
"revenue": {
"folder": "2023-03-14T04:03/revenue",
"total_file_count": 51
},
"revenue_attribution": {
"folder": "2023-03-14T04:03/revenue_attribution",
"total_file_count": 61
}
}
}
Each entry in the list tables
can be used to automate the loading of the data using either a lambda function or similar, iterating over it and performing a load operation on the folder
value.
Schedule
A full dump of each data platform table will be created after each successful datamodeling run, which is scheduled differently for paid and free accounts:
- 4:00 AM UTC
- 3:00 PM UTC