Connect your Dreamdata data to Amazon Redshift using AWS Glue
What
# Integrate Dreamdata with AWS Redshift
This report shows how to integrate our Dreamdata product with your AWS Redshift cluster. The goal is to be able to query Dreamdata's data from your AWS Redshift cluster.
- [Integration with AWS Glue](#Integration-with-AWS-Glue)
## Pre-requisites
AWS:
- AWS account ID
- AWS redshift cluster
GCP:
- GCP service account in a project with Cloud Billing enbaled
Dreamdata:
- Dreamdata [GCS](https://cloud.google.com/storage) enabled ([How to](https://docs.dreamdata.io/article/xyxuyo080a-google-cloud-storage))
- Give your GCP service account permissions to access dreamdata's bucket
Reminder, don't forget to sign in to AWS and GCP form your terminal
```
aws configure
gcloud auth login --update-adc
```
If you have all the pre-requisites done, you should be able to fill the following variables (which will be used throughout the redshift integration)
```bash
aws_account_id=
redshift_cluster_id=
service_account=
gcs_name=
```
### Integration with AWS Glue
This solution relies on [AWS Glue](https://docs.aws.amazon.com/glue/index.html). AWS Glue is a service that can act as a middle layer between an [AWS s3](https://docs.aws.amazon.com/s3/index.html) bucket and your AWS Redshift cluster.
#### Steps
- [Pre-requisites](#pre-requisites)
- [Transfer bucket](#transfer-to-s3-bucket)
- [Configure AWS Glue](#configure-aws-glue)
- [Run AWS Glue crawler](#run-aws-glue-crawler)
- [Configure AWS Redshift](#configure-aws-redshift)
- [Query from AWS Redshift](#query-from-aws-redshift)
#### Transfer to s3 bucket
Let's create an AWS s3 bucket that will store all the data that is currently located in Dreamdata's premises.
Create your s3 bucket:
```bash
aws s3api create-bucket --bucket ${s3_name}
```
Transfer all the data from dreamdata GCS to your newly created s3 bucket.
```
gsutil -i ${service_account} -m rsync -rd "gs://${gcs_name}" "s3://${s3_name}"
```
Notice that in order for the transfer to work, the `service_account` in question must:
- Have access to dreamdata's bucket ([more info here](https://docs.dreamdata.io/article/xyxuyo080a-google-cloud-storage))
- Be attached to a GCP Project with Cloud Billing enabled
#### Configure AWS Glue
We will use [AWS Glue](https://docs.aws.amazon.com/glue/index.html) as a layer in between your `s3 bucket` hosting the data and your `redshift cluster`.
Firstly, create a role that can be assumed by AWS Glue:
glue-role.json
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}]
}
```
```bash
aws iam create-role \
--role-name dd-glue \
--assume-role-policy-document file://glue-role.json
```
Then, attach the policies
```
aws iam attach-role-policy \
--role-name dd-glue \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam attach-role-policy \
--role-name dd-glue \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
```
Having the role populated with the right permissions, create an `AWS Glue database`
```bash
aws glue create-database \
--database-input "{\"Name\":\"dd-glue-database\"}"
```
Finally, the `AWS Glue crawler` which will populate the database with the data from the s3 bucket.
*Note that the crawler needs to know which tables you want to extract from the data that is currently located in your s3 bucket. In this example, find a crawler that extracts the tables `contacts`, `companies`, and `events`.
```bash
aws glue create-crawler \
--name dd-crawler \
--role dd-glue \
--database-name dd-glue-database \
--targets "{\"S3Targets\": [{\"Path\": \"s3://${s3_name}/contacts\"}, {\"Path\": \"s3://${s3_name}/companies\"}, {\"Path\": \"s3://${s3_name}/events\"}]}"
```
#### Run AWS Glue crawler
At this point, AWS Glue is ready to deposit the data that is hosted in your s3 bucket in its database. This same database is then going to be queried by AWS Redshift.
To populate the newly created AWS Glue database with fresh data, do:
```bash
aws glue start-crawler \
--name dd-crawler
```
Remember that you'll need to trigger the command just above every time you want to have fresh data in the AWS Glue database. Otherwise, see [trigger crawlers with crons](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html) documentation.
#### Configure AWS Redshift
In order to do queries from your AWS Redshift cluster to the data now located in AWS Glue database, we will use [AWS Redshift Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html). Note that AWS Redshift Spectrum does come out of the box with AWS Redshift.
Let's add the right permissions.
Create a role that will be assumed by your AWS Redshift cluster:
redshift-role.json
```json
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}]
}
```
```bash
aws iam create-role \
--role-name dd-redshift \
--assume-role-policy-document file://redshift-role.json
```
Attach [minimum needed policies](https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html#spectrum-iam-policies-s3),
spectrum-policy.json
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::${s3_name}"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
}
]
}
```
```bash
aws iam create-policy \
--policy-name dd-spectrum-redshift \
--policy-document file://spectrum-policy.json
aws iam attach-role-policy \
--role-name dd-redshift \
--policy-arn "arn:aws:iam::${aws_account_id}:policy/dd-spectrum-redshift"
```
And, finally, add the new role to the list of roles that can be used within your cluster
```bash
aws redshift modify-cluster-iam-roles \
--cluster-identifier ${redshift_cluster_id} \
--add-iam-roles "arn:aws:iam::${aws_account_id}:role/dd-redshift"
```
#### Query from AWS Redshift
We are in a good place now. The only thing left is to atually query the data form your AWS Redshift Cluster. To do so, open your preferred AWS Redshift query editor.
Create an [external schema for Amazon Redshift Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html),
```sql
create external schema spectrum_schema from data catalog
database 'dd-redshift-database'
iam_role 'arn:aws:iam::${aws_account_id}:role/dd-redshift'
create external database if not exists;
```
Finally, do your queries. As an example, let's count the `contacts` we have in the `dd-redshift-database`
```sql
SELECT count(*) FROM spectrum_schema.contacts;
```