Connect your Dreamdata data to Amazon Redshift using AWS Glue

Steen Voersaa Updated by Steen Voersaa

What

# Integrate Dreamdata with AWS Redshift

This report shows how to integrate our Dreamdata product with your AWS Redshift cluster. The goal is to be able to query Dreamdata's data from your AWS Redshift cluster.

- [Integration with AWS Glue](#Integration-with-AWS-Glue)

## Pre-requisites

AWS:

- AWS account ID

- AWS redshift cluster

GCP:

- GCP service account in a project with Cloud Billing enbaled

Dreamdata:

- Dreamdata [GCS](https://cloud.google.com/storage) enabled ([How to](https://docs.dreamdata.io/article/xyxuyo080a-google-cloud-storage))

- Give your GCP service account permissions to access dreamdata's bucket

Reminder, don't forget to sign in to AWS and GCP form your terminal

```

aws configure

gcloud auth login --update-adc

```

If you have all the pre-requisites done, you should be able to fill the following variables (which will be used throughout the redshift integration)

```bash

aws_account_id=

redshift_cluster_id=

service_account=

gcs_name=

```

### Integration with AWS Glue

This solution relies on [AWS Glue](https://docs.aws.amazon.com/glue/index.html). AWS Glue is a service that can act as a middle layer between an [AWS s3](https://docs.aws.amazon.com/s3/index.html) bucket and your AWS Redshift cluster.

#### Steps

- [Pre-requisites](#pre-requisites)

- [Transfer bucket](#transfer-to-s3-bucket)

- [Configure AWS Glue](#configure-aws-glue)

- [Run AWS Glue crawler](#run-aws-glue-crawler)

- [Configure AWS Redshift](#configure-aws-redshift)

- [Query from AWS Redshift](#query-from-aws-redshift)

#### Transfer to s3 bucket

Let's create an AWS s3 bucket that will store all the data that is currently located in Dreamdata's premises.

Create your s3 bucket:

```bash

aws s3api create-bucket --bucket ${s3_name}

```

Transfer all the data from dreamdata GCS to your newly created s3 bucket.

```

gsutil -i ${service_account} -m rsync -rd "gs://${gcs_name}" "s3://${s3_name}"

```

Notice that in order for the transfer to work, the `service_account` in question must:

- Have access to dreamdata's bucket ([more info here](https://docs.dreamdata.io/article/xyxuyo080a-google-cloud-storage))

- Be attached to a GCP Project with Cloud Billing enabled

#### Configure AWS Glue

We will use [AWS Glue](https://docs.aws.amazon.com/glue/index.html) as a layer in between your `s3 bucket` hosting the data and your `redshift cluster`.

Firstly, create a role that can be assumed by AWS Glue:

glue-role.json

```json

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

"Principal": {

"Service": "glue.amazonaws.com"

},

"Action": "sts:AssumeRole"

}]

}

```

```bash

aws iam create-role \

--role-name dd-glue \

--assume-role-policy-document file://glue-role.json

```

Then, attach the policies

```

aws iam attach-role-policy \

--role-name dd-glue \

--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

aws iam attach-role-policy \

--role-name dd-glue \

--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

```

Having the role populated with the right permissions, create an `AWS Glue database`

```bash

aws glue create-database \

--database-input "{\"Name\":\"dd-glue-database\"}"

```

Finally, the `AWS Glue crawler` which will populate the database with the data from the s3 bucket.

*Note that the crawler needs to know which tables you want to extract from the data that is currently located in your s3 bucket. In this example, find a crawler that extracts the tables `contacts`, `companies`, and `events`.

```bash

aws glue create-crawler \

--name dd-crawler \

--role dd-glue \

--database-name dd-glue-database \

--targets "{\"S3Targets\": [{\"Path\": \"s3://${s3_name}/contacts\"}, {\"Path\": \"s3://${s3_name}/companies\"}, {\"Path\": \"s3://${s3_name}/events\"}]}"

```

#### Run AWS Glue crawler

At this point, AWS Glue is ready to deposit the data that is hosted in your s3 bucket in its database. This same database is then going to be queried by AWS Redshift.

To populate the newly created AWS Glue database with fresh data, do:

```bash

aws glue start-crawler \

--name dd-crawler

```

Remember that you'll need to trigger the command just above every time you want to have fresh data in the AWS Glue database. Otherwise, see [trigger crawlers with crons](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html) documentation.

#### Configure AWS Redshift

In order to do queries from your AWS Redshift cluster to the data now located in AWS Glue database, we will use [AWS Redshift Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html). Note that AWS Redshift Spectrum does come out of the box with AWS Redshift.

Let's add the right permissions.

Create a role that will be assumed by your AWS Redshift cluster:

redshift-role.json

```json

{

"Version": "2012-10-17",

"Statement": [{

"Effect": "Allow",

"Principal": {

"Service": "redshift.amazonaws.com"

},

"Action": "sts:AssumeRole"

}]

}

```

```bash

aws iam create-role \

--role-name dd-redshift \

--assume-role-policy-document file://redshift-role.json

```

Attach [minimum needed policies](https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-iam-policies.html#spectrum-iam-policies-s3),

spectrum-policy.json

```json

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"s3:GetBucketLocation",

"s3:GetObject",

"s3:ListMultipartUploadParts",

"s3:ListBucket",

"s3:ListBucketMultipartUploads"

],

"Resource": [

"arn:aws:s3:::${s3_name}"

]

},

{

"Effect": "Allow",

"Action": [

"glue:CreateDatabase",

"glue:DeleteDatabase",

"glue:GetDatabase",

"glue:GetDatabases",

"glue:UpdateDatabase",

"glue:CreateTable",

"glue:DeleteTable",

"glue:BatchDeleteTable",

"glue:UpdateTable",

"glue:GetTable",

"glue:GetTables",

"glue:BatchCreatePartition",

"glue:CreatePartition",

"glue:DeletePartition",

"glue:BatchDeletePartition",

"glue:UpdatePartition",

"glue:GetPartition",

"glue:GetPartitions",

"glue:BatchGetPartition"

],

"Resource": [

"*"

]

}

]

}

```

```bash

aws iam create-policy \

--policy-name dd-spectrum-redshift \

--policy-document file://spectrum-policy.json

aws iam attach-role-policy \

--role-name dd-redshift \

--policy-arn "arn:aws:iam::${aws_account_id}:policy/dd-spectrum-redshift"

```

And, finally, add the new role to the list of roles that can be used within your cluster

```bash

aws redshift modify-cluster-iam-roles \

--cluster-identifier ${redshift_cluster_id} \

--add-iam-roles "arn:aws:iam::${aws_account_id}:role/dd-redshift"

```

#### Query from AWS Redshift

We are in a good place now. The only thing left is to atually query the data form your AWS Redshift Cluster. To do so, open your preferred AWS Redshift query editor.

Create an [external schema for Amazon Redshift Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html),

```sql

create external schema spectrum_schema from data catalog

database 'dd-redshift-database'

iam_role 'arn:aws:iam::${aws_account_id}:role/dd-redshift'

create external database if not exists;

```

Finally, do your queries. As an example, let's count the `contacts` we have in the `dd-redshift-database`

```sql

SELECT count(*) FROM spectrum_schema.contacts;

```

How did we do?

Contact