Connect to AWS Redshift using AWS Glue

This document describes how to integrate your Dreamdata data with your AWS Redshift cluster. Once connected, you can run your own queries on our data models, as well as copy, manipulate, join and use the data within other tools connected to Redshift.

This solution relies on AWS Glue. AWS Glue is a service that can act as a middle layer between an AWS s3 bucket and your AWS Redshift cluster.

Steps

  • Pre-requisites
  • Transfer to s3 bucket
  • Configure AWS Glue
  • Run AWS Glue crawler
  • Configure AWS Redshift
  • Query from AWS Redshift

Pre-requisites

AWS

  • AWS account ID
  • AWS redshift cluster

Google Cloud Platform

  • GCP Service Account in a project with Cloud Billing enabled

Dreamdata

  • Dreamdata Google Cloud Storage destination enabled
  • Give your Google Cloud Storage service account permissions to access Dreamdata's Google Cloud Storage bucket. Learn more here.

With all pre-requisites done, you should be able to fill the following variables (which will be used throughout the integration)

aws_account_id=
redshift_cluster_id=
service_account=
gcs_name=

Don't forget to sign in to AWS and GCP from your terminal
aws configure
gcloud auth login --update-adc

Transfer to s3 bucket

First create an AWS s3 bucket to store the data that is currently located in Dreamdata's premises on BigQuery.

Create your s3 bucket

aws s3api create-bucket --bucket ${s3_name}

Transfer all the data from Dreamdata's Google Cloud Storage to your newly created s3 bucket

gsutil -i ${service_account} -m rsync -rd "gs://${gcs_name}" "s3://${s3_name}"
Notice that in order for the transfer to work, the service_account in question must
have access to Dreamdata's bucket (how to) and be attached to a Google Cloud Platform project with Cloud Billing enabled

Configure AWS Glue

AWS Glue will act as a layer in between your AWS s3 bucket, currently hosting the data, and your AWS Redshift cluster. We will define a AWS Glue database that can be queried from AWS Redshift. Also, in order to move the data from the s3 bucket to the newly created AWS Glue database, we will use a AWS Glue crawler.

First, create a role that can be assumed by AWS Glue

glue-role.json

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
aws iam create-role \
--role-name dd-glue \
--assume-role-policy-document file://glue-role.json

Then, attach the following policies

aws iam attach-role-policy \
--role-name dd-glue \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

aws iam attach-role-policy \
--role-name dd-glue \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

Create a AWS Glue database

aws glue create-database \
--database-input "{\"Name\":\"dd-glue-database\"}"

Finally, create a AWS Glue crawler

aws glue create-crawler \
--name dd-crawler \
--role dd-glue \
--database-name dd-glue-database \
--targets "{\"S3Targets\": [{\"Path\": \"s3://${s3_name}/contacts\"}, {\"Path\": \"s3://${s3_name}/companies\"}, {\"Path\": \"s3://${s3_name}/events\"}]}"
Note that the crawler needs to know which tables you want to extract from the data that is currently located in your s3 bucket. In this example, find a crawler that extracts the tables contacts, companies, and events

Run AWS Glue crawler

At this point, AWS Glue is ready to deposit the data that is hosted in your s3 bucket in the AWS Glue database.

To populate the newly created AWS Glue database with fresh data

aws glue start-crawler \
--name dd-crawler
Remember that you'll need to trigger the command just above every time you want to have fresh data in the AWS Glue database. Otherwise, see trigger crawlers with crons documentation.

Configure AWS Redshift

In order to run queries from your AWS Redshift cluster to the data now located in AWS Glue database, you will use AWS Redshift Spectrum, which is already a part of your AWS Redshift.

Create a role that will be assumed by your AWS Redshift cluster

redshift-role.json

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
aws iam create-role \
--role-name dd-redshift \
--assume-role-policy-document file://redshift-role.json

Attach minimum needed policies,

spectrum-policy.json

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::${s3_name}"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
}
]
}
aws iam create-policy \
--policy-name dd-spectrum-redshift \
--policy-document file://spectrum-policy.json

aws iam attach-role-policy \
--role-name dd-redshift \
--policy-arn "arn:aws:iam::${aws_account_id}:policy/dd-spectrum-redshift"

And, finally, add the new role to the list of roles that can be used within your cluster

aws redshift modify-cluster-iam-roles \
--cluster-identifier ${redshift_cluster_id} \
--add-iam-roles "arn:aws:iam::${aws_account_id}:role/dd-redshift"

Query from AWS Redshift

Almost done. The only thing left to query data from your AWS Redshift Cluster is to create an external schema for AWS Redshift Spectrum. To do so, open your preferred AWS Redshift query editor.

Create an external schema for Amazon Redshift Spectrum

create external schema spectrum_schema from data catalog
database 'dd-redshift-database'
iam_role 'arn:aws:iam::${aws_account_id}:role/dd-redshift'

create external database if not exists;

You are now ready to run queries. As an example, try counting the contacts

SELECT count(*) FROM spectrum_schema.contacts;


How did we do?