Connect to AWS Redshift using AWS Glue
This document describes how to integrate your Dreamdata data with your AWS Redshift cluster. Once connected, you can run your own queries on our data models, as well as copy, manipulate, join and use the data within other tools connected to Redshift.
This solution relies on AWS Glue. AWS Glue is a service that can act as a middle layer between an AWS s3 bucket and your AWS Redshift cluster.
Steps
- Pre-requisites
- Transfer to s3 bucket
- Configure AWS Glue
- Run AWS Glue crawler
- Configure AWS Redshift
- Query from AWS Redshift
Pre-requisites
AWS
- AWS account ID
- AWS redshift cluster
Google Cloud Platform
- GCP Service Account in a project with Cloud Billing enabled
Dreamdata
- Dreamdata Google Cloud Storage destination enabled
- Give your Google Cloud Storage service account permissions to access Dreamdata's Google Cloud Storage bucket. Learn more here.
With all pre-requisites done, you should be able to fill the following variables (which will be used throughout the integration)
aws_account_id=
redshift_cluster_id=
service_account=
gcs_name=
aws configure
gcloud auth login --update-adc
Transfer to s3 bucket
First create an AWS s3 bucket to store the data that is currently located in Dreamdata's premises on BigQuery.
Create your s3 bucket
aws s3api create-bucket --bucket ${s3_name}
Transfer all the data from Dreamdata's Google Cloud Storage to your newly created s3 bucket
gsutil -i ${service_account} -m rsync -rd "gs://${gcs_name}" "s3://${s3_name}"
service_account
in question musthave access to Dreamdata's bucket (how to) and be attached to a Google Cloud Platform project with Cloud Billing enabled
Configure AWS Glue
AWS Glue will act as a layer in between your AWS s3
bucket, currently hosting the data, and your AWS Redshift
cluster. We will define a AWS Glue database
that can be queried from AWS Redshift. Also, in order to move the data from the s3 bucket to the newly created AWS Glue database, we will use a AWS Glue crawler
.
First, create a role that can be assumed by AWS Glue
glue-role.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "glue.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
aws iam create-role \
--role-name dd-glue \
--assume-role-policy-document file://glue-role.json
Then, attach the following policies
aws iam attach-role-policy \
--role-name dd-glue \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam attach-role-policy \
--role-name dd-glue \
--policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
Create a AWS Glue database
aws glue create-database \
--database-input "{\"Name\":\"dd-glue-database\"}"
Finally, create a AWS Glue crawler
aws glue create-crawler \
--name dd-crawler \
--role dd-glue \
--database-name dd-glue-database \
--targets "{\"S3Targets\": [{\"Path\": \"s3://${s3_name}/contacts\"}, {\"Path\": \"s3://${s3_name}/companies\"}, {\"Path\": \"s3://${s3_name}/events\"}]}"
contacts
, companies
, and events
Run AWS Glue crawler
At this point, AWS Glue is ready to deposit the data that is hosted in your s3 bucket in the AWS Glue database.
To populate the newly created AWS Glue database with fresh data
aws glue start-crawler \
--name dd-crawler
Configure AWS Redshift
In order to run queries from your AWS Redshift cluster to the data now located in AWS Glue database, you will use AWS Redshift Spectrum, which is already a part of your AWS Redshift.
Create a role that will be assumed by your AWS Redshift cluster
redshift-role.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "redshift.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
aws iam create-role \
--role-name dd-redshift \
--assume-role-policy-document file://redshift-role.json
Attach minimum needed policies,
spectrum-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListMultipartUploadParts",
"s3:ListBucket",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::${s3_name}"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
}
]
}
aws iam create-policy \
--policy-name dd-spectrum-redshift \
--policy-document file://spectrum-policy.json
aws iam attach-role-policy \
--role-name dd-redshift \
--policy-arn "arn:aws:iam::${aws_account_id}:policy/dd-spectrum-redshift"
And, finally, add the new role to the list of roles that can be used within your cluster
aws redshift modify-cluster-iam-roles \
--cluster-identifier ${redshift_cluster_id} \
--add-iam-roles "arn:aws:iam::${aws_account_id}:role/dd-redshift"
Query from AWS Redshift
Almost done. The only thing left to query data from your AWS Redshift Cluster is to create an external schema for AWS Redshift Spectrum. To do so, open your preferred AWS Redshift query editor.
Create an external schema for Amazon Redshift Spectrum
create external schema spectrum_schema from data catalog
database 'dd-redshift-database'
iam_role 'arn:aws:iam::${aws_account_id}:role/dd-redshift'
create external database if not exists;
You are now ready to run queries. As an example, try counting the contacts
SELECT count(*) FROM spectrum_schema.contacts;