Using AWS Data Pipeline to Export Microsoft SQL Server RDS data to an S3 Bucket

I spent the day figuring out how to export some data that's sitting on an AWS RDS instance that happens to be running Microsoft SQL Server to an S3 bucket. After it's in the S3 bucket, it's going to go through Elastic MapReduce (EMR). The provided Data Pipeline templates provided by Amazon don't deal with SQL Server and there's a tricky part when creating the pipeline in Architect.

First, let's recap what Pipeline is:

AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals.

This saves me some time from writing some separate applications to move all the pieces around, schedule it, have error logging etc.

Prepare the Data Sources###

  1. Before we begin, we need a SQL Server instance running on RDS first:
    Microsoft SQL Server Instance on RDS

  2. Download the Microsoft JDBC driver as we need that to connect to the server.

  3. Setup your S3 bucket. You'll need a location for logging and the actual output.

AWS Data Pipeline Edit in Architect###

The Edit in Architect button appears when creating a custom pipeline. It can also be exported and you'll see a bunch of Json configurations of your pipeline if you want to have it in version control. It looks like a cloud formation configuration.

Here's what my pipeline looks like:
SQL Server RDS To S3 Pipeline

  • user_data is the RDS SQL Server
  • export_data is an activity which does the exporting. An activity is a piece of work or processing.
  • emr_input is the S3 bucket

Notice on the right side that the two data sources are defined under DataNodes. We can name them here but they don't necessarily have to match up. Notice that under user_data that we have a select statement. This is the data that we want exported. By default, the activity will export to a CSV file. To get the Select Query option, you'll have to add it by clicking on Add an optional field...

The destination location is just a path to your S3 bucket and that's all that's needed.

I just wanted to point out that the Table property for user_data should match the select query. You can replace movie_rating in the select query with the parameter ${Table} or just make it match the Table field.

After adding the user_data source, we need to tell Pipeline to use the Microsoft JDBC drivers to connect to it when running the query:

AWS Data Pipeline SQL Server JDBC configuration

You can grab the username and password from the EC2 instance tab when you create SQL Server in RDS.

Important Note: Do not add Table name and Database as an option.

The Amazon examples such as the RedShift template shows that you can tack on the database name at the end of the JDBC connection string. The Microsoft JDBC driver works with a different type of connection string which uses semi-colons and explicitly states database=yourdatabase in the string.

So if you think adding another field with the Database name and Pipeline will figure that out for you, you'd be wrong.

If you add a Database field, Pipeline will just append that to the end of your connection string and that will make it invalid for the Microsoft driver.

Define the Exporting Activity###

When you add the Activity, you can choose the source and destination of where the data flows. The arrows will automatically get drawn in Architect.

Activities is really just an application in this case and that application needs to run on a resource. We'll define that resource as an EC2 instance.

AWS Activity Resource EC2 instance

And the Activity definition itself:

AWS Activity Configuration

The CopyActivity tells Pipeline that it's going to be copying data. Notice we set the user_data RDS db to be the input and the S3 bucket as the output.

That should be it. Activate the Pipeline and have it scheduled to run after activation and you will see the EC2 instances being made and have proper logging of the tasks in the Pipeline.

Once the Pipeline has finished running, you'll notice a random unique string CSV file appear in the destination bucket.