Amazon S3 is a popular object storage service used to store both structured and unstructured data. You can leverage S3 to securely and cost-effectively build a data lake of any size or scale. With an S3-powered data lake, you can easily use the native AWS services for data processing, analytics, machine learning, and more.
RudderStack lets you configure S3 data lake as a destination to which you can send your event data seamlessly.
Configuring S3 Data Lake destination in RudderStack
To send event data to SQL Server, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to SQL Server via RudderStack.
To configure SQL Server as a destination in RudderStack, follow these steps:
- In your RudderStack dashboard, set up the data source. Then, select S3 Data Lake from the list of destinations.
- Assign a name to your destination and then click on Next.
Connection settings
Enter the following credentials in the Connection Credentials page:
- S3 Storage Bucket Name: The name of the S3 bucket that will be used to store the data before loading it into the S3 data lake.
- Register schema on AWS Glue: If you enable this option, RudderStack registers the schema of your incoming data on AWS Glue's data catalog.
- AWS Glue Region: Your AWS Glue region. For example, for
N.Virginia
, it would beus-east-1
. - S3 Prefix: If specified, RudderStack creates a folder in the bucket with this prefix and push all the data within that folder.
- Namespace: If specified, all the data for the destination will be pushed to
s3://<bucketName>/<prefix>/rudder-datalake/<namespace>
. If you don't specify a namespace in the settings, it is set to the source name, by default.
- AWS Access Key ID: Your AWS access key ID.
- AWS Secret Access Key: Your AWS secret access key.
- glue:CreateTable
- glue:UpdateTable
- glue:CreateDatabase
- glue:GetTables
Finding your data in S3 data lake
RudderStack converts your events into Parquet files and dumps them to the configured S3 bucket. Before dumping the events, RudderStack groups them by the event name based on the time (UTC) they were received.
The folder structure is shown below:
s3://<bucketName>/<prefix>/rudder-datalake/<namespace>/<tableName>/YYYY/MM/DD/HH
As mentioned in the Connection Settings section:
prefix
: This is the S3 prefix in the destination settings. If not specified, RudderStack will omit this value.namespace
: The namespace specified in the destination settings. If not specified, RudderStack sets this field to the source name by default.tableName
: RudderStack sets this to the event name.
YYYY
, MM
, DD
, and HH
are replaced by actual time values. A combination of these values represents the UTC time.
Suppose that RudderStack tracks the following two events:
Event name | Timestamp |
---|---|
Product Purchased | "2019-10-12T08:40:50.52Z" UTC |
Cart Viewed | "2019-11-12T09:34:50.52Z" UTC |
RudderStack will convert these events into Parquet files and dump them into the following folders:
Event Name | Folder Location |
---|---|
Product Purchased | s3://<bucketName>/<prefix>/rudder-datalake/<namespace>/product_purchased/2019/10/12/08 |
Cart Viewed | s3://<bucketName>/<prefix>/rudder-datalake/<namespace>/cart_viewed/2019/11/12/09 |
If AWS Glue is enabled, all the table definitions are created in a database with the name set to the namespace specified in the destination settings.
Creating a crawler
In the absence of AWS Glue, you can create a crawler to go through your data and create table definitions out of it.
Follow these steps to create a crawler:
- Go to your AWS Glue console and select Crawler from the left pane.
- Select Add Crawler.
- Specify a name for your crawler and click Next, as shown:
- Next, under the Crawler source type section, choose Data stores.
- Configure the Repeat crawls of S3 data stores based on your requirement.
- Then, under the Data store section, select S3 from the dropdown for the Choose a data store setting, as shown:
- For the Crawl data in setting, choose the option Specified path in my account.
- In the Include path setting, enter the S3 URI of your configured bucket followed by the suffix
/<prefix>/rudder-datalake/<namespace>/
.
testBucket
and the configured prefix and namespace are testPrefix
and testNameSpace
respectively, then your path should be:s3://testBucket/testPrefix/rudder-datalake/testNameSpace/
s3://testBucket/rudder-datalake/testNameSpace/
.- Then, under the Add another data store setting, select No, as shown:
- In the IAM Role section, configure a suitable IAM role.
- In the Schedule section, select the frequency of your crawler from the dropdown options, as shown:
- In the Output section, configure the database that stores all the tables. Under the Grouping behavior for S3 data section, enable the Create a single schema for each S3 path option, as shown:
- Specify the Table level as 5 or 4 (refer to the tips below). This value indicates the absolute level of the table location in the bucket.
mydataset/a/b
, if the level is set to 3, the table will be created at the location mydataset/a/b
. Similarly, if the level is set to 2, the table will be created at the location mydataset/a
.s3://testBucket/<prefix>/rudder-datalake/<namespace>/
, make sure the table level is set to:- 5: If a prefix is configured.
- 4: If a prefix is not configured.
- Review your crawler configuration, as shown:
- Click on Finish to confirm the configuration.
- Finally, click on your crawler and run it. Wait for the process to finish - you should see some tables created in your configured database.
Querying data using AWS Athena
You can query your S3 data using a tool like AWS Athena which lets you run SQL queries on S3.
Follow these steps to start querying your data on s3 -
- Open your AWS Athena console. Then, go to the same AWS region which was used while to configure AWS Glue.
- In the left pane, select
AwsDataCatalog
as your data source, as shown:
- Select your configured namespace (or the database you specified while configuring the crawler) from the database dropdown menu.
By default, the namespace is set to your source name if you did not specify it in the destination settings.
- You should see some tables already created under the Tables section in the left pane.
- You can preview the data by clicking on the three dots next to the table and selecting the Preview Data option. Alternatively, you can run your own SQL queries in the workspace on the right, as shown:
IPs to be allowlisted
To enable network access to RudderStack, you will need to allowlist the following RudderStack IPs:
- 3.216.35.97
- 34.198.90.241
- 54.147.40.62
- 23.20.96.9
- 18.214.35.254
- 35.83.226.133
- 52.41.61.208
- 44.227.140.138
- 54.245.141.180
- 3.66.99.198
- 3.64.201.167
- 3.66.99.198
- 3.64.201.167
Contact us
For queries on any of the sections covered in this guide, you can contact us or start a conversation in our Slack community.