I recently needed to move live data (> 6 Terabytes) from an Azure data warehouse to a another Azure data warehouse in different tenant while keeping track of any ongoing data modifications at the source. My secondary objective was to avoid data corruption – so I figured I needed a CDC system.
At the time of writing Azure SQL database and data warehouse do not support the SQL Server change data capture (CDC) feature.
What is CDC?
From Microsoft docs, change data capture records insert, update, and delete activity that is applied to a SQL Server table. This makes the details of the changes available in an easily consumed relational format.
Why does it matter?
Certain scenarios require changes to be captured and acted upon, the data warehouse is by no means an OLTP system however ETL is not always a straight path. For example it is not uncommon to capture changes from an OLTP system and load into the warehouse. So surely there must exist CDC tools. But I was wrong as I quickly found out there isn’t much out there except for guidelines on how it should be done. I read about Attunity however I did not get a chance to try out this time. I’ll blog about this product in future.
How It was done
In short, this has turned out to be a journey of learning, team work and leveraging Azure services such as the data factory.
In this setup, the data factory is an orchestrator. The following activities were used:
- Copy activity for moving data between source and destination. In addition
- It is also responsible for calling the get and set procedures from a stored procedure activity
- Webhook activity is used to execute Azure runboooks – for the pre and post configuration tasks
Pre and post configuration tasks include:
- Resuming / Pausing the data warehouse
- scaling up and down
Worth pointing out that the data factory and the destination Azure Data warehouse are in the same tenant, which gives us more flexibility in terms of authentication methods i.e we don’t have to use SQL auth if we can use MSI.
The source system is a simple implementation of SQL Server CDC. At the source, changes are tracked (INSERT, UPDATE, DELETE) and cached. The diagram below shows the change identification, tracking and caching processes. The identification phase applies a unique sortable column to each table being tracked, in this case an identity column is added. The tracking phase compares the id enhanced table with a static reference dataset which contains a snapshot taken during the last sync. The final phase in the change types are separated then applied to the reference static data then tagged and saved in cache.
At the the destination, the tracked data is received and verified then the instructions are applied to the cached data. The destination table is truncated and loaded with the processed data.
Other abilities of this system are
- Recover from failures such as loss of connectivity
- Scale performance / throughput
- Logic written in T-SQL
I made the following assumptions about the data and the system, not that I expect severe impact to the system if these assumptions change but it’s worth mentioning
- IDENTITY column was not the primary key on the table
- Tables with CLUSTERED COLUMNSTORE indexes always use ascending sort order