In this post we explore data warehouse performance, taking into account the new Gen2 tier.
A key goal when optimising request (data warehouse speak for query) performance is to reduce any costly DMS operation, specifically shufflemove and broadcastmove. Performing the basic steps outlined in part 1, will in most cases reduce the impact of the costly operations. However to completely eliminate these costly operations we can start by choosing:
- An optimal distribution type
- The appropriate distribution key
The Azure Data warehouse supports 3 types of distributions:
- Round robin
With replicate distribution, a copy of the entire table is copied to each data distribution (60 in total). This distribution type is suited to small tables or tables that contain only static dataset, i.e. does not change frequently. With the other two distribution types, data is distributed across the distributions. In a round robin distribution, the rows in the table are spread evenly across the distributions. The hash distribution on the other hand, is slightly different where we have more control on how the data is distributed using a distribution key.
I’ll focus on the other two distribution types from here on since in most cases we would be dealing with large changing data sets (> 60 Million rows).
Follow the robin
When a suitable distribution key cannot be easily found one can use round robin. In fact when we don’t care about request performance such as during the ETL phase of our process pipeline the round robin distribution is a great candidate to stage data for quick loads. Alas we are here to optimise queries so I would not follow the robin any further.
So that leaves us with our good mate, the hash.
Before we talk about hash distribution, we have to look ahead and envision the type of request patterns which would be executed in the data warehouse. In most cases this is an easy process because most cases the fact table will have a primary key which all the dimension tables will depend on, if this is the case then this is a good place to start.
If you are unfortunate and don’t have a fact table in the mart then we have to review all the queries especially those with joins and identify the commonly used predicates. Once the predicates have been identified we select a key based on the the following criteria:
- Predicate frequency: Commonly used predicate – dominant predicate
- Distribution skew: Key with the lowest distribution skew
Identify distribution skew
- To identify distribution skew, create multiple copies of the table with the different distribution keys using CTAS
- Execute DBCC PDW_SHOWSPACEUSED which will provide the distribution skew.
Living with distribution Skew
In some cases if you cannot avoid distribution skew or find it difficult choosing a key with the least distribution skew, then I would suggest using temporary tables. I have previously written about my love for temporary tables – breaking down complex queries in the data warehouse.
Assuming I have 2 tables, FactTable1 and DimentionTable1, which are distributed on column A and column B respectively. If the predicate is on column B, then a temporary table can be of great help to transform FactTable1 into a new temporary table (FactTable2) distributed on column B which can then be joined to DimentionTable1. Now all tables have the same distribution keys and the predicate is the same.
It is worth nothing that the new Gen2 data warehouse performance tier uses NvMe SSD for tempdb. Further performance results to be explored in future posts.