Azure, data warehouse, SQL

Azure Data warehouse Performance Tuning 102

In this post we explore data warehouse performance, taking into account the new Gen2 tier.

A key goal when optimising request (data warehouse speak for query) performance is to reduce any costly DMS operation, specifically shufflemove and broadcastmove. Performing the basic steps outlined in part 1, will in most cases reduce the impact of the costly operations. However to completely eliminate these costly operations we can start by choosing:

  1. An optimal distribution type
  2. The appropriate distribution key

Distribution Types

The Azure Data warehouse supports 3 types of distributions:

  1. Replicate
  2. Hash
  3. Round robin

With replicate distribution, a copy of the entire table is copied to each data distribution (60 in total). This distribution type is suited to small tables or tables that contain only static dataset, i.e. does not change frequently. With the other two distribution types, data is distributed across the distributions. In a round robin distribution, the rows in the table are spread evenly across the distributions. The hash distribution on the other hand, is slightly different where we have more control on how the data is distributed using a distribution key.

I’ll focus on the other two distribution types from here on since in most cases we would be dealing with large changing data sets (> 60 Million rows).

Follow the robin

When a suitable distribution key cannot be easily found one can use round robin. In fact when we don’t care about request performance such as during the ETL phase of our process pipeline the round robin distribution is a great candidate to stage data for quick loads. Alas we are here to optimise queries so I would not follow the robin any further.

Hash mate

So that leaves us with our good mate, the hash.

Before we talk about hash distribution, we have to look ahead and envision the type of request patterns which would be executed in the data warehouse. In most cases this is an  easy process because most cases the fact table will have a primary key which all the dimension tables will depend on, if this is the case then this is a good place to start.

If you are unfortunate and don’t have a fact table in the mart then we have to review all the queries especially those with joins and identify the commonly used predicates. Once the predicates have been identified we select a key based on the the following criteria:

  • Predicate frequency: Commonly used predicate – dominant predicate
  • Distribution skew: Key with the lowest distribution skew

Identify distribution skew

  1. To identify distribution skew, create multiple copies of the table with the different distribution keys using CTAS
  2. Execute DBCC PDW_SHOWSPACEUSED which will provide the distribution skew.

Living with distribution Skew

In some cases if you cannot avoid distribution skew or find it difficult choosing a key with the least distribution skew, then I would suggest using temporary tables. I have previously written about my love for temporary tables – breaking down complex queries in the data warehouse.

Assuming I have 2 tables, FactTable1 and DimentionTable1, which are distributed on column A and column B respectively. If the predicate is on column B, then a temporary table can be of great help to transform FactTable1 into a new temporary table (FactTable2) distributed on column B which can then be joined to DimentionTable1. Now all tables have the same distribution keys and the predicate is the same.

Finally

It is worth nothing that the new Gen2 data warehouse performance tier uses NvMe SSD for tempdb. Further performance results to be explored in future posts.

Azure, data warehouse, Random Inno, SQL

Upgrading The Azure Data warehouse To Gen2

The Data warehouse has recently been upgraded to the next version called Gen2. The good news is that the upgrade process is relatively seamless.

Upgrading is supposed to be a one step process but I have decided to add 2 extra steps:

  1. Backup / Take a snapshot
  2. Upgrade the data warehouse
  3. Rebuild all indexes

Backup / Take s snapshot

PS /Users/piro>New-AzureRmSqlDatabaseRestorePoint -ResourceGroupName "ResourceGroup01" -ServerName "Server01" -DatabaseName "Database01" -RestorePointLabel "RestorePoint01"

For more information refer to New-AzureRmSqlDatabaseRestorePoint.

Upgrade the data warehouse

At present the only way to upgrade is to use the portal. To upgrade login into the portal and navigate the Azure SQL database. A new option to upgrade will be available in the database overview blade.

Screen Shot 2018-09-20 at 22.44.59.png

Rebuild all indexes


IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables
CREATE TABLE #tables
WITH(DISTRIBUTION = ROUND_ROBIN, HEAP)
AS
SELECT s.name AS table_schema, t.name AS table_name FROM [sys].[tables] t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.[is_external] = 0;

DECLARE @tableName VARCHAR(MAX)
DECLARE @schemaName VARCHAR(MAX)
DECLARE @sql VARCHAR(MAX)

WHILE EXISTS (SELECT 1 FROM #tables)
BEGIN
SET @tableName = (SELECT TOP(1) table_name FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @schemaName = (SELECT TOP(1) table_schema FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @sql = 'ALTER INDEX ALL ON [' + @schemaName + '].['+ @tableName + '] REBUILD'
EXEC(@sql)
DELETE #tables WHERE table_name = @tableName and table_schema = @schemaName
END
IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables

 

Link to GitHub Code, RebuildIndex.sql

More information here, Optimize performance by upgrading SQL Data Warehouse