Azure, data warehouse


Last year the Azure DW team announced the availability of added another hotly requested feature, auto creation of statistics.

Auto creating statistics has many known benefits which impacts DevOps, as database deployment pipelines don’t have to include the statistics creating queried, DBA’s who don’t have to worry about the lack of statistics on objects and business users who normally suffer from poorly perming reports and queries.

The Azure DW blogpost describes the impact well.

The Peril, yes there is one – is that auto created stats objects will sometimes prevent columns from being dropped from tables. The following error can sometimes be observed:

SQL exception has occured in script: '0000001-dropColum.sql'
Script block number: 0; Block line 1; Message:
System.Data.SqlClient.SqlException (0x80131904): The statistics 'Stat_b71bb60085b5420bbd7b95c46e578bb6' is dependent on column 'CostValue'.

Should this error occur, then the best option would be to disable auto create stats, and after the db deploy / code has been run we can enable it:



Top tip: disable at the start of your database deployment and then enable at the end. MS Docs

Azure, data warehouse, SQL

Azure Data warehouse Performance Tuning 102

In this post we explore data warehouse performance, taking into account the new Gen2 tier.

A key goal when optimising request (data warehouse speak for query) performance is to reduce any costly DMS operation, specifically shufflemove and broadcastmove. Performing the basic steps outlined in part 1, will in most cases reduce the impact of the costly operations. However to completely eliminate these costly operations we can start by choosing:

  1. An optimal distribution type
  2. The appropriate distribution key

Distribution Types

The Azure Data warehouse supports 3 types of distributions:

  1. Replicate
  2. Hash
  3. Round robin

With replicate distribution, a copy of the entire table is copied to each data distribution (60 in total). This distribution type is suited to small tables or tables that contain only static dataset, i.e. does not change frequently. With the other two distribution types, data is distributed across the distributions. In a round robin distribution, the rows in the table are spread evenly across the distributions. The hash distribution on the other hand, is slightly different where we have more control on how the data is distributed using a distribution key.

I’ll focus on the other two distribution types from here on since in most cases we would be dealing with large changing data sets (> 60 Million rows).

Follow the robin

When a suitable distribution key cannot be easily found one can use round robin. In fact when we don’t care about request performance such as during the ETL phase of our process pipeline the round robin distribution is a great candidate to stage data for quick loads. Alas we are here to optimise queries so I would not follow the robin any further.

Hash mate

So that leaves us with our good mate, the hash.

Before we talk about hash distribution, we have to look ahead and envision the type of request patterns which would be executed in the data warehouse. In most cases this is an  easy process because most cases the fact table will have a primary key which all the dimension tables will depend on, if this is the case then this is a good place to start.

If you are unfortunate and don’t have a fact table in the mart then we have to review all the queries especially those with joins and identify the commonly used predicates. Once the predicates have been identified we select a key based on the the following criteria:

  • Predicate frequency: Commonly used predicate – dominant predicate
  • Distribution skew: Key with the lowest distribution skew

Identify distribution skew

  1. To identify distribution skew, create multiple copies of the table with the different distribution keys using CTAS
  2. Execute DBCC PDW_SHOWSPACEUSED which will provide the distribution skew.

Living with distribution Skew

In some cases if you cannot avoid distribution skew or find it difficult choosing a key with the least distribution skew, then I would suggest using temporary tables. I have previously written about my love for temporary tables – breaking down complex queries in the data warehouse.

Assuming I have 2 tables, FactTable1 and DimentionTable1, which are distributed on column A and column B respectively. If the predicate is on column B, then a temporary table can be of great help to transform FactTable1 into a new temporary table (FactTable2) distributed on column B which can then be joined to DimentionTable1. Now all tables have the same distribution keys and the predicate is the same.


It is worth nothing that the new Gen2 data warehouse performance tier uses NvMe SSD for tempdb. Further performance results to be explored in future posts.

Azure, Random Inno

Capture Telemetry – Azure SQL Data Warehouse

A short article about capturing transient telemetry.

Most DMVs in the Azure SQL data warehouse are transient, which means data stored in the table is not persisted usually after:

  1. A Pause / Resume / Scale event
  2. After a max rollover threshold has been reached (10000 rows)

For example, the data in the sys.dm_pdw_request_steps dmv – a very important dmv which provides a simple query plan for executing queries, will be lost if the events above are encountered.

There are many patterns to persist such data but a simple pattern is shown below, where the telemetry data is copied to a persisted table in the data warehouse and then can be used in reporting or for troubleshooting at a later time in an Azure SQL database.

Screen Shot 2018-10-28 at 23.51.43

In my solution below I have created a function app which triggers the copy from dmv to permanent table frequently. An external table is created in the Azure SQL db to the Azure SQL DW to view the data. One can also copy the data from the external table into a local table in the Azure SQL db for performance improvements say in reports.

Next post I’ll provide the code.

Azure, data warehouse, Random Inno, SQL

Upgrading The Azure Data warehouse To Gen2

The Data warehouse has recently been upgraded to the next version called Gen2. The good news is that the upgrade process is relatively seamless.

Upgrading is supposed to be a one step process but I have decided to add 2 extra steps:

  1. Backup / Take a snapshot
  2. Upgrade the data warehouse
  3. Rebuild all indexes

Backup / Take s snapshot

PS /Users/piro>New-AzureRmSqlDatabaseRestorePoint -ResourceGroupName "ResourceGroup01" -ServerName "Server01" -DatabaseName "Database01" -RestorePointLabel "RestorePoint01"

For more information refer to New-AzureRmSqlDatabaseRestorePoint.

Upgrade the data warehouse

At present the only way to upgrade is to use the portal. To upgrade login into the portal and navigate the Azure SQL database. A new option to upgrade will be available in the database overview blade.

Screen Shot 2018-09-20 at 22.44.59.png

Rebuild all indexes

IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables
SELECT AS table_schema, AS table_name FROM [sys].[tables] t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.[is_external] = 0;


SET @tableName = (SELECT TOP(1) table_name FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @schemaName = (SELECT TOP(1) table_schema FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @sql = 'ALTER INDEX ALL ON [' + @schemaName + '].['+ @tableName + '] REBUILD'
DELETE #tables WHERE table_name = @tableName and table_schema = @schemaName
IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables


Link to GitHub Code, RebuildIndex.sql

More information here, Optimize performance by upgrading SQL Data Warehouse

Random Inno

Azure Data warehouse Performance Tuning 101

When it comes to performance tuning or troubleshooting, for a standard SQL server many tools and techniques exist. For example query execution plans can be analysed, profiler exists, and a host of other dmvs which a DBA cannot live without.

In an Azure Data warehouse there are similar dmvs. However some basic principles can be applied to guarantee optimal performance:

  1. Create statistics on every column
  2. Use indexes on join predicates. Non clustered indexes for small tables or tables with  a column store index or instead of a clustered index
  3. Use a consistent distribution key for all dimension and fact tables
  4. Use column store indexes for extra large tables
  5. Use clustered indexes for large tables
  6. Rebuild indexes regularly as part of a inter-day or weekly maintenance process
  7. Update statistics regularly as part of a intra-day or inter-day maintenance process
  8. Limit concurrency utilisation using resource classes
  9. Choose the correct distribution type to avoid expensive DMS operations
Random Inno

Azure MSI: Connect Using PowerShell Or .NET?

Managed Service Identity (MSI) was introduced last year. Since then quite a few articles have been written about it:

  1. Use a Windows VM Managed Service Identity (MSI) to access Azure Key Vault:
  2. Azure SQL authentication with a Managed Service Identity

MSI gives your code an automatically managed identity for authenticating to Azure services, so that you can keep credentials out of your code.

To enable MSI, for most services it can be done using PowerShell / ARM templates / Portal.

Enabling and configuring MSI is usually performed in 3 steps

  1. Enable MSI for the source resource
  2. Grant the application spn access to another target resource
  3. Add MSI authentication to the code hosted on the source resource

Example using Azure functions can be found in github.

I was surprised though to find out that connecting to Azure SQL using PowerShell with MSI does not work when hosted in a function app.

Also included in the visual studio solution is the function app running in PowerShell. It fails to log into Azure SQL server with the following error:

Exception while executing function: Functions.dm_pdw_exec_sessions. Microsoft.Azure.WebJobs.Script: PowerShell script error. System.Management.Automation: Exception calling "Open" with "0" argument(s): "Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'.". .Net SqlClient Data Provider: Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON

At this point I suspect impersonation is working correctly with IIS hosting the function app