Azure, data warehouse, SQL

Azure Data warehouse Performance Tuning 102

In this post we explore data warehouse performance, taking into account the new Gen2 tier.

A key goal when optimising request (data warehouse speak for query) performance is to reduce any costly DMS operation, specifically shufflemove and broadcastmove. Performing the basic steps outlined in part 1, will in most cases reduce the impact of the costly operations. However to completely eliminate these costly operations we can start by choosing:

  1. An optimal distribution type
  2. The appropriate distribution key

Distribution Types

The Azure Data warehouse supports 3 types of distributions:

  1. Replicate
  2. Hash
  3. Round robin

With replicate distribution, a copy of the entire table is copied to each data distribution (60 in total). This distribution type is suited to small tables or tables that contain only static dataset, i.e. does not change frequently. With the other two distribution types, data is distributed across the distributions. In a round robin distribution, the rows in the table are spread evenly across the distributions. The hash distribution on the other hand, is slightly different where we have more control on how the data is distributed using a distribution key.

I’ll focus on the other two distribution types from here on since in most cases we would be dealing with large changing data sets (> 60 Million rows).

Follow the robin

When a suitable distribution key cannot be easily found one can use round robin. In fact when we don’t care about request performance such as during the ETL phase of our process pipeline the round robin distribution is a great candidate to stage data for quick loads. Alas we are here to optimise queries so I would not follow the robin any further.

Hash mate

So that leaves us with our good mate, the hash.

Before we talk about hash distribution, we have to look ahead and envision the type of request patterns which would be executed in the data warehouse. In most cases this is an  easy process because most cases the fact table will have a primary key which all the dimension tables will depend on, if this is the case then this is a good place to start.

If you are unfortunate and don’t have a fact table in the mart then we have to review all the queries especially those with joins and identify the commonly used predicates. Once the predicates have been identified we select a key based on the the following criteria:

  • Predicate frequency: Commonly used predicate – dominant predicate
  • Distribution skew: Key with the lowest distribution skew

Identify distribution skew

  1. To identify distribution skew, create multiple copies of the table with the different distribution keys using CTAS
  2. Execute DBCC PDW_SHOWSPACEUSED which will provide the distribution skew.

Living with distribution Skew

In some cases if you cannot avoid distribution skew or find it difficult choosing a key with the least distribution skew, then I would suggest using temporary tables. I have previously written about my love for temporary tables – breaking down complex queries in the data warehouse.

Assuming I have 2 tables, FactTable1 and DimentionTable1, which are distributed on column A and column B respectively. If the predicate is on column B, then a temporary table can be of great help to transform FactTable1 into a new temporary table (FactTable2) distributed on column B which can then be joined to DimentionTable1. Now all tables have the same distribution keys and the predicate is the same.

Finally

It is worth nothing that the new Gen2 data warehouse performance tier uses NvMe SSD for tempdb. Further performance results to be explored in future posts.

Azure, Random Inno

Capture Telemetry – Azure SQL Data Warehouse

A short article about capturing transient telemetry.

Most DMVs in the Azure SQL data warehouse are transient, which means data stored in the table is not persisted usually after:

  1. A Pause / Resume / Scale event
  2. After a max rollover threshold has been reached (10000 rows)

For example, the data in the sys.dm_pdw_request_steps dmv – a very important dmv which provides a simple query plan for executing queries, will be lost if the events above are encountered.

There are many patterns to persist such data but a simple pattern is shown below, where the telemetry data is copied to a persisted table in the data warehouse and then can be used in reporting or for troubleshooting at a later time in an Azure SQL database.

Screen Shot 2018-10-28 at 23.51.43

In my solution below I have created a function app which triggers the copy from dmv to permanent table frequently. An external table is created in the Azure SQL db to the Azure SQL DW to view the data. One can also copy the data from the external table into a local table in the Azure SQL db for performance improvements say in reports.

Next post I’ll provide the code.

Azure, data warehouse, Random Inno, SQL

Upgrading The Azure Data warehouse To Gen2

The Data warehouse has recently been upgraded to the next version called Gen2. The good news is that the upgrade process is relatively seamless.

Upgrading is supposed to be a one step process but I have decided to add 2 extra steps:

  1. Backup / Take a snapshot
  2. Upgrade the data warehouse
  3. Rebuild all indexes

Backup / Take s snapshot

PS /Users/piro>New-AzureRmSqlDatabaseRestorePoint -ResourceGroupName "ResourceGroup01" -ServerName "Server01" -DatabaseName "Database01" -RestorePointLabel "RestorePoint01"

For more information refer to New-AzureRmSqlDatabaseRestorePoint.

Upgrade the data warehouse

At present the only way to upgrade is to use the portal. To upgrade login into the portal and navigate the Azure SQL database. A new option to upgrade will be available in the database overview blade.

Screen Shot 2018-09-20 at 22.44.59.png

Rebuild all indexes


IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables
CREATE TABLE #tables
WITH(DISTRIBUTION = ROUND_ROBIN, HEAP)
AS
SELECT s.name AS table_schema, t.name AS table_name FROM [sys].[tables] t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.[is_external] = 0;

DECLARE @tableName VARCHAR(MAX)
DECLARE @schemaName VARCHAR(MAX)
DECLARE @sql VARCHAR(MAX)

WHILE EXISTS (SELECT 1 FROM #tables)
BEGIN
SET @tableName = (SELECT TOP(1) table_name FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @schemaName = (SELECT TOP(1) table_schema FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @sql = 'ALTER INDEX ALL ON [' + @schemaName + '].['+ @tableName + '] REBUILD'
EXEC(@sql)
DELETE #tables WHERE table_name = @tableName and table_schema = @schemaName
END
IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables

 

Link to GitHub Code, RebuildIndex.sql

More information here, Optimize performance by upgrading SQL Data Warehouse

Random Inno

Azure Data warehouse Performance Tuning 101

When it comes to performance tuning or troubleshooting, for a standard SQL server many tools and techniques exist. For example query execution plans can be analysed, profiler exists, and a host of other dmvs which a DBA cannot live without.

In an Azure Data warehouse there are similar dmvs. However some basic principles can be applied to guarantee optimal performance:

  1. Create statistics on every column
  2. Use indexes on join predicates. Non clustered indexes for small tables or tables with  a column store index or instead of a clustered index
  3. Use a consistent distribution key for all dimension and fact tables
  4. Use column store indexes for extra large tables
  5. Use clustered indexes for large tables
  6. Rebuild indexes regularly as part of a inter-day or weekly maintenance process
  7. Update statistics regularly as part of a intra-day or inter-day maintenance process
  8. Limit concurrency utilisation using resource classes
  9. Choose the correct distribution type to avoid expensive DMS operations
Random Inno

Azure MSI: Connect Using PowerShell Or .NET?

Managed Service Identity (MSI) was introduced last year. Since then quite a few articles have been written about it:

  1. Use a Windows VM Managed Service Identity (MSI) to access Azure Key Vault:
  2. Azure SQL authentication with a Managed Service Identity

MSI gives your code an automatically managed identity for authenticating to Azure services, so that you can keep credentials out of your code.

To enable MSI, for most services it can be done using PowerShell / ARM templates / Portal.

Enabling and configuring MSI is usually performed in 3 steps

  1. Enable MSI for the source resource
  2. Grant the application spn access to another target resource
  3. Add MSI authentication to the code hosted on the source resource

Example using Azure functions can be found in github.

I was surprised though to find out that connecting to Azure SQL using PowerShell with MSI does not work when hosted in a function app.

Also included in the visual studio solution is the function app running in PowerShell. It fails to log into Azure SQL server with the following error:

Exception while executing function: Functions.dm_pdw_exec_sessions. Microsoft.Azure.WebJobs.Script: PowerShell script error. System.Management.Automation: Exception calling "Open" with "0" argument(s): "Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'.". .Net SqlClient Data Provider: Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON

At this point I suspect impersonation is working correctly with IIS hosting the function app

Azure, data warehouse, Microsoft, Random Inno

Azure Parallel Data Warehouse (PDW) Errors #1: cancelled transactions

Microsoft’s Azure data warehouse is a cloud hosted PaaS offering. Which implies compute and storage resources are managed by Microsoft / Azure fabric.

Scheduled and unscheduled maintenance activities occur which are supposed to be transparent, i.e. should not impact the platform. This is a false assumption.

On some occasions maintenance will impact the platform. For example back-end “node movements”, rotating compute / storage resources will cause errors on one of the compute / data nodes to bubble up into a transaction causing the transaction to enter a cancelled state.

Protect your transactions

The best way to protect a transaction when dealing with Azure DW, is to integrate retry logic into code.

$retry = 5
while ($retry -gt 0)

{
    try
    {
        # Connect here and perform SQL / ETL operation ...

        # Finally end the loop
        $retry = 0
    }
    catch
    {
        $retry =- 1
        if ($retry -gt 0)
        {
            Start-Sleep -Seconds 30
        }
    }
}
Quant

PrincessTenko’s delayed Stock transactions

Been getting questions about her transactions (pls post them on my blog as comments) Thought I’d clarify that the transaction section is delayed.

This means that there is a lag between the time the trade occurred  and when it is printed, i.e. a few periods ago PrincessTenko dabbled in that stock.

See her in action (and possibly here arch rival soon) here: quant.peterirojah.com