Azure, data warehouse, Random Inno, SQL

Restoring Azure SQL Datawarehouse

Microsoft recently added snapshots to the Azure SQL datawarehouse recovery arsenal. What does it do? A snapshot is like a point-in-time marker, which allows you to restore a datawarehouse to a state recorded at the marker.

Restore points are useful in more than just the traditional bcp scenarios, such as disaster recovery. Restore points can also be used to refresh environments or move point-in-time data from one server to another in different subscriptions.

Normal Azure datawarehouse backups are geo-backups which automatically occur every 8 hours, using snapshots we dont have to wait for the next backup to occur, we are able to restore our defined point-in-time snapshot into a different server and subscription.

First thing, we use the New-AzSqlDatabaseRestorePoint powershell commandlet (in the Az module, the old AzureRm commandlet works too) to create a snapshot

$RestorePointLabel = (New-Guid).Guid
$SrcRestorePoint = New-AzSqlDatabaseRestorePoint -ResourceGroupName $SrcResourceGroup `
-ServerName $SrcServerName -DatabaseName $SrcDbName `
-RestorePointLabel $RestorePointLabel

Next we can choose to restore into a server in the same subscription or into another sub. If the target server is in the the same subscription as the source then we can use the Restore-AzSqlDatabase commandlet


# Get the Database Object
$Database = Get-AzureRmSqlDatabase -ResourceGroupName $SrcResourceGroup -ServerName $SrcServerName -DatabaseName $SrcDbName
# Restore the database
Restore-AzSqlDatabase -FromPointInTimeBackup -PointInTime $SrcRestorePoint.RestorePointCreationDate -ServerName $TrgSqlServerName -TargetDatabaseName $TrgDatabaseName -ResourceId $Database.ResourceID -ResourcegroupName $TrgResourceGroupName

In the event that we want to restore into a server in a different subscription, we have to

  • create a temporary server in the source subscription
  • restore the snapshot from the source server into the temporary server
  • move the temporary server into the target subscription
  • create a new snapshot in the temporary server
  • restore the new snapshot into the target server in the same subscription

Code for this can be found in github.

Azure, data warehouse

Azure DW, The Peril of AUTO_CREATE_STATISTICS

Last year the Azure DW team announced the availability of added another hotly requested feature, auto creation of statistics.

Auto creating statistics has many known benefits which impacts DevOps, as database deployment pipelines don’t have to include the statistics creating queried, DBA’s who don’t have to worry about the lack of statistics on objects and business users who normally suffer from poorly perming reports and queries.

The Azure DW blogpost describes the impact well.

The Peril, yes there is one – is that auto created stats objects will sometimes prevent columns from being dropped from tables. The following error can sometimes be observed:

SQL exception has occured in script: '0000001-dropColum.sql'
Script block number: 0; Block line 1; Message:
System.Data.SqlClient.SqlException (0x80131904): The statistics 'Stat_b71bb60085b5420bbd7b95c46e578bb6' is dependent on column 'CostValue'.

Should this error occur, then the best option would be to disable auto create stats, and after the db deploy / code has been run we can enable it:

Disable
ALTER DATABASE SalesDW SET AUTO_CREATE_STATISTICS OFF;

Enable
ALTER DATABASE SalesDW SET AUTO_CREATE_STATISTICS ON;

Top tip: disable at the start of your database deployment and then enable at the end. MS Docs

Azure, data warehouse, SQL

Azure Data warehouse Performance Tuning 102

In this post we explore data warehouse performance, taking into account the new Gen2 tier.

A key goal when optimising request (data warehouse speak for query) performance is to reduce any costly DMS operation, specifically shufflemove and broadcastmove. Performing the basic steps outlined in part 1, will in most cases reduce the impact of the costly operations. However to completely eliminate these costly operations we can start by choosing:

  1. An optimal distribution type
  2. The appropriate distribution key

Distribution Types

The Azure Data warehouse supports 3 types of distributions:

  1. Replicate
  2. Hash
  3. Round robin

With replicate distribution, a copy of the entire table is copied to each data distribution (60 in total). This distribution type is suited to small tables or tables that contain only static dataset, i.e. does not change frequently. With the other two distribution types, data is distributed across the distributions. In a round robin distribution, the rows in the table are spread evenly across the distributions. The hash distribution on the other hand, is slightly different where we have more control on how the data is distributed using a distribution key.

I’ll focus on the other two distribution types from here on since in most cases we would be dealing with large changing data sets (> 60 Million rows).

Follow the robin

When a suitable distribution key cannot be easily found one can use round robin. In fact when we don’t care about request performance such as during the ETL phase of our process pipeline the round robin distribution is a great candidate to stage data for quick loads. Alas we are here to optimise queries so I would not follow the robin any further.

Hash mate

So that leaves us with our good mate, the hash.

Before we talk about hash distribution, we have to look ahead and envision the type of request patterns which would be executed in the data warehouse. In most cases this is an  easy process because most cases the fact table will have a primary key which all the dimension tables will depend on, if this is the case then this is a good place to start.

If you are unfortunate and don’t have a fact table in the mart then we have to review all the queries especially those with joins and identify the commonly used predicates. Once the predicates have been identified we select a key based on the the following criteria:

  • Predicate frequency: Commonly used predicate – dominant predicate
  • Distribution skew: Key with the lowest distribution skew

Identify distribution skew

  1. To identify distribution skew, create multiple copies of the table with the different distribution keys using CTAS
  2. Execute DBCC PDW_SHOWSPACEUSED which will provide the distribution skew.

Living with distribution Skew

In some cases if you cannot avoid distribution skew or find it difficult choosing a key with the least distribution skew, then I would suggest using temporary tables. I have previously written about my love for temporary tables – breaking down complex queries in the data warehouse.

Assuming I have 2 tables, FactTable1 and DimentionTable1, which are distributed on column A and column B respectively. If the predicate is on column B, then a temporary table can be of great help to transform FactTable1 into a new temporary table (FactTable2) distributed on column B which can then be joined to DimentionTable1. Now all tables have the same distribution keys and the predicate is the same.

Finally

It is worth nothing that the new Gen2 data warehouse performance tier uses NvMe SSD for tempdb. Further performance results to be explored in future posts.

Azure, Random Inno

Capture Telemetry – Azure SQL Data Warehouse

A short article about capturing transient telemetry.

Most DMVs in the Azure SQL data warehouse are transient, which means data stored in the table is not persisted usually after:

  1. A Pause / Resume / Scale event
  2. After a max rollover threshold has been reached (10000 rows)

For example, the data in the sys.dm_pdw_request_steps dmv – a very important dmv which provides a simple query plan for executing queries, will be lost if the events above are encountered.

There are many patterns to persist such data but a simple pattern is shown below, where the telemetry data is copied to a persisted table in the data warehouse and then can be used in reporting or for troubleshooting at a later time in an Azure SQL database.

Screen Shot 2018-10-28 at 23.51.43

In my solution below I have created a function app which triggers the copy from dmv to permanent table frequently. An external table is created in the Azure SQL db to the Azure SQL DW to view the data. One can also copy the data from the external table into a local table in the Azure SQL db for performance improvements say in reports.

Next post I’ll provide the code.

Azure, data warehouse, Random Inno, SQL

Upgrading The Azure Data warehouse To Gen2

The Data warehouse has recently been upgraded to the next version called Gen2. The good news is that the upgrade process is relatively seamless.

Upgrading is supposed to be a one step process but I have decided to add 2 extra steps:

  1. Backup / Take a snapshot
  2. Upgrade the data warehouse
  3. Rebuild all indexes

Backup / Take s snapshot

PS /Users/piro>New-AzureRmSqlDatabaseRestorePoint -ResourceGroupName "ResourceGroup01" -ServerName "Server01" -DatabaseName "Database01" -RestorePointLabel "RestorePoint01"

For more information refer to New-AzureRmSqlDatabaseRestorePoint.

Upgrade the data warehouse

At present the only way to upgrade is to use the portal. To upgrade login into the portal and navigate the Azure SQL database. A new option to upgrade will be available in the database overview blade.

Screen Shot 2018-09-20 at 22.44.59.png

Rebuild all indexes


IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables
CREATE TABLE #tables
WITH(DISTRIBUTION = ROUND_ROBIN, HEAP)
AS
SELECT s.name AS table_schema, t.name AS table_name FROM [sys].[tables] t
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.[is_external] = 0;

DECLARE @tableName VARCHAR(MAX)
DECLARE @schemaName VARCHAR(MAX)
DECLARE @sql VARCHAR(MAX)

WHILE EXISTS (SELECT 1 FROM #tables)
BEGIN
SET @tableName = (SELECT TOP(1) table_name FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @schemaName = (SELECT TOP(1) table_schema FROM #tables ORDER BY table_schema ASC, table_name ASC)
SET @sql = 'ALTER INDEX ALL ON [' + @schemaName + '].['+ @tableName + '] REBUILD'
EXEC(@sql)
DELETE #tables WHERE table_name = @tableName and table_schema = @schemaName
END
IF OBJECT_ID('tempdb..#tables') IS NOT NULL DROP TABLE #tables

 

Link to GitHub Code, RebuildIndex.sql

More information here, Optimize performance by upgrading SQL Data Warehouse

Random Inno

Azure Data warehouse Performance Tuning 101

When it comes to performance tuning or troubleshooting, for a standard SQL server many tools and techniques exist. For example query execution plans can be analysed, profiler exists, and a host of other dmvs which a DBA cannot live without.

In an Azure Data warehouse there are similar dmvs. However some basic principles can be applied to guarantee optimal performance:

  1. Create statistics on every column
  2. Use indexes on join predicates. Non clustered indexes for small tables or tables with  a column store index or instead of a clustered index
  3. Use a consistent distribution key for all dimension and fact tables
  4. Use column store indexes for extra large tables
  5. Use clustered indexes for large tables
  6. Rebuild indexes regularly as part of a inter-day or weekly maintenance process
  7. Update statistics regularly as part of a intra-day or inter-day maintenance process
  8. Limit concurrency utilisation using resource classes
  9. Choose the correct distribution type to avoid expensive DMS operations