Pages

Monday, January 15, 2018

Distribution Cleanup Job Blocking

The distribution cleanup job may block the log reader agent for several minutes, if there is a large number of rows to keep and "delete TOP N" deletes N-1 rows.

The Delete TOP N Behavior

When there are N or N+1 rows, delete TOP N stops scanning the range once N rows are reached
When there are N-1 rows, delete TOP N will scan the entire range. If the range contains a very large amount rows then the scan may take long time, and the locks will be held until the scan finishes, which can cause blocking.

For example, the distribution cleanup job calls the delete statements in a "WHILE 1 =1" loop from the following stored procedures

-- sp_MSdelete_publisherdb_trans line 202:
delete TOP(2000) MSrepl_commands WITH (PAGLOCK) from MSrepl_commands with (INDEX(ucMSrepl_commands))

-- sp_MSdelete_dodelete line 38:
delete TOP(5000) MSrepl_transactions WITH (PAGLOCK) from MSrepl_transactions with (INDEX(ucMSrepl_transactions))

The following extended event session can be used to get the query plans for line 202 and 38 of the respective stored procedures

CREATE EVENT SESSION [distribution_cleanup] ON SERVER 
ADD EVENT sqlserver.query_post_execution_showplan(
    ACTION(sqlserver.query_plan_hash,sqlserver.tsql_stack)
    WHERE (([object_name]=N'sp_MSdelete_publisherdb_trans' 
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''202''%frame level=''2''%') 
 OR ([object_name]=N'sp_MSdelete_dodelete'
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''38''%frame level=''2''%'))))),
ADD EVENT sqlserver.sp_statement_completed(SET collect_object_name=(1),collect_statement=(1)
    ACTION(sqlserver.query_plan_hash,sqlserver.tsql_stack)
    WHERE (([object_name]=N'sp_MSdelete_publisherdb_trans' 
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''202''%frame level=''2''%') 
 OR ([object_name]=N'sp_MSdelete_dodelete'
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''38''%frame level=''2''%')))))
ADD TARGET package0.event_file(SET filename=N'c:\temp\distribution_cleanup.xel')
WITH (TRACK_CAUSALITY=ON)
GO

In this particular case, the table MSrepl_commands had about 600 million rows, each iteration deletes 2,000 rows with a few milliseconds of duration, but the last iteration deleted 1,352 rows with a 19 minute duration


The following query can be used to extract the query plan from the .xel file, so we can just use SSMS to visualize it

SELECT event_data = CONVERT(XML, event_data)
INTO #tmp
FROM sys.fn_xe_file_target_read_file( N'C:\temp\distribution_cleanup_0_131593922293630000*.xel', NULL, NULL, NULL);

SELECT 
t.event_data.value('(/event/@timestamp)[1]', 'datetime2') AS event_time
,t.event_data.value('(/event/data[@name="duration"]/value)[1]', 'bigint') AS duration
,t.event_data.value('(/event/data[@name="object_name"]/value)[1]', 'varchar(255)') AS object_name
,actual_plan = x.xml_fragment.query('.')
FROM #tmp AS t
CROSS APPLY t.event_data.nodes(N'/event/data[@name="showplan_xml"]/value/*')
AS x(xml_fragment)
order by event_time 


Comparing the query plan when it deletes 2,000 rows vs. the plan that deletes 1.356 rows, the former reads 2,000 rows and the latter reads more than 200 million rows due to the residual predicate


This explains why the lock is held for a long time, and the blocking is explained by the cluster index definition shown below

The Cluster Index

Both tables MSrepl_commands and MSrepl_transaction have a similar cluster index that has the publisher_database_id as the leading key column. Thus, rows are stored in order by publisher_database_id. An example for MSrepl_transactions is shown below


If there is more than one publisher_database_id then the last page of the first publisher_database_id range will contain not only the newest rows for the first publisher_database_id but also the oldest rows for the second publisher_database_id. So, the delete with (PAGLOCK) of the oldest rows for the second publisher_database_id will block the insert of new rows by the log reader for the first publisher_database_id .

Example

In my test environment two publisher databases pub1 and pub2, have publisher_database_id = 3 and publisher_database_id = 4 respectively.

-- create databases
use master
create database pub1
create database pub2
go

-- create tables
use pub1
create table test1(a int not null identity, b datetime, c varchar(255)
constraint test1_pk primary key (a)
)
use pub2
create table test2(a int not null identity, b datetime, c varchar(255)
constraint test2_pk primary key (a)
)
GO

-- Adding the transactional publication with immediate synch
use [pub1]
exec sp_replicationdboption @dbname = N'pub1', @optname = N'publish', @value = N'true'
exec sp_addpublication @publication = N'test_pub1', @allow_anonymous = N'true', @immediate_sync = N'true', @independent_agent = N'true', @status = N'active'
exec sp_addpublication_snapshot @publication = N'test_pub1', @frequency_type = 1
exec sp_addarticle @publication = N'test_pub1', @article = N'test1', @source_owner = N'dbo', @source_object = N'test1'
GO

-- Adding the transactional publication with immediate synch
use [pub2]
exec sp_replicationdboption @dbname = N'pub2', @optname = N'publish', @value = N'true'
exec sp_addpublication @publication = N'test_pub2', @allow_anonymous = N'true', @immediate_sync = N'true', @independent_agent = N'true', @status = N'active'
exec sp_addpublication_snapshot @publication = N'test_pub2', @frequency_type = 1
exec sp_addarticle @publication = N'test_pub2', @article = N'test2', @source_owner = N'dbo', @source_object = N'test2'
GO

-- disable distribution cleanup job
exec msdb..sp_update_job @job_name = 'Distribution clean up: distribution', @enabled = 0

-- add 5999 rows at 2:43 PM
use pub2
insert test2(b,c)
values (getdate(),replicate('x',200))
go 5999

-- add 21000 rows at 3:44 PM (one hour later)
use pub1
insert test1(b,c)
values (getdate(),replicate('x',200))
go 1000

use pub2
insert test2(b,c)
values (getdate(),replicate('x',200))
go 20000

-- verify there is 26999 rows
use distribution
select count(*) from MSrepl_transactions
select count(*) from MSrepl_commands

-- get root page
select allocated_page_file_id, allocated_page_page_id, next_page_page_id, previous_page_page_id, page_level 
from sys.dm_db_database_page_allocations(db_id(),object_id('MSrepl_transactions'),1,null,'DETAILED') 
where page_type_desc is not null and page_type_desc = 'INDEX_PAGE'


DBCC TRACEON(3604)
DBCC PAGE (distribution, 1, 72305, 3);
DBCC TRACEOFF(3604)



The last page for the publisher_database_id = 3 range is 72702 and the first page for the publisher_database_id = 4 range is 72629.

To simulate a long duration we can execute the distribution cleanup proc with an open transaction

use distribution
begin tran
EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 1

Removed 5999 replicated transactions consisting of 5999 statements in 1 seconds (11998 rows/sec).

The cleanup job removed the 5,999 rows we added 1 hour ago for publisher_database_id = 4 and held an X lock on page 72702 that contains the oldest rows for publisher_database_id = 4. We can see the locks using this query

select * from sys.dm_tran_locks where request_session_id = 68 and request_mode = 'X'



In another session, we can execute the following script to add new rows to publisher_database_id = 3 (pub1)

use pub1
insert test1(b,c)
values (getdate(),replicate('x',200))
go 

The log reader agent for publisher_database_id = 3 is blocked by the delete of publisher_database_id = 4, as shown in the spWhoIsActive output, because page 72702 also contains the newest rows for publisher_database_id = 3


Workaround

We can avoid deleting N-1 rows by "delete TOP N" and prevent it from reading million of rows, if we modify the delete loop to iterate a multiple of N on the sp_MSdelete_publisherdb_trans stored procedure.

Code snipped for the MSrepl_commands table

select @max = count(*)/2000 from MSrepl_commands with (INDEX(ucMSrepl_commands), nolock) where
 publisher_database_id = @publisher_database_id and
 xact_seqno <= @max_xact_seqno and
 (type & ~@snapshot_bit) not in (@directory_type, @alt_directory_type) and
 (type & ~@replpost_bit) <> @scriptexec_type
 OPTION (MAXDOP 4)

WHILE @i <= @max
BEGIN

 DELETE TOP(2000) MSrepl_commands WITH (PAGLOCK) from MSrepl_commands with (INDEX(ucMSrepl_commands)) where
  publisher_database_id = @publisher_database_id and
  xact_seqno <= @max_xact_seqno and
  (type & ~@snapshot_bit) not in (@directory_type, @alt_directory_type) and
  (type & ~@replpost_bit) <> @scriptexec_type
  OPTION (MAXDOP 1)
  
 select @row_count = @@rowcount
 -- Update output parameter
 select @num_commands = @num_commands + @row_count
    
 set @i=@i+1
END

Code snipped for the MSrepl_transactions table

select @max = count(*)/5000 from MSrepl_transactions with (INDEX(ucMSrepl_transactions), nolock) where
 publisher_database_id = @publisher_database_id and
 xact_seqno <= @max_xact_seqno and
 xact_seqno <> @last_xact_seqno and
 xact_seqno <> @last_log_xact_seqno
 OPTION (MAXDOP 4)

WHILE @i <= @max
BEGIN
 exec dbo.sp_MSdelete_dodelete @publisher_database_id, 
  @max_xact_seqno, 
  @last_xact_seqno, 
  @last_log_xact_seqno,
  @has_immediate_sync


 select @row_count = @@rowcount

 -- Update output parameter
 select @num_transactions = @num_transactions + @row_count
   
 set @i=@i+1
END

Note that Microsoft doesn't support modifying the system stored procedures, so test it at your own risk



Wednesday, June 14, 2017

Version Ghost Records Causing Replication Latency

Setting snapshot isolation level on the subscriber database is a common practice to avoid user queries blocking transactional replication threads.

However, a side effect may be an increase of replication latency during long running user queries because the version ghost records from the MSreplication_subscription table cannot to be cleanup until those long running queries complete.

MSreplication_subscriptions 

There is one MSreplication_subscriptions table on each subscriber database, it contains one row per subscription, and it is used to keep track of the last transaction delivered to the subscriber. Using extended events filtered by the session id of the distribution agent

CREATE EVENT SESSION [subscription_thread_trace] ON SERVER 
ADD EVENT sqlserver.prepare_sql(
    WHERE ([sqlserver].[session_id]=(59))),
ADD EVENT sqlserver.rpc_starting(
    WHERE ([sqlserver].[session_id]=(59))),
ADD EVENT sqlserver.sql_batch_starting(
    WHERE ([sqlserver].[session_id]=(59))),
ADD EVENT sqlserver.sql_statement_starting(
    WHERE ([sqlserver].[session_id]=(59)))
WITH (TRACK_CAUSALITY=ON)
GO

We can see that the table is updated after each transaction has been committed to the subscriber database



If there are thousands of transactions replicated, this table will be updated thousands of times too

Row Versioning 

If the subscriber database has snapshot isolation enabled, example

ALTER DATABASE [subscriber] SET READ_COMMITTED_SNAPSHOT ON 

then each update on the MSreplication_subscription table will mark the old row on the page as ghost, add a pointer to the row version in the version store on tempdb, and add the new row to the page.

Ghost records cannot be cleanup if they point to a row version on the version store, and the row version cannot be cleanup if there is a query using the version store. More info can be found here and here.

So a long running query will cause MSreplication_subscription to grow full of version ghost records 

Example 

To simulate a log running query in a test environment, we can leave a transaction open on any table on the subscriber database

USE [subscriber_db]
BEGIN TRAN
DELETE TOP(1) [table]

Execute the statement obtained from the extended event session

exec sp_executesql N'update MSreplication_subscriptions 
set transaction_timestamp = cast(@P1 as binary(15)) + cast(substring(transaction_timestamp, 16, 1) as binary(1))
, "time" = @P2 
where UPPER(publisher) = UPPER(@P3) 
and publisher_db = @P4 
and publication = @P5 
and subscription_type = 0 
and (substring(transaction_timestamp, 16, 1) = 0 or datalength(transaction_timestamp) < 16)'
,N'@P1 varbinary(14)
,@P2 datetime
,@P3 nvarchar(24)
,@P4 nvarchar(8)
,@P5 nvarchar(5)'
,0x0
,'2017-06-13 15:03:47'
,N'subscriber'
,N'subscriber_db'
,N'publication'
go 

Find the data page id

DBCC IND ('susbcriber_db', 'MSreplication_subscriptions', 1);



Examine the page content

DBCC TRACEON(3604)
DBCC PAGE (subscriber_db, 1, 1767872, 3);
DBCC TRACEOFF(3604)

After the update, the old row on the page is marked as ghost in slot 0 that point to the version on the version store on tempdb. The new row is added in slot 1, so internally an update will behave as a delete and insert pair, as shown below, some output has been omitted.

Slot 0 Offset 0x18e Length 302

Record Type = GHOST_DATA_RECORD     Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS VERSIONING_INFO
Record Size = 302                   
Memory Dump @0x000000F2084FA18E

0000000000000000:   7c003e00 00000000 2d880391 a7000000 80440d57  |.>.....-..Ԥ....D.W
0000000000000014:   fd7f0000 d820e17f c2027dae 62de7eb6 754898b0  ý...Ø á.Â.}®bÞ~¶uH.°
0000000000000028:   f2bb6c6c 0b260000 05000000 00000000 01000000  ò»ll.&..............
000000000000003C:   fd7f0e00 00150500 58006800 9800a800 20017400  ý.......X.h...¨. .t.
0000000000000050:   65007300 74003100 74007000 63006300 5f007300  e.s.t.1.t.p.c.c._.s.
0000000000000064:   73006400 42004f00 57004e00 56004400 49004d00  s.d.B.O.W.N.V.D.I.M.
0000000000000078:   4d003100 5c004200 4f004d00 53005300 4e005600  M.1.\.B.O.M.S.S.N.V.
000000000000008C:   44004900 4d004d00 30003100 00000000 00000000  D.I.M.M.0.1.........
00000000000000A0:   00000000 00000000 42004f00 57004e00 56004400  ........B.O.W.N.V.D.
00000000000000B4:   49004d00 4d003100 5c004200 4f004d00 53005300  I.M.M.1.\.B.O.M.S.S.
00000000000000C8:   4e005600 44004900 4d002d00 74007000 63006300  N.V.D.I.M.-.t.p.c.c.
00000000000000DC:   5f007300 73006400 2d007400 65007300 74003100  _.s.s.d.-.t.e.s.t.1.
00000000000000F0:   2d004200 4f005700 53005300 44005400 45005300  -.B.O.W.S.S.D.T.E.S.
0000000000000104:   54003000 33005c00 42004f00 4d005300 53005300  T.0.3.\.B.O.M.S.S.S.
0000000000000118:   53004400 2d003100 f01e0200 05000100 745a2f07  S.D.-.1.ð.......tZ/.
000000000000012C:   0000                                          ..     

Version Information = 
 Transaction Timestamp: 120543860
 Version Pointer: (file 5 page 138992 currentSlotId 1)

Slot 1 Offset 0x2bc Length 302

Record Type = PRIMARY_RECORD        Record Attributes =  NULL_BITMAP VARIABLE_COLUMNS VERSIONING_INFO
Record Size = 302                   
Memory Dump @0x000000F2084FA2BC

0000000000000000:   70003e00 00000000 01880391 a7000000 00100000  p.>........Ԥ.......
0000000000000014:   20000000 00000001 00e07dae 62de7eb6 754898b0   ........à}®bÞ~¶uH.°
0000000000000028:   f2bb6c6c 0b260000 f0bceb56 fd7f0000 d7fa7cfd  ò»ll.&..ð¼ëVý...×ú|ý
000000000000003C:   7f000e00 00150500 58006800 9800a800 20017400  ........X.h...¨. .t.
0000000000000050:   65007300 74003100 74007000 63006300 5f007300  e.s.t.1.t.p.c.c._.s.
0000000000000064:   73006400 42004f00 57004e00 56004400 49004d00  s.d.B.O.W.N.V.D.I.M.
0000000000000078:   4d003100 5c004200 4f004d00 53005300 4e005600  M.1.\.B.O.M.S.S.N.V.
000000000000008C:   44004900 4d004d00 30003100 01000000 00000000  D.I.M.M.0.1.........
00000000000000A0:   00000000 00000000 42004f00 57004e00 56004400  ........B.O.W.N.V.D.
00000000000000B4:   49004d00 4d003100 5c004200 4f004d00 53005300  I.M.M.1.\.B.O.M.S.S.
00000000000000C8:   4e005600 44004900 4d002d00 74007000 63006300  N.V.D.I.M.-.t.p.c.c.
00000000000000DC:   5f007300 73006400 2d007400 65007300 74003100  _.s.s.d.-.t.e.s.t.1.
00000000000000F0:   2d004200 4f005700 53005300 44005400 45005300  -.B.O.W.S.S.D.T.E.S.
0000000000000104:   54003000 33005c00 42004f00 4d005300 53005300  T.0.3.\.B.O.M.S.S.S.
0000000000000118:   53004400 2d003100 00000000 00000000 745a2f07  S.D.-.1.........tZ/.
000000000000012C:   0000                                          ..     

Version Information = 
 Transaction Timestamp: 120543860
 Version Pointer: Null

Eventually, after thousands of updates this table will have a high number of version ghost records, on the following example the table had 17,085 version ghost records and the table grew to 1,315 pages instead of 1 page

SELECT page_count, record_count, ghost_record_count, version_ghost_record_count 
FROM sys.dm_db_index_physical_stats(db_id('tpcc_ssd'), object_id('MSreplication_subscriptions'), NULL, NULL , 'SAMPLED');


Replication Latency 

The problem of having MSreplication_subscriptions with a high number of ghost records is that the update to this table will take longer and longer, for example we may see that the logical reads are 1,363 pages

set statistics IO on
go
exec sp_executesql N'update MSreplication_subscriptions 
set transaction_timestamp = cast(@P1 as binary(15)) + cast(substring(transaction_timestamp, 16, 1) as binary(1))
, "time" = @P2 
where UPPER(publisher) = UPPER(@P3) 
and publisher_db = @P4 
and publication = @P5 
and subscription_type = 0 
and (substring(transaction_timestamp, 16, 1) = 0 or datalength(transaction_timestamp) < 16)'
,N'@P1 varbinary(14)
,@P2 datetime
,@P3 nvarchar(24)
,@P4 nvarchar(8)
,@P5 nvarchar(5)'
,0x000009B600088BB0001D00000000
,'2017-06-13 15:03:47'
,N'subscriber'
,N'subscriber_db'
,N'publication'
go 
set statistics IO off
go

Table 'MSreplication_subscriptions'. Scan count 1, logical reads 1362, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)


Because the where clause of the update is not a SARG, the query plan will show a cluster index scan 


Workaround 

If we find that the MSreplication_susbcriptions has a high number of version ghost records

SELECT page_count, record_count, ghost_record_count, version_ghost_record_count 
FROM sys.dm_db_index_physical_stats(db_id('tpcc_ssd'), object_id('MSreplication_subscriptions'), NULL, NULL , 'SAMPLED');

Then it means there is a long running query using the version store not allowing the ghost cleanup.
One way to force the ghost cleanup is to rebuild the index

ALTER INDEX uc1MSReplication_subscriptions on MSreplication_subscriptions REBUILD WITH(ONLINE=ON)

The updates will now scan one page and the replication latency should come down

Thursday, March 23, 2017

Replication Commands That May Cause Blocking

Besides the replication snapshot agent, there are other replication commands that require a Schema Modification (SCH-M) lock for a very short duration on the article(s) of the publication, and they may cause blocking in very busy publishers.

To find out which commands use SCH-M, we can use the lock acquired extended event filtered by the object id of the article (table) in the publication database and the lock mode = SCH-M while executing the most common commands

CREATE EVENT SESSION [replication_sch_m_locks] ON SERVER 
ADD EVENT sqlserver.lock_acquired(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([mode]=(2) AND [package0].[equal_int64]([object_id],(222623836))))
ADD TARGET package0.event_file(SET filename=N'c:\temp\replication_sch_m_locks',max_file_size=(100),max_rollover_files=(10))
WITH (TRACK_CAUSALITY=ON)
GO

Below are the commands that use SCH-M lock on articles (tables)
  • Sp_addarticle @article='article'
  • Sp_changearticle @article='article '
  • Sp_dropsubscription @article='article' 
  • Sp_dropsubscription @article=’All’ 
  • Initialize with a snapshot the articles being added 
For example, to create a publication, we will use sp_addarticle, and it will require SCH-M lock on each article. If we initialize them using the snapshot agent, it will require SCH-M locks on all articles of the publication.

To add articles to an existing publication, we will use sp_addarticle, and it will require SCH-M lock only on the articles being added. If we initialize them using the snapshot agent, it will require SCH-M locks only on the articles being added.

Wednesday, March 22, 2017

Duplicate Key or Row Not Found Errors

For example, we will get this error in the distribution agent

Violation of PRIMARY KEY constraint 'PK_Sales_OrderLines'. Cannot insert duplicate key in object 'Sales.OrderLines'. The duplicate key value is (2). (Source: MSSQLServer, Error number: 2627)

Or

The row was not found at the Subscriber when applying the replicated DELETE command for Table '[Sales].[OrderLines]' with Primary Key(s): [OrderLineID] = 2 (Source: MSSQLServer, Error number: 20568)

The most common reasons for these errors I have found are: duplicate article subscription or another process modifying the data in the subscriber table.

Duplicate article subscription

To find out if it is due to duplicate article subscription, we can run in the publisher database (e.g. WideWorldImporters) the following stored procedure providing the subscriber instance name and the article name from the error message above (e.g. OrderLines)

exec sp_helpsubscription @subscriber = 'InstanceName', @article = 'OrderLines'


We will see that we are subscribing to the same article OrderLines from 2 publications BigTables and SmallTables

Thus a single insert in the table OrderLines in the publisher will be replicated twice to the subscriber generating the duplicate key error. In similar manner, a single delete in the table OrderLines in the publisher will be replicated twice to the subscriber generating the row not found error

To fix it



Another Process Modifying the Data

To find out if it is due to another process modifying the data we can create a SQL Audit on the subscriber table (e.g. OrderLines). We can filter by the account that have access or use public to show all accounts, as shown below.
USE [master]
GO

CREATE SERVER AUDIT [Audit-OrderLines]
TO FILE 
( FILEPATH = N'c:\temp\'
 ,MAXSIZE = 1024 MB
 ,MAX_FILES = 10
 ,RESERVE_DISK_SPACE = OFF
)
WITH
( QUEUE_DELAY = 1000
 ,ON_FAILURE = CONTINUE
 ,AUDIT_GUID = '146fab87-58c8-4521-b4b5-9ffb175f739a'
)
ALTER SERVER AUDIT [Audit-OrderLines] WITH (STATE = ON)
GO

USE [WideWorldImportersDSS]
GO

CREATE DATABASE AUDIT SPECIFICATION [DatabaseAuditSpecification-OrdeLines]
FOR SERVER AUDIT [Audit-OrderLines]
ADD (DELETE ON OBJECT::[Sales].[OrderLines] BY [public]),
ADD (INSERT ON OBJECT::[Sales].[OrderLines] BY [public]),
ADD (UPDATE ON OBJECT::[Sales].[OrderLines] BY [public])
WITH (STATE = ON)
GO

The SQL Audit file will show who, when, and what modified records from the table. In the example below, the first line was a trigger, the second was a stored procedure, the third was an ad-hoc SQL, and the rest of the lines were done by the replication account


To fix it, we need to stop the other processes are modifying data and then resynch the data

Friday, December 30, 2016

Querying the Data Validation Results

Typically, I use replication monitor to validate subscriptions and we can see the results on the subscription details window, as shown below. But if the publication has many tables, we can use a query instead, to easily spot the table that is out of synchronization.


We can run this query on the distribution database and provide the publisher database (e.g. WideWorldImporters) and the publication (e.g. TransactionTables)

 
select 
a.name
,h.comments
from MSdistribution_history h with (nolock)
join MSdistribution_agents a with (nolock)
on h.agent_id = a.id
where a.publisher_db = 'WideWorldImporters'
and a.publication = 'TransactionTables'
and h.comments like 'Table% out of synchronization%'


Friday, November 18, 2016

The Power of Window Functions

Window functions are the most efficient way to calculate certain aggregates such as running totals.

For example, we could use the CROSS APPLY operator to calculate the running total, as shown below, using the sample database Wide World Importers
 
SELECT t1.CustomerID, t1.InvoiceID, t1.TransactionAmount
,c.RunningTotali
FROM [WideWorldImporters].[Sales].[CustomerTransactions] t1
CROSS APPLY(
SELECT SUM(t2.TransactionAmount) RunningTotal
FROM [WideWorldImporters].[Sales].[CustomerTransactions] t2
WHERE t2.InvoiceID is not null
AND t2.InvoiceID <= t1.InvoiceID
AND t1.CustomerID = t2.CustomerID
) c
WHERE t1.InvoiceID is not null
ORDER BY t1.CustomerID, t1.InvoiceID

We could refactor this query to use window functions, as shown below
 
SELECT t1.CustomerID, t1.InvoiceID, t1.TransactionAmount,
SUM(t1.TransactionAmount) OVER(PARTITION BY t1.CustomerID
ORDER BY t1.InvoiceID
ROWS BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) RunningTotal
FROM [WideWorldImporters].[Sales].[CustomerTransactions] t1
WHERE t1.InvoiceID is not null
ORDER BY t1.CustomerID, t1.InvoiceID

Both queries will get the same result


However, the performance difference is abysmal


As shown above, the CROSS APPLY query (first statement) takes about 5 min and 2 million reads, while the window function (second statement) takes 264 milliseconds and a thousand reads to complete

Enjoy the power.

Thursday, August 4, 2016

Replication Snapshot Agent Blocking

The snapshot agent is used to initialize the subscription in transaction replication, and it requires a very short duration schema modification (SCH-M) lock on all the articles of the publication. If the articles are being queried thousands per second or queried for a long time then they will block the snapshot agent because SCH-M lock is not compatible with any other lock, once the SCH-M lock is waiting, everybody else will wait.

If blocking for few seconds is an issue in your system then you can initialize when there is low activity or initialize without a snapshot documented in books online.

Below I will show you how to reproduce this blocking scenario using a long running query.
 

-- on publisher: create publication database

use master
create database publisher_test_snapshot_agent_blocking
go

use publisher_test_snapshot_agent_blocking
create table paul_test1(a int primary key, b int)
create table paul_test2(a int primary key, b int)

-- on subscriber: create subscription database

use master
create database subscriber_test_snapshot_agent_blocking
go

-- on publisher: create publication and subscription

use publisher_test_snapshot_agent_blocking
exec sp_replicationdboption @dbname = N'publisher_test_snapshot_agent_blocking', @optname = N'publish', @value = N'true'
exec sp_addpublication @publication = N'test_snapshot_agent_blocking', @description = N'test', @sync_method = N'concurrent', @retention = 0, @allow_push = N'true', @allow_pull = N'true', @allow_anonymous = N'false', @enabled_for_internet = N'false', @snapshot_in_defaultfolder = N'true', @compress_snapshot = N'false', @ftp_port = 21, @ftp_login = N'anonymous', @allow_subscription_copy = N'false', @add_to_active_directory = N'false', @repl_freq = N'continuous', @status = N'active', @independent_agent = N'true', @immediate_sync = N'false', @allow_sync_tran = N'false', @autogen_sync_procs = N'false', @allow_queued_tran = N'false', @allow_dts = N'false', @replicate_ddl = 1, @allow_initialize_from_backup = N'false', @enabled_for_p2p = N'false', @enabled_for_het_sub = N'false'
exec sp_addpublication_snapshot @publication = N'test_snapshot_agent_blocking', @frequency_type = 1, @frequency_interval = 0, @frequency_relative_interval = 0, @frequency_recurrence_factor = 0, @frequency_subday = 0, @frequency_subday_interval = 0, @active_start_time_of_day = 0, @active_end_time_of_day = 235959, @active_start_date = 0, @active_end_date = 0, @job_login = null, @job_password = null, @publisher_security_mode = 1
exec sp_addarticle @publication = N'test_snapshot_agent_blocking', @article = N'paul_test1', @source_owner = N'dbo', @source_object = N'paul_test1', @type = N'logbased', @description = N'', @creation_script = N'', @pre_creation_cmd = N'drop', @schema_option = 0x000000000803509F, @identityrangemanagementoption = N'none', @destination_table = N'paul_test1', @destination_owner = N'dbo', @status = 24, @vertical_partition = N'false', @ins_cmd = N'CALL [sp_MSins_dbopaul_test1]', @del_cmd = N'CALL [sp_MSdel_dbopaul_test1]', @upd_cmd = N'SCALL [sp_MSupd_dbopaul_test1]'
exec sp_addarticle @publication = N'test_snapshot_agent_blocking', @article = N'paul_test2', @source_owner = N'dbo', @source_object = N'paul_test2', @type = N'logbased', @description = N'', @creation_script = N'', @pre_creation_cmd = N'drop', @schema_option = 0x000000000803509F, @identityrangemanagementoption = N'none', @destination_table = N'paul_test2', @destination_owner = N'dbo', @status = 24, @vertical_partition = N'false', @ins_cmd = N'CALL [sp_MSins_dbopaul_test2]', @del_cmd = N'CALL [sp_MSdel_dbopaul_test2]', @upd_cmd = N'SCALL [sp_MSupd_dbopaul_test2]'
exec sp_addsubscription @publication = N'test_snapshot_agent_blocking', @subscriber = N'(subscriber)', @destination_db = N'subscriber_test_snapshot_agent_blocking', @subscription_type = N'Push', @sync_type = N'automatic', @article = N'all', @update_mode = N'read only', @subscriber_type = 0
GO

-- on a new query window: run this long running select on the publisher

use publisher_test_snapshot_agent_blocking
select *
from master..spt_values a
cross join master..spt_values b
cross join paul_test1

-- on a new query window: start the snapshot agent  on the distributor

use distribution
exec sp_MSstartsnapshot_agent @publisher = N'(publisher)', @publisher_db = N'publisher_test_snapshot_agent_blocking', @publication = N'test_snapshot_agent_blocking'

-- on a new query window: run this long running select on the publisher

use publisher_test_snapshot_agent_blocking
select *
from master..spt_values a
cross join master..spt_values b
cross join paul_test1

-- on a new query window: run this long running select on the publisher

use publisher_test_snapshot_agent_blocking
select *
from master..spt_values a
cross join master..spt_values b
cross join paul_test1

-- on a new query window: run sp_WhoIsActive

exec sp_WhoIsActive 
@get_outer_command = 1
,@get_plans = 1
,@get_transaction_info = 1
,@find_block_leaders = 1
,@get_locks = 1

You will see that the long running query (session id = 88) is blocking the snapshot agent (session id = 95) that is requesting SCH-M locks, which in turn is blocking the other 2 queries (session_ id = 52 and 86) requesting Intent Shared (IS) locks.