Paul Ou Yang: 2018

Thursday, November 29, 2018

Unable to Query the Redirected Server for Original Publisher

sp_redirect_publisher uses the account credential used when the Publisher was added to the Remote distributor and stored in msdb.dbo.MSdistpublishers. So if a SQL authentication account was used such as 'sa' or 'distributor_admin,' replication may fail when the password expired or the account doesn't exist.

For example, if we execute the following, the login column should be blank

select * from msdb.dbo.MSdistpublishers

To fix it run the following

exec [sys].[sp_changedistpublisher] 'host\instance', 'security_mode', '1'
exec [sys].[sp_changedistpublisher] 'host\instance', 'security_mode', '1'

Now the login column will be blank

Otherwise we will get these errors when the password expired

Unable to query the redirected server ',' for original publisher '\' and publisher database '' to determine the name of the remote server; Error 18456, Error message 'Error 18456, Level 14, State 1, Message: Login failed for user 'sa'.'.'.

Unable to query the redirected server ',' for original publisher '\' and publisher database '' to determine the name of the remote server; Error 18456, Error message 'Error 18456, Level 14, State 1, Message: Login failed for user 'distributor_admin'.'.'.

Friday, August 24, 2018

Adding the Distributor to the AG Publisher May Remove Replication

If we have an Availability Group (AG) publisher and we plan to rebuild the secondary without impacting transactional replication, then we may see unexpected results if the secondary was the original publisher. Because adding the distributor will delete the metadata in the remote distributor, and replication will need to be rebuild.

Scenario

Replication was created when HostA\InstanceA was the primary so it is the original publisher, as shown below.

Primary: HostA\InstanceA (Original Publisher)
Secondary: HostB\InstanceB
Distributor: HostR\InstanceR

After fail over HostA\InstanceA is the secondary, as shown below.

Secondary: HostA\InstanceA (Original Publisher)
Primary: HostB\InstanceB
Distributor: HostR\InstanceR

Then HostA\InstanceA is rebuild on new hardware without production impact, as shown below.

Secondary: HostA\InstanceA (Original Publisher Rebuild)
Primary: HostB\InstanceB
Distributor: HostR\InstanceR

But now we need to run sp_addistributor 'HostR\InstanceR' in the newly built secondary as outlined here

When we run sp_addistributor in the secondary, it executes the following in the distributor, passing the name of the secondary that was the original publisher

repl_distributor.distribution_db.sys.sp_MSdistpublisher_cleanup @@servername

So it will delete all the metadata and replication will stop working, then we will need to run sp_removereplication in the primary and rebuild replication

Workaround

Since sp_adddistributor passes @@servername, we can rename the secondary to HostANew\InstanceA using sp_dropserver/sp_addserver as outlined here, then sp_MSdistpublisher_cleanup @@servername won't find any matching entries of the original publisher in the distributor and replication will continue to work as expected.

Wednesday, August 15, 2018

The Nulls in the Infamous Error "The Row Was Not Found ..."

While I was troubleshooting an issue, I wondered why there were nulls on the error:

"The row was not found at the Subscriber when applying the replicated (null) command for Table '(null)' with Primary Key(s): (null) (Source: MSSQLServer, Error number: 20598)"

After a little research I found out that since SQL 2012 SP4, sp_MSreplraiserror and sp_MSdel_dbo% have 3 additional parameters used in the raise error 20598 command, and they will show null depending on the SQL version of the subscriber or if the subscriber was upgraded in place

sp_MSreplraiserror

It now expects 3 parameters in the raise error 20598 command, as shown below. It gets called in the subscriber by the delete stored procedure for the respective table, e.g. sp_MSdel_dbotb1. Thus, if the publisher is SQL 2016 SP2 and the subscriber is SQL 2012 SP3, then the error message will show nulls

Before

create procedure sys.sp_MSreplraiserror @errorid int, @param1 sysname = null, @param2 sysname= null
as
    if @errorid = 20508 raiserror (20508, 11, 1)
    ...
    else if @errorid = 20598 raiserror (20598, 16, 1)

After

create procedure sys.sp_MSreplraiserror @errorid int, @param1 sysname = null, @param2 sysname= null, @param3 int = null
as
    if @errorid = 20508 raiserror (20508, 11, 1)
    ...
    else if @errorid = 20598 raiserror (20598, 16, 1, @param3, @param1, @param2)

Workaround

Since it is a system stored procedure there is no other option than upgrading the subscriber

sp_MSdel_dbo%

It now passes values to the 3 parameters in sp_MSreplraiserror, as shown below. Thus, if the subscriber was upgraded in place e.g from SQL 2012 SP3 to SQL 2016 SP2, then the error message with show nulls

Before

create procedure [dbo].[sp_MSdel_dbotb1] (@pkc1 int)
as
begin
delete [dbo].[tb1]
where [sno] = @pkc1
if @@rowcount = 0
    if @@microsoftversion>0x07320000
        exec sp_MSreplraiserror @errorid=20598
end

After

create procedure [dbo].[sp_MSdel_dbotb1] (@pkc1 int)
as
begin
declare @primarykey_text nvarchar(100) = ''
delete [dbo].[tb1]
where [sno] = @pkc1
if @@rowcount = 0
    if @@microsoftversion>0x07320000
    Begin
        if exists (Select * from sys.all_parameters where object_id = OBJECT_ID('sp_MSreplraiserror') and [name] = '@param3')
        Begin
        set @primarykey_text = @primarykey_text + '[sno] = ' + convert(nvarchar(100),@pkc1,1)
        exec sp_MSreplraiserror @errorid=20598, @param1=N'[dbo].[tb1]', @param2=@primarykey_text, @param3=13234
        End
        Else
        exec sp_MSreplraiserror @errorid=20598
    End
end

Workaround

Redeploy the replication stored procedures using sp_scriptpublicationcustomprocs

If the 20598 error message shows nulls, we can still use the transaction sequence number as outlined here to get more info. But it is nice to have the values in the error message to easily find the culprit, example:

"The row was not found at the Subscriber when applying the replicated DELETE command for Table '[dbo].[tb1]' with Primary Key(s): [sno] = 3 (Source: MSSQLServer, Error number: 20598)"

Monday, January 15, 2018

Distribution Cleanup Job Blocking

The distribution cleanup job may block the log reader agent for several minutes, if there is a large number of rows to keep and "delete TOP N" deletes N-1 rows.

The Delete TOP N Behavior

When there are N or N+1 rows, delete TOP N stops scanning the range once N rows are reached
When there are N-1 rows, delete TOP N will scan the entire range. If the range contains a very large amount rows then the scan may take long time, and the locks will be held until the scan finishes, which can cause blocking.

For example, the distribution cleanup job calls the delete statements in a "WHILE 1 =1" loop from the following stored procedures

-- sp_MSdelete_publisherdb_trans line 202:
delete TOP(2000) MSrepl_commands WITH (PAGLOCK) from MSrepl_commands with (INDEX(ucMSrepl_commands))

-- sp_MSdelete_dodelete line 38:
delete TOP(5000) MSrepl_transactions WITH (PAGLOCK) from MSrepl_transactions with (INDEX(ucMSrepl_transactions))

The following extended event session can be used to get the query plans for line 202 and 38 of the respective stored procedures

CREATE EVENT SESSION [distribution_cleanup] ON SERVER 
ADD EVENT sqlserver.query_post_execution_showplan(
    ACTION(sqlserver.query_plan_hash,sqlserver.tsql_stack)
    WHERE (([object_name]=N'sp_MSdelete_publisherdb_trans' 
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''202''%frame level=''2''%') 
 OR ([object_name]=N'sp_MSdelete_dodelete'
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''38''%frame level=''2''%'))))),
ADD EVENT sqlserver.sp_statement_completed(SET collect_object_name=(1),collect_statement=(1)
    ACTION(sqlserver.query_plan_hash,sqlserver.tsql_stack)
    WHERE (([object_name]=N'sp_MSdelete_publisherdb_trans' 
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''202''%frame level=''2''%') 
 OR ([object_name]=N'sp_MSdelete_dodelete'
 AND [sqlserver].[like_i_sql_unicode_string]([sqlserver].[tsql_stack],N'%frame level=''1''%line=''38''%frame level=''2''%')))))
ADD TARGET package0.event_file(SET filename=N'c:\temp\distribution_cleanup.xel')
WITH (TRACK_CAUSALITY=ON)
GO

In this particular case, the table MSrepl_commands had about 600 million rows, each iteration deletes 2,000 rows with a few milliseconds of duration, but the last iteration deleted 1,352 rows with a 19 minute duration

The following query can be used to extract the query plan from the .xel file, so we can just use SSMS to visualize it

SELECT event_data = CONVERT(XML, event_data)
INTO #tmp
FROM sys.fn_xe_file_target_read_file( N'C:\temp\distribution_cleanup_0_131593922293630000*.xel', NULL, NULL, NULL);

SELECT 
t.event_data.value('(/event/@timestamp)[1]', 'datetime2') AS event_time
,t.event_data.value('(/event/data[@name="duration"]/value)[1]', 'bigint') AS duration
,t.event_data.value('(/event/data[@name="object_name"]/value)[1]', 'varchar(255)') AS object_name
,actual_plan = x.xml_fragment.query('.')
FROM #tmp AS t
CROSS APPLY t.event_data.nodes(N'/event/data[@name="showplan_xml"]/value/*')
AS x(xml_fragment)
order by event_time

Comparing the query plan when it deletes 2,000 rows vs. the plan that deletes 1.356 rows, the former reads 2,000 rows and the latter reads more than 200 million rows due to the residual predicate

This explains why the lock is held for a long time, and the blocking is explained by the cluster index definition shown below

The Cluster Index

Both tables MSrepl_commands and MSrepl_transaction have a similar cluster index that has the publisher_database_id as the leading key column. Thus, rows are stored in order by publisher_database_id. An example for MSrepl_transactions is shown below

If there is more than one publisher_database_id then the last page of the first publisher_database_id range will contain not only the newest rows for the first publisher_database_id but also the oldest rows for the second publisher_database_id. So, the delete with (PAGLOCK) of the oldest rows for the second publisher_database_id will block the insert of new rows by the log reader for the first publisher_database_id .

Example

In my test environment two publisher databases pub1 and pub2, have publisher_database_id = 3 and publisher_database_id = 4 respectively.

-- create databases
use master
create database pub1
create database pub2
go

-- create tables
use pub1
create table test1(a int not null identity, b datetime, c varchar(255)
constraint test1_pk primary key (a)
)
use pub2
create table test2(a int not null identity, b datetime, c varchar(255)
constraint test2_pk primary key (a)
)
GO

-- Adding the transactional publication with immediate synch
use [pub1]
exec sp_replicationdboption @dbname = N'pub1', @optname = N'publish', @value = N'true'
exec sp_addpublication @publication = N'test_pub1', @allow_anonymous = N'true', @immediate_sync = N'true', @independent_agent = N'true', @status = N'active'
exec sp_addpublication_snapshot @publication = N'test_pub1', @frequency_type = 1
exec sp_addarticle @publication = N'test_pub1', @article = N'test1', @source_owner = N'dbo', @source_object = N'test1'
GO

-- Adding the transactional publication with immediate synch
use [pub2]
exec sp_replicationdboption @dbname = N'pub2', @optname = N'publish', @value = N'true'
exec sp_addpublication @publication = N'test_pub2', @allow_anonymous = N'true', @immediate_sync = N'true', @independent_agent = N'true', @status = N'active'
exec sp_addpublication_snapshot @publication = N'test_pub2', @frequency_type = 1
exec sp_addarticle @publication = N'test_pub2', @article = N'test2', @source_owner = N'dbo', @source_object = N'test2'
GO

-- disable distribution cleanup job
exec msdb..sp_update_job @job_name = 'Distribution clean up: distribution', @enabled = 0

-- add 5999 rows at 2:43 PM
use pub2
insert test2(b,c)
values (getdate(),replicate('x',200))
go 5999

-- add 21000 rows at 3:44 PM (one hour later)
use pub1
insert test1(b,c)
values (getdate(),replicate('x',200))
go 1000

use pub2
insert test2(b,c)
values (getdate(),replicate('x',200))
go 20000

-- verify there is 26999 rows
use distribution
select count(*) from MSrepl_transactions
select count(*) from MSrepl_commands

-- get root page
select allocated_page_file_id, allocated_page_page_id, next_page_page_id, previous_page_page_id, page_level 
from sys.dm_db_database_page_allocations(db_id(),object_id('MSrepl_transactions'),1,null,'DETAILED') 
where page_type_desc is not null and page_type_desc = 'INDEX_PAGE'

DBCC TRACEON(3604)
DBCC PAGE (distribution, 1, 72305, 3);
DBCC TRACEOFF(3604)

The last page for the publisher_database_id = 3 range is 72702 and the first page for the publisher_database_id = 4 range is 72629.

To simulate a long duration we can execute the distribution cleanup proc with an open transaction

use distribution
begin tran
EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 1

Removed 5999 replicated transactions consisting of 5999 statements in 1 seconds (11998 rows/sec).

The cleanup job removed the 5,999 rows we added 1 hour ago for publisher_database_id = 4 and held an X lock on page 72702 that contains the oldest rows for publisher_database_id = 4. We can see the locks using this query

select * from sys.dm_tran_locks where request_session_id = 68 and request_mode = 'X'

In another session, we can execute the following script to add new rows to publisher_database_id = 3 (pub1)

use pub1
insert test1(b,c)
values (getdate(),replicate('x',200))
go

The log reader agent for publisher_database_id = 3 is blocked by the delete of publisher_database_id = 4, as shown in the spWhoIsActive output, because page 72702 also contains the newest rows for publisher_database_id = 3

Workaround

We can avoid deleting N-1 rows by "delete TOP N" and prevent it from reading million of rows, if we modify the delete loop to iterate a multiple of N on the sp_MSdelete_publisherdb_trans stored procedure.

Code snipped for the MSrepl_commands table

select @max = count(*)/2000 from MSrepl_commands with (INDEX(ucMSrepl_commands), nolock) where
 publisher_database_id = @publisher_database_id and
 xact_seqno <= @max_xact_seqno and
 (type & ~@snapshot_bit) not in (@directory_type, @alt_directory_type) and
 (type & ~@replpost_bit) <> @scriptexec_type
 OPTION (MAXDOP 4)

WHILE @i <= @max
BEGIN

 DELETE TOP(2000) MSrepl_commands WITH (PAGLOCK) from MSrepl_commands with (INDEX(ucMSrepl_commands)) where
  publisher_database_id = @publisher_database_id and
  xact_seqno <= @max_xact_seqno and
  (type & ~@snapshot_bit) not in (@directory_type, @alt_directory_type) and
  (type & ~@replpost_bit) <> @scriptexec_type
  OPTION (MAXDOP 1)
  
 select @row_count = @@rowcount
 -- Update output parameter
 select @num_commands = @num_commands + @row_count
    
 set @i=@i+1
END

Code snipped for the MSrepl_transactions table

select @max = count(*)/5000 from MSrepl_transactions with (INDEX(ucMSrepl_transactions), nolock) where
 publisher_database_id = @publisher_database_id and
 xact_seqno <= @max_xact_seqno and
 xact_seqno <> @last_xact_seqno and
 xact_seqno <> @last_log_xact_seqno
 OPTION (MAXDOP 4)

WHILE @i <= @max
BEGIN
 exec dbo.sp_MSdelete_dodelete @publisher_database_id, 
  @max_xact_seqno, 
  @last_xact_seqno, 
  @last_log_xact_seqno,
  @has_immediate_sync


 select @row_count = @@rowcount

 -- Update output parameter
 select @num_transactions = @num_transactions + @row_count
   
 set @i=@i+1
END

Note that Microsoft doesn't support modifying the system stored procedures, so test it at your own risk

Pages