Pages

Friday, August 24, 2018

Adding the Distributor to the AG Publisher May Remove Replication

If we have an Availability Group (AG) publisher and we plan to rebuild the secondary without impacting transactional replication, then we may see unexpected results if the secondary was the original publisher. Because adding the distributor will delete the metadata in the remote distributor, and replication will need to be rebuild.

Scenario 

Replication was created when HostA\InstanceA was the primary so it is the original publisher, as shown below.

Primary: HostA\InstanceA (Original Publisher)
Secondary: HostB\InstanceB
Distributor: HostR\InstanceR

After fail over HostA\InstanceA is the secondary, as shown below.

Secondary: HostA\InstanceA (Original Publisher)
Primary: HostB\InstanceB
Distributor: HostR\InstanceR

Then HostA\InstanceA is rebuild on new hardware without production impact, as shown below.

Secondary: HostA\InstanceA (Original Publisher Rebuild)
Primary: HostB\InstanceB
Distributor: HostR\InstanceR

But now we need to run sp_addistributor 'HostR\InstanceR' in the newly built secondary as outlined here

When we run sp_addistributor in the secondary, it executes the following in the distributor, passing the name of the secondary that was the original publisher

repl_distributor.distribution_db.sys.sp_MSdistpublisher_cleanup @@servername

So it will delete all the metadata and replication will stop working, then we will need to run sp_removereplication in the primary and rebuild replication

Workaround 

Since sp_adddistributor passes @@servername, we can rename the secondary to HostANew\InstanceA using sp_dropserver/sp_addserver as outlined here, then sp_MSdistpublisher_cleanup @@servername won't find any matching entries of the original publisher in the distributor and replication will continue to work as expected.

Wednesday, August 15, 2018

The Nulls in the Infamous Error "The Row Was Not Found ..."

While I was troubleshooting an issue, I wondered why there were nulls on the error:

"The row was not found at the Subscriber when applying the replicated (null) command for Table '(null)' with Primary Key(s): (null) (Source: MSSQLServer, Error number: 20598)"

After a little research I found out that since SQL 2012 SP4, sp_MSreplraiserror and sp_MSdel_dbo% have 3 additional parameters used in the raise error 20598 command, and they will show null depending on the SQL version of the subscriber or if the subscriber was upgraded in place

sp_MSreplraiserror

It now expects 3 parameters in the raise error 20598 command, as shown below. It gets called in the subscriber by the delete stored procedure for the respective table, e.g. sp_MSdel_dbotb1. Thus, if the publisher is SQL 2016 SP2 and the subscriber is SQL 2012 SP3, then the error message will show nulls

Before

create procedure sys.sp_MSreplraiserror @errorid int, @param1 sysname = null, @param2 sysname= null
as
    if @errorid = 20508 raiserror (20508, 11, 1)
    ...
    else if @errorid = 20598 raiserror (20598, 16, 1)

After

create procedure sys.sp_MSreplraiserror @errorid int, @param1 sysname = null, @param2 sysname= null, @param3 int = null
as
    if @errorid = 20508 raiserror (20508, 11, 1)
    ...
    else if @errorid = 20598 raiserror (20598, 16, 1, @param3, @param1, @param2)

Workaround

Since it is a system stored procedure there is no other option than upgrading the subscriber

sp_MSdel_dbo%

It now passes values to the 3 parameters in sp_MSreplraiserror, as shown below. Thus, if the subscriber was upgraded in place e.g from SQL 2012 SP3 to SQL 2016 SP2, then the error message with show nulls

Before

create procedure [dbo].[sp_MSdel_dbotb1] (@pkc1 int)
as
begin
delete [dbo].[tb1]
where [sno] = @pkc1
if @@rowcount = 0
    if @@microsoftversion>0x07320000
        exec sp_MSreplraiserror @errorid=20598
end

After

create procedure [dbo].[sp_MSdel_dbotb1] (@pkc1 int)
as
begin
declare @primarykey_text nvarchar(100) = ''
delete [dbo].[tb1]
where [sno] = @pkc1
if @@rowcount = 0
    if @@microsoftversion>0x07320000
    Begin
        if exists (Select * from sys.all_parameters where object_id = OBJECT_ID('sp_MSreplraiserror') and [name] = '@param3')
        Begin
        set @primarykey_text = @primarykey_text + '[sno] = ' + convert(nvarchar(100),@pkc1,1)
        exec sp_MSreplraiserror @errorid=20598, @param1=N'[dbo].[tb1]', @param2=@primarykey_text, @param3=13234
        End
        Else
        exec sp_MSreplraiserror @errorid=20598
    End
end

Workaround

Redeploy the replication stored procedures using sp_scriptpublicationcustomprocs

If the 20598 error message shows nulls, we can still use the transaction sequence number as outlined here to get more info. But it is nice to have the values in the error message to easily find the culprit, example:

"The row was not found at the Subscriber when applying the replicated DELETE command for Table '[dbo].[tb1]' with Primary Key(s): [sno] = 3 (Source: MSSQLServer, Error number: 20598)"