Saturday, October 31, 2015

Analyzing Replication Performance Statistics

As mentioned in this KB article, useful data to troubleshoot replication performance is stored in the MSlogreader_history and MSdistribution_history tables in the distribution database.

I was troubleshooting latency from the distribution to the subscriber database. So here is the query to get the info from MSdistribution_history for a particular distribution agent id. Similar query can be done for the log reader.
declare @agent_id int = 662;

with a as(
,cast(h.comments as xml) comments
from MSdistribution_history h with (nolock)     
h.[comments] like '<stats state="1"%'  
and h.agent_id = @agent_id
,c.value('(/stats/@state)[1]', 'int') AS [state]      
,c.value('(/stats/@work)[1]', 'int') AS [work]      
,c.value('(/stats/@idle)[1]', 'int') AS [idle]            
,c.value('(/stats/reader/@fetch)[1]', 'int') AS [reader_fetch]      
,c.value('(/stats/reader/@wait)[1]', 'int')  AS [reader_wait]            
,c.value('(/stats/writer/@write)[1]', 'int') AS [writer_write]      
,c.value('(/stats/writer/@wait)[1]', 'int')  AS [writer_wait]            
,c.value('(/stats/sincelaststats/@elapsedtime)[1]', 'int') AS [sincelaststats_elapsedtime]      
,c.value('(/stats/sincelaststats/@work)[1]', 'int') AS [sincelaststats_work]      
,c.value('(/stats/sincelaststats/@cmds)[1]', 'int') AS [sincelaststats_cmds]      
,c.value('(/stats/sincelaststats/@cmdspersec)[1]', 'float') AS [sincelaststats_cmdspersec]       
,c.value('(/stats/sincelaststats/reader/@fetch)[1]', 'int') AS [sincelaststats_reader_fetch]      
,c.value('(/stats/sincelaststats/reader/@wait)[1]', 'int')  AS [sincelaststats_reader_wait]            
,c.value('(/stats/sincelaststats/writer/@write)[1]', 'int') AS [sincelaststats_writer_write]      
,c.value('(/stats/sincelaststats/writer/@wait)[1]', 'int')  AS [sincelaststats_writer_wait]    
from a
cross apply a.comments.nodes('/stats[@state="1"]') b(c)
order by time;

Replication uses a memory buffer structure to queue up the rows between the reader thread and the writer thread. Thus a high “since last stats reader wait” means it is waiting from the writer thread to consume the rows from the buffer and deliver them to the subscriber, while a high “since last stats writer wait” means it is waiting for the reader thread to produce rows from the distribution to the buffer.

Next, you can use your favorite tool (I used SSRS) to create a graph so we can visualize the information better. In the graph below, we can see there is a correlation between the number of commands replicated and the replication latency. So we can either decrease the number of commands replicated or increase the performance of the subscriber. I this particular case we focused on the subscriber side and found that IO was the bottleneck.