Intro
Tungsten Replicator extracts from MySQL by reading the binary logs, and writing transaction data in our own format, which we call THL (Transaction History Log). You can read about THL here. THL is our portable format that also contains additional metadata, and allows us to apply events consistently to replicas, as well as non-MySQL targets and even to different MySQL versions.
Occasionally, we need to reposition the primary extractor, for either of the following reasons:
- The Primary replicator is in an ERROR or SUSPECT state. This is usually because of external issues, like running out of disk space.
-
A Replica has an external issue, causing it to go offline. Compounded by a loss of THL, the Replica requests THL from the Primary extractor but that THL no longer exists on the Primary, and thus we see a message such as:
pendingExceptionMessage: Client handshake failure: Client response validation failed: Master log does not contain requested transaction: master source ID=db1 client source ID=db2 requested seqno=4 client epoch number=0 master min seqno=8 master max seqno=8
These are two different scenarios, but the resolution is the same: We need to extract events from the MySQL binary log starting from the position that the replicas are requesting.
Determining the Correct Position
In both scenarios, we need to determine the correct position for extraction. The goal is to start extraction from the correct binary log position so that the replicas can continue to apply transactions without missing any events. If we position the replicator too far forward, transactions will be lost. If we position the replicator too far back, we will spend time extracting events that are not needed and will waste resources processing these events.
In both scenarios, we query the Replicas to see the last sequence number applied, and then take the lowest sequence number as the starting point of extraction from the Primary. As an example, we can use trepctl qs
to quickly see this info on both Replicas:
[tungsten@db2 ~]$ trepctl qs
State: alpha Faulty (Offline) for 164.747s
Error Reason: SEQNO -1 did not apply
Error: Client handshake failure: Client response validation failed: Master log does not contain requested transaction: master source ID=db1 client source ID=db2 requested seqno=96740 client epoch number=96739 master min seqno=96745 master max seqno=96745
[tungsten@db3 ~]$ trepctl qs
State: alpha Faulty (Offline) for 148.624s
Error Reason: SEQNO -1 did not apply
Error: Client handshake failure: Client response validation failed: Master log does not contain requested transaction: master source ID=db1 client source ID=db3 requested seqno=96739 client epoch number=96739 master min seqno=96745 master max seqno=96745
From this we see db2 needs to start from sequence 96740, and db3 needs to start from 96739. We need the lesser sequence number, so we will position the primary replicator from 96739. We will do this with the dsctl
command. dsctl
requires the MySQL binary log position, epoch number, and sequence number. Fortunately, we can get all of that information from the Replica, in this case, db3:
[tungsten@db3 ~]$ dsctl get -ascmd
dsctl set -seqno 96739 -epoch 96739 -event-id "mysql-bin.000033:0000000000000983;87" -source-id "db1"
We use the argument get
to query the Tungsten Replicator tracking schema, and the option -ascmd
to print the results as a command that we can use directly. Notice the sequence number is 96739, which matches the output of trepctl qs
.
Now we can set the position of the Primary replicator, making sure to take it offline first. Note: If this is a Tungsten Cluster, be sure to do cctrl> set policy maintenance
first:
[tungsten@db1 ~]$ trepctl offline
# paste the dsctl command we got from db3:
[tungsten@db1 ~]$ dsctl set -seqno 96739 -epoch 96739 -event-id "mysql-bin.000033:0000000000000983;87" -source-id "db1" -reset
Service "alpha" datasource "global" catalog information cleared
Service "alpha" datasource "global" position was set to: seqno=96739 epoch_number=96739 eventid=mysql-bin.000033:0000000000000983;87 source_id=db1
# we added the -reset option (above) to dsctl to clear the old tracking schema
[tungsten@db1 ~]$ thl purge -y
# remove old THL
[tungsten@db1 ~]$ trepctl online
[tungsten@db1 ~]$ thl info
log directory = /opt/continuent/thl/alpha/
log files = 1
logs size = 0.01 MB
min seq# = 96740
max seq# = 96746
Now the THL should be available to our replicas, so let’s bring them online, using the -force
argument as needed:
[tungsten@db2 ~]$ trepctl online
[tungsten@db2 ~]$ trepctl qs
State: alpha Faulty (Offline) for 7.374s
Error Reason: SEQNO -1 did not apply
Error: Client handshake failure: Client response validation failed: Log epoch numbers do not match: master source ID=db1 client source ID=db2 seqno=96740 server epoch number=96740 client epoch number=96739
# db2 has different epoch number, so let's override this check
[tungsten@db2 ~]$ trepctl online -force
[tungsten@db2 ~]$ trepctl qs
State: alpha Online for 5.197s, running for 1664469.821s
Latency: 186.828s from source DB commit time on thls://db1:2112/ into target database
191.907s since last source commit
Sequence: 96746 last applied, 0 transactions behind (71381-96746 stored) estimate 0.00s before synchronization
[tungsten@db3 ~]$ trepctl online
[tungsten@db3 ~]$ trepctl qs
State: alpha Online for 4.164s, running for 6010.756s
Latency: 295.826s from source DB commit time on thls://db1:2112/ into target database
299.785s since last source commit
Sequence: 96746 last applied, 0 transactions behind (40593-96746 stored) estimate 0.00s before synchronization
Both db2 and db3 are back online. If this were a Tungsten cluster, we can recover or welcome the nodes as needed through cctrl
.
When Parallel Apply is in Use
When using parallel apply, the steps for determining the starting sequence numbers are slightly different. As a reminder, parallel apply creates n apply threads that run in parallel, with each having their own sequence number. The replicator tools typically show the sequence number from just 1 thread, but it may not be the thread that is most behind. We need to find the lowest sequence of all threads and use that. Again, we use the dsctl
command:
[tungsten@db2 ~]$ dsctl get -ascmd
dsctl set -seqno 96746 -epoch 96740 -event-id "mysql-bin.000033:0000000000003619;110759" -source-id "db1"
dsctl set -seqno 96736 -epoch 96740 -event-id "mysql-bin.000033:0000000000000619;110759" -source-id "db1"
dsctl set -seqno 96716 -epoch 96740 -event-id "mysql-bin.000032:0000000000000619;110759" -source-id "db1"
dsctl set -seqno 96726 -epoch 96740 -event-id "mysql-bin.000033:0000000000000219;110759" -source-id "db1"
dsctl set -seqno 96732 -epoch 96740 -event-id "mysql-bin.000033:0000000000000609;110759" -source-id "db1"
We see 96716 is the lowest sequence number on this host, so that is what we will use.
(We can make this easier to read by using sort
):
[tungsten@db2 ~]$ dsctl get -ascmd |sort
dsctl set -seqno 96716 -epoch 96740 -event-id "mysql-bin.000032:0000000000000619;110759" -source-id "db1"
dsctl set -seqno 96726 -epoch 96740 -event-id "mysql-bin.000033:0000000000000219;110759" -source-id "db1"
dsctl set -seqno 96732 -epoch 96740 -event-id "mysql-bin.000033:0000000000000609;110759" -source-id "db1"
dsctl set -seqno 96736 -epoch 96740 -event-id "mysql-bin.000033:0000000000000619;110759" -source-id "db1"
dsctl set -seqno 96746 -epoch 96740 -event-id "mysql-bin.000033:0000000000003619;110759" -source-id "db1"
Once we determine the lowest sequence on each replica that is offline, the procedure is exactly the same as the above: take the dsctl
command and run it on the primary.
Conclusion
Resetting the primary replicator is not difficult, and the important step is to determine the starting point to make THL available to the downstream Replicas. One item to note: this article assumes that the binary logs are available to extract from. If the binary logs have been purged or are otherwise missing, we cannot extract, and you will have to reprovision the Replica(s) in this case. But we have a tool for that — tprovision
! Read this blog to learn even more about tprovision
.
Comments
Add new comment