Summary
In this blog, part of a series of “Recently a Customer Asked” posts for Tungsten University, we explore the reason for the occasional Tungsten Replicator OFFLINE:ERROR state when applying to AWS Redshift, along with possible steps to compensate for the issue.
The Question
How do we handle the occasional Tungsten Replicator OFFLINE:ERROR state when applying to AWS Redshift? We see
Errno 104 Connection reset by peer
messages.
The Reason
It appears that there was a transient error with s3cmd
, which is used to upload files to AWS S3 as part of the process to load data into RedShift. For this specific customer, the issue was only happening when deleting files in s3.
According to the INI file, this Replicator did have auto recovery enabled, which is a good thing. The setting was configured to wait 5 minutes before attempting recovery, which is probably why this customer noticed the outage:
auto-recovery-delay-interval=300s
The Solution
The interval can be made shorter so that the Replicator attempts to go back online more quickly after an error. Just change the delay from 300 to 30 in tungsten.ini
, and then run tpm update
.
auto-recovery-delay-interval=30s
Deep Dive
To enable Auto-Recovery in the replicator, there are three separate options to configure in the [defaults]
section of your tungsten.ini
file. Here they are with the default values:
auto-recovery-delay-interval=300s
auto-recovery-delay-interval=15s
auto-recovery-max-attempts=3
auto-recovery-delay-interval - The delay between the replicator identifying that autorecovery is needed, and autorecovery being attempted. For busy MySQL installations, larger numbers may be needed to allow time for MySQL servers to restart or recover from their failure.
auto-recovery-max-attempts - Specifies the number of attempts the replicator will make to go back online. When the number of attempts has been reached, the replicator will remain in the OFFLINE state.
Autorecovery is not enabled until the value of this parameter is set to a non-zero value. The state of autorecovery can be determined using the autoRecoveryEnabled status parameter. The number of attempts made to autorecover can be tracked using the autoRecoveryTotal status parameter.
auto-recovery-reset-interval - The time in ONLINE state that indicates to the replicator that the autorecovery procedure has succeeded. For servers with very large transactions, this value should be increased to allow the transaction to be successfully applied.
Wrap-Up
In this blog, part of a series of “Recently a Customer Asked” posts for Tungsten University, we explored the reason for the occasional Tungsten Replicator OFFLINE:ERROR state when applying to AWS Redshift, along with possible steps to compensate for the issue.
Smooth sailing!
Comments
Add new comment