Check Out Part 1
In a previous blog article, we saw how to use rsync
to provision a database within a Tungsten Cluster. As a reminder, rsync
is useful for syncing files or directories that differ from source to target. scp
can also be used for this purpose, however scp
will send all files, whether they differ or not. For databases, if the target already has an older version of a database (or even a corrupt database), in theory we could save a lot of time reprovisioning the database if we could send only the changes, rather than the entire database files.
Adding the Options
rsync
normally copies whole files, based on if the file timestamps differ from source to target. For large databases, however, this can be quite inefficient. Consider a large, active database: during the course of a day, the database is updated, however at the file level, only a fraction of the blocks on the filesystem have changed. Therefore, we’d like to just transfer the changes, instead of the whole file, since the files are quite large.
In Part 1, we have the following rsync
command:
db2 shell$ sudo rsync -avz --progress --delete db1:/volumes/data/mysql/ /volumes/data/mysql/
As a review, this command compares the timestamps and sizes between the files in /volumes/data/mysql
between db1 and db2 and will copy files that have timestamps newer on db1. When doing the actual copy, rsync
copies the file to a temporary hidden file in the target directory. When the copy is complete, the old file is removed and replaced with the hidden file.
If the target already contains a database, we can simply transfer the changes from the source. This works even if the target database is corrupt. We are simply going to retain the database files that are already there and update the file blocks where necessary.
Let’s add some additional options to make this more database friendly:
-c |
When comparing the actual files, compare based on file checksum rather than file mod date and file size. Although this is a slower comparison, database file mod times and sizes are not always updated even when actively being written to. |
--inplace |
When copying, do not create a temporary file, copy the source file directly to the target file. |
--no-whole-file |
Do not copy the whole file, copy only the changed blocks in the file. |
With the above options, provided we already have a database on the target, even corrupt, that needs to be reprovisioned, we transfer only the changed blocks from source to target. Consider a large, active database - the delta between the database files between source and target is a small percentage.
Our new command line now looks like this:
db2 shell$ sudo rsync -ravc --inplace --no-whole-file --progress --delete db1:/volumes/data/mysql/ /volumes/data/mysql/
Seeding a Database
Suppose we have db1 and db2, where db2 is a replica of db1. During operations, db2 somehow gets corrupted and is no longer viable. We may be tempted to take a maintenance window, remove db2, and then copy db1 from db2. However, using the method above, we can still make use of the files on db2 and simply copy the changed blocks from db1.
We still have to take a maintenance window even by using the method above, although it should be much smaller than copying the entire database. We can further reduce this window by “seeding” the database on db2: Throughout the day, we can run the rsync
command even while db1 is running. This of course will not create a working database on db2, but it will send incremental changes from db1 to db2. We can do this a number of times throughout the day, and when we finally take our maintenance window, there will be even fewer changes to send from db1 to db2 because we have already seeded db2 earlier.
Multiple Threads
Comparing checksums on large files can be time-consuming. In this case, can we create multiple rsync
threads to make the provision process more robust. Here’s one way to accomplish this from db1:
db1 shell$ sudo "cd /volumes/data/mysql/; find . -type f |xargs -I {} -P 4 -n 1 rsync -ravc --inplace --no-whole-file --progress --delete {} db2:/volumes/data/mysql/"
In the above command, we do a find
on the MySQL data directory to enumerate all of the files that make up the database. We pipe that into xargs
to rsync
each file, using a number of concurrent threads. In the example, we use 4 threads (as defined by the -P
parameter).
Conclusion
Provisioning a database can be extremely time-consuming, and typically requires a long maintenance window. Using the techniques above, we can leverage the existing files of a failed or corrupted database to make the provisioning much quicker. In the latest version of Tungsten Cluster, the utility tprovision
utilizes these techniques when using rsync
as the provision method.
Comments
Add new comment