Replication Topologies

Key Value Points

  • Real-time Data Loading - Transactions are replicated from MySQL to Hadoop in real time producing carbon-copy tables, enabling execution of analytic views on the Hadoop side within multiple environments, including Hive.
  • High-performance, Low-impact - Flexible topologies enable high-speed, low-impact transfer from multiple upstream systems into Hadoop, with no application changes
  • Integration with Hive - Support for Hive preferred data formats and schema generation
  • Integration with Sqoop - Enables use of Sqoop to provision data, then Tungsten Replicator to replicate transactions incrementally in real-time
  • Open Source - Low cost alternative for the existing commercial solutions. Tungsten Replicator, released under a GPL V2 license, is an established open source based solution, and is backed up with expert 24x7 support and consulting from Continuent.

Overview

Tungsten Replicator uses the MySQL binary log to extract and identify data changes within the core MySQL database. Tungsten Replicator takes this information, places the content into the Transaction History Log (THL) and provides a single, identifiable, sequence number (transaction ID) that enables the state of replication to be easily identified. The sequence number also provides a convenient start/stop reference.

The core of Tungsten Replicator relies on two key services:

  • the extractor, which extracts data from a supported database (MySQL, Oracle), and
  • the applier, which writes data into a supported database (e.g. MySQL, Oracle, MongoDB).

The data transfer between the two systems is based on the THL format, and the flexible information and structure that it supports. The two sides of the system work together to provide replication between different databases, even though the underlying database technologies may be different, and have different functionality and data transaction expectations.

The Hadoop applier operates within this model, taking the raw THL data, and applying it into Hadoop, and then processing and merging the change information within Hadoop. Because the information is taken from the live stream of changes in the binary log, the replication of data is in real-time, with changes generated within MySQL transferred instantly through the binary log and THL to the applied database server. Transactions written to the binary log are transaction safe, and this safety is replicated in sequence to the target database, without fear of corruption or inconsistency.

Tungsten Replicator Topologieshadoop-topo1

Tungsten Replicator supports a number of different topologies, and these can be exploited when combined with the Hadoop applier to enable a range of different data replication structures.

A standard replication service could replicate data from multiple MySQL servers into a single Hadoop cluster, using Hadoop to provide the consolidation of information. This can be useful when you have multiple MySQL servers providing a shared data architecture, but want to perform analytics on the full body of data across all servers.

hadoop-topo2
Alternatively, multiple MySQL servers can be combined into a single MySQL server by making use of the fan-in support provided within Tungsten Replicator. Using this method, multiple servers can combine their data into a single server, and can be combined with filters to select, rename or ignore different tables. A major benefit of this model is that it provides flexible DDL structuring and refiltering before the data is replicated into Hadoop, enabling a single stream of changes into the Hadoop cluster. This maximises the data structure and concentration of data which improves the distribution of data within larger Hadoop clusters, helping to improving data processing and materialisation of the data tables.

hadoop-topo3
Different replicators and heterogeneous targets can also be combined, reading from the same Tungsten Replicator master, and applying to multiple different heterogeneous targets. For example, Tungsten Replicator could be configured to read information from the master and replicate some of this information into a MongoDB server for use by the application servers for document-based access to the content. Simultaneously, the data could also be written into a Hadoop cluster for the purposes of analytics to analyze the entire database or change information.

For more information, please download this white paper.