Data Integration - Synchronization

> (Data|State) Management and Processing > (Data Processing|Data Integration)

1 - About

duplicate of Concurrency - Synchronization of Data Processing - Replication ?

Ensure that all instances of a repository (database, file system, …) contain the same data. Its not a trivial task when the data is volatile.

complex subject

  • Replication is the process of copying data
  • Data synchronization is the process of establishing consistency among data from a source to a target data storage and vice versa and the continuous harmonization of the data over time.
Advertising

3 - Inconsistencies

Replicating data can introduce inconsistencies.

When you modify data, the same modification must be made to all other copies of that data and this process may take some time. Fully transactional systems implement procedures that lock all copies of a data item before changing them, and only releasing this lock when the update has been successfully applied across all instances. However, in a globally distributed system such an approach is impractical due to the inherent latency of the network (ie Internet), so most systems that implement replication, update each site individually. After an update, different sites may see different data but the system becomes “eventually consistent” as the synchronization process ripples the data updates out across all sites.

4 - Properties

Direction:

  • one-way synchronization
  • bidirectional

Synchronous:

  • asynchronous
  • synchronous

Parallel:

  • Serial
  • Parallel

5 - Implementation

Two main issues:

  • Which replication topology should you use?
  • Which synchronization strategy should you implement?
Advertising

5.1 - Basic Copy

copies all data (from a data source) to all other instances

For instance, Client has lost synchronization. Either through a backup/restore or because of a bug. In this case, the client needs to get the current state from the server without going through the deltas. This is a copy from master to detail, deltas and performance be damned. It's a one-time thing; the client is broken; don't try to optimize this, just implement a reliable copy.

5.2 - Batch of updates

Synchronizing data can be expensive in terms of network bandwidth requirements, and it may be necessary to implement the synchronization process as a periodic task that performs a batch of updates.

Synchronize changes. Your change-log (or delta history) approach looks good for this. Clients send their deltas to the server (via subscribe or push mechanism); server consolidates and distributes the deltas to the clients. This is the typical case. Databases call this “transaction replication”.

You should follow the database (and SVN) design pattern of sequentially numbering every change. That way a client can make a trivial request (“What revision should I have?”) before attempting to synchronize. And even then, the query (“All deltas since 2149”) is delightfully simple for the client and server to process.

5.2.1 - Conflict

  • How to handle synchronization conflicts with bi-directionality.
  • If the data is partitioned (the data for an entity lives only in one place) of if it's a one-way sync direction, there is no possibility of conflicts

5.2.2 - Data

  • Stale Data: your applications and services can live with potentially stale data
  • Read-only data ?
  • Synchronization volume
  • Transactional integrity needed (If so, then replication might not be the most appropriate solution)
  • Data Security: Authorization access
Advertising

5.2.3 - Change Capture

  • Trigger
  • Database Log
  • Timestamp
  • Offload in a structured file (such as CSV)

5.3 - Comparison

Client is suspicious. In this case, you need to compare client against server to determine if the client is up-to-date and needs any deltas.

6 - Documentation / Reference

data/processing/synchronization.txt · Last modified: 2018/10/21 21:38 by gerardnico