Handling Update Errors During a Cluster Repartition

When using a cluster of Diffusion servers with topic replication enabled, each server contains multiple subsets of the topic tree data, which are called partitions. (This use of the term “partition” is not to be confused with a network partition, where servers in a cluster stop being able to communicate with each other due to a network fault). When the number of servers in a cluster changes, Diffusion changes which partitions are stored on each server to optimize performance & ensure high availability. This process is called repartition. Repartition occurs when:

  • You add servers to an existing cluster
  • You remove servers from an existing cluster
  • Servers start for the first time and the cluster is built
  • All servers in a cluster restart
  • A cluster heals from a network partition

During repartition, it is not possible to forward updates between servers. If a value update is attempted during repartition, a ClusterRepartitionException is thrown and the value update stream is invalidated.

Diffusion does not automatically retry the failed value update, because the cluster has no way to determine whether or not the value from a failed update is still valid, or is now outdated. Instead, your client code must handle this situation by creating a new value stream and reapplying topic updates.

The best strategy to use when retrying updates will vary depending on your application. 

For example, suppose that during a repartition, your client code sent three failed updates to the same topic with different values, and only the value in the final update is now valid. Your client should not retry the first two updates. It should only retry the update with the current valid value.