Replication with Hazelcast™

Hazelcast™ is used by Diffusion™ to replicate sessions and topics in order to provide additional support for high availability (HA). It also allows failover of active update sources.

  • Session replication allows sessions to failover to a different server in the event that a server fails. When a client reconnects after a failover the client will maintain its original client ID. Session replication cannot be used for control clients.
  • Topic replication makes it possible for topics to be created, updated, and managed across an entire cluster from a single node.
  • Failover of active update sources is used to ensure that when a server that is the active update source for a section of the topic tree becomes unavailable, another server is assigned to be the active update source for that section of the topic tree.

Failover appears to the client as a disconnection and subsequent reconnection. To take advantage of high server availability, clients must implement a reconnect process.

Note: Session properties are not currently replicated through session replication.

Configuration

Replication must be enabled, which is done in the Replication.xml configuration file. When configuring topic replication the topic paths have to be specified. There is no limit on the topic tree depth during topic replication, all topics under the topicPath will be replicated.

<replication enabled="true">
<provider>HAZELCAST</provider>
<sessionReplication enabled="true" />
<topicReplication enabled="true">
<topics>
<topicPath>foo/bar</topicPath>
</topics>
</topicReplication>
</replication>

In the above example both session and topic replication have been enabled. In order to define which Hazelcast nodes can communicate with each other a hazelcast.xml configuration file needs to be created.

<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config
http://www.hazelcast.com/schema/config/hazelcast-config-3.0.xsd""
xmlns="http://www.hazelcast.com/schema/config"" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">;
<network>

<join>
<!-- <multicast enabled="true" /> -->
<tcp-ip enabled="true">
<members>
<member>node1.example.com</member>
<member>203.0.113.1</member>
<member>203.0.113.2:5757</member>
<member>203.0.113.3-7</member>
</members>
</tcp-ip>
</join>

</network>
</hazelcast>

Ensure that the hazelcast.xml file is on the Diffusion server classpath. For example by putting the file in the diffusion_installation/etc directory. Restart the Diffusion server to load the configuration. By default a Hazelcast node multicasts to all other Hazelcast nodes in the network. In the above example this multicast capability has been disabled.

Diagnosing problems with Hazelcast

By enabling logging for Hazelcast you can use the generated log files to diagnose problems. To enable logging, include the following line in your hazelcast.xml file:

<property name="hazelcast.logging.type">slf4j</property>

You can also enable logging by starting the Diffusion server that contains the node with the following parameter – hazelcast.logging.type=slf4j You can enable JMX for your Hazelcast nodes and use a JMX tool to examine the Mbeans. To enable JMX for a Hazelcast node, include the following line in your hazelcast.xml file:

<property name="hazelcast.jmx">true</property>

Considerations

Session replication

  • Replication of session information into the datagrid is not automatic. It must be configured at the server.
  • Messages in transit are not preserved. Use acks to ascertain whether or not messages have been received.
  • When a classic client session reconnects it must be authenticated again. Ensure that all Diffusion servers in your solution have access to the same authentication methods.
  • The failover appears to the client as a disconnection and subsequent reconnection. To take advantage of high server availability, clients must implement a reconnect process.
  • The Diffusion server that a client reconnection attempt is forwarded to depends on your load balancer configuration. Sticky load balancing can be turned on to take advantage of reconnection or turned off to rely on session replication and failover.

Topic replication

  • Only publishing topics can be replicated.
  • Replication is supported only for the following types of topic data: No topic data, Custom topic data, Protocol buffer topic data, Record topic data, Single value topic data
  • Replication is not supported for paged topics.
  • Only topic-wide messages are replicated. Messages sent to a single client or to all clients except one are not replicated.
  • Replication of topic information into the datagrid is not automatic. It must be configured at the server. This gives a performance advantage, as you can choose which parts of your topic tree to replicate.
  • Replication of topic data can impact performance.
  • Do not use topic replication on sections of the topic tree that are owned and updated by publishers. Publishers can make updates to topics that are not replicated or that supersede replicated data. If you use topic replication with topics updated by publishers, this can cause the data on the replicated topics to become unsynchronized.
  • Avoid including replicated topics in REMOVE_TOPICS session wills. When a replicated topic is removed from a server as a result of a session will, it is removed from all other servers that replicate that topic. If all sessions on a Diffusion™ server that have a remove topics removal request for a branch of the topic tree close, the topics are removed even if that topic is replicated and sessions on other Diffusion servers have removal requests registered against that part of the tree. When the topics are removed on the server, that change is replicated to all other servers that participate in replication for these topics.

Failover of active update sources

  • If the topic paths that the control client uses to register an update source do not match the topic paths configured in the Replication.xml configuration file of the server, unexpected behavior can occur.
  • Failover of active update sources is not automatic. It must be configured at the server.
  • The mechanism that provides failover of active update sources assumes that all servers have the same configuration and that all control clients implement the same behavior as part of a scalable and highly available deployment. If this is not the case, unexpected behavior can occur.
  • Do not use topic replication and failover of active update sources on sections of the topic tree that are owned and updated by publishers. Topic updates sent by publishers are not replicated.

Removal requests and topic replication

  • If all sessions on a Diffusion server that have a remove topics removal request for a branch of the topic tree close, the topics are removed even if that topic is replicated and sessions on other Diffusion servers have removal requests registered against that part of the tree. When the topics are removed on the server, that change is replicated to all other servers that participate in replication for these topics.

Common Issues

Error: Unable to register above or below existing sources

This can occur when a control client tries to register as a topic source that is not defined in Replication.xml e.g. Trying to register as a topic source for A/B/C when the Replication.xml contains <topicPath>A</topicPath> instead of <topicPath>A/B/C</topicPath>

Diffusion instance takes too long to go live after failover

<property name="hazelcast.max.no.heartbeat.seconds">value</property>

can be set within hazelcast.xml to shorten timeouts on heartbeats to other nodes. This ensures that a node is declared dead within the specified time.