The Configuration node should be replicated as there is a single configuration node for the whole network. It is the only single point of failure of the system.
When a Configuration node fails the cluster does not stop operating, but is not
able to recover if there is some exceptional condition to handle, like a Data
Node going off line or the addition of a new Data Node to the cluster.
The Configuration node is a standard Redis server, like every other Data node.
Data Nodes
==========
Data nodes just hold data, and are normal Redis processes. There is no configuration stored on nodes, nor the nodes are "active" in the cluster, they just receive normal Redis commands.
Proxy Nodes
===========
Proxy nodes get requests from clients and route this requests to the right Redis nodes.
Proxy nodes take persistent connections to all the Data Nodes and the
Configuration Node. This connections are keep alive with PING requests from time
to time if there is no traffic. This way Proxy Nodes can understand asap if
there is a problem in some Data Node or in the Configuration Node.
When a Proxy Node is started it needs to know the Configuration node address in order to load the infomration about the Data nodes and the mapping between the key space and the nodes.
On startup a Proxy Node will also register itself in the Configuration node, and will make sure to refresh it's configuration every N seconds (via an EXPIREing key) so that it's possible to detect when a Proxy node fails.
When a new Data node joins or leaves the cluster, and in general when the cluster configuration changes, all the Proxy nodes will receive a notification and will reload the configuration from the Configuration node.
1) A client sends a query to a Proxy Node, using the Redis protocol like if it was a plain Redis Node.
2) The Proxy Node inspects the command arguments to detect the key. The key is hashed. The Proxy Node has the table mapping a given key to M nodes, and persistent connections to all the nodes.
At this point the process is different in case of read or write queries:
WRITE QUERY:
3a) The Proxy Node forwards the query to M Data Nodes at the same time, waiting for replies.
3b) Once all the replies are received the Proxy Node checks that the replies are consistent. For instance all the M nodes need to reply with OK and so forth. If the query fails in a subset of nodes but succeeds in other nodes, the failing nodes are considered unreliable and are put off line notifying the configuration node.
3c) The reply is transfered back to the client.
READ QUERY:
3d) The Proxy Node forwards the query to a single random client, passing the reply back to the client.
This mean that keys hashing to slot 10 will be saved in the two Data nodes 192.168.1.19:6379 and 192.168.1.200:6379.
When a client performs a read operation (via a proxy node), the proxy will contact a random Data node among the data nodes in charge for the given slot.
For instance a client can ask for the following operation to a given Proxy node:
GET mykey
"mykey" hashes to (for instance) slot 10, so the Proxy will forward the request to either Data node 192.168.1.19:6379 or 192.168.1.200:6379, and then forward back the reply to the client.
When a write operation is performed, it is forwarded to both the Data nodes in the example (and in general to all the data nodes).
Adding or removing a node
=========================
When a Data node is added to the cluster, it is added via an LPUSH operation into a Redis list representing a queue of Data nodes that are ready to enter the cluster. This list is hold by the Configuration node of course, and can be added manually or via a configuration utility.
LPUSH newnodes 192.168.1.55:6379
The Handling node will check from time to time for this new elements in the "newode" list. If there are new nodes pending to enter the cluster, they are processed one after the other in this way:
For instance let's assume there are already two Data nodes in the cluster:
192.168.1.1:6379
192.168.1.2:6379
We add a new node 192.168.1.3:6379 via the LPUSH operation.
We can imagine that the 1024 hash slots are assigned equally among the two inital nodes. In order to add the new (third) node what we have to do is to move incrementally 341 slots form the two old servers to the new one.
For now we can think that every hash slot is only stored in a single server, to generalize the idea later.
In order to simplify the implementation every slot can be moved from one Data node to another one in a blocking way, that is, read operations will continue to all the 1024 slots, but a single slot at a time will delay write operations until the moving from one Data node to another is completed.
In order to do so the Handler node, before to move a given node, marks it as "write-locked" in the Configuration server, than asks all the Proxy nodes to refresh the configuration.
Then the slot is moved (1/1024 of all the keys). The Configuration server is modified to reflect the new hashing slots configuration, the slot is unlocked, the Proxy nodes notified.
Implementation details
======================
To run the Handling node and the Configuration node in the same physical computer is probably a good idea.