Consul leader election issues
Problem: The cluster is in a broken state because consul can't seem to gather a quorum w/ it's raft implementation.
In my case, there was a raft peer that was bogus. I accidentally had it advertising it's IP as 127.0.0.1
, but there was no process who had that node-id at that address.
There are two possible paths out that I know of. You can put a peers.json
file in the consul data directory. Or you can manually bring up the consul process with the -bootstrap
flag, to allow it to self-elect into a leader. The peers.json
file approach worked for me.
peers.json
format differs depending on raft implementation, but mine looked like this.
[ { "id": "e4c3529a-c3ad-ae8b-7e8a-60c784d72eea", "address": "192.168.88.2:8300", "non_voter": false }, { "id": "0ab95c84-c779-6439-289b-781e74f64503", "address": "192.168.88.3:8300", "non_voter": false }, { "id": "5942fa52-081f-44c8-4ba7-ffc4f14f8807", "address": "192.168.88.4:8300", "non_voter": false } ]
To have the system re-bootstrap, stop the consul process (sudo systemctl stop consul
) on all nodes in the quorum. Put the peers.json
file in $CONSUL_DATA_DIR/raft/
(the consul data directory is specified in the consul config) for each node. Start the processes again.
Example logs from failing to elect a leader
Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.326Z [ERROR] agent: failed to sync changes: error="No cluster leader" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.523Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id= Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.524Z [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.88.3:8300 [Candidate]" term=31 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: election won: term=31 tally=2 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: entering leader state: leader="Node at 192.168.88.3:8300 [Leader]" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: added peer, starting replication: peer=e4c3529a-c3ad-ae8b-7e8a-60c784d72eea Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: added peer, starting replication: peer=5942fa52-081f-44c8-4ba7-ffc4f14f8807 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.532Z [INFO] agent.server.raft: added peer, starting replication: peer=6826fedb-99ea-196e-bbb8-bf57ad0989fe Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [INFO] agent.server: cluster leadership acquired Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [INFO] agent.server: New leader elected: payload=abrahms-server-1 Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=6826fedb-99ea-196e-bbb8-bf57ad0989fe fallback=127.0.0.1:8300 error="Could not find address for server > Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.533Z [INFO] agent.server.raft: pipelining replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.536Z [INFO] agent.server.raft: pipelining replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.88.3:8300 [Follower]" leader-address= leader-id= Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [ERROR] agent.server: failed to wait for barrier: error="leadership lost while committing log" Jul 20 05:29:38 abrahms-server-1 consul[1323]: 2023-07-20T05:29:38.540Z [INFO] agent.server: cluster leadership lost Jul 20 05:29:42 abrahms-server-1 consul[1323]: 2023-07-20T05:29:42.426Z [INFO] agent.server.serf.wan: serf: attempting reconnect to abrahms-server-9rzl.dc1 127.0.0.1:8302 Jul 20 05:29:45 abrahms-server-1 consul[1323]: 2023-07-20T05:29:45.680Z [WARN] agent: Syncing service failed.: service=consul error="No cluster leader" Jul 20 05:29:45 abrahms-server-1 consul[1323]: 2023-07-20T05:29:45.680Z [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.236Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader-addr= last-leader-id= Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.236Z [INFO] agent.server.raft: entering candidate state: node="Node at 192.168.88.3:8300 [Candidate]" term=32 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.244Z [INFO] agent.server.raft: election won: term=32 tally=2 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: entering leader state: leader="Node at 192.168.88.3:8300 [Leader]" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: added peer, starting replication: peer=e4c3529a-c3ad-ae8b-7e8a-60c784d72eea Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server: cluster leadership acquired Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: added peer, starting replication: peer=5942fa52-081f-44c8-4ba7-ffc4f14f8807 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server.raft: added peer, starting replication: peer=6826fedb-99ea-196e-bbb8-bf57ad0989fe Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [WARN] agent.server.raft: unable to get address for server, using fallback address: id=6826fedb-99ea-196e-bbb8-bf57ad0989fe fallback=127.0.0.1:8300 error="Could not find address for server > Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.245Z [INFO] agent.server: New leader elected: payload=abrahms-server-1 Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.246Z [INFO] agent.server.raft: pipelining replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: pipelining replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: entering follower state: follower="Node at 192.168.88.3:8300 [Follower]" leader-address= leader-id= Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter e4c3529a-c3ad-ae8b-7e8a-60c784d72eea 192.168.88.2:8300}" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [ERROR] agent.server: failed to wait for barrier: error="node is not the leader" Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.249Z [INFO] agent.server: cluster leadership lost Jul 20 05:29:47 abrahms-server-1 consul[1323]: 2023-07-20T05:29:47.248Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter 5942fa52-081f-44c8-4ba7-ffc4f14f8807 192.168.88.4:8300}"