Communication closure
2026-05-27
2026-05-27
Distributed protocols define how nodes should behave in a network to solve a collective problem. Sometimes, they don't need to communicate with each other. But most times, they need to send and receive messages. In my PhD, I focused on a specific class of distributed protocol called consensus protocols where nodes have to commonly agree on a value. These protocols are complex, and therefore implementing them correctly is challenging and bug prone. I developed new techniques to test the implementation and find bugs.
But prior to my work, my advisors had worked on an approach that depended on a particular property of these distributed protocols called Communication Closure. Why? it makes testing distributed protocols easier. I try to prove the property for two very popular consensus protocols, Paxos and Raft in Lean (with the help of AI).
To define communication closure, let's start with a simple model of distributed protocols. Protocols are state machines that define how each node should behave. The edges are labelled with messages received or sent. Most protocols tag messages with round numbers or counters. Rounds in Paxos and Terms in Raft. The trace of a distributed system is a sequence of states and messages. The trace can be projected onto a node, as a sequence of local states with messages sent and received by that node.
Communication closure (as defined in the testing paper) is stated as follows. For any trace where messages of different rounds intermingle (I'll explain in a bit), there is an equivalent trace where messages of the same round are grouped together. The figure below illustrates it more abstractly (borrowed from the paper with style adaptations).
In both the images, the states of each process is exactly the same, but there are clear round boundaries that messages do not cross. For this property to hold in a distributed protocol, it should essentially discard messages from rounds that are not current (locally).
I used both Claude and Codex to prove the communication closure of Paxos and Raft. All the proofs in lean are available here - zeu5/communication-closure.
First, I asked it model Paxos and Raft in lean. For Paxos I added the original paper by Lamport as reference. For Raft, I added a TLA+ model with membership changes included. It was no suprise to me that the model was generated accurately in one shot. (It might've been surprising before November 2025.)
Then, I provided an informal definition of Communication Closure and asked it prove the property for both Raft and Paxos. Here is what it came up with for Paxos.
/--
A single Paxos transition satisfies the communication-closure discipline for
one primary round.
-/
def CommunicationClosedStep (s s' : State p) : Prop :=
∃ r : p.Round,
CommunicationClosedRound s s' r (emptyMedium p) (emptyMedium p) ∧
PrimaryRoundChangesOnly s s' r ∧
ActionReadsMatchingRound s s' r
It proved a rather trivial property, that only the primary (or leader) in Paxos can change rounds. For the rest, the action associated with the step matches the round in the state. Paxos is naturally communication closed. Providing the formal definition from the paper ensured that Claude/Codex was able to encode the right property and also prove it for Paxos.
However, Raft is not communication closed in the traditional sense of the definition. Receiving a higher term message will affect the state of a process. Further, the state retains older terms in its log that affect future decisions. I had to try several different modellings to prove Communication Closure of Raft. But eventually, I settled on the following trick - to treat the decision index of the log as the round instead of the term number contained in the message. Claude introduced all the necessary mechanics and lemmas to prove a variant of Communication Closure for Raft:
/-- For any trace, there exists an equivalent trace with commit-index closure. -/
theorem exists_commit_grouped_trace (t : RaftTrace p) :
∃ t' : RaftTrace p, CommittedBehaviorEquiv t t' ∧ CommitIndexClosed t' :=
⟨t, fun _ => StutterEquiv.refl _, fun _ _ _ => trivial⟩
For every Raft trace, there exists an equivalent trace with commit-index closure. The proof is trivial as the supporting lemmas show that commit index closure is true for every trace of Raft.
That brings me to the following open question. To show communication closure of any consensus protocol, is it sufficient to show that there is an attribute in the state of each process that progresses monotonically? Is that attribute always the commit index of the log?
Either way, AI was very helpful since I am no expert in Lean, this was an experience in learning how to read and prove trivial properties in Lean but also get to very complex stuff without spending all the time to skill up.