A classic not to be overlooked in understanding blockchain: PBFT

share
A classic not to be overlooked in understanding blockchain: PBFT

In this article, the author will introduce a long-standing classic: PBFT. Its full name is Practical Byzantine Fault Tolerance, which has been around for over 20 years. Its invention stems from a famous consensus problem in distributed systems: the Byzantine Generals Problem. PBFT is not a consensus protocol designed for fully open environments - in fact, before the emergence of blockchain, there was no Byzantine fault-tolerant consensus protocol specifically for open environments. The advent of blockchain has inspired researchers to revisit this classic PBFT. PBFT has some characteristics that are quite different from blockchain, providing useful insights for improving blockchain, such as the Proof-of-Stake model built on PBFT. In the following sections, the author will briefly introduce the background, consensus operation, correctness proof, and the different characteristics of PBFT compared to blockchain.

Table of Contents

Why invent PBFT?

Throughout history, humans have continuously pursued systems that can operate sustainably. For a system to be sustainable, it must first be fault-tolerant to avoid downtime due to a single failure. One intuitive way to achieve fault tolerance is to design a system with a certain level of redundancy — allowing multiple entities with the same components and states to operate simultaneously, so that when a failure occurs, replacing the faulty entity can ensure the system continues to operate. In a distributed system, we refer to this design as State Machine Replication.

What is a state machine?

A state machine is an abstract black box that has an initial state and, upon receiving input, can transition to a new state based on a corresponding transition function, and the transition process is deterministic — given the same initial state and input, the same output will always be produced. A system composed of several state machines with the same transition function is called a state machine replication.

Why is consensus necessary?

Since state machines are deterministic, to ensure that these redundant state machines have consistent states, each state machine must agree on the order of transitions. These state machines form a network, with each state machine being a node in the network, and nodes can only achieve consensus through communication by exchanging messages.

Why is achieving consensus so difficult?

In situations where consensus is sought solely through communication, a challenging problem arises, which is a famous problem in consensus protocols: the Byzantine Generals Problem.

A group of Byzantine generals surrounds a city and must reach a consensus to either attack or retreat simultaneously, with each general only able to communicate their decision to others through messengers. However, among these generals, there are traitors who may send conflicting messages to disrupt the consensus, or only inform a portion of the generals. In the presence of known traitors, how can a correct consensus be reached?

Clearly, the consensus problem of state machine replication can be reduced to the Byzantine Generals Problem:

  • Nodes are the generals
  • Communication is the messengers
  • Decision to attack/retreat is the content that needs consensus
  • Random behavior of generals/state machines is called Byzantine Fault
  • The property of being able to reach consensus even in the presence of traitors/hostile state machines is called Byzantine Fault Tolerance.

Correct Consensus: Safety and Liveness

A correct and usable consensus protocol must ensure that Byzantine generals will always reach a unique consensus in terms of safety and that the consensus will be formed with liveness. The author will elaborate on safety and liveness in the following text.

Can consensus always be reached?

Can consensus that possesses safety and liveness always be achieved? The answer is not necessarily. According to the FLP theorem: In an asynchronous network communication and deterministic program scenario, if any crash fault occurs, then consensus cannot simultaneously have safety and liveness.

What to do? There are two methods to bypass the limitations of the FLP theorem: the first method is to assume synchronous network communication — as adopted by PBFT; the second method is to introduce some random factors into the consensus protocol, making the entire protocol non-deterministic, which makes the protocol more robust and able to operate even in an asynchronous network communication scenario, such as Honey Badger BFT, which is an asynchronous and non-deterministic consensus protocol.

How about performance?

Before the emergence of PBFT, various consensus protocols had been developed, but none had good enough performance until the arrival of PBFT, as indicated by the "P" in its name, Practical.

How does PBFT work?

PBFT is a Byzantine fault-tolerant state machine replication. To solve the Byzantine Generals Problem, one intuitive approach is to use single or multiple rounds of voting to achieve majority consensus. However, who initiates the voting? How many votes are needed to ensure safety and liveness in consensus? PBFT innovates with a three-phase voting design, consisting of "Pre-prepare," "Prepare," and "Commit" stages.

To simplify the problem, let's make the following assumptions:

Assumptions
  • There are only 4 generals, tolerating at most 1 traitor (4 = 3f+1, where f is the number of tolerated traitors). The significance of 3f+1 will be detailed in the safety analysis below.
  • Each general is numbered 0-3.
  • Each general can recognize each other's signatures.
  • Each action has a sequence number.
  • Attacks/retreats form a sequential series arranged by sequence number, for example: Attack — Attack — Defend — Attack — ...
  • There is a leader and several validators.
  • The leader is divided into different rounds, and the leader's round is called the view.
  • Generals take turns being the leader in a round-robin fashion, for example: the 1st generation leader is General 1, the 2nd generation leader is General 2, and so on; the 4th generation leader is General 0 again.
  • A general initiates a proposal for leadership rotation, triggering a view change mechanism called view-change.

Phase 1: Pre-prepare
  • The leader is responsible for receiving attack/retreat commands from the Byzantine client.
  • The leader initiates the proposal, which includes attack or retreat messages, the view number, and the sequence number.
  • The leader sends a "pre-prepare" message with their signature to other validators via messengers.

PBFT Pre-prepare Phase

Phase 2: Prepare
  • Upon receiving the "pre-prepare" message, validators decide whether to accept the leader's proposal. If they agree, they send a "prepare" message with their signature to all generals; if they disagree, they do not send any messages.
  • Validators who send "prepare" messages begin the "prepare" phase.
  • If a general receives over 3 "prepare" messages, the general enters the "prepared" state, and the set of these "prepare" messages is collectively referred to as a "prepared certificate."

PBFT Prepare Phase

Phase 3: Commit
  • If a "prepared" general decides to commit, they send a signed "commit" message to all generals; if they decide not to commit, they do not send any messages.
  • Generals who send "commit" messages begin the "commit" phase.
  • If a general receives over 3 "commit" messages, they execute the message content, indicating consensus has been reached for the proposal.
  • Generals who commit then enter the "executed" state and report the execution result to the Byzantine client.
  • After reporting, generals conclude this round and await the next proposal.