Skip to main content

GRANDPA Equivocation and sysinfo Process Collection Results In Slashing on Kusama Network: a Post-Mortem.

August 18, 2020 in Kusama, Slashing
Avatarby Polkadot

Multiple bugs in code resulted in nodes dropping out from Kusama network and losing the database that stores which blocks they validated. Consequently, the same nodes double-signed those blocks on restart. The slashes caused by this issue have been reverted via Kusama Council motions.

On Friday July 31, two Kusama validators on runtime version v2019 started crashing every few minutes giving two distinctive errors, reporting an issue. At a first glance, the problem seemed to be related to the validators' keys. It was subsequently found that this was not the cause, as the validators affected confirmed they did not change keys in the process. Additionally, the issue seemed to be present solely on Kusama network, not on Polkadot.

Going a bit further down the rabbit hole, the team realised that the issue seemed to have started as a result of a GRANDPA equivocation causing a slash event in Kusama, originally triggered by a file descriptor leak that caused nodes to crash. This leak prevented nodes from writing the GRANDPA voter state (the votes at a given round) to disk and caused the nodes that lost this data to vote again after restarting, this time voting for a block newer than their original choice. This led to an equivocation.

The combination of these two events resulting in validators being slashed started at some point after v0.8.15 (v2015 in Kusama) was released and the network was upgraded. The Authority Discovery feature had already been in place for some time on the runtime module level but not enabled by default on the client, and this version also enabled GRANDPA to report equivocations on unsigned extrinsics.

With this information in hand, the team's main hypothesis was that equivocations caused by the file descriptors leak could actually have started happening a while ago but were only reported after the v0.8.15 upgrade back in July: by running this version of the network, nodes started reporting themselves after crashing and this attracted the attention of the teams involved. Still, investigation into the logs of nodes run by Parity did not find any previous equivocation (they would be logged to the terminal).

Further investigation into the root causes of the file descriptor leak pointed at two main culprits: authority discovery and metrics collection. Authority discovery was using an excessive amount of sockets to query data from the DHT (i.e. discovering other authorities IP addresses). For system metrics collection (e.g. CPU and memory) we were relying on the sysinfo crate which was keeping a cache of file descriptors over all processes in the system and threads for each process (it's fetching the data by reading from /proc).

The short-term solution was to disable the Authority Discovery feature by default and also to stop collecting system metrics. The Authority Discovery module will be re-enabled again in a future release once there is a proper fix for the excessive use of sockets.

Until a new version was available the Parity team recommended manually disabling Authority Discovery. Additionally, in any case of the node crashing, validators were advised to introduce a delay before restarting it (1-2 minutes). This reduces the likelihood of the node equivocating in GRANDPA if its votes were not persisted to disk.

After some discussions and developments, Polkadot v0.8.22 was released, including the short-term fixes detailed above. All validators should upgrade their version and monitor for results. All slashes caused by this bug were reverted by the Kusama Council - and in this spirit, a new discussion was opened regarding the reversion of economic loss but not the nomination loss by validators.


To keep up with developments, there are plenty of ways to get plugged in to the Kusama community. Join the discussion on the Direction Channel. Learn more about Kusama on our website and in the Kusama Wiki. Want to join the core growth team behind Kusama? Join the Ambassador Program.

From the blog

Product

JAM Session: Gavin Wood Reveals Bold Vision for Polkadot's Next Revolution

Yesterday at Token 2049 Dubai, Gavin Wood announced a bold vision for the next generation of Polkadot technology. In line with the other groundbreaking firsts that Polkadot has brought to the market, this new vision is set to revolutionize the future of Web3. It will provide the speed, scale, full decentralization, and ease of use needed to drive forward deep innovation across not just Web3, but the entire tech landscape. At the heart of this vision is JAM, a new version of the Polkadot chain t

Bridges

The landscape of trustless bridges on Polkadot

With research and writing from Oliver Brett, Adrian Catangiu, and Aidan Musnitzky, this article explores the rich environment of bridge building, both within Polkadot and to external ecosystems. Any Web3 protocol with true aspirations of interoperability needs to consider the development and deployment of bridges to external networks, and in this sense Polkadot is no different. Blockchain bridges are, in essence, mechanisms for two sovereign chains with different technological foundations to o

AI

Unleashing the Potential of AI with Polkadot: the Blockchain Powered Revolution

Blockchain technology has opened up a world of possibilities, and nowhere is this more evident than in the emerging field of artificial intelligence (AI). The Polkadot network is at the forefront of this revolution, serving as a powerful platform for innovative AI projects that are pushing the boundaries of what's possible. In this blog, we'll dive into some of the most exciting AI initiatives within the Polkadot ecosystem, exploring how these projects are leveraging Polkadot's advanced capabili

Subscribe to the newsletter to hear about updates and events.