Skip to main content

GRANDPA Equivocation and sysinfo Process Collection Results In Slashing on Kusama Network: a Post-Mortem.

August 18, 2020 in Kusama, Slashing
Avatarby Polkadot

Multiple bugs in code resulted in nodes dropping out from Kusama network and losing the database that stores which blocks they validated. Consequently, the same nodes double-signed those blocks on restart. The slashes caused by this issue have been reverted via Kusama Council motions.

On Friday July 31, two Kusama validators on runtime version v2019 started crashing every few minutes giving two distinctive errors, reporting an issue. At a first glance, the problem seemed to be related to the validators' keys. It was subsequently found that this was not the cause, as the validators affected confirmed they did not change keys in the process. Additionally, the issue seemed to be present solely on Kusama network, not on Polkadot.

Going a bit further down the rabbit hole, the team realised that the issue seemed to have started as a result of a GRANDPA equivocation causing a slash event in Kusama, originally triggered by a file descriptor leak that caused nodes to crash. This leak prevented nodes from writing the GRANDPA voter state (the votes at a given round) to disk and caused the nodes that lost this data to vote again after restarting, this time voting for a block newer than their original choice. This led to an equivocation.

The combination of these two events resulting in validators being slashed started at some point after v0.8.15 (v2015 in Kusama) was released and the network was upgraded. The Authority Discovery feature had already been in place for some time on the runtime module level but not enabled by default on the client, and this version also enabled GRANDPA to report equivocations on unsigned extrinsics.

With this information in hand, the team's main hypothesis was that equivocations caused by the file descriptors leak could actually have started happening a while ago but were only reported after the v0.8.15 upgrade back in July: by running this version of the network, nodes started reporting themselves after crashing and this attracted the attention of the teams involved. Still, investigation into the logs of nodes run by Parity did not find any previous equivocation (they would be logged to the terminal).

Further investigation into the root causes of the file descriptor leak pointed at two main culprits: authority discovery and metrics collection. Authority discovery was using an excessive amount of sockets to query data from the DHT (i.e. discovering other authorities IP addresses). For system metrics collection (e.g. CPU and memory) we were relying on the sysinfo crate which was keeping a cache of file descriptors over all processes in the system and threads for each process (it's fetching the data by reading from /proc).

The short-term solution was to disable the Authority Discovery feature by default and also to stop collecting system metrics. The Authority Discovery module will be re-enabled again in a future release once there is a proper fix for the excessive use of sockets.

Until a new version was available the Parity team recommended manually disabling Authority Discovery. Additionally, in any case of the node crashing, validators were advised to introduce a delay before restarting it (1-2 minutes). This reduces the likelihood of the node equivocating in GRANDPA if its votes were not persisted to disk.

After some discussions and developments, Polkadot v0.8.22 was released, including the short-term fixes detailed above. All validators should upgrade their version and monitor for results. All slashes caused by this bug were reverted by the Kusama Council - and in this spirit, a new discussion was opened regarding the reversion of economic loss but not the nomination loss by validators.


To keep up with developments, there are plenty of ways to get plugged in to the Kusama community. Join the discussion on the Direction Channel. Learn more about Kusama on our website and in the Kusama Wiki. Want to join the core growth team behind Kusama? Join the Ambassador Program.

From the blog

Technology

The Way to a 10x Throughput Lift on Parachains

Parity engineer Dmitry Siniavin explains the calculations involved in determining that Polkadot parachains can increase their throughput rate by a factor of 10.

Ecosystem

Polkadot’s April Ecosystem Insights

Welcome to Polkadot’s new monthly ecosystem insights blog, your go-to source for all the latest tech updates, feedback and discussions happening across the Polkadot Ecosystem. In this blog we’ll explore a variety of topics and gather insights from sources ranging from GitHub to the Forums. Authors: Remy Le Berre and Joshua Cheong Ecosystem Activity OpenGov OpenGov sits at the heart of decision-making within the Polkadot Ecosystem. A place where anyone can freely discuss, propose, vote and v

Community

Racing Into the Future: Polkadot and Conor Daly Revolutionize Sports Sponsorship

Rev up your engagement with the Polkadot community through Conor Daly’s thrilling ambassadorship. Get insider access by subscribing to our newsletter and staying connected on social media for live updates and exclusive content. 🚥 Subscribe to the Polkadot Newsletter 🏎️ Follow Polkadot on X 🏁 Get Your Exclusive Conor Daly Insider Pass 🥇 Attending Consensus? Check out Polkadot’s event page Breaking new ground in sports sponsorship, the Polkadot community has chosen race car dynamo Conor D

Subscribe to the newsletter to hear about updates and events.