A Polkadot Postmortem - 24.05.2021

TL;DR: On 24 May 2021, Polkadot nodes failed with an out of memory (OOM) error on block 5,202,216. This block contained an on-chain solution to the validator election, which is normally computed off-chain and only takes place on-chain if no off-chain solution is submitted. Due to the large number of nominators, the election overflowed the memory allocated in the Wasm environment.

While an update was being prepared to fix the issue, validators were asked to temporarily downgrade their node software to a previous version that includes a native (non-Wasm) version of the runtime. The native version is not constrained by the Wasm memory allocator. The network recovered after an hour and ten minutes of downtime.

Later, on block 5,203,204, several nodes failed with a “storage root mismatch” error. After investigation, this was due to a difference in the compiler version that built the native runtime and the on-chain Wasm runtime. The solution was to implement a feature that allows overriding the on-chain Wasm runtime with a Wasm runtime build with the correct compiler version.

The issue has since been resolved and precautions have been implemented to prevent this from happening again in the future.

The bad

On 24.05.2021, Polkadot nodes failed with an out of memory (OOM) error while trying to build block 5202216. The nodes themselves did not crash, but the runtime did (i.e. the blockchain’s state transition function). Polkadot’s runtime is written in WebAssembly and is executed either by a Wasm interpreter or a Wasm compiler. However, as part of the runtime execution environment, a fixed amount of memory is always provided (64MB at that time) and this wasn’t enough for this block.

This block was the last block of the penultimate session in the era, meaning that a new validator set needed to be elected for the new era that would start after the next session. The election of the validator set can be done off-chain or on-chain, but off-chain is preferred as the election algorithm is quite a heavy computational task. However, for this session no validator submitted a solution (presumably because they also ran into the same OOM while doing the election off-chain), so it needed to be done on-chain and the result of this was the OOM all validators got while trying to author this block. The solution to the OOM was rather quite easy—to increase the default memory size of the Wasm runtime to 128MB: https://github.com/paritytech/substrate/pull/8892.

To bring this change to all validators, a new release would need to be cut, and a large number of validators would need to update. However, there was a much easier solution to this problem in the short-term (and most importantly faster to deploy). Polkadot’s runtime is compiled not only to Wasm but also to native code for better performance, and most importantly, the native runtime does not put any bounds on memory usage during execution. But the native runtime only matches the on-chain runtime when the running node is from the same release as the on-chain runtime. The on-chain runtime at this point was the runtime matching the v0.8.30 release, which was released on 08.04.2021. Since then, there had already been 3 new releases, meaning most of the validators already were running the latest node release (v0.9.x).

So, in an effort to overcome the problematic block as fast as possible, all validator operators were asked to downgrade their validators to v0.8.30 and to run them with the `--execution native` flag to force running with the native runtime. Overall, it took about 1 hour and 10 minutes from detecting the issue, coming up with a short-term solution, announcing it to validators and ultimately having new blocks built and having the network fully recover.

After the network was back, we started preparing the 0.9.3 release to distribute the increase of the Wasm max memory usage so we could support using the Wasm runtime again. In this process, we took a node and wanted to check that syncing the problematic block with the increased memory ceiling now worked with Wasm. The problematic block worked indeed, but we encountered a storage root mismatch while trying to import 5203204.

The ugly

A storage root mismatch means that importing a block doesn’t lead to the same storage root advertised by the block author. In general, in a blockchain the same input should always lead to the same output. However, in this case the network was still running and building blocks, which could only mean that there was a non-determinism between the native and the Wasm runtime, because we had instructed all validators to run with the native runtime.

So we started to investigate the mismatch between the native and Wasm runtimes. We tried to sync the chain locally first with the same release and the native runtime. However, this also led to the same storage root mismatch. This was even more alarming, because the same code compiled for the same architecture should always produce the same results. When we compile the Wasm runtime we do this using the so-called `no-std` environment, which involves using different code paths. So, it is “easier” to introduce some mismatch, but compiling the native runtime twice should result in code that is doing the same thing both times.

This brought us to the assumption that the rust compiler may have been generating faulty code that resulted in the mismatch we had seen. Due to some extreme luck (otherwise our endeavour would probably have taken a bit longer), someone at Parity still had a binary of this release lying around that wasn’t the same as the one attached to the release on github. This binary was able to sync the chain with the native runtime without any problems. The only difference between this binary and the one we built before was the rust compiler version that had been used. So we thought maybe something had changed between the latest compiler version and the version that we used to build the node back then. And yes, after downgrading the rust compiler and re-building the release branch, the node now managed to sync successfully.

The good

After verifying that the native runtime compiled with the old rust compiler could sync the chain, we also tried compiling the Wasm runtime with this rust compiler. There is a special flag for the Polkadot node that allows us to override the on-chain Wasm runtime with a local version, and we used this to verify that syncing worked. So the question became, why did we have this mismatch between the native and Wasm runtimes of the 0.8.30 release? You need to know that we use the rust nightly compiler to compile the Wasm runtime (the nightly is required because not everything we use in the Wasm build is yet in the stable rust compiler). The compiler versions used for the node and the Wasm runtime are part of the release announcement.

So something must have changed between the 1.51.0 stable rust compiler (released on 23.03.2021 and used to build the native runtime) and the rust nightly compiler from 7.04.2021 that was used to build the Wasm runtime. After some time bisecting the rust toolchains between these dates, we found the nightly from 05.03.2021 to be the first one that broke our determinism. So we only needed to check the commits that got merged between 04.03.2021 and 05.03.2021 and found the problematic commit.

Compiling the rust compiler without this commit and using the self-built compiler to compile our node showed that the native runtime produced the correct data and we could sync the chain. The commit changed the `binary_search_by` function in a way that it could return a different index when there are multiple matches. As we use this function in the runtime, it can lead to a slightly different ordering of the data that is stored in the state, which leads to a different storage root.

So this meant that we now had blocks built by the native runtime that could not be synced with the Wasm runtime, and we could not change the on-chain Wasm runtime to fix this, because you cannot rewrite the history of the blockchain without forking. We came up with a pull request that introduces `code_substitute` to the chain specification. The chain specification is mainly used to store the genesis and some other information about the chain. This new field `code_substitute` is a map that uses a block hash as key and maps to a Wasm runtime code blob. It instructs the node to overwrite the on-chain Wasm runtime with the given one from every block after the one specified in the chain specification until the spec version of the runtime doesn’t match anymore.

We also created a pull request that uses the `code_substitute` with the correct values to enable the nodes to sync again using Wasm. Anyone can rebuild the runtime using `srtool` to make sure that what’s being built is the code from v0.8.30 and that they get the same Wasm blob.

With the 0.9.3 release the node contains all the required fixes to make the chain work as expected.

In future we will improve the current situation even more:

  • The deprecation of the native runtime will now be pursued with a much higher priority. Using the Wasm compiler Wasmtime already brings us to a performance level that is almost the same as using the native runtime, so we don’t really need the native optimization anymore. Especially with all the potential downsides.
  • The allocator will be improved to support a much more flexible allocation of resources, meaning we will not cap the maximum allocation at 128MB and will probably support the maximum of Wasm (4GB).
  • On-chain elections will be completely disabled; an election now needs to always happen off-chain and be submitted to the runtime.
  • Until the allocator is improved the off-chain worker will use a higher memory limit than the on-chain Wasm runtime execution. This should help with making sure that off-chain elections don’t run out of memory and can be successfully submitted.
  • For the time being, with a native and Wasm runtime, we will make sure to use the same compiler version for the native and the Wasm build. This should prevent running into changes resulting from using different toolchain versions.