Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth/downloader: fix invalid hash chain error due to head mini reorg #17839

Merged

Conversation

karalabe
Copy link
Member

@karalabe karalabe commented Oct 4, 2018

For a very long time now we're received issue reports about the synchronization failing with "invalid hash chain" errors. We managed to reproducibly hit it on Rinkeby lately, so here's the fix! 💥

The issue happens during fast sync, while the downloader is busy pulling the latest state trie. Since we added in-memory pruning about a year ago, the downloader cannot stick to a single pivot point throughout sync, rather it needs to push it forward if the chain progresses enough (since old state will become unavailable). This was handled by the downloader constantly rechecking for new headers while state sync is ongoing, slowly pushing them on top of the current queue.

A small mini reorg on the chain head however meant that the newly discovered header might actually not fit on top of the old queue, causing an invalid hash chain error, dropping the master peer. This can happen on mainnet if a previous head becomes an uncle, and this happens relatively often on Rinkeby if out-of-turn signers race for the next block.


I've tried to create a solution that's minimally invasive, since the downloader is one of our most sensitive pieces of code. The PR contains two tiny extensions to the header sync logic:

  • If we're not syncing skeletons any more, only small batches of headers (i.e. we're honing in on the head of the chain), we reach out to our local database and see where we're at sync progress wise. If there's plenty of work left (48 blocks currently), then instead of delivering all the new headers to the queue, we retain/delay the last 2. This ensures that as long as we're busy syncing, any tiny 2-block head reorgs get silently ignored.
  • After a while, the only missing headers will be the 2 we ignored, so every header fetch will just return the same 2 headers. We don't want the downloader to spin like crazy fetching these over and over again, so the second extension is that if we received headers, but have none to import (i.e. all were delayed), we sleep for 3 seconds before trying again to check for new data.

This seems to have completely fixed my Rinkeby sync, where I can do a full fast sync without hitting any invalid hash chain errors whatsoever.

head = full
}
}
// If the head is way older that this batch, delay the last few headers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that/than

}
}
// If the head is way older that this batch, delay the last few headers
if head+uint64(reorgProtThreshold) < headers[n-1].Number.Uint64() && n >= reorgProtHeaderDelay {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that means. If head+48 < headers[last] && len(headers) > 2 , then use headers[:-2]. So what if len(headers)== 1

Either we should pause there, or the initial check n > 0 should be n > reorgProtHeaderDelay

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headers are always fetched in batches of 128. Worst case scenario if there's exactly 1 header after a skeleton fetch, then we skip delaying and potentially hit the previous issue. I could add an extra clause to handle the case if len(headers) is less then reorgProtHeaderDelay.

} else {
// No headers delivered, or all of them being delayed, sleep a bit and retry
p.log.Trace("All headers delayed, waiting")
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the full details about how this part works -- but you introduce a sleep here (3s) which wasn't there previously. Are you sure that's 1. sufficient time and 2. doesn't block other important things, like downloader events that cannot be delivered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors line 826, which did the same thing, just when we fully exhausted our headers.

@karalabe karalabe force-pushed the downloader-invalid-hash-chain-fix branch from 087ab30 to 6ee3b26 Compare October 5, 2018 07:45
@karalabe karalabe merged commit 5d3b7bb into ethereum:master Oct 5, 2018
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Feb 24, 2025
gzliudan added a commit to gzliudan/XDPoSChain that referenced this pull request Feb 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants