Skip to content

runtime: tight loops should be preemptible #10958

@aclements

Description

@aclements
Member

Currently goroutines are only preemptible at function call points. Hence, it's possible to write a tight loop (e.g., a numerical kernel or a spin on an atomic) with no calls or allocation that arbitrarily delays preemption. This can result in arbitrarily long pause times as the GC waits for all goroutines to stop.

In unusual situations, this can even lead to deadlock when trying to stop the world. For example, the runtime's TestGoroutineParallelism tries to prevent GC during the test to avoid deadlock. It runs several goroutines in tight loops that communicate through a shared atomic variable. If the coordinator that starts these is paused part way through, it will deadlock.

One possible fix is to insert preemption points on control flow loops that otherwise have no preemption points. Depending on the cost of this check, it may also require unrolling such loops to amortize the overhead of the check.

This has been a longstanding issue, so I don't think it's necessary to fix it for 1.5. It can cause the 1.5 GC to miss its 10ms STW goal, but code like numerical kernels is probably not latency-sensitive. And as far as I can tell, causing a deadlock like the runtime test can do requires looking for trouble.

@RLH

Activity

added this to the Go1.6 milestone on May 26, 2015
minux

minux commented on May 26, 2015

@minux
Member
randall77

randall77 commented on May 26, 2015

@randall77
Contributor

The problem with interrupts is that the goroutine can be at an arbitrary address. The interrupted address is probably at a place where we have no data about where the pointers are. We'd either have to record a lot more frame layout/type/liveness information (including registers), or we'd have to simulate the goroutine forward to a safepoint. Both are challenging.

minux

minux commented on May 26, 2015

@minux
Member
ianlancetaylor

ianlancetaylor commented on May 26, 2015

@ianlancetaylor
Contributor

It seems to me that we can preempt at a write barrier. So the only loops we are talking about are those that make no writes to the heap and make no function calls. If we think in terms of adding

    uint8 b
  loop:
    ++b
    if b == 0 {
        preemptCheck()
    }

then the normal path through the loop will have two extra instructions (add/beq) where the add may be to either a register or a memory location, depending on overall register pressure. This will be measurable in tight loops but for most cases shouldn't be too bad.

randall77

randall77 commented on May 26, 2015

@randall77
Contributor

No pointer writes to the heap.

But maybe this is enough? If we can preempt at the first write barrier, then even if we let the mutator run it can't modify anything that the GC cares about. So maybe we just set the write barrier enabled bit and let goroutines run. Once a goroutine sees the write barrier bit (how do we force that?) it can't modify any memory the GC cares about. So it is safe to start the GC without waiting for every goroutine to stop.

aclements

aclements commented on May 27, 2015

@aclements
MemberAuthor

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard solution to this problem.

However, I think the check can be done in just two instructions, even without adding Ian's loop counter. Just CMP the preempt flag and branch if it's set. That branch will almost never be hit, so it will be highly predictable, and the preempt flag should be in the L1, so the check may in fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I don't think it actually helps with either of the examples I gave, since numerical kernels probably won't have heap pointer writes, and the runtime test that can deadlock the GC certainly has no pointer writes.

RLH

RLH commented on May 27, 2015

@RLH
Contributor

This is the reference for doing a GC safepoint at every instruction. It is
not something we want to do for x86 much less the various other
architectures we need to support.
http://dl.acm.org/citation.cfm?id=301652

Most systems I know of do what Austin suggests, unroll the loop and insert
a check. No numbers but 8 unrolls seems to be what I recall. Not only is
the branch highly predictable but the check does not introduce any
dependencies making it close to free on an out of order machine. There have
been other approaches such as code plugging and predicated branches but HW
has moved on. I had not seen Ian's suggestion.

On Tue, May 26, 2015 at 9:24 PM, Austin Clements notifications@github.com
wrote:

The problem of inserting artificial preemption points is reduced
numerical performance.

True. That's why I suggested unrolling loops, which AFAIK is a standard
solution to this problem.

However, I think the check can be done in just two instructions, even
without adding Ian's loop counter. Just CMP the preempt flag and branch if
it's set. That branch will almost never be hit, so it will be highly
predictable, and the preempt flag should be in the L1, so the check may in
fact be extremely cheap.

No pointer writes to the heap. But maybe this is enough?

This certainly reduces the set of programs that have this problem, but I
don't think it actually helps with either of the examples I gave, since
numerical kernels probably won't have heap pointer writes, and the runtime
test that can deadlock the GC certainly has no pointer writes.


Reply to this email directly or view it on GitHub
#10958 (comment).

modified the milestones: Go1.6Early, Go1.6 on Jun 30, 2015
kingluo

kingluo commented on Sep 9, 2015

@kingluo

@aclements,

Currently goroutines are only preemptible at function call points.

To be precise, the preemption check seems only at newstack()?

For example,

package main

import (
    "fmt"
    "runtime"
    "time"
)

func test() {
    a := 100
    for i := 1; i < 1000; i++ {
        a = i*100/i + a
    }
}

func main() {
    runtime.GOMAXPROCS(1)
    go func() {
        for {
            test()
        }
    }()
    time.Sleep(100 * time.Millisecond)
    fmt.Println("hello world")
}

The test() is not inline, and the infinite loop calls test(), but it would not preempt at the calls, because the morestack() --> newstack() not involved, so the "hello world" would never be printed.

randall77

randall77 commented on Sep 9, 2015

@randall77
Contributor

test() is not inlined, but since it is a leaf it is promoted to nosplit, so it doesn't have the preemption check any more. Fixing that sounds much easier than the rest of this bug. Maybe we could forbid nosplit promotion for functions called from loops?

griesemer

griesemer commented on Sep 10, 2015

@griesemer
Contributor

For the reference, the same problem appeared in the HotSpot JVM. I remember two approaches:

  1. Around 1997/1998, Rene Schmidt (http://jaoo.dk/aarhus2007/speaker/Rene+W.+Schmidt) implemented the following mechanism: Threads running a tight loop w/o function calls would receive a signal to temporarily suspend them. The runtime then dynamically generated a partial "copy" of the loop instructions the thread was running, from the current PC to the (unconditional) backward branch, except that the backward branch was replaced with a call into the runtime (leading to proper suspension at a safe point). The thread was then restarted with the pc modified such that it would run the newly generated piece of code. That code would run to the end of the loop body and then suspend itself at which point the code copy was discarded and the pc (return address) adjusted to continue with the original code (after the safe point).

This mechanism was ingenious but also insanely complicated. It was abandoned eventually for:

  1. A simple test and branch (2 additional instructions) at the end of a loop body which didn't contain any calls. As far as I recall, we didn't do any form of loop unrolling and the performance implications were not significant in the overall picture of a larger application.

My vote would be for 2) to start in loops where @ianlancetaylor's suggestion doesn't work. I suspect that for all but the smallest long-running inner loops (where unrolling might make sense independently), the performance penalty is acceptable.

If this is not good enough, and we don't want to pay the code size cost of unrolling a loop multiple times, here's another idea based on 1): Instead of the backward branch check as in 2) plus unrolling, keep the original loop, and generate (at compile time) a 2nd version of the loop body that ends in a runtime call to suspend itself instead of the backward branch. The code size cost is about the cost of having the loop unrolled twice. When the goroutine needs to run to a safe point, use a signal to temporarily suspend the goroutine, switch the pc to the corresponding pc in the copy of the loop body, continue with execution there and have the code suspend itself at a safe point. There's no dynamic code generation involved, generating the extra loop body is trivial at compile-time, the extra amount of code is less than with loop unrolling, and there's only a little bit of runtime work to modify the pc. The regular code would run at full speed if no garbage collection is needed.

271 remaining items

aclements

aclements commented on Jan 2, 2020

@aclements
MemberAuthor

Go 1.14 will handle loop preemption far better than previous versions, but there are still some loose ends to tie up in 1.15, so I'm bumping this issue to 1.15.

aclements

aclements commented on Jan 2, 2020

@aclements
MemberAuthor

Actually, rather than bump this to 1.15, since tight loops are preemptible now, I've created a new issue to keep track of loose ends (#36365) and will go ahead and close this issue as fixed.

gopherbot

gopherbot commented on Jan 8, 2020

@gopherbot
Contributor

Change https://golang.org/cl/213837 mentions this issue: runtime: protect against external code calling ExitProcess

locked and limited conversation to collaborators on Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsFixThe path to resolution is known, but the work has not been done.early-in-cycleA change that should be done early in the 3 month dev cycle.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@mdempsky@rsc@quentinmit@minux

        Issue actions

          runtime: tight loops should be preemptible · Issue #10958 · golang/go