Skip to content

runtime: make sure blocked channels run operations in FIFO order #11506

Closed
@rsc

Description

@rsc
Contributor

I've had mail conversations with two different projects using Go recently, both centered around the surprise that if many goroutines are blocked on a particular channel, and that channel becomes available, it is still possible for running goroutines who "drive by" at the right time to get their operation in before the goroutines that are blocked, and it is possible for the goroutines who are blocked to be reordered as a result. If this happens repeatedly, the effect is that the blocked goroutines can block arbitrarily long even though the channel is known to be ready at regular intervals. I wonder if we should adjust the channel implementation to ensure that when a channel does become available for sending or receiving, blocked operations take priority over "drive by" operations. It seems to me that this can be done by completing the blocked operation as part of the operation that unblocks it.

Sends and receives on unbuffered channels already behave this way: a send with blocked receivers picks the first receiver goroutine off the queue, delivers the value to it, readies it (puts it on a run queue), and continues execution. That is, the send completes the blocked operation. Receives on unbuffered channels similarly complete blocked sends.

Sends and receives on buffered channels do not behave this way:

A send into a buffered channel with blocked receivers stores the value in the buffer, wakes up a receiver, and continues executing. When the receiver is eventually scheduled, it checks the channel, and maybe it gets lucky and the value is still there. But maybe not, in which case it goes to the end of the queue.

A receive out of a buffered channel copies a value out of the buffer, wakes a blocked sender, and continues. When the sender is eventually scheduled, it checks the channel, and maybe it gets lucky and there is still room in the buffer for the send. But maybe not, in which case it goes to the end of the queue.

It seems to me that it would be easy and symmetric with the unbuffered channel operations for a send with blocked receivers to deliver the value straight to a receiver, completing the first pending receive operation. Similarly a receive with blocked senders could take its own value out of the channel buffer and then complete the send into the newly emptied space. That would be the only operation of the four that needs to transfer a pair of values instead of just one.

It doesn't seem to me that it would hurt performance to do this, and in fact it might help, since there would not be all the unnecessary wakeups that happen now. It would give much more predictable behavior when goroutines queue up on a stuck channel: once queued, that's the order they're going to be served, guaranteed.

From a "happens before" point of view, I'm talking nonsense. There is no "happens before" for two different goroutines blocking on the same channel, and so there is no difference between all the execution orders. I understand that. But from a "what actually happens" point of view, there certainly is a difference: if you know that the goroutines have been stacking up one per minute for an hour before the channel finally gets unblocked and they all complete, you know what order they blocked and might reasonably expect them to unblock in that same order. This remains roughly as true at shorter time scales.

Especially since so many Go programs care about latency, making channels first come, first served seems worth doing. And it looks to me like it can be done trivially and at no performance cost. The current behavior looks more like a bug than an intended result.

I'm not talking about a language change, just a change to our implementation.

Thoughts?

@robpike @dvyukov @randall77 @aclements

Activity

self-assigned this
on Jul 1, 2015
added this to the Go1.6Early milestone on Jul 1, 2015
dr2chase

dr2chase commented on Jul 1, 2015

@dr2chase
Contributor

I would make this change. The current behavior sounds unfriendly to me, and years past I used a similar policy for locks and waits (in Java) and thought it gave good behavior. Starvation is bad.

robpike

robpike commented on Jul 2, 2015

@robpike
Contributor

Language semantics require that the values be FIFO, but the implications of that with multiple goroutines reading the values are unspecified and muddy at best. That said, I believe the claim here that it seems more natural for blocked ones to go first; it just seems intuitive.

I think you could even make a case that the current situation is almost a bug.

There may be a noticeable performance hit for heavy contention, though, because I believe this style requires asymptotically more context switches. I still think channels should behave as you suggest.

i3d

i3d commented on Jul 2, 2015

@i3d
Contributor

I'd fully support this. I think this is a more "natural" result that I was expecting from one of the discussion threads we had.

randall77

randall77 commented on Jul 2, 2015

@randall77
Contributor

I already have a CL for basically exactly this.

https://go-review.googlesource.com/#/c/9345/

aclements

aclements commented on Jul 2, 2015

@aclements
Member

From a specification perspective, it's true that our current approach satisfies the happens-before graph, but it does not satisfy liveness. That's a formally defined, reasonable, and desirable property we could specify as a requirement of channels without specifying something as specific as FIFO blocking.

Implementation-wise, I'm concerned about making blocking strictly FIFO order. This seems like the exact same mistake as making map order deterministic or goroutine scheduling deterministic. People can and will come to depend on this and eventually we may want to weaken it. In particular, this is asking for scalability problems. One of the cardinal rules that came out of my thesis was to not impose strict ordering on operations unless it is a fundamental requirement. It is by no means a fundamental requirement of the blocking order.

I fully support addressing the liveness issue, and I like the approach of eliminating the wake-up race, but if we're going to do that, we should think about whether we can introduce bounded randomness into the blocking order to prevent people from depending on strict FIFO blocking.

cespare

cespare commented on Jul 2, 2015

@cespare
Contributor

@aclements FIFO blocking seems harder to incorrectly rely on than map ordering. Perhaps, like goroutine scheduling, it should only be randomized with -race?

sougou

sougou commented on Jul 2, 2015

@sougou
Contributor

This change will improve tail latency during bursty load, which is what we observed in one of our benchmarks in vitess.
If accepted, should the language specification be amended to reflect this behavior?

dr2chase

dr2chase commented on Jul 2, 2015

@dr2chase
Contributor

Flip a coin (linear feedback shift register), choose one of the top two, till we come up with a better reason for some other policy.Starvation is exponentially rare, service is nearly fifo, but order will usually differ if there's more than one waiter.

randall77

randall77 commented on Jul 2, 2015

@randall77
Contributor

I think we're overthinking the randomization. With randomized goroutine scheduling (which is already going in, at least for -race testing), we'll get randomized channel ordering for free. The random scheduling should change the order in which waiters get queued and the order in which sends get matched to the waiters.

dvyukov

dvyukov commented on Jul 2, 2015

@dvyukov
Member

You seem to assume that the entity that benefits from reduced latency is a goroutine only. This is true when a goroutine services a request and needs to acquire some aux resource (e.g. a database connection). But this is not true for all producer-consumer/pipelining scenarios, where the entity that benefits from reduced latency is a message in a chan. Today messages in chans are serviced strictly FIFO and with minimal latency: the next running goroutine picks up the first message in a chan. What you propose improves the database connection pool scenario but equally worsens the producer-consumer scenario. Because under your proposal we handoff the first message in the chan to a goroutine that will run who-knows-when, and a goroutine that drives by the chan next moment either blocks or services the second message ahead of the first one. If we switch the implementation we will start receiving complaints about the other scenarios.
Also sase synchronization primitives is usually the wrong level for fairness. They can't ensure user-perceived fairness. Consider that to service a request you need to do 10 DB queries. DB pool has fair
queue, however older requests (that do 10-th query ) complete with newer requests (that do first query). As the result some requests can experience no waiting time, while others can wait for a hundred of requests in total.
What you propose is known to significantly reduce performance due to requirement of lock-step scheduling order. Which becomes significantly worse if you add a bunch of unrelated goroutines to the mix, so that you have lock-step with a very long step time.
Regarding the unnecessary wakeup, it's easy to fix. See #8900.

RLH

RLH commented on Jul 2, 2015

@RLH
Contributor

I like the imagery of a drive by goroutine a lot, the literature also uses
the boring term barging. To put a number on the cost of fair locks Doug Lea
(http://gee.cs.oswego.edu/dl/papers/aqs.pdf) reports a 1 to 2 orders of
magnitude slowdown over locks that allow barging. They
(java.util.concurrency circa 2004) also stepped back from a definition of
fair to be less than a strict FIFO.
In the intervening decade since that paper Hannes Payer has done some
promising work on queues with even weaker semantics such as a allowing a
pop to return some value near, as opposed to at, the front of the queue.
I believe we can build fairness on top of weaker unfair semantics so not
cooking the stronger semantics into Go is the way forward.

On Thu, Jul 2, 2015 at 7:52 AM, Dmitry Vyukov notifications@github.com
wrote:

You seem to assume that the entity that benefits from reduced latency is a
goroutine only. This is true when a goroutine services a request and needs
to acquire some aux resource (e.g. a database connection). But this is not
true for all producer-consumer/pipelining scenarios, where the entity that
benefits from reduced latency is a message in a chan. Today messages in
chans are serviced strictly FIFO and with minimal latency: the next
running goroutine picks up the first message in a chan. What you
propose improves the database connection pool scenario but equally worsens
the producer-consumer scenario. Because under your proposal we handoff the
first message in the chan to a goroutine that will run who-knows-when, and
a goroutine that drives by the chan next moment either blocks or services
the second message ahead of the first one. If we switch the implementation
we will start receiving complaints about the other scenarios.
Also sase synchronization primitives is usually the wrong level for
fairness. They can't ensure user-perceived fairness. Consider that to
service a request you need to do 10 DB queries. DB pool has fair
queue, however older requests (that do 10-th query ) complete with newer
requests (that do first query). As the result some requests can experience
no waiting time, while others can wait for a hundred of requests in total.
What you propose is known to significantly reduce performance due to
requirement of lock-step scheduling order. Which becomes significantly
worse if you add a bunch of unrelated goroutines to the mix, so that you
have lock-step with a very long step time.
Regarding the unnecessary wakeup, it's easy to fix. See #8900
#8900.


Reply to this email directly or view it on GitHub
#11506 (comment).

gopherbot

gopherbot commented on Oct 23, 2015

@gopherbot
Contributor

CL https://golang.org/cl/9345 mentions this issue.

7 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @rsc@i3d@cespare@sougou@RLH

        Issue actions

          runtime: make sure blocked channels run operations in FIFO order · Issue #11506 · golang/go