Skip to content

[Question] Possible race condition on AbstractReferenceCounted #8563

Closed
@merlimat

Description

@merlimat
Contributor

I'm investigating an issue with a custom object that inherit from AbstractReferenceCounted and looking at the code I got one doubt.

My expectation is that after calling retain(), if I didn't get an exception, the object should be guaranteed not to be deallocated.

Looking at the retain and release methods:

private ReferenceCounted retain0(int increment) {
int oldRef = refCntUpdater.getAndAdd(this, increment);
if (oldRef <= 0 || oldRef + increment < oldRef) {
// Ensure we don't resurrect (which means the refCnt was 0) and also that we encountered an overflow.
refCntUpdater.getAndAdd(this, -increment);
throw new IllegalReferenceCountException(oldRef, increment);
}
return this;
}

private boolean release0(int decrement) {
int oldRef = refCntUpdater.getAndAdd(this, -decrement);
if (oldRef == decrement) {
deallocate();
return true;
} else if (oldRef < decrement || oldRef - decrement > oldRef) {
// Ensure we don't over-release, and avoid underflow.
refCntUpdater.getAndAdd(this, decrement);
throw new IllegalReferenceCountException(oldRef, -decrement);
}
return false;
}

The scenario I was considering is :

  1. Object has refCnt==1
  2. We have 3 threads playing with this object
  3. Thread-1 calls obj.release()
  4. Thread-2 calls obj.retain() and sees the oldRef==0 then rolls back the increment (but the rollback is not atomic)
  5. Thread-3 calls obj.retain() and sees oldRef==1 (from T-2 increment) therefore thinks the object is not dead
  6. Thread-1 will call obj.deallocate()

Is this a possible scenario or am I missing something?

Would it make a difference to use a loop and compareAndSet() (which could probably have a perf penalty)?

Activity

normanmaurer

normanmaurer commented on Nov 15, 2018

@normanmaurer
Member
carl-mastrangelo

carl-mastrangelo commented on Nov 15, 2018

@carl-mastrangelo
Member

@merlimat yes, but at least one thread does see an exception. This was mentioned on the original PR. Also, IIRC, the quiescent state still ends up at the same ref count of zero.

I don't think it's really a recoverable scenario. The Linux Kernel uses CAS and makes sure all retain calls starting from 0 stay at 0, and all release calls from INT_MAX stay at INT_MAX. Netty could copy this if it wants to pay the performance penalty, but from my point of view, the ref counting is not meant to be water-tight.

merlimat

merlimat commented on Nov 15, 2018

@merlimat
ContributorAuthor

(I hadn't actually checked the PR description for that change)

yes, but at least one thread does see an exception.

Yes, in my above example T-1 will get exception but T-2 will not.

Netty could copy this if it wants to pay the performance penalty, but from my point of view, the ref counting is not meant to be water-tight.

The problem is that I have no way to validate that my object is still valid after calling retain(). Eg: we might have ByteBuf pointing to garbage data or leading to NPEs.

Sure, this only happens with 3+ threads but I don't see a workaround to it.

carl-mastrangelo

carl-mastrangelo commented on Nov 15, 2018

@carl-mastrangelo
Member

The problem is that I have no way to validate that my object is still valid after calling retain()

A reasonable alternative is to call System.exit on getting a ref count exception.

merlimat

merlimat commented on Nov 15, 2018

@merlimat
ContributorAuthor

A reasonable alternative is to call System.exit on getting a ref count exception.

Take this code for example:

  • This is a cache of entries
  • The cache keeps 1 ref count to always keep the entry valid.
  • Whenever we try to get 1 entry out of the cache, we retain() the entry so that if it's being evicted we either get it or we get an exception

https://github.com/apache/pulsar/blob/579e0dc74039c081301662e5fd55209c1a4abe5c/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/RangeCache.java#L83-L96

carl-mastrangelo

carl-mastrangelo commented on Nov 15, 2018

@carl-mastrangelo
Member

From my reading, that code isn't valid. Catching and discarding throwable is going to cause problems anyways. ReferenceCounted doesn't really say what should happen in the event of errors, likely because different implementations might want different behavior.

It's hard to see this as a bug when there isn't expected behavior. Ensuring that every retain/release throws upon reaching an invalid state is costly, and not every use may want to pay that cost.

Additionally, you have a reasonable work around: You could wrap the objects in your cache in a custom implementation of ReferenceCounted and rely on the exact error behavior you want.

merlimat

merlimat commented on Nov 15, 2018

@merlimat
ContributorAuthor

Additionally, you have a reasonable work around: You could wrap the objects in your cache in a custom implementation of ReferenceCounted and rely on the exact error behavior you want.

I can certainly fix that by having a different AbstractReferenceCounted base class, though my concern is that this will still apply to ByteBuf instances coming from a pool, where I cannot asses whether the buffer is pointing to valid data or not.

carl-mastrangelo

carl-mastrangelo commented on Nov 15, 2018

@carl-mastrangelo
Member

though my concern is that this will still apply to ByteBuf instances coming from a pool, where I cannot asses whether the buffer is pointing to valid data or not.

Looking more thoroughly at that code, it seems the issue is that adding/remove values from the cache, and retain/releasing them is not done atomically. If it were the case, you could use wrapper objects to do the ref counting and not have to extend or override the Value objects.

merlimat

merlimat commented on Nov 15, 2018

@merlimat
ContributorAuthor

In any case, 83a19d5 has changed a semantic that used to work before.

If retain() doesn't offer the guarantee that an object is valid, at the very minimum it should be printed in all caps in red in the documentation :)

carl-mastrangelo

carl-mastrangelo commented on Nov 15, 2018

@carl-mastrangelo
Member

@normanmaurer One thing we could do is make AbstractReferenceCountedByteBuf override retain/release, revert my PR, and move that optimization down the class hierarchy. It's unclear to me if io.netty.util is a utility package for Netty to use, or for users of Netty to use. I don't like breaking existing behavior, but the speedup is going to be painful to give back.

merlimat

merlimat commented on Nov 15, 2018

@merlimat
ContributorAuthor

@carl-mastrangelo would it work to have some system property to control which implementation is being used?

carl-mastrangelo

carl-mastrangelo commented on Nov 15, 2018

@carl-mastrangelo
Member

@merlimat I'm reluctant to make it configuration based, because that tends to not play well in Netty is shaded. It could work, though I can't make that call.

merlimat

merlimat commented on Nov 16, 2018

@merlimat
ContributorAuthor

Would it make sense to have option for both strict and relaxed semantic in ReferenceCounted interface?

Eg:

  • void retain() -> requires the object to have refCnt >= 1, otherwise undefined behavior
  • boolean tryRetain() -> return true if it was able to keep the object alive, false otherwise (and no exception)

2 remaining items

normanmaurer

normanmaurer commented on Nov 16, 2018

@normanmaurer
Member

@carl-mastrangelo just to be more clear I think we need to guarantee that once an object was released (which means release(...) returned true) all other calls to release(....) and retain(...) must throw.

njhill

njhill commented on Nov 19, 2018

@njhill
Member

@carl-mastrangelo @normanmaurer not sure this is of interest but I thought of a way to retain roughly half of the speedup and address most of the race condition problems, see 0c0e092e8588d5c6729080b5ae7345e041470616. Apologies if this is just the same as your already-dismissed idea.

In particular this guarantees @normanmaurer's above requirement in all cases and also ensures there can never be a double de-allocation (as is possible with the current logic). However, it can suffer from some of the existing problems if releases coincide with retains overflowing (above Integer.MAX_VALUE + 1). Maybe that still disqualifies it for consideration, in which case nevermind!

A couple of other observations:

  • Although the original "optimistic add" optimization improves the most highly contended cases, it actually harmed uncontended cases which I was wondering may actually be more common? maybe in those cases latency generally isn't as important?
  • If any CAS loop is going to be used (e.g. if reverting to the originally logic), a small optimization could be to use a non-volatile read for the first try (as done in the above commit)
normanmaurer

normanmaurer commented on Nov 19, 2018

@normanmaurer
Member

@njhill will need to check in detail. In the meantime PTAL #8574

normanmaurer

normanmaurer commented on Nov 20, 2018

@normanmaurer
Member

@njhill @merlimat @carl-mastrangelo also somewhat related to this issue: #8576

added this to the 4.1.32.Final milestone on Dec 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @merlimat@normanmaurer@carl-mastrangelo@njhill

        Issue actions

          [Question] Possible race condition on AbstractReferenceCounted · Issue #8563 · netty/netty