Closed
Description
I'm investigating an issue with a custom object that inherit from AbstractReferenceCounted
and looking at the code I got one doubt.
My expectation is that after calling retain()
, if I didn't get an exception, the object should be guaranteed not to be deallocated.
Looking at the retain and release methods:
The scenario I was considering is :
- Object has
refCnt==1
- We have 3 threads playing with this object
- Thread-1 calls
obj.release()
- Thread-2 calls
obj.retain()
and sees theoldRef==0
then rolls back the increment (but the rollback is not atomic) - Thread-3 calls
obj.retain()
and seesoldRef==1
(from T-2 increment) therefore thinks the object is not dead - Thread-1 will call
obj.deallocate()
Is this a possible scenario or am I missing something?
Would it make a difference to use a loop and compareAndSet() (which could probably have a perf penalty)?
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
normanmaurer commentedon Nov 15, 2018
@carl-mastrangelo PTAL
carl-mastrangelo commentedon Nov 15, 2018
@merlimat yes, but at least one thread does see an exception. This was mentioned on the original PR. Also, IIRC, the quiescent state still ends up at the same ref count of zero.
I don't think it's really a recoverable scenario. The Linux Kernel uses CAS and makes sure all
retain
calls starting from 0 stay at 0, and allrelease
calls from INT_MAX stay at INT_MAX. Netty could copy this if it wants to pay the performance penalty, but from my point of view, the ref counting is not meant to be water-tight.merlimat commentedon Nov 15, 2018
(I hadn't actually checked the PR description for that change)
Yes, in my above example T-1 will get exception but T-2 will not.
The problem is that I have no way to validate that my object is still valid after calling
retain()
. Eg: we might have ByteBuf pointing to garbage data or leading to NPEs.Sure, this only happens with 3+ threads but I don't see a workaround to it.
carl-mastrangelo commentedon Nov 15, 2018
A reasonable alternative is to call
System.exit
on getting a ref count exception.merlimat commentedon Nov 15, 2018
Take this code for example:
https://github.com/apache/pulsar/blob/579e0dc74039c081301662e5fd55209c1a4abe5c/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/util/RangeCache.java#L83-L96
carl-mastrangelo commentedon Nov 15, 2018
From my reading, that code isn't valid. Catching and discarding throwable is going to cause problems anyways. ReferenceCounted doesn't really say what should happen in the event of errors, likely because different implementations might want different behavior.
It's hard to see this as a bug when there isn't expected behavior. Ensuring that every retain/release throws upon reaching an invalid state is costly, and not every use may want to pay that cost.
Additionally, you have a reasonable work around: You could wrap the objects in your cache in a custom implementation of ReferenceCounted and rely on the exact error behavior you want.
merlimat commentedon Nov 15, 2018
I can certainly fix that by having a different
AbstractReferenceCounted
base class, though my concern is that this will still apply toByteBuf
instances coming from a pool, where I cannot asses whether the buffer is pointing to valid data or not.carl-mastrangelo commentedon Nov 15, 2018
Looking more thoroughly at that code, it seems the issue is that adding/remove values from the cache, and retain/releasing them is not done atomically. If it were the case, you could use wrapper objects to do the ref counting and not have to extend or override the Value objects.
merlimat commentedon Nov 15, 2018
In any case, 83a19d5 has changed a semantic that used to work before.
If
retain()
doesn't offer the guarantee that an object is valid, at the very minimum it should be printed in all caps in red in the documentation :)carl-mastrangelo commentedon Nov 15, 2018
@normanmaurer One thing we could do is make AbstractReferenceCountedByteBuf override retain/release, revert my PR, and move that optimization down the class hierarchy. It's unclear to me if io.netty.util is a utility package for Netty to use, or for users of Netty to use. I don't like breaking existing behavior, but the speedup is going to be painful to give back.
merlimat commentedon Nov 15, 2018
@carl-mastrangelo would it work to have some system property to control which implementation is being used?
carl-mastrangelo commentedon Nov 15, 2018
@merlimat I'm reluctant to make it configuration based, because that tends to not play well in Netty is shaded. It could work, though I can't make that call.
merlimat commentedon Nov 16, 2018
Would it make sense to have option for both strict and relaxed semantic in
ReferenceCounted
interface?Eg:
void retain()
-> requires the object to haverefCnt >= 1
, otherwise undefined behaviorboolean tryRetain()
-> return true if it was able to keep the object alive, false otherwise (and no exception)2 remaining items
normanmaurer commentedon Nov 16, 2018
@carl-mastrangelo just to be more clear I think we need to guarantee that once an object was released (which means
release(...)
returned true) all other calls torelease(....)
andretain(...)
must throw.njhill commentedon Nov 19, 2018
@carl-mastrangelo @normanmaurer not sure this is of interest but I thought of a way to retain roughly half of the speedup and address most of the race condition problems, see 0c0e092e8588d5c6729080b5ae7345e041470616. Apologies if this is just the same as your already-dismissed idea.
In particular this guarantees @normanmaurer's above requirement in all cases and also ensures there can never be a double de-allocation (as is possible with the current logic). However, it can suffer from some of the existing problems if releases coincide with retains overflowing (above
Integer.MAX_VALUE + 1
). Maybe that still disqualifies it for consideration, in which case nevermind!A couple of other observations:
Ensure retain(...) will always throw once it did once.
normanmaurer commentedon Nov 19, 2018
@njhill will need to check in detail. In the meantime PTAL #8574
normanmaurer commentedon Nov 20, 2018
@njhill @merlimat @carl-mastrangelo also somewhat related to this issue: #8576
Harden ref-counting concurrency semantics
Harden ref-counting concurrency semantics (#8583)
Harden ref-counting concurrency semantics (#8583)