Skip to content

[FLINK-11094] [rocksdb] Make non-accessed restored state in RocksDBStateBackend snapshottable #7264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

tzulitai
Copy link
Contributor

@tzulitai tzulitai commented Dec 9, 2018

What is the purpose of the change

The ultimate goal of this PR is to fix a bug where NPE is thrown if a non-accessed restored state in RocksDBKeyedStatebackend was snapshotted.
This was an overlooked issue caused by changes in FLINK-10679.

The problem is that in that change, in the RocksDBKeyedBackend, RegisteredStateMetaInfoBases were no longer created eagerly for all restored state, but instead only lazily created when the state was accessed again by the user. This causes non-accessed restored state to have empty meta info, and throws NPE when trying to take a snapshot of them.

The rationale behind FLINK-10679 was that, since RegisteredStateMetaInfoBase holds already serializer instances for state access, creating them eagerly at restore time with restored serializer snapshots did not make sense (because at that point-in-time, we do not have the new serializers yet for state access; the snapshot is only capable of creating the previous state serializer).

This PR fixes this by letting state meta infos hold a StateSerializerProvider, instead of serializer instances. This allows meta infos to be instantiated from snapshots without eagerly accessing the restore serializer.

The API for the StateSerializerProvider is as follows:

public abstract class StateSerializerProvider<T> {
    TypeSerializer<T> currentSchemaSerializer();
    TypeSerializerSchemaCompatibility<T> registerSerializerForRestoredState(TypeSerializer<T> newRegisteredSerializer);
    TypeSerializer<T> previousSchemaSerializer();

    // factory methods for provider:
    public static <T> StateSerializerProvider<T> fromNewState(TypeSerializer<T> registeredSerializer);
    public static <T> StateSerializerProvider<T> fromRestoredState(TypeSerializerSnapshot<T> snapshot);
}

A StateSerializerProvider can be created either from:

  1. A restored serializer snapshot when restoring the state.
  2. A fresh, new state's serializer, when registering the state for the first time.

For 1), state that has not been accessed yet after the restore will return the same serializer (i.e. the serializer for previous schema) for both currentSchemaSerializer and previousSchemaSerializer, meaning that since no new serializer has been registered, the schema remains the same anyways.
Once a restored state is re-accessed, then registerSerializerForRestoredState(TypeSerializer<T> newSerializer) should be used to update what serializer the provider returns in currentSchemaSerializer.

Brief change log

Main changes are:

  • First 3 hotfixes (3c99bf3 to 074e6aa) is some preliminary code cleanup / test hardening for the StateBackendMigrationTestBase.
  • 32cf5ab: Introduce the new StateSerializerProvider abstraction and added tests for it
  • 239e942: Eagerly create meta infos from restored snapshots, so that they are created even if the state isn't accessed.
  • 5e9d568: Let meta infos use StateSerializerProvider instead of serializer instance
  • 7cd35c7 to e13e4b3: Remove all restoredXXStateMetaInfos maps from all state backends. They can be removed because all restored meta infos are now implicitly part of the eagerly created RegisteredStateMetaInfoBases (which is kept in the registeredXXXMetaInfo maps in all state backends).
  • dc7fd17: Add test coverage for snapshotting non-accessed restored state

Verifying this change

This PR adds a new test StateBackendTestBase.testSnapshotNonAccessedState() to cover the issue.
The test fails for RocksDB state backend case without the fixes.

Also, unit tests are added for StateSerializerProvider, in StateSerializerProviderTest.

All existing migration related tests in StateBackendMigrationTestBase are also related and should not be failing.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

… tests should fail if exception isn't thrown

This commit strengthens tests in StateBackendMigrationTestBase that
depend on a certain state operation (restoring state, accessing state,
etc.) to be failing to assert correct behaviour. However, we previously
do not really fail the test if no exception was thrown when there should
be.

This also caught some bugs in the test itself which had the tests
verifying incorrect behaviour.
…ackend should always be compatible

Previously, we were only checking if the new namespace serializer is
incompatible, while properly we should be checking that it is strictly
compatible.

This doesn't affect any user expected behaviour, since the namespace
serializer is never exposed to users.
…ore understandable

The StateBackendMigrationTestBase previously mocked user behaviour of
upgrading serializers by using a single serializer class, that can be
configured with different target compatibility results when they are
checked for compatibility.

This is a bit hard to understand, also doesn't really reflect how a user
would actually approach the feature. Instead, instead of configuring a
single serializer class with different compatibility "personalities",
this commit uses actual different classes, V1TestTypeSerializer,
V2TestTypeSerializer, and IncompatibleTestTypeSerializer, to simulate
the compatibility cases in tests.

This commit also refactors the serializer migration related test
serializers / types / snapshots to the testutil package, so that it can
be shared by other state migration related tests in the future.
A StateSerializerProvider  wraps logic on how to obtain
serializers for registered state, either with the previous
schema of state in checkpoints or the current schema of state.
…in RocksDBKeyedStateBackend

This ensures that all restored state, even non-accessed ones after the
restore, have a meta info available on future snapshots.
…erProviders

This commit lets all state meta info subclasses use
StateSerializerProviders to replace direct serializer instances. This
allows meta infos that were instantiate with restored serializer
snapshots to not eagerly access the restore serializer when restoring
state. This needs to be avoided since when restoring from 1.6, the
restore serializer might not be available; for RocksDB, this should be
tolerable.

This further coherents with the principle of not accessing the restore
serializer unless neccessary.
… map from RocksDBKeyedStateBackend

With the fact that now restored serializer snapshots are held by their
eagerly created meta infos in the kvStateInformation map, we no longer
need to keep a separate map of restored meta info snapshots. As can be
seen in the changes, that map is no longer queried anywhere.
…apKeyedStateBackend

Likewise in the RocksDBKeyedStateBackend, since restored snapshots are
now part of the registered meta info map, we no longer need this map.
This commit also reflects the principal changes we made to the
RegisteredKeyValueStateMetaInfo class in the HeapKeyedStatBackend.
…faultOperatorStateBackend

Likewise in the RocksDBKeyedStateBackend, since restored snapshots
are now part of the registered meta info map, we no longer need these
maps. This commit also reflects the principal changes we made to the
RegisteredKeyValueStateMetaInfo class in the DefaultOperatorStateBackend.
…izer(...) from StateMetaInfoSnapshot

That method is no longer used after the series of changes in
FLINK-11094. This also corresponds to the new principle that the restore
serializer is only accessed, when the state backends attempt request it
from their state meta infos. We do not create restore serializers
eagerly when creating meta infos from a StateMetaInfoSnapshot.
…alizer snapshot in StateMetaInfoSnapshot

This corresponds to the abstraction rework of retiring
TypeSerializerConfigSnapshot to be replaced by TypeSerializerSnapshot.
The related fields / methods should not mention "config" anymore.
@igalshilman
Copy link
Contributor

Hey @tzulitai, thanks for addressing this issue!
I went through the commits and they align to our discussions offline, it looks good to me 👍

tzulitai added a commit to tzulitai/flink that referenced this pull request Dec 11, 2018
… map for restored StateMetaInfoSnapshots

Since now all restored state meta info snapshots are handled so that we
always eagerly create the corresponding RegisteredStateMetaInfoBase for
it, the information is already part of the registered state infos map.

As can be seen in the changes, those maps are no longer queried and can
therefore be safely removed.

This closes apache#7264.
@asfgit asfgit closed this in 97f556e Dec 11, 2018
tzulitai added a commit to tzulitai/flink that referenced this pull request Dec 11, 2018
… map for restored StateMetaInfoSnapshots

Since now all restored state meta info snapshots are handled so that we
always eagerly create the corresponding RegisteredStateMetaInfoBase for
it, the information is already part of the registered state infos map.

As can be seen in the changes, those maps are no longer queried and can
therefore be safely removed.

This closes apache#7264.
asfgit pushed a commit that referenced this pull request Dec 11, 2018
… map for restored StateMetaInfoSnapshots

Since now all restored state meta info snapshots are handled so that we
always eagerly create the corresponding RegisteredStateMetaInfoBase for
it, the information is already part of the registered state infos map.

As can be seen in the changes, those maps are no longer queried and can
therefore be safely removed.

This closes #7264.
tisonkun pushed a commit to tisonkun/flink that referenced this pull request Jan 17, 2019
… map for restored StateMetaInfoSnapshots

Since now all restored state meta info snapshots are handled so that we
always eagerly create the corresponding RegisteredStateMetaInfoBase for
it, the information is already part of the registered state infos map.

As can be seen in the changes, those maps are no longer queried and can
therefore be safely removed.

This closes apache#7264.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants