Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM #6178

codelipenghui · 2020-01-31T17:01:06Z

Master Issue: #5751

Motivation

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications

If the processing message size exceed this value, broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto read data from the connection.

Verifying this change

Unit tests added

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Does this pull request introduce a new feature? (no)

merlimat · 2020-01-31T19:44:41Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

+        if (maxMessagePublishBufferSize < 0) {
+            return false;
+        }
+        if (currentMessagePublishBufferSize.addAndGet(msgSize) >= maxMessagePublishBufferSize &&


This would become a contention point across all the threads in the broker

@merlimat Yes, this does increase competition, how about move currentMessagePublishBufferSize to ServerCnx and periodically sync them to totalMessagePublishBufferSize in BrokerService by a single thread?

Of course this will cause delays, but it will reduce competition.

jiazhai · 2020-02-05T11:28:30Z

@codelipenghui Thanks for the work. It is a good approach to avoid OOM.
@merlimat Thanks for the comments. Seems it is not easy to avoid the contention while still accurately track the memory usage. Is there any suggestions for this?

jiazhai · 2020-02-10T09:07:22Z

ping @merlimat

sijie · 2020-02-10T17:23:17Z

@codelipenghui

how about move currentMessagePublishBufferSize to ServerCnx and periodically sync them to totalMessagePublishBufferSize in BrokerService by a single thread?

this sounds good to me. Also consider using LongAdder rather than AtomicLong.

codelipenghui · 2020-02-11T15:43:20Z

@merlimat @sijie @jiazhai I have applied the comment, please help take a look, thanks.

sijie · 2020-02-13T22:58:31Z

conf/broker.conf

+
+# Interval between checks to see if message publish buffer size is exceed the max message publish buffer size
+# Use 0 or negative number to disable the max publish buffer limiting.
+messagePublishBufferCheckIntervalInMills=100


Suggested change

messagePublishBufferCheckIntervalInMills=100

messagePublishBufferCheckIntervalInMillis=100

sijie · 2020-02-13T23:01:58Z

pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java

+            + " but broker have not send response to client, usually waiting to write to bookies.\n\n"
+            + " It's shared across all the topics running in the same broker.\n\n"
+            + " Use -1 to disable the memory limitation. Default is 1/5 of direct memory.\n\n")
+    private int maxMessagePublishBufferSizeInMB = Math.max(64,


The default value should be a value that makes the broker behave as close to the behavior without this code change. I understand we want to enable the rate-limiting feature. So we should try to make the default value as 60% and 70% of max direct memory? Otherwise, people might experience unexpected performance issues when they upgrade a broker from an old version to a newer version.

Maybe we'd better keep the default value to -1.

Changed the default buffer size to half of the direct memory.

sijie · 2020-02-13T23:03:59Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

@@ -257,6 +268,8 @@ public BrokerService(PulsarService pulsar) throws Exception {
                .newSingleThreadScheduledExecutor(new DefaultThreadFactory("pulsar-msg-expiry-monitor"));
        this.compactionMonitor =
            Executors.newSingleThreadScheduledExecutor(new DefaultThreadFactory("pulsar-compaction-monitor"));
+        this.messagePublishBufferMonitor =


We should create this executor only when the feature is enabled.

Also I see we are creating more and more schedulers. Can we consider reusing some of the executors?

Hmm, I think we need a different thread name. It's better for jstack analysis.

sijie · 2020-02-13T23:06:53Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

@@ -2011,4 +2033,34 @@ public ConfigField(Field field) {
            return Optional.empty();
        }
    }
+
+    private void checkMessagePublishBuffer() {
+        currentMessagePublishBufferSize = 0;


It seems to me that this variable doesn't have to be a class variable of BrokerService. It can just be a local variable, right?

Yes, it is.

sijie · 2020-02-13T23:09:07Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

+    private final long maxMessagePublishBufferSize;
+    private final long resumeProducerReadMessagePublishBufferSize;
+    private volatile long currentMessagePublishBufferSize;
+    private volatile boolean isMessagePublishBufferThreshold;


Suggested change

private volatile boolean isMessagePublishBufferThreshold;

private volatile boolean reachMessagePublishBufferThreshold;

sijie · 2020-02-13T23:09:33Z

pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java

@@ -216,8 +218,17 @@
    private Channel listenChannel;
    private Channel listenChannelTls;

+    private final long maxMessagePublishBufferSize;
+    private final long resumeProducerReadMessagePublishBufferSize;
+    private volatile long currentMessagePublishBufferSize;


Suggested change

private volatile long currentMessagePublishBufferSize;

private volatile long currentMessagePublishBufferBytes;

I prefer using bytes rather than size to make the unit more explicit.

@sijie

* [Issue 5904]Support `unload` all partitions of a partitioned topic (apache#6187) Fixes apache#5904 ### Motivation Pulsar supports unload a non-partitioned-topic or a partition of a partitioned topic. If there has a partitioned topic with too many partitions, users need to get all partition and unload them one by one. We need to support unload all partition of a partitioned topic. * [Issue 4175] [pulsar-function-go] Create integration tests for Go Functions for production-readiness (apache#6104) This PR is to provide integration tests that test execution of Go functions that are managed by the Java FunctionManager. This will allow us to test things like behavior during function timeouts, heartbeat failures, and other situations that can only be effectively tested in an integration test. Master issue: apache#4175 Fixes issue: apache#6204 ### Modifications We must add Go to the integration testing logic. We must also build the Go dependencies into the test Dockerfile to ensure the Go binaries are available at runtime for the integration tests. * [Issue 5999] Support create/update tenant with empty cluster (apache#6027) ### Motivation Fixes apache#5999 ### Modifications Add the logic to handle the blank cluster name. * Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM (apache#6178) Motivation Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM. Modifications If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection. * Enable get precise backlog and backlog without delayed messages. (apache#6310) Fixes apache#6045 apache#6281 ### Motivation Enable get precise backlog and backlog without delayed messages. ### Verifying this change Added new unit tests for the change. * KeyValue schema support for pulsar sql (apache#6325) Fixes apache#5560 ### Motivation Currently, Pulsar SQL can't read the keyValue schema data. This PR added support Pulsar SQL reading messages with a key-value schema. ### Modifications Add KeyValue schema support for Pulsar SQL. Add prefix __key. for the key field name. * Avoid get partition metadata while the topic name is a partition name. (apache#6339) Motivation To avoid get partition metadata while the topic name is a partition name. Currently, if users want to skip all messages for a partitioned topic or unload a partitioned topic, the broker will call get topic metadata many times. For a topic with the partition name, it is not necessary to call get partitioned topic metadata again. * explicit statement env 'BOOKIE_MEM' and 'BOOKIE_GC' for values-mini.yaml (apache#6340) Fixes apache#6338 ### Motivation This commit started while I was using helm in my local minikube, noticed that there's a mismatch between `values-mini.yaml` and `values.yaml` files. At first I thought it was a copy/paste error. So I created apache#6338; Then I looked into the details how these env-vars[ were used](https://github.com/apache/pulsar/blob/28875d5abc4cd13a3e9cc4f59524d2566d9f9f05/conf/bkenv.sh#L36), found out its ok to use `PULSAR_MEM` as an alternative. But it introduce problems: 1. Since `BOOKIE_GC` was not defined , the default [BOOKIE_EXTRA_OPTS](https://github.com/apache/pulsar/blob/28875d5abc4cd13a3e9cc4f59524d2566d9f9f05/conf/bkenv.sh#L39) will finally use default value of `BOOKIE_GC`, thus would cover same the JVM parameters defined prior in `PULSAR_MEM`. 2. May cause problems when bootstrap scripts changed in later dev, better to make it explicitly. So I create this pr to solve above problems(hidden trouble). ### Modifications As mentioned above, I've made such modifications below: 1. make `BOOKIE_MEM` and `BOOKIE_GC` explicit in `values-mini.yaml` file. Keep up with the format in`values.yaml` file. 2. remove all print-gc-logs related args. Considering the resource constraints of minikube environment. The removed part's content is `-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -verbosegc -XX:G1LogLevel=finest` 3. leave `PULSAR_PREFIX_dbStorage_rocksDB_blockCacheSize` empty as usual, as [conf/standalone.conf#L576](https://github.com/apache/pulsar/blob/df152109415f2b10dd83e8afe50d9db7ab7cbad5/conf/standalone.conf#L576) says it would to use 10% of the direct memory size by default. * Fix java doc for key shared policy. (apache#6341) The key shared policy does not support setting the maximum key hash range, so fix the java doc. * client: make SubscriptionMode a member of ConsumerConfigurationData (apache#6337) Currently, SubscriptionMode is a parameter to create ConsumerImpl, but it is not exported out, and user could not set this value for consumer. This change tries to make SubscriptionMode a member of ConsumerConfigurationData, so user could set this parameter when create consumer. * Windows CMake corrections (apache#6336) * Corrected method of specifying Windows path to LLVM tools * Fixing windows build * Corrected the dll install path * Fixing pulsarShared paths * remove future.join() from PulsarSinkEffectivelyOnceProcessor (apache#6361) * use checkout@v2 to avoid fatal: reference is not a tree (apache#6386) "fatal: reference is not a tree" is a known issue in actions/checkout#23 and fixed in checkout@v2, update checkout used in GitHub actions. * [Pulsar-Client] Stop shade snappy-java in pulsar-client-shaded (apache#6375) Fixes apache#6260 Snappy, like other compressions (LZ4, ZSTD), depends on native libraries to do the real encode/decode stuff. When we shade them in a fat jar, only the java implementations of snappy class are shaded, however, left the JNI incompatible with the underlying c++ code. We should just remove the shade for snappy, and let maven import its lib as a dependency. I've tested the shaded jar locally generated by this pr, it works for all compression codecs. * Fix CI not triggered (apache#6397) In apache#6386 , checkout@v2 is brought in for checkout. However, it's checking out PR merge commit by default, therefore breaks diff-only action which looking for commits that a PR is based on. And make all tests skipped. This PR fixes this issue. and has been proven to work with apache#6396 Brokers/unit-tests. * [Issue 6355][HELM] autorecovery - could not find or load main class (apache#6373) This applies the recommended fix from apache#6355 (comment) Fixes apache#6355 ### Motivation This PR corrects the configmap data which was causing the autorecovery pod to crashloop with `could not find or load main class` ### Modifications Updated the configmap var data per [this comment](apache#6355 (comment)) from @sijie * Creating a topic does not wait for creating cursor of replicators (apache#6364) ### Motivation Creating a topic does not wait for creating cursor of replicators ## Verifying this change The exists unit test can cover this change * [Reader] Should set either start message id or start message from roll back duration. (apache#6392) Currently, when constructing a reader, users can set both start message id and start time. This is strange and the behavior should be forbidden. * Seek to the first one >= timestamp (apache#6393) The current logic for `resetCursor` by timestamp is odd. The first message it returns is the last message earlier or equal to the designated timestamp. This "earlier" message should be avoided to emit. * [Minor] Fix java code errors reported by lgtm. (apache#6398) Four kinds of errors are fixed in this PR: - Array index out of bounds - Inconsistent equals and hashCode - Missing format argument - Reference equality test of boxed types According to https://lgtm.com/projects/g/apache/pulsar/alerts/?mode=tree&severity=error&id=&lang=java * [Java Reader Client] Start reader inside batch result in read first message in batch. (apache#6345) Fixes apache#6344 Fixes apache#6350 The bug was brought in apache#5622 by changing the skip logic wrongly. * Fix broker to specify a list of bookie groups. (apache#6349) ### Motivation Fixes apache#6343 ### Modifications Add a method to cast object value to `String`. * Fixed enum package not found (apache#6401) Fixes apache#6400 ### Motivation This problem is blocking the current test. 1.1.8 version of `enum34` seems to have some problems, and the problem reproduces: Use pulsar latest code: ``` cd pulsar mvn clean install -DskipTests dokcer pull apachepulsar/pulsar-build:ubuntu-16.04 docker run -it -v $PWD:/pulsar --name pulsar apachepulsar/pulsar-build:ubuntu-16.04 /bin/bash docker exec -it pulsar /bin/bash cmake . make -j4 && make install cd python python setup.py bdist_wheel pip install dist/pulsar_client-*-linux_x86_64.whl ``` `pip show enum34` ``` Name: enum34 Version: 1.1.8 Summary: Python 3.4 Enum backported to 3.3, 3.2, 3.1, 2.7, 2.6, 2.5, and 2.4 Home-page: https://bitbucket.org/stoneleaf/enum34 Author: Ethan Furman Author-email: ethan@stoneleaf.us License: BSD License Location: /usr/local/lib/python2.7/dist-packages Requires: Required-by: pulsar-client, grpcio ``` ``` root@55e06c5c770f:/pulsar/pulsar-client-cpp/python# python Python 2.7.12 (default, Oct 8 2019, 14:14:10) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from enum import Enum, EnumMeta Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: No module named enum >>> exit() ``` There is no problem with using 1.1.9 in the test. ### Modifications * Upgrade enum34 from 1.1.8 to 1.1.9 ### Verifying this change local test pass * removed comma from yaml config (apache#6402) * Fix broker client tls settings error (apache#6128) when broker create the inside client, it sets tlsTrustCertsFilePath as "getTlsCertificateFilePath()", but it should be "getBrokerClientTrustCertsFilePath()" * [Issue 3762][Schema] Fix the problem with parsing of an Avro schema related to shading in pulsar-client. (apache#6406) Motivation Avro schemas are quite important for proper data flow and it is a pity that the apache#3762 issue stayed untouched for so long. There were some workarounds on how to make Pulsar use an original avro schema, but in the end, it is pretty hard to run an enterprise solution on workarounds. With this PR I would like to find a solution to the problem caused by shading avro in pulsar-client. As it was discussed in the issue, there are two possible solutions for this problem: Unshade the avro library in the pulsar-client library. (IMHO it seems like a proper solution for this problem, but it also brings a risk of unknown side-effects) Use reflection to get original schemas from generated classes. (I went for this solution) Could you please comment if this is a proper solution for the problem? I will add tests when my approach will be confirmed. Modifications First, we try to extract an original avro schema from the "$SCHEMA" field using reflection. If it doesn't work, the process falls back generation of the schema from POJO. * Remove duplicated lombok annotations in the tests module (apache#6412) * Add verification for SchemaDefinitionBuilderImpl.java (apache#6405) ### Motivation Add verification for SchemaDefinitionBuilderImpl.java ### Verifying this change Added a new unit test. * Cleanup pom files in the tests module (apache#6421) ### Modifications - Removed dependencies on test libraries that were already imported in the parent pom file. - Removed groupId tags that are inherited from the parent pom file. * Update BatchReceivePolicy.java (apache#6423) BatchReceivePolicy implements Serializable. * Consumer received duplicated deplayed messages upon restart Fix when send a delayed message ,there is a case when a consumer restarts and pull duplicate messages. apache#6403 * Bump netty version to 4.1.45.Final (apache#6424) netty 4.1.43 has a bug preventing it from using Linux native Epoll transport This results in pulsar brokers failing over to NioEventLoopGroup even when running on Linux. The bug is fixed in netty releases 4.1.45.Final * Fix publish buffer limit does not take effect Motivation If set up maxMessagePublishBufferSizeInMB > Integer.MAX_VALUE / 1024 / 1024, the publish buffer limit does not take effect. The reason is maxMessagePublishBufferBytes always 0 when use following calculation method : pulsar.getConfiguration().getMaxMessagePublishBufferSizeInMB() * 1024 * 1024; So, changed to pulsar.getConfiguration().getMaxMessagePublishBufferSizeInMB() * 1024L * 1024L; * doc: Add on the missing right parenthesis (apache#6426) * Add on the missing right parenthesis doc: Missing right parenthesis in the `token()` line from Pulsar Client Java Code. * Add on the missing right parenthesis on line L70 * Switch from deprecated MAINTAINER tag to LABEL with maintainer's info in Dockerfile (apache#6429) Motivation & Modification The MAINTAINER instruction is deprecated in favor of the LABEL instruction with the maintainer's info in docker files. * Amend the default value of . (apache#6374) * fix the bug of authenticationData is't initialized. (apache#6440) Motivation fix the bug of authenticationData is't initialized. the method org.apache.pulsar.proxy.server.ProxyConnection#handleConnect can't init the value of authenticationData. cause of the bug that you will get the null value form the method org.apache.pulsar.broker.authorization.AuthorizationProvider#canConsumeAsync when implements org.apache.pulsar.broker.authorization.AuthorizationProvider interface. Modifications init the value of authenticationData from the method org.apache.pulsar.proxy.server.ProxyConnection#handleConnect. Verifying this change implements org.apache.pulsar.broker.authorization.AuthorizationProvider interface， and get the value of authenticationData. * Remove duplicated test libraries in POM dependencies (apache#6430) ### Motivation The removed test libraries were already defined in the parent pom ### Modification Removed duplicated test libraries in POM dependencies * Add a message on how to make log refresh immediately when starting a component (apache#6078) ### Motivation Some users may confuse by pulsar/bookie log without flushing immediately. ### Modifications Add a message in `bin/pulsar-daemon` when starting a component. * Close ZK before canceling future with exception (apache#6228) (apache#6399) Fixes apache#6228 * [Flink-Connector]Get PulsarClient from cache should always return an open instance (apache#6436) * Update sidebars.json (apache#6434) The referenced markdown files do not exist and so the "Next" and "Previous" buttons on the bottom of pages surrounding them result in 404 Not Found errors * [Broker] Create namespace failed when TLS is enabled in PulsarStandalone (apache#6457) When starting Pulsar in standalone mode with TLS enabled, it will fail to create two namespaces during start. This is because it's using the unencrypted URL/port while constructing the PulsarAdmin client. * Update version-2.5.0-sidebars.json (apache#6455) The referenced markdown files do not exist and so the "Next" and "Previous" buttons on the bottom of pages surrounding them result in 404 Not Found errors * [Issue 6168] Fix Unacked Message Tracker by Using Time Partition on C++ (apache#6391) ### Motivation Fix apache#6168 . >On C++ lib, like the following log, unacked messages are redelivered after about 2 * unAckedMessagesTimeout. ### Modifications As same apache#3118, by using TimePartition, fixed ` UnackedMessageTracker` . - Add `TickDurationInMs` - Add `redeliverUnacknowledgedMessages` which require `MessageIds` to `ConsumerImpl`, `MultiTopicsConsumerImpl` and `PartitionedConsumerImpl`. * [ClientAPI]Fix hasMessageAvailable() (apache#6362) Fixes apache#6333 Previously, `hasMoreMessages` is test against: ``` return lastMessageIdInBroker.compareTo(lastDequeuedMessage) == 0 && incomingMessages.size() > 0; ``` However, the `incomingMessages` could be 0 when the consumer/reader has just started and hasn't received any messages yet. In this PR, the last entry is retrieved and decoded to get message metadata. for the batchIndex field population. * Use System.nanoTime() instead of System.currentTimeMillis() (apache#6454) Fixes apache#6453 ### Motivation `ConsumerBase` and `ProducerImpl` use `System.currentTimeMillis()` to measure the elapsed time in the 'operations' inner classes (`ConsumerBase$OpBatchReceive` and `ProducerImpl$OpSendMsg`). An instance variable `createdAt` is initialized with `System.currentTimeMills()`, but it is not used for reading wall clock time, the variable is only used for computing elapsed time (e.g. timeout for a batch). When the variable is used to compute elapsed time, it would more sense to use `System.nanoTime()`. ### Modifications The instance variable `createdAt` in `ConsumerBase$OpBatchReceive` and `ProducerImpl$OpSendMsg` is initialized with `System.nanoTime()`. Usage of the variable is updated to reflect that the variable holds nano time; computations of elapsed time takes the difference between the current system nano time and the `createdAt` variable. The `createdAt` field is package protected, and is currently only used in the declaring class and outer class, limiting the chances for unwanted side effects. * Fixed the max backoff configuration for lookups (apache#6444) * Fixed the max backoff configuration for lookups * Fixed test expectation * More test fixes * upgrade scala-maven-plugin to 4.1.0 (apache#6469) ### Motivation The Pulsar examples include some third-party libraries with security vulnerabilities. - log4j-core-2.8.1 https://www.cvedetails.com/cve/CVE-2017-5645 ### Modifications - Upgraded the version of scala-maven-plugin from 4.0.1 to 4.1.0. log4j-core-2.8.1 were installed because scala-maven-plugin depends on it. * [pulsar-proxy] fix logging for published messages (apache#6474) ### Motivation Proxy-logging fetches incorrect producerId for `Send` command because of that logging always gets producerId as 0 and it fetches invalid topic name for the logging. ### Modification Fixed topic logging by fetching correct producerId for `Send` command. * [Issue 6394] Add configuration to disable auto creation of subscriptions (apache#6456) ### Motivation Fixes apache#6394 ### Modifications - provide a flag `allowAutoSubscriptionCreation` in `ServiceConfiguration`, defaults to `true` - when `allowAutoSubscriptionCreation` is disabled and the specified subscription (`Durable`) on the topic does not exist when trying to subscribe via a consumer, the server should reject the request directly by `handleSubscribe` in `ServerCnx` - create the subscription on the coordination topic if it does not exist when init `WorkerService` * Make tests more stable by using JSONAssert equals (apache#6435) Similar to the change you already merged for AvroSchemaTest.java(apache#6247): `jsonSchema.getSchemaInfo().getSchema()` in `pulsar-client/src/test/java/org/apache/pulsar/client/impl/schema/JSONSchemaTest.java` returns a JSON object. `schemaJson` compares with hard-coded JSON String. However, the order of entries in `schemaJson` is not guaranteed. Similarly, test `testKeyValueSchemaInfoToString` in `pulsar-client/src/test/java/org/apache/pulsar/client/impl/schema/KeyValueSchemaInfoTest.java` returns a JSON object. `havePrimitiveType` compares with hard-coded JSON String, and the order of entries in `havePrimitiveType` is not guaranteed. This PR proposes to use JSONAssert and modify the corresponding JSON test assertions so that the test is more stable. ### Motivation Using JSONAssert and modifying the corresponding JSON test assertions so that the test is more stable. ### Modifications Adding `assertJSONEqual` method and replacing `assertEquals` with it in tests `testAllowNullSchema`, `testNotAllowNullSchema` and `testKeyValueSchemaInfoToString`. * Avoid calling ConsumerImpl::redeliverMessages() when message list is empty (apache#6480) * [pulsar-client] fix deadlock on send failure (apache#6488) * Enhance Authorization by adding TenantAdmin interface (apache#6487) * Enhance Authorization by adding TenantAdmin interface * Remove debugging comment Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com> * Independent schema is set for each consumer generated by topic (apache#6356) ### Motivation Master Issue: apache#5454 When one Consumer subscribe multi topic, setSchemaInfoPorvider() will be covered by the consumer generated by the last topic. ### Modification clone schema for each consumer generated by topic. ### Verifying this change Add the schemaTest for it. * Fix memory leak when running topic compaction. (apache#6485) Fixes apache#6482 ### Motivation Prevent topic compaction from leaking direct memory ### Modifications Several leaks were discovered using Netty leak detection and code review. * `CompactedTopicImpl.readOneMessageId` would get an `Enumeration` of `LedgerEntry`, but did not release the underlying buffers. Fix: iterate though the `Enumeration` and release underlying buffer. Instead of logging the case where the `Enumeration` did not contain any elements, complete the future exceptionally with the message (will be logged by Caffeine). * Two main sources of leak in `TwoPhaseCompactor`. The `RawBacthConverter.rebatchMessage` method failed to close/release a `ByteBuf` (uncompressedPayload). Also, the return ByteBuf of `RawBacthConverter.rebatchMessage` was not closed. The first one was easy to fix (release buffer), to fix the second one and make the code easier to read, I decided to not let `RawBacthConverter.rebatchMessage` close the message read from the topic, instead the message read from the topic can be closed in a try/finally clause surrounding most of the method body handing a message from a topic (in phase two loop). Then if a new message was produced by `RawBacthConverter.rebatchMessage` we check that after we have added the message to the compact ledger and release the message. ### Verifying this change Modified `RawReaderTest.testBatchingRebatch` to show new contract. One can run the test described to reproduce the issue, to verify no leak is detected. * Fix create partitioned topic with a substring of an existing topic name. (apache#6478) Fixes apache#6468 Fix create a partitioned topic with a substring of an existing topic name. And make create partitioned topic async. * Bump jcloud version to 2.2.0 and remove jcloud-shade module (apache#6494) In jclouds 2.2.0, the [gson is shaded internally](https://issues.apache.org/jira/browse/JCLOUDS-1166). We could safely remove the jcloud-shade module as a cleanup. * Refactor tests in pulsar client tools test (apache#6472) ### Modifications The main modification was the reduction of repeated initialization of the variables in the tests. * Fix Topic metrics documentation (apache#6495) ### Motivation *Explain here the context, and why you're making that change. What is the problem you're trying to solve.* Motivation is to have correct reference-metrics documentation. ### Modifications *Describe the modifications you've done.* There is an error in the `Topic metrics` section `pulsar_producers_count` => `pulsar_in_messages_total` * [pulsar-client] remove duplicate cnx method (apache#6490) ### Motivation Remove duplicate `cnx()` method for `producer` * [proxy] Fix proxy routing to functions worker (apache#6486) ### Motivation Currently, the proxy only works to proxy v1/v2 functions routes to the function worker. ### Modifications This changes this code to proxy all routes for the function worker when those routes match. At the moment this is still a static list of prefixes, but in the future it may be possible to have this list of prefixes be dynamically fetched from the REST routes. ### Verifying this change - added some tests to ensure the routing works as expected * Fix some async method problems at PersistentTopicsBase. (apache#6483) * Instead of always using admin access for topic, use read/write/admin access for topic (apache#6504) Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com> * [Minor]Remove unused property from pom (apache#6500) This PR is a follow-up of apache#6494 * [pulsar-common] Remove duplicate RestException references (apache#6475) ### Motivation Right now, various pulsar-modules have duplicate `RestException` class and repo has multiple duplicate class. So, move `RestException` to common place and all modules should use the same Exception class to avoid duplicate classes. * pulsar-proxy: fix correct name for proxy thread executor name (apache#6460) ### Motivation fix correct name for proxy thread executor name * Add subscribe initial position for consumer cli. (apache#6442) ### Motivation In some case, users expect to consume messages from beginning similar to the option `--from-beginning` of kafka consumer CLI. ### Modifications Add `--subscription-position` for `pulsar-client` and `pulsar-perf`. * [Cleanup] Log format does not match arguments (apache#6509) * Start namespace service and schema registry service before start broker. (apache#6499) ### Motivation If the broker service is started, the client can connect to the broker and send requests depends on the namespace service, so we should create the namespace service before starting the broker. Otherwise, NPE occurs. ![image](https://user-images.githubusercontent.com/12592133/76090515-a9961400-5ff6-11ea-9077-cb8e79fa27c0.png) ![image](https://user-images.githubusercontent.com/12592133/76099838-b15db480-6006-11ea-8f39-31d820563c88.png) ### Modifications Move the namespace service creation and the schema registry service creation before start broker service. * [pulsar-client-cpp] Fix Redelivery of Messages on UnackedMessageTracker When Ack Messages . (apache#6498) ### Motivation Because of apache#6391 , acked messages were counted as unacked messages. Although messages from brokers were acknowledged, the following log was output. ``` 2020-03-06 19:44:51.790 INFO ConsumerImpl:174 | [persistent://public/default/t1, sub1, 0] Created consumer on broker [127.0.0.1:58860 -> 127.0.0.1:6650] my-message-0: Fri Mar 6 19:45:05 2020 my-message-1: Fri Mar 6 19:45:05 2020 my-message-2: Fri Mar 6 19:45:05 2020 2020-03-06 19:45:15.818 INFO UnAckedMessageTrackerEnabled:53 | [persistent://public/default/t1, sub1, 0] : 3 Messages were not acked within 10000 time ``` This behavior happened on master branch. * [pulsar-proxy] fixing data-type of logging-level (apache#6476) ### Modification `ProxyConfig` has wrapper method for `proxyLogLevel` to present `Optional` data-type. after apache#3543 we can define config param as optional without creating wrapper methods. * [pulsar-broker] recover zk-badversion while updating cursor metadata (apache#5604) fix test Co-authored-by: ltamber <ltamber12@gmail.com> Co-authored-by: Devin Bost <devinbost@users.noreply.github.com> Co-authored-by: Fangbin Sun <sunfangbin@gmail.com> Co-authored-by: lipenghui <penghui@apache.org> Co-authored-by: ran <gaoran_10@126.com> Co-authored-by: liyuntao <liyuntao58607@gmail.com> Co-authored-by: Jia Zhai <zhaijia@apache.org> Co-authored-by: Nick Rivera <heronr@users.noreply.github.com> Co-authored-by: Neng Lu <freeneng@gmail.com> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: John Harris <jharris-@users.noreply.github.com> Co-authored-by: guangning <guangning@apache.org> Co-authored-by: newur <ruwen.reddig@gmail.com> Co-authored-by: Sergii Zhevzhyk <vzhikserg@users.noreply.github.com> Co-authored-by: liudezhi <33149602+liudezhi2098@users.noreply.github.com> Co-authored-by: Dzmitry Kazimirchyk <dzmitryk@users.noreply.github.com> Co-authored-by: futeng <ifuteng@gmail.com> Co-authored-by: bilahepan <YTgaotianci@gmail.com> Co-authored-by: Paweł Łoziński <pawel.lozinski@gmail.com> Co-authored-by: Ryan Slominski <ryans@jlab.org> Co-authored-by: k2la <mzq6mft9zz@gmail.com> Co-authored-by: Rolf Arne Corneliussen <racorn@users.noreply.github.com> Co-authored-by: Matteo Merli <mmerli@apache.org> Co-authored-by: Sijie Guo <sijie@apache.org> Co-authored-by: Rajan Dhabalia <rdhabalia@apache.org> Co-authored-by: Sanjeev Kulkarni <sanjeevrk@gmail.com> Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com> Co-authored-by: congbo <39078850+congbobo184@users.noreply.github.com> Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> Co-authored-by: Addison Higham <addisonj@gmail.com>

tuteng · 2020-03-21T04:22:18Z

Add label release-2.5.1, due to #6431 dependency

…er OOM (apache#6178) Motivation Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM. Modifications If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection. (cherry picked from commit 91dfa1a)

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them.

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them. (cherry picked from commit 5bd0387)

…er OOM (#6178) Motivation Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM. Modifications If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection. (cherry picked from commit 91dfa1a)

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them. (cherry picked from commit 5bd0387)

…er OOM (apache#6178) Motivation Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM. Modifications If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection. (cherry picked from commit 91dfa1a)

In PR apache#6178, some of the method in servercnx is turn from public to private, this change tries to resume them. (cherry picked from commit 5bd0387)

…er OOM (apache#6178) Motivation Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM. Modifications If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection.

In PR apache#6178, some of the method in servercnx is turn from public to private, this change tries to resume them.

codelipenghui added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Jan 31, 2020

codelipenghui added this to the 2.6.0 milestone Jan 31, 2020

codelipenghui requested review from merlimat, sijie, rdhabalia, jiazhai and nkurihar January 31, 2020 17:01

codelipenghui self-assigned this Jan 31, 2020

merlimat reviewed Jan 31, 2020

View reviewed changes

codelipenghui force-pushed the publish_message_buffer branch from 83c7cb3 to f92ed4e Compare February 11, 2020 15:34

jiazhai approved these changes Feb 13, 2020

View reviewed changes

sijie requested changes Feb 13, 2020

View reviewed changes

codelipenghui added 4 commits February 16, 2020 10:34

Add maxMessagePublishBufferSizeInMB configuration to avoid broker OOM

68d15b5

Apply comments

c3bbbb0

Clean code

7bbc7a2

Fix unit test

917f29b

codelipenghui force-pushed the publish_message_buffer branch from b4be922 to 917f29b Compare February 16, 2020 02:35

Apply comments

47923d1

sijie approved these changes Feb 16, 2020

View reviewed changes

Fix unit tests.

f752a6c

codelipenghui merged commit 91dfa1a into apache:master Feb 16, 2020

tuteng added the release/2.5.1 label Mar 21, 2020

This was referenced Mar 21, 2020

PIP-55: Refresh Authentication Credentials #6074

Merged

Supports evenly distribute topics count when splits bundle #6241

Merged

jiazhai mentioned this pull request Mar 21, 2020

resume some servercnx method to public #6581

Merged

jiazhai added a commit that referenced this pull request Mar 22, 2020

resume servercnx method to public (#6581)

5bd0387

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them.

jiazhai added a commit that referenced this pull request Mar 22, 2020

resume servercnx method to public (#6581)

ad02244

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them. (cherry picked from commit 5bd0387)

tuteng pushed a commit that referenced this pull request Apr 6, 2020

resume servercnx method to public (#6581)

e82ef7a

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them. (cherry picked from commit 5bd0387)

tuteng pushed a commit that referenced this pull request Apr 13, 2020

resume servercnx method to public (#6581)

37d9005

In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them. (cherry picked from commit 5bd0387)

merlimat mentioned this pull request Jul 1, 2020

Improved in max-pending-bytes mechanism for broker #7406

Merged

codelipenghui mentioned this pull request Jul 11, 2020

[pulsar-broker] Broker handle back-pressure with max-pending message across topics to avoid OOM #7499

Open

huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020

resume servercnx method to public (apache#6581)

4edf488

In PR apache#6178, some of the method in servercnx is turn from public to private, this change tries to resume them.

codelipenghui deleted the publish_message_buffer branch November 6, 2020 00:53

danielsinai mentioned this pull request Feb 11, 2021

maxMessageBufferSizeInMB is not working? #9562

Closed

sijie mentioned this pull request Feb 11, 2021

ISSUE-9562: maxMessageBufferSizeInMB is not working? streamnative/pulsar-archived#2154

Closed

wenbingshen mentioned this pull request Sep 24, 2021

[BUG] Questions about pulsar broker direct OOM #12169

Open

sijie mentioned this pull request Sep 24, 2021

ISSUE-12169: [BUG] Questions about pulsar broker direct OOM streamnative/pulsar-archived#3090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM #6178

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM #6178

codelipenghui commented Jan 31, 2020

merlimat Jan 31, 2020

codelipenghui Feb 2, 2020 •

edited

jiazhai commented Feb 5, 2020

jiazhai commented Feb 10, 2020

sijie commented Feb 10, 2020

codelipenghui commented Feb 11, 2020

sijie Feb 13, 2020

sijie Feb 13, 2020

codelipenghui Feb 14, 2020

codelipenghui Feb 16, 2020

sijie Feb 13, 2020

codelipenghui Feb 14, 2020

sijie Feb 13, 2020

codelipenghui Feb 14, 2020

sijie Feb 13, 2020

sijie Feb 13, 2020

tuteng commented Mar 21, 2020

	messagePublishBufferCheckIntervalInMills=100
	messagePublishBufferCheckIntervalInMillis=100

	private volatile boolean isMessagePublishBufferThreshold;
	private volatile boolean reachMessagePublishBufferThreshold;

	private volatile long currentMessagePublishBufferSize;
	private volatile long currentMessagePublishBufferBytes;

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM #6178

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM #6178

Conversation

codelipenghui commented Jan 31, 2020

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Choose a reason for hiding this comment

codelipenghui Feb 2, 2020 • edited

Choose a reason for hiding this comment

jiazhai commented Feb 5, 2020

jiazhai commented Feb 10, 2020

sijie commented Feb 10, 2020

codelipenghui commented Feb 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuteng commented Mar 21, 2020

codelipenghui Feb 2, 2020 •

edited