Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM #6178

Merged
merged 6 commits into from Feb 16, 2020

Conversation

codelipenghui
Copy link
Contributor

Master Issue: #5751

Motivation

Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications

If the processing message size exceed this value, broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto read data from the connection.

Verifying this change

Unit tests added

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@codelipenghui codelipenghui added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Jan 31, 2020
@codelipenghui codelipenghui added this to the 2.6.0 milestone Jan 31, 2020
@codelipenghui codelipenghui self-assigned this Jan 31, 2020
if (maxMessagePublishBufferSize < 0) {
return false;
}
if (currentMessagePublishBufferSize.addAndGet(msgSize) >= maxMessagePublishBufferSize &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would become a contention point across all the threads in the broker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat Yes, this does increase competition, how about move currentMessagePublishBufferSize to ServerCnx and periodically sync them to totalMessagePublishBufferSize in BrokerService by a single thread?

Of course this will cause delays, but it will reduce competition.

@jiazhai
Copy link
Member

jiazhai commented Feb 5, 2020

@codelipenghui Thanks for the work. It is a good approach to avoid OOM.
@merlimat Thanks for the comments. Seems it is not easy to avoid the contention while still accurately track the memory usage. Is there any suggestions for this?

@jiazhai
Copy link
Member

jiazhai commented Feb 10, 2020

ping @merlimat

@sijie
Copy link
Member

sijie commented Feb 10, 2020

@codelipenghui

how about move currentMessagePublishBufferSize to ServerCnx and periodically sync them to totalMessagePublishBufferSize in BrokerService by a single thread?

this sounds good to me. Also consider using LongAdder rather than AtomicLong.

@codelipenghui
Copy link
Contributor Author

@merlimat @sijie @jiazhai I have applied the comment, please help take a look, thanks.

conf/broker.conf Outdated

# Interval between checks to see if message publish buffer size is exceed the max message publish buffer size
# Use 0 or negative number to disable the max publish buffer limiting.
messagePublishBufferCheckIntervalInMills=100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
messagePublishBufferCheckIntervalInMills=100
messagePublishBufferCheckIntervalInMillis=100

+ " but broker have not send response to client, usually waiting to write to bookies.\n\n"
+ " It's shared across all the topics running in the same broker.\n\n"
+ " Use -1 to disable the memory limitation. Default is 1/5 of direct memory.\n\n")
private int maxMessagePublishBufferSizeInMB = Math.max(64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value should be a value that makes the broker behave as close to the behavior without this code change. I understand we want to enable the rate-limiting feature. So we should try to make the default value as 60% and 70% of max direct memory? Otherwise, people might experience unexpected performance issues when they upgrade a broker from an old version to a newer version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we'd better keep the default value to -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the default buffer size to half of the direct memory.

@@ -257,6 +268,8 @@ public BrokerService(PulsarService pulsar) throws Exception {
.newSingleThreadScheduledExecutor(new DefaultThreadFactory("pulsar-msg-expiry-monitor"));
this.compactionMonitor =
Executors.newSingleThreadScheduledExecutor(new DefaultThreadFactory("pulsar-compaction-monitor"));
this.messagePublishBufferMonitor =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create this executor only when the feature is enabled.

Also I see we are creating more and more schedulers. Can we consider reusing some of the executors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think we need a different thread name. It's better for jstack analysis.

@@ -2011,4 +2033,34 @@ public ConfigField(Field field) {
return Optional.empty();
}
}

private void checkMessagePublishBuffer() {
currentMessagePublishBufferSize = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this variable doesn't have to be a class variable of BrokerService. It can just be a local variable, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is.

private final long maxMessagePublishBufferSize;
private final long resumeProducerReadMessagePublishBufferSize;
private volatile long currentMessagePublishBufferSize;
private volatile boolean isMessagePublishBufferThreshold;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private volatile boolean isMessagePublishBufferThreshold;
private volatile boolean reachMessagePublishBufferThreshold;

@@ -216,8 +218,17 @@
private Channel listenChannel;
private Channel listenChannelTls;

private final long maxMessagePublishBufferSize;
private final long resumeProducerReadMessagePublishBufferSize;
private volatile long currentMessagePublishBufferSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private volatile long currentMessagePublishBufferSize;
private volatile long currentMessagePublishBufferBytes;

I prefer using bytes rather than size to make the unit more explicit.

@codelipenghui codelipenghui merged commit 91dfa1a into apache:master Feb 16, 2020
kaynewu added a commit to kaynewu/pulsar that referenced this pull request Mar 10, 2020
* [Issue 5904]Support `unload` all partitions of a partitioned topic (apache#6187)

Fixes apache#5904 

### Motivation
Pulsar supports unload a non-partitioned-topic or a partition of a partitioned topic. If there has a partitioned topic with too many partitions, users need to get all partition and unload them one by one. We need to support unload all partition of a partitioned topic.

* [Issue 4175] [pulsar-function-go] Create integration tests for Go Functions for production-readiness (apache#6104)

This PR is to provide integration tests that test execution of Go functions that are managed by the Java FunctionManager. This will allow us to test things like behavior during function timeouts, heartbeat failures, and other situations that can only be effectively tested in an integration test. 

Master issue: apache#4175
Fixes issue: apache#6204 

### Modifications

We must add Go to the integration testing logic. We must also build the Go dependencies into the test Dockerfile to ensure the Go binaries are available at runtime for the integration tests.

* [Issue 5999] Support create/update tenant with empty cluster (apache#6027)

### Motivation

Fixes apache#5999

### Modifications

Add the logic to handle the blank cluster name.

* Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM (apache#6178)

Motivation
Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications
If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection.

* Enable get precise backlog and backlog without delayed messages. (apache#6310)

Fixes apache#6045 apache#6281 

### Motivation

Enable get precise backlog and backlog without delayed messages.

### Verifying this change

Added new unit tests for the change.

* KeyValue schema support for pulsar sql (apache#6325)

Fixes apache#5560

### Motivation

Currently, Pulsar SQL can't read the keyValue schema data. This PR added support Pulsar SQL reading messages with a key-value schema.

### Modifications

Add KeyValue schema support for Pulsar SQL. Add prefix __key. for the key field name.

* Avoid get partition metadata while the topic name is a partition name. (apache#6339)

Motivation

To avoid get partition metadata while the topic name is a partition name.
Currently, if users want to skip all messages for a partitioned topic or unload a partitioned topic, the broker will call get topic metadata many times. For a topic with the partition name, it is not necessary to call get partitioned topic metadata again.

* explicit statement env 'BOOKIE_MEM' and 'BOOKIE_GC' for values-mini.yaml (apache#6340)

Fixes apache#6338

### Motivation
This commit started while I was using helm in my local minikube, noticed that there's a mismatch between `values-mini.yaml` and `values.yaml` files. At first I thought it was a copy/paste error. So I created apache#6338;

Then I looked into the details how these env-vars[ were used](https://github.com/apache/pulsar/blob/28875d5abc4cd13a3e9cc4f59524d2566d9f9f05/conf/bkenv.sh#L36), found out its ok to use `PULSAR_MEM` as an alternative. But it introduce problems:
1. Since `BOOKIE_GC` was not defined , the default [BOOKIE_EXTRA_OPTS](https://github.com/apache/pulsar/blob/28875d5abc4cd13a3e9cc4f59524d2566d9f9f05/conf/bkenv.sh#L39)  will finally use default value of `BOOKIE_GC`, thus would cover same the JVM parameters defined prior in `PULSAR_MEM`.
2. May cause problems when bootstrap scripts changed in later dev, better to make it explicitly.

So I create this pr to solve above problems(hidden trouble).

### Modifications

As mentioned above, I've made such modifications below:
1. make `BOOKIE_MEM` and `BOOKIE_GC` explicit in `values-mini.yaml` file.  Keep up with the format in`values.yaml` file.
2. remove all  print-gc-logs related args. Considering the resource constraints of minikube environment. The removed part's content is `-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -verbosegc -XX:G1LogLevel=finest`
3. leave `PULSAR_PREFIX_dbStorage_rocksDB_blockCacheSize` empty as usual, as [conf/standalone.conf#L576](https://github.com/apache/pulsar/blob/df152109415f2b10dd83e8afe50d9db7ab7cbad5/conf/standalone.conf#L576) says it would to use 10% of the direct memory size by default.

* Fix java doc for key shared policy. (apache#6341)

The key shared policy does not support setting the maximum key hash range, so fix the java doc.

* client: make SubscriptionMode a member of ConsumerConfigurationData (apache#6337)

Currently, SubscriptionMode is a parameter to create ConsumerImpl, but it is not exported out, and user could not set this value for consumer.  This change tries to make SubscriptionMode a member of ConsumerConfigurationData, so user could set this parameter when create consumer.

* Windows CMake corrections (apache#6336)

* Corrected method of specifying Windows path to LLVM tools

* Fixing windows build

* Corrected the dll install path

* Fixing pulsarShared paths

* remove future.join() from PulsarSinkEffectivelyOnceProcessor (apache#6361)

* use checkout@v2 to avoid fatal: reference is not a tree (apache#6386)

"fatal: reference is not a tree" is a known issue in actions/checkout#23 and fixed in checkout@v2, update checkout used in GitHub actions.

* [Pulsar-Client] Stop shade snappy-java in pulsar-client-shaded (apache#6375)

Fixes apache#6260 

Snappy, like other compressions (LZ4, ZSTD), depends on native libraries to do the real encode/decode stuff. When we shade them in a fat jar, only the java implementations of snappy class are shaded, however, left the JNI incompatible with the underlying c++ code.

We should just remove the shade for snappy, and let maven import its lib as a dependency.

I've tested the shaded jar locally generated by this pr, it works for all compression codecs.

* Fix CI not triggered (apache#6397)

In apache#6386 , checkout@v2 is brought in for checkout.

However, it's checking out PR merge commit by default, therefore breaks diff-only action which looking for commits that a PR is based on. And make all tests skipped.

This PR fixes this issue. and has been proven to work with apache#6396 Brokers/unit-tests.

* [Issue 6355][HELM] autorecovery - could not find or load main class (apache#6373)

This applies the recommended fix from
apache#6355 (comment)

Fixes apache#6355

### Motivation

This PR corrects the configmap data which was causing the autorecovery pod to crashloop
with `could not find or load main class`

### Modifications

Updated the configmap var data per [this comment](apache#6355 (comment)) from @sijie

* Creating a topic does not wait for creating cursor of replicators (apache#6364)

### Motivation

Creating a topic does not wait for creating cursor of replicators

## Verifying this change

The exists unit test can cover this change

* [Reader] Should set either start message id or start message from roll back duration. (apache#6392)

Currently, when constructing a reader, users can set both start message id and start time. 

This is strange and the behavior should be forbidden.

* Seek to the first one >= timestamp (apache#6393)

The current logic for `resetCursor` by timestamp is odd. The first message it returns is the last message earlier or equal to the designated timestamp. This "earlier" message should be avoided to emit.

* [Minor] Fix java code errors reported by lgtm.  (apache#6398)

Four kinds of errors are fixed in this PR:

- Array index out of bounds
- Inconsistent equals and hashCode
- Missing format argument
- Reference equality test of boxed types

According to https://lgtm.com/projects/g/apache/pulsar/alerts/?mode=tree&severity=error&id=&lang=java

* [Java Reader Client] Start reader inside batch result in read first message in batch. (apache#6345)

Fixes apache#6344 
Fixes apache#6350

The bug was brought in apache#5622 by changing the skip logic wrongly.

* Fix broker to specify a list of bookie groups. (apache#6349)

### Motivation

Fixes apache#6343

### Modifications

Add a method to cast object value to `String`.

* Fixed enum package not found (apache#6401)

Fixes apache#6400

### Motivation
This problem is blocking the current test. 1.1.8 version of `enum34` seems to have some problems, and the problem reproduces:

Use pulsar latest code:
```
cd pulsar
mvn clean install -DskipTests
dokcer pull apachepulsar/pulsar-build:ubuntu-16.04
docker run -it -v $PWD:/pulsar --name pulsar apachepulsar/pulsar-build:ubuntu-16.04 /bin/bash
docker exec -it pulsar /bin/bash
cmake .
make -j4 && make install 
cd python
python setup.py bdist_wheel
pip install dist/pulsar_client-*-linux_x86_64.whl
```
`pip show enum34`
```
Name: enum34
Version: 1.1.8
Summary: Python 3.4 Enum backported to 3.3, 3.2, 3.1, 2.7, 2.6, 2.5, and 2.4
Home-page: https://bitbucket.org/stoneleaf/enum34
Author: Ethan Furman
Author-email: ethan@stoneleaf.us
License: BSD License
Location: /usr/local/lib/python2.7/dist-packages
Requires:
Required-by: pulsar-client, grpcio
```

```
root@55e06c5c770f:/pulsar/pulsar-client-cpp/python# python
Python 2.7.12 (default, Oct  8 2019, 14:14:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from enum import Enum, EnumMeta
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named enum
>>> exit()
```

There is no problem with using 1.1.9 in the test.

### Modifications

* Upgrade enum34 from 1.1.8 to 1.1.9

### Verifying this change

local test pass

* removed comma from yaml config (apache#6402)

* Fix broker client tls settings error (apache#6128)

when broker create the inside client, it sets tlsTrustCertsFilePath as "getTlsCertificateFilePath()", but it should be "getBrokerClientTrustCertsFilePath()"

* [Issue 3762][Schema] Fix the problem with parsing of an Avro schema related to shading in pulsar-client. (apache#6406)

Motivation
Avro schemas are quite important for proper data flow and it is a pity that the apache#3762 issue stayed untouched for so long. There were some workarounds on how to make Pulsar use an original avro schema, but in the end, it is pretty hard to run an enterprise solution on workarounds. With this PR I would like to find a solution to the problem caused by shading avro in pulsar-client. As it was discussed in the issue, there are two possible solutions for this problem:

Unshade the avro library in the pulsar-client library. (IMHO it seems like a proper solution for this problem, but it also brings a risk of unknown side-effects)
Use reflection to get original schemas from generated classes. (I went for this solution)
Could you please comment if this is a proper solution for the problem? I will add tests when my approach will be confirmed.

Modifications
First, we try to extract an original avro schema from the "$SCHEMA" field using reflection. If it doesn't work, the process falls back generation of the schema from POJO.

* Remove duplicated lombok annotations in the tests module (apache#6412)

* Add verification for SchemaDefinitionBuilderImpl.java (apache#6405)

### Motivation

Add verification for SchemaDefinitionBuilderImpl.java

### Verifying this change

Added a new unit test.

* Cleanup pom files in the tests module (apache#6421)

### Modifications

- Removed dependencies on test libraries that were already imported in the parent pom file.

- Removed groupId tags that are inherited from the parent pom file.

* Update BatchReceivePolicy.java (apache#6423)

BatchReceivePolicy implements Serializable.

* Consumer received duplicated deplayed messages upon restart

Fix when send a delayed message ,there is a case when a consumer restarts and pull duplicate messages. apache#6403

* Bump netty version to 4.1.45.Final (apache#6424)

netty 4.1.43 has a bug preventing it from using Linux native Epoll transport

This results in pulsar brokers failing over to NioEventLoopGroup even when running on Linux.

The bug is fixed in netty releases 4.1.45.Final

* Fix publish buffer limit does not take effect

Motivation
If set up maxMessagePublishBufferSizeInMB > Integer.MAX_VALUE / 1024 / 1024, the publish buffer limit does not take effect. The reason is maxMessagePublishBufferBytes always 0 when use following calculation method :

pulsar.getConfiguration().getMaxMessagePublishBufferSizeInMB() * 1024 * 1024;
So, changed to

pulsar.getConfiguration().getMaxMessagePublishBufferSizeInMB() * 1024L * 1024L;

* doc: Add on the missing right parenthesis (apache#6426)

* Add on the missing right parenthesis

doc: Missing right parenthesis in the `token()` line from Pulsar Client Java Code.

* Add on the missing right parenthesis on line L70

* Switch from deprecated MAINTAINER tag to LABEL with maintainer's info in Dockerfile (apache#6429)

Motivation & Modification
The MAINTAINER instruction is deprecated in favor of the LABEL instruction with the maintainer's info in docker files.

* Amend the default value of . (apache#6374)

* fix the bug of authenticationData is't initialized. (apache#6440)

Motivation
fix the bug of authenticationData is't initialized.

the method org.apache.pulsar.proxy.server.ProxyConnection#handleConnect can't init the value of authenticationData.
cause of the bug that you will get the null value form the method org.apache.pulsar.broker.authorization.AuthorizationProvider#canConsumeAsync
when implements org.apache.pulsar.broker.authorization.AuthorizationProvider interface.

Modifications
init the value of authenticationData from the method org.apache.pulsar.proxy.server.ProxyConnection#handleConnect.

Verifying this change
implements org.apache.pulsar.broker.authorization.AuthorizationProvider interface, and get the value of authenticationData.

* Remove duplicated test libraries in POM dependencies (apache#6430)

### Motivation
The removed test libraries were already defined in the parent pom

### Modification
Removed duplicated test libraries in POM dependencies

* Add a message on how to make log refresh immediately when starting a component (apache#6078)

### Motivation

Some users may confuse by pulsar/bookie log without flushing immediately.

### Modifications

Add a message in `bin/pulsar-daemon` when starting a component.

* Close ZK before canceling future with exception (apache#6228) (apache#6399)

Fixes apache#6228

* [Flink-Connector]Get PulsarClient from cache should always return an open instance (apache#6436)

* Update sidebars.json (apache#6434)

The referenced markdown files do not exist and so the "Next" and "Previous" buttons on the bottom of pages surrounding them result in 404 Not Found errors

* [Broker] Create namespace failed when TLS is enabled in PulsarStandalone (apache#6457)

When starting Pulsar in standalone mode with TLS enabled, it will fail to create two namespaces during start. 

This is because it's using the unencrypted URL/port while constructing the PulsarAdmin client.

* Update version-2.5.0-sidebars.json (apache#6455)

The referenced markdown files do not exist and so the "Next" and "Previous" buttons on the bottom of pages surrounding them result in 404 Not Found errors

* [Issue 6168] Fix Unacked Message Tracker by Using Time Partition on C++ (apache#6391)

### Motivation
Fix apache#6168 .
>On C++ lib, like the following log, unacked messages are redelivered after about 2 * unAckedMessagesTimeout.

### Modifications
As same apache#3118, by using TimePartition, fixed ` UnackedMessageTracker` .
- Add `TickDurationInMs`
- Add `redeliverUnacknowledgedMessages` which require `MessageIds` to `ConsumerImpl`, `MultiTopicsConsumerImpl` and `PartitionedConsumerImpl`.

* [ClientAPI]Fix hasMessageAvailable() (apache#6362)

Fixes apache#6333 

Previously, `hasMoreMessages` is test against:
```
return lastMessageIdInBroker.compareTo(lastDequeuedMessage) == 0
                && incomingMessages.size() > 0;
```
However, the `incomingMessages` could be 0 when the consumer/reader has just started and hasn't received any messages yet. 

In this PR, the last entry is retrieved and decoded to get message metadata. for the batchIndex field population.

* Use System.nanoTime() instead of System.currentTimeMillis() (apache#6454)

Fixes apache#6453 

### Motivation
`ConsumerBase` and `ProducerImpl` use `System.currentTimeMillis()` to measure the elapsed time in the 'operations' inner classes (`ConsumerBase$OpBatchReceive` and `ProducerImpl$OpSendMsg`).

An instance variable `createdAt` is initialized with `System.currentTimeMills()`, but it is not used for reading wall clock time, the variable is only used for computing elapsed time (e.g. timeout for a batch).

When the variable is used to compute elapsed time, it would more sense to use `System.nanoTime()`.

### Modifications

The instance variable `createdAt` in `ConsumerBase$OpBatchReceive` and  `ProducerImpl$OpSendMsg` is initialized with `System.nanoTime()`. Usage of the variable is updated to reflect that the variable holds nano time; computations of elapsed time takes the difference between the current system nano time and the `createdAt` variable.

The `createdAt` field is package protected, and is currently only used in the declaring class and outer class, limiting the chances for unwanted side effects.

* Fixed the max backoff configuration for lookups (apache#6444)

* Fixed the max backoff configuration for lookups

* Fixed test expectation

* More test fixes

* upgrade scala-maven-plugin to 4.1.0 (apache#6469)

### Motivation
The Pulsar examples include some third-party libraries with security vulnerabilities.
- log4j-core-2.8.1
https://www.cvedetails.com/cve/CVE-2017-5645

### Modifications

- Upgraded the version of scala-maven-plugin from 4.0.1 to 4.1.0. log4j-core-2.8.1 were installed because scala-maven-plugin depends on it.

* [pulsar-proxy] fix logging for published messages (apache#6474)

### Motivation
Proxy-logging fetches incorrect producerId for `Send` command because of that logging always gets producerId as 0 and it fetches invalid topic name for the logging.

### Modification
Fixed topic logging by fetching correct producerId for `Send` command.

* [Issue 6394] Add configuration to disable auto creation of subscriptions (apache#6456)

### Motivation

Fixes apache#6394

### Modifications

- provide a flag `allowAutoSubscriptionCreation` in `ServiceConfiguration`, defaults to `true`
- when `allowAutoSubscriptionCreation` is disabled and the specified subscription (`Durable`) on the topic does not exist when trying to subscribe via a consumer, the server should reject the request directly by `handleSubscribe` in `ServerCnx`
- create the subscription on the coordination topic if it does not exist when init `WorkerService`

* Make tests more stable by using JSONAssert equals (apache#6435)

Similar to the change you already merged for AvroSchemaTest.java(apache#6247):
`jsonSchema.getSchemaInfo().getSchema()` in `pulsar-client/src/test/java/org/apache/pulsar/client/impl/schema/JSONSchemaTest.java` returns a JSON object. `schemaJson` compares with hard-coded JSON String. However, the order of entries in `schemaJson` is not guaranteed. Similarly, test `testKeyValueSchemaInfoToString` in `pulsar-client/src/test/java/org/apache/pulsar/client/impl/schema/KeyValueSchemaInfoTest.java` returns a JSON object. `havePrimitiveType` compares with hard-coded JSON String, and the order of entries in `havePrimitiveType` is not guaranteed.


This PR proposes to use JSONAssert and modify the corresponding JSON test assertions so that the test is more stable.

### Motivation

Using JSONAssert and modifying the corresponding JSON test assertions so that the test is more stable.

### Modifications

Adding `assertJSONEqual` method and replacing `assertEquals` with it in tests `testAllowNullSchema`, `testNotAllowNullSchema` and `testKeyValueSchemaInfoToString`.

* Avoid calling ConsumerImpl::redeliverMessages() when message list is empty (apache#6480)

* [pulsar-client] fix deadlock on send failure (apache#6488)

* Enhance Authorization by adding TenantAdmin interface (apache#6487)

* Enhance Authorization by adding TenantAdmin interface

* Remove debugging comment

Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com>

* Independent schema is set for each consumer generated by topic (apache#6356)

### Motivation

Master Issue: apache#5454 

When one Consumer subscribe multi topic, setSchemaInfoPorvider() will be covered by the consumer generated by the last topic.

### Modification
clone schema for each consumer generated by topic.
### Verifying this change
Add the schemaTest for it.

* Fix memory leak when running topic compaction. (apache#6485)


Fixes apache#6482

### Motivation
Prevent topic compaction from leaking direct memory

### Modifications

Several leaks were discovered using Netty leak detection and code review.
* `CompactedTopicImpl.readOneMessageId` would get an `Enumeration` of `LedgerEntry`, but did not release the underlying buffers. Fix: iterate though the `Enumeration` and release underlying buffer. Instead of logging the case where the `Enumeration` did not contain any elements, complete the future exceptionally with the message (will be logged by Caffeine).
* Two main sources of leak in `TwoPhaseCompactor`. The `RawBacthConverter.rebatchMessage` method failed to close/release a `ByteBuf` (uncompressedPayload). Also, the return ByteBuf of `RawBacthConverter.rebatchMessage` was not closed. The first one was easy to fix (release buffer), to fix the second one and make the code easier to read, I decided to not let `RawBacthConverter.rebatchMessage`  close the message read from the topic, instead the message read from the topic can be closed in a try/finally clause surrounding most of the method body handing a message from a topic (in phase two loop). Then if a new message was produced by `RawBacthConverter.rebatchMessage` we check that after we have added the message to the compact ledger and release the message.

### Verifying this change
Modified `RawReaderTest.testBatchingRebatch` to show new contract.

One can run the test described to reproduce the issue, to verify no leak is detected.

* Fix create partitioned topic with a substring of an existing topic name. (apache#6478)

Fixes apache#6468

Fix create a partitioned topic with a substring of an existing topic name. And make create partitioned topic async.

* Bump jcloud version to 2.2.0 and remove jcloud-shade module (apache#6494)

In jclouds 2.2.0, the [gson is shaded internally](https://issues.apache.org/jira/browse/JCLOUDS-1166). We could safely remove the jcloud-shade module as a cleanup.

* Refactor tests in pulsar client tools test (apache#6472)

### Modifications

The main modification was the reduction of repeated initialization of the variables in the tests.

* Fix Topic metrics documentation (apache#6495)

### Motivation

*Explain here the context, and why you're making that change. What is the problem you're trying to solve.*

Motivation is to have correct reference-metrics documentation.

### Modifications

*Describe the modifications you've done.*

There is an error in the `Topic metrics` section

`pulsar_producers_count` => `pulsar_in_messages_total`

* [pulsar-client] remove duplicate cnx method (apache#6490)

### Motivation
Remove duplicate `cnx()` method for `producer`

* [proxy] Fix proxy routing to functions worker (apache#6486)

### Motivation


Currently, the proxy only works to proxy v1/v2 functions routes to the
function worker.

### Modifications

This changes this code to proxy all routes for the function worker when
those routes match. At the moment this is still a static list of
prefixes, but in the future it may be possible to have this list of
prefixes be dynamically fetched from the REST routes.

### Verifying this change
- added some tests to ensure the routing works as expected

* Fix some async method problems at PersistentTopicsBase. (apache#6483)

* Instead of always using admin access for topic, use read/write/admin access for topic (apache#6504)

Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com>

* [Minor]Remove unused property from pom (apache#6500)

This PR is a follow-up of apache#6494

* [pulsar-common] Remove duplicate RestException references (apache#6475)

### Motivation
Right now, various pulsar-modules have duplicate `RestException` class  and repo has multiple duplicate class. So, move `RestException` to common place and all modules should use the same Exception class to avoid duplicate classes.

* pulsar-proxy: fix correct name for proxy thread executor name (apache#6460)

### Motivation
fix correct name for proxy thread executor name

* Add subscribe initial position for consumer cli. (apache#6442)

### Motivation

In some case, users expect to consume messages from beginning similar to the option `--from-beginning` of kafka consumer CLI. 

### Modifications

Add `--subscription-position` for `pulsar-client` and `pulsar-perf`.

* [Cleanup] Log format does not match arguments (apache#6509)

* Start namespace service and schema registry service before start broker. (apache#6499)

### Motivation

If the broker service is started, the client can connect to the broker and send requests depends on the namespace service, so we should create the namespace service before starting the broker. Otherwise, NPE occurs.

![image](https://user-images.githubusercontent.com/12592133/76090515-a9961400-5ff6-11ea-9077-cb8e79fa27c0.png)

![image](https://user-images.githubusercontent.com/12592133/76099838-b15db480-6006-11ea-8f39-31d820563c88.png)


### Modifications

Move the namespace service creation and the schema registry service creation before start broker service.

* [pulsar-client-cpp] Fix Redelivery of Messages on UnackedMessageTracker When Ack Messages . (apache#6498)

### Motivation
Because of apache#6391 , acked messages were counted as unacked messages. 
Although messages from brokers were acknowledged, the following log was output.

```
2020-03-06 19:44:51.790 INFO  ConsumerImpl:174 | [persistent://public/default/t1, sub1, 0] Created consumer on broker [127.0.0.1:58860 -> 127.0.0.1:6650]
my-message-0: Fri Mar  6 19:45:05 2020
my-message-1: Fri Mar  6 19:45:05 2020
my-message-2: Fri Mar  6 19:45:05 2020
2020-03-06 19:45:15.818 INFO  UnAckedMessageTrackerEnabled:53 | [persistent://public/default/t1, sub1, 0] : 3 Messages were not acked within 10000 time

```

This behavior happened on master branch.

* [pulsar-proxy] fixing data-type of logging-level (apache#6476)

### Modification
`ProxyConfig` has wrapper method for `proxyLogLevel` to present `Optional` data-type. after apache#3543 we can define config param as optional without creating wrapper methods.

* [pulsar-broker] recover zk-badversion while updating cursor metadata (apache#5604)

fix test

Co-authored-by: ltamber <ltamber12@gmail.com>
Co-authored-by: Devin Bost <devinbost@users.noreply.github.com>
Co-authored-by: Fangbin Sun <sunfangbin@gmail.com>
Co-authored-by: lipenghui <penghui@apache.org>
Co-authored-by: ran <gaoran_10@126.com>
Co-authored-by: liyuntao <liyuntao58607@gmail.com>
Co-authored-by: Jia Zhai <zhaijia@apache.org>
Co-authored-by: Nick Rivera <heronr@users.noreply.github.com>
Co-authored-by: Neng Lu <freeneng@gmail.com>
Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com>
Co-authored-by: John Harris <jharris-@users.noreply.github.com>
Co-authored-by: guangning <guangning@apache.org>
Co-authored-by: newur <ruwen.reddig@gmail.com>
Co-authored-by: Sergii Zhevzhyk <vzhikserg@users.noreply.github.com>
Co-authored-by: liudezhi <33149602+liudezhi2098@users.noreply.github.com>
Co-authored-by: Dzmitry Kazimirchyk <dzmitryk@users.noreply.github.com>
Co-authored-by: futeng <ifuteng@gmail.com>
Co-authored-by: bilahepan <YTgaotianci@gmail.com>
Co-authored-by: Paweł Łoziński <pawel.lozinski@gmail.com>
Co-authored-by: Ryan Slominski <ryans@jlab.org>
Co-authored-by: k2la <mzq6mft9zz@gmail.com>
Co-authored-by: Rolf Arne Corneliussen <racorn@users.noreply.github.com>
Co-authored-by: Matteo Merli <mmerli@apache.org>
Co-authored-by: Sijie Guo <sijie@apache.org>
Co-authored-by: Rajan Dhabalia <rdhabalia@apache.org>
Co-authored-by: Sanjeev Kulkarni <sanjeevrk@gmail.com>
Co-authored-by: Sanjeev Kulkarni <sanjeevk@splunk.com>
Co-authored-by: congbo <39078850+congbobo184@users.noreply.github.com>
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
Co-authored-by: Addison Higham <addisonj@gmail.com>
@tuteng
Copy link
Member

tuteng commented Mar 21, 2020

Add label release-2.5.1, due to #6431 dependency

tuteng pushed a commit to AmateurEvents/pulsar that referenced this pull request Mar 21, 2020
…er OOM (apache#6178)

Motivation
Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications
If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection.

(cherry picked from commit 91dfa1a)
jiazhai added a commit that referenced this pull request Mar 22, 2020
In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them.
jiazhai added a commit that referenced this pull request Mar 22, 2020
In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them.
(cherry picked from commit 5bd0387)
tuteng pushed a commit that referenced this pull request Apr 6, 2020
In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them.

(cherry picked from commit 5bd0387)
tuteng pushed a commit that referenced this pull request Apr 13, 2020
…er OOM (#6178)

Motivation
Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications
If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection.

(cherry picked from commit 91dfa1a)
tuteng pushed a commit that referenced this pull request Apr 13, 2020
In PR #6178, some of the method in servercnx is turn from public to private, this change tries to resume them.

(cherry picked from commit 5bd0387)
jiazhai pushed a commit to jiazhai/pulsar that referenced this pull request May 18, 2020
…er OOM (apache#6178)

Motivation
Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications
If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection.
(cherry picked from commit 91dfa1a)
jiazhai added a commit to jiazhai/pulsar that referenced this pull request May 18, 2020
In PR apache#6178, some of the method in servercnx is turn from public to private, this change tries to resume them.
(cherry picked from commit 5bd0387)
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
…er OOM (apache#6178)

Motivation
Introduce maxMessagePublishBufferSizeInMB configuration to avoid broker OOM.

Modifications
If the processing message size exceeds this value, the broker will stop read data from the connection. When available size > half of the maxMessagePublishBufferSizeInMB, start auto-read data from the connection.
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
In PR apache#6178, some of the method in servercnx is turn from public to private, this change tries to resume them.
@codelipenghui codelipenghui deleted the publish_message_buffer branch November 6, 2020 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/2.5.1 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants