Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: fully concurrent read #10523

Merged
merged 5 commits into from Jun 14, 2019
Merged

Conversation

jingyih
Copy link
Contributor

@jingyih jingyih commented Mar 6, 2019

This PR solves "large read blocking write" issue (#10525) in mvcc/backend, as an alternative approach to #9384. The idea is to make ALL reads concurrent, therefore avoiding the need to predict and postpone the large reads.
For 'full story', please refer to the etcd: Fully Concurrent Reads Design Proposal.

TL;DR

In presence of long running reads, this PR increases the throughput of other requests (short reads and writes) by about 70%, reduces P99 and P999 latency by about 90%. In other benchmark cases where there are no long running reads, performance w/ and w/o this PR is comparable. (see later section for detailed results)

What does this PR solve

Concurrency bottleneck in current backend implementation

Any ongoing kvstore read transaction will hold a RWMutex read lock [1], which blocks any modification to backend read buffer. There are two places where an backend write transaction needs to acquire write lock and make modifications to read buffer:

  • when a kvstore write transaction ends, write buffer (i.e., the changes made by this kvstore write transaction) needs to be merged into read buffer. [2]
  • when a batch interval ends and changes are actually committed, current read buffer and boltdb Tx needs to be reset / re-create for next batch interval. [3]

Therefore a long read transaction can potentially block write and the following reads. Note that this blocking is necessary to provide linearizable guarantee, simply removing the locking will not work. Backend needs to make sure the combined view of read buffer + bolddb Tx is always consistent.

Ref:
[1]

etcd/mvcc/kvstore_txn.go

Lines 32 to 34 in 886d30d

func (s *store) Read() TxnRead {
s.mu.RLock()
tx := s.b.ReadTx()

etcd/mvcc/kvstore_txn.go

Lines 53 to 54 in 886d30d

func (tr *storeTxnRead) End() {
tr.tx.RUnlock()

[2]

t.backend.readTx.Lock() // blocks txReadBuffer for writing.
t.buf.writeback(&t.backend.readTx.buf)
t.backend.readTx.Unlock()

[3]

t.backend.readTx.Lock()
t.unsafeCommit(stop)
t.backend.readTx.Unlock()

How this PR addresses the bottleneck

  • In the beginning of a kvstore read transaction, a copy of the current backend read buffer is created. Any range ops in this transaction will read from the copy and boltdb Tx. Therefore this transaction does not need to block following changes to backend read buffer.
  • At the end of a batch interval, we do not rollback boltdb Tx immediately. Instead, we create new boltdb Tx (and reset read buffer) and immediately starts the next batch interval. Any old boltdb Tx will be closed only when all kvstore read transactions that are referencing it are done.

Benchmark result (PR rebased on 5/8/2019)

Setup

Generated 100K keys

tools/benchmark/benchmark put --key-space-size 100000 --key-size 256 --val-size 10240 --total 100000 --sequential-keys

Throughput benchmark

1. Short reads

Result is comparable.

2. Writes

Result is comparable.
#10523 (comment)

3. Short reads + Writes

Result is comparable, this PR is slight worse in P99 and P999 read latency distribution.
This is likely to be the worst case scenario for this PR. Since the server is overloaded by writes, the backend read buffer is loaded with pending writes.
#10523 (comment)

4. Long reads + Writes

Result: this PR increased write throughput by 69.5%, reduced P99 write latency by 90.8%, reduced P999 write latency by 85.3%.
This is the scenario this PR tries to improve.
#10523 (comment)

5. Long reads + (short) Reads + Writes

Result: In presence of long expensive reads, short read throughput is increased by 77.5%, P99 read latency is reduced by 93.0%, P999 read latency is reduced by 82.9%; and write throughput is increased by 85.3%, P99 write latency is reduced by 94.4%, P999 write latency is reduced by 83.9%.
#10523 (comment)

Kubernetes scalability test result

Posted here: #10523 (comment)

@jingyih
Copy link
Contributor Author

jingyih commented Mar 6, 2019

/cc @xiang90 @jpbetz @heyitsanthony

@jingyih
Copy link
Contributor Author

jingyih commented Mar 6, 2019

Benchmark

Setup:
We used the same setup as in #9384 (comment), where we have 3 types of client requests running concurrently: long read, short read and write.

Result:
-Master
Short read response time histogram:
0.0006 [1] |
0.0741 [2609] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1475 [0] |
0.2210 [0] |
0.2945 [0] |
0.3680 [17] |
0.4414 [121] |∎
0.5149 [151] |∎∎
0.5884 [79] |∎
0.6618 [12] |
0.7353 [10] |

Write response time histogram:
0.0004 [1] |
0.0736 [638] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1468 [57] |∎∎∎
0.2199 [55] |∎∎∎
0.2931 [57] |∎∎∎
0.3663 [56] |∎∎∎
0.4395 [75] |∎∎∎∎
0.5127 [44] |∎∎
0.5859 [12] |
0.6590 [3] |
0.7322 [2] |

Long read response time histogram:
1.7072 [1] |∎∎∎
1.7822 [3] |∎∎∎∎∎∎∎∎∎∎
1.8572 [9] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9322 [12] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.0073 [8] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.0823 [2] |∎∎∎∎∎∎
2.1573 [2] |∎∎∎∎∎∎
2.2323 [0] |
2.3073 [0] |
2.3824 [2] |∎∎∎∎∎∎
2.4574 [1] |∎∎∎

-This PR
Short read response time histogram:
0.0006 [1] |
0.0082 [2831] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0158 [36] |
0.0234 [22] |
0.0309 [24] |
0.0385 [20] |
0.0461 [21] |
0.0537 [26] |
0.0613 [14] |
0.0689 [4] |
0.0765 [1] |

Write response time histogram:
0.0004 [1] |
0.0068 [936] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0131 [28] |∎
0.0195 [7] |
0.0258 [4] |
0.0322 [2] |
0.0385 [4] |
0.0449 [2] |
0.0512 [6] |
0.0576 [6] |
0.0639 [4] |

Long read response time histogram:
1.6783 [1] |∎∎∎∎∎
1.7156 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.7528 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.7901 [8] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8274 [7] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8647 [5] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9019 [2] |∎∎∎∎∎∎∎∎∎∎
1.9392 [2] |∎∎∎∎∎∎∎∎∎∎
1.9765 [1] |∎∎∎∎∎
2.0138 [2] |∎∎∎∎∎∎∎∎∎∎
2.0510 [2] |∎∎∎∎∎∎∎∎∎∎

@jingyih
Copy link
Contributor Author

jingyih commented Mar 6, 2019

Please ignore the conflict, this PR needs a rebase to #10506.

@jingyih jingyih added the WIP label Mar 6, 2019
@jingyih jingyih changed the title [WIP] mvcc: full concurrent read [WIP] mvcc: fully concurrent read Mar 6, 2019
mvcc/backend/backend.go Outdated Show resolved Hide resolved
mvcc/backend/read_tx.go Show resolved Hide resolved
mvcc/backend/backend.go Outdated Show resolved Hide resolved
mvcc/backend/batch_tx.go Outdated Show resolved Hide resolved
mvcc/backend/read_tx.go Show resolved Hide resolved
mvcc/backend/backend.go Outdated Show resolved Hide resolved
@xiang90
Copy link
Contributor

xiang90 commented Mar 11, 2019

@jingyih Have you done a throughput benchmark? Copying the cache might affect throughput but not the latency.

@jingyih
Copy link
Contributor Author

jingyih commented Mar 11, 2019

@xiang90

We do plan to get more benchmark results. Do you have suggestion on what kind of workload we should use?

Copying the cache will affect the system most if the cache size is large, i.e., when the system is write heavy. We plan to design a workload to simulate the worst case. We are also thinking about adding metrics (only for debugging purpose) to measure cache size and time duration of making the copy.

@jingyih
Copy link
Contributor Author

jingyih commented Mar 11, 2019

@xiang90

BTW, in the previous benchmark, we are already hitting the throughput limit, i.e., the rate we are requesting as described in #9384 (comment) is higher than the actual rate.

Edit (5/9/2019): this statement is not correct.

@jingyih
Copy link
Contributor Author

jingyih commented May 9, 2019

Rebased to master.

@jingyih
Copy link
Contributor Author

jingyih commented May 9, 2019

Throughput comparison: short reads

tools/benchmark/benchmark range "A" "Z" --total 300000 --rate 10000 --conns 10 --clients 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Summary:
Total: 45.4235 secs.
Slowest: 0.0167 secs.
Fastest: 0.0002 secs.
Average: 0.0010 secs.
Stddev: 0.0014 secs.
Requests/sec: 6604.5057

Response time histogram:
0.0002 [1] |
0.0019 [265438] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0035 [17638] |∎∎
0.0052 [7170] |∎
0.0068 [4330] |
0.0084 [3031] |
0.0101 [1845] |
0.0117 [450] |
0.0134 [64] |
0.0150 [23] |
0.0167 [10] |

Latency distribution:
10% in 0.0003 secs.
25% in 0.0004 secs.
50% in 0.0005 secs.
75% in 0.0008 secs.
90% in 0.0021 secs.
95% in 0.0039 secs.
99% in 0.0081 secs.
99.9% in 0.0105 secs.

Master + PR

Summary:
Total: 44.9010 secs.
Slowest: 0.0147 secs.
Fastest: 0.0002 secs.
Average: 0.0010 secs.
Stddev: 0.0014 secs.
Requests/sec: 6681.3650

Response time histogram:
0.0002 [1] |
0.0017 [260698] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0031 [20088] |∎∎∎
0.0046 [8230] |∎
0.0060 [4482] |
0.0075 [2960] |
0.0089 [2165] |
0.0103 [1202] |
0.0118 [151] |
0.0132 [20] |
0.0147 [3] |

Latency distribution:
10% in 0.0003 secs.
25% in 0.0004 secs.
50% in 0.0005 secs.
75% in 0.0008 secs.
90% in 0.0022 secs.
95% in 0.0037 secs.
99% in 0.0078 secs.
99.9% in 0.0101 secs.

@jingyih
Copy link
Contributor Author

jingyih commented May 9, 2019

Throughput comparison: writes

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Summary:
Total: 38.4857 secs.
Slowest: 0.2721 secs.
Fastest: 0.0003 secs.
Average: 0.0041 secs.
Stddev: 0.0058 secs.
Requests/sec: 1559.0213

Response time histogram:
0.0003 [1] |
0.0275 [59633] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0547 [356] |
0.0818 [0] |
0.1090 [0] |
0.1362 [0] |
0.1634 [0] |
0.1906 [0] |
0.2177 [0] |
0.2449 [0] |
0.2721 [10] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0018 secs.
50% in 0.0030 secs.
75% in 0.0042 secs.
90% in 0.0063 secs.
95% in 0.0153 secs.
99% in 0.0260 secs.
99.9% in 0.0333 secs.

Master + PR

Summary:
Total: 38.0085 secs.
Slowest: 0.0395 secs.
Fastest: 0.0003 secs.
Average: 0.0040 secs.
Stddev: 0.0047 secs.
Requests/sec: 1578.5939

Response time histogram:
0.0003 [1] |
0.0042 [45641] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0081 [10040] |∎∎∎∎∎∎∎∎
0.0120 [1093] |
0.0160 [333] |
0.0199 [822] |
0.0238 [1011] |
0.0277 [700] |
0.0316 [267] |
0.0356 [72] |
0.0395 [20] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0018 secs.
50% in 0.0030 secs.
75% in 0.0042 secs.
90% in 0.0059 secs.
95% in 0.0151 secs.
99% in 0.0261 secs.
99.9% in 0.0328 secs.

@jingyih
Copy link
Contributor Author

jingyih commented May 9, 2019

Throughput comparison: short reads + writes

tools/benchmark/benchmark range "A" "Z" --total 50000 --rate 10000 --conns 10 --clients 10

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Read

Summary:
Total: 40.9591 secs.
Slowest: 0.2451 secs.
Fastest: 0.0003 secs.
Average: 0.0080 secs.
Stddev: 0.0068 secs.
Requests/sec: 1220.7308

Response time histogram:
0.0003 [1] |
0.0248 [48584] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0493 [1405] |∎
0.0737 [0] |
0.0982 [0] |
0.1227 [0] |
0.1472 [0] |
0.1717 [0] |
0.1961 [0] |
0.2206 [0] |
0.2451 [10] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0024 secs.
50% in 0.0081 secs.
75% in 0.0092 secs.
90% in 0.0138 secs.
95% in 0.0214 secs.
99% in 0.0286 secs.
99.9% in 0.0357 secs.

Write

Summary:
Total: 38.7327 secs.
Slowest: 0.2424 secs.
Fastest: 0.0004 secs.
Average: 0.0044 secs.
Stddev: 0.0053 secs.
Requests/sec: 1549.0771

Response time histogram:
0.0004 [1] |
0.0246 [59568] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0488 [421] |
0.0730 [0] |
0.0972 [0] |
0.1214 [0] |
0.1456 [0] |
0.1698 [0] |
0.1940 [0] |
0.2182 [0] |
0.2424 [10] |

Latency distribution:
10% in 0.0014 secs.
25% in 0.0022 secs.
50% in 0.0034 secs.
75% in 0.0045 secs.
90% in 0.0071 secs.
95% in 0.0153 secs.
99% in 0.0236 secs.
99.9% in 0.0290 secs.

Master + PR

Read

Summary:
Total: 40.4403 secs.
Slowest: 0.2657 secs.
Fastest: 0.0003 secs.
Average: 0.0079 secs.
Stddev: 0.0071 secs.
Requests/sec: 1236.3901

Response time histogram:
0.0003 [1] |
0.0268 [48986] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0534 [1003] |
0.0799 [0] |
0.1065 [0] |
0.1330 [0] |
0.1596 [0] |
0.1861 [0] |
0.2126 [0] |
0.2392 [0] |
0.2657 [10] |

Latency distribution:
10% in 0.0009 secs.
25% in 0.0020 secs.
50% in 0.0081 secs.
75% in 0.0091 secs.
90% in 0.0135 secs.
95% in 0.0212 secs.
99% in 0.0299 secs.
99.9% in 0.0408 secs.

Write

Summary:
Total: 38.1133 secs.
Slowest: 0.2609 secs.
Fastest: 0.0003 secs.
Average: 0.0041 secs.
Stddev: 0.0054 secs.
Requests/sec: 1574.2543

Response time histogram:
0.0003 [1] |
0.0264 [59700] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0524 [289] |
0.0785 [0] |
0.1046 [0] |
0.1306 [0] |
0.1567 [0] |
0.1827 [0] |
0.2088 [0] |
0.2348 [0] |
0.2609 [10] |

Latency distribution:
10% in 0.0011 secs.
25% in 0.0019 secs.
50% in 0.0031 secs.
75% in 0.0042 secs.
90% in 0.0066 secs.
95% in 0.0146 secs.
99% in 0.0238 secs.
99.9% in 0.0339 secs.

@jingyih
Copy link
Contributor Author

jingyih commented May 9, 2019

Throughput comparison: long reads + writes

tools/benchmark/benchmark range --prefix "" --total 60 --rate 1 --conns 2 --clients 2

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

(Note that the requested ops per second is higher than actual served ops per second.)

Master

Long Read

Summary:
Total: 62.1513 secs.
Slowest: 2.9222 secs.
Fastest: 1.6638 secs.
Average: 1.8026 secs.
Stddev: 0.2047 secs.
Requests/sec: 0.9654

Response time histogram:
1.6638 [1] |
1.7896 [44] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9154 [11] |∎∎∎∎∎∎∎∎∎∎
2.0413 [0] |
2.1671 [0] |
2.2930 [2] |∎
2.4188 [0] |
2.5447 [1] |
2.6705 [0] |
2.7964 [0] |
2.9222 [1] |

Latency distribution:
10% in 1.6877 secs.
25% in 1.7274 secs.
50% in 1.7508 secs.
75% in 1.7909 secs.
90% in 1.8802 secs.
95% in 2.2883 secs.

Write

Summary:
Total: 65.8218 secs.
Slowest: 0.5919 secs.
Fastest: 0.0003 secs.
Average: 0.0086 secs.
Stddev: 0.0465 secs.
Requests/sec: 911.5515

Response time histogram:
0.0003 [1] |
0.0595 [59265] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1186 [134] |
0.1778 [0] |
0.2370 [0] |
0.2961 [10] |
0.3553 [0] |
0.4145 [62] |
0.4736 [259] |
0.5328 [240] |
0.5919 [29] |

Latency distribution:
10% in 0.0010 secs.
25% in 0.0017 secs.
50% in 0.0029 secs.
75% in 0.0041 secs.
90% in 0.0065 secs.
95% in 0.0166 secs.
99% in 0.2727 secs.
99.9% in 0.5160 secs.

Master + PR

Long Read

Summary:
Total: 61.5149 secs.
Slowest: 2.5447 secs.
Fastest: 1.6309 secs.
Average: 1.7897 secs.
Stddev: 0.1694 secs.
Requests/sec: 0.9754

Response time histogram:
1.6309 [1] |∎
1.7223 [20] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8136 [29] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9050 [2] |∎∎
1.9964 [2] |∎∎
2.0878 [2] |∎∎
2.1792 [2] |∎∎
2.2705 [0] |
2.3619 [0] |
2.4533 [0] |
2.5447 [2] |∎∎

Latency distribution:
10% in 1.6706 secs.
25% in 1.7047 secs.
50% in 1.7399 secs.
75% in 1.7940 secs.
90% in 2.0302 secs.
95% in 2.1252 secs.

Write

Summary:
Total: 38.8433 secs.
Slowest: 0.0912 secs.
Fastest: 0.0003 secs.
Average: 0.0040 secs.
Stddev: 0.0061 secs.
Requests/sec: 1544.6694

Response time histogram:
0.0003 [1] |
0.0094 [56287] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0185 [1841] |∎
0.0276 [1398] |
0.0367 [133] |
0.0457 [50] |
0.0548 [22] |
0.0639 [111] |
0.0730 [68] |
0.0821 [66] |
0.0912 [23] |

Latency distribution:
10% in 0.0010 secs.
25% in 0.0017 secs.
50% in 0.0028 secs.
75% in 0.0040 secs.
90% in 0.0061 secs.
95% in 0.0145 secs.
99% in 0.0251 secs.
99.9% in 0.0760 secs.

@jingyih jingyih requested review from jpbetz and xiang90 May 9, 2019 18:07
@jingyih
Copy link
Contributor Author

jingyih commented May 10, 2019

@jingyih Have you done a throughput benchmark? Copying the cache might affect throughput but not the latency.

@xiang90
Updated with new benchmark results (for throughput). Summarized in the end of comment #1.

@jingyih
Copy link
Contributor Author

jingyih commented May 10, 2019

Throughput comparison: long reads + (short) reads + writes

tools/benchmark/benchmark range --prefix "" --total 80 --rate 1 --conns 2 --clients 2

tools/benchmark/benchmark range "A" "Z" --total 50000 --rate 10000 --conns 10 --clients 10

tools/benchmark/benchmark put --key-space-size 100 --key-size 256 --val-size 10240 --total 60000 --rate 2000 --clients 10 --conns 10

Master

Long Read

Summary:
Total: 81.9649 secs.
Slowest: 2.5878 secs.
Fastest: 1.6621 secs.
Average: 1.8583 secs.
Stddev: 0.1484 secs.
Requests/sec: 0.9760

Response time histogram:
1.6621 [1] |
1.7547 [4] |∎∎∎
1.8473 [52] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9398 [14] |∎∎∎∎∎∎∎∎∎∎
2.0324 [4] |∎∎∎
2.1249 [0] |
2.2175 [1] |
2.3101 [0] |
2.4026 [2] |∎
2.4952 [1] |
2.5878 [1] |

Latency distribution:
10% in 1.7674 secs.
25% in 1.7961 secs.
50% in 1.8277 secs.
75% in 1.8530 secs.
90% in 1.9546 secs.
95% in 2.3388 secs.

(Short) Read

Summary:
Total: 78.3127 secs.
Slowest: 0.8620 secs.
Fastest: 0.0003 secs.
Average: 0.0155 secs.
Stddev: 0.0597 secs.
Requests/sec: 638.4658

Response time histogram:
0.0003 [1] |
0.0865 [49225] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1727 [14] |
0.2588 [10] |
0.3450 [10] |
0.4312 [58] |
0.5173 [554] |
0.6035 [118] |
0.6897 [0] |
0.7759 [0] |
0.8620 [10] |

Latency distribution:
10% in 0.0014 secs.
25% in 0.0047 secs.
50% in 0.0081 secs.
75% in 0.0094 secs.
90% in 0.0149 secs.
95% in 0.0228 secs.
99% in 0.4799 secs.
99.9% in 0.5367 secs.

Write

Summary:
Total: 75.1863 secs.
Slowest: 0.8549 secs.
Fastest: 0.0004 secs.
Average: 0.0105 secs.
Stddev: 0.0540 secs.
Requests/sec: 798.0180

Response time histogram:
0.0004 [1] |
0.0858 [59228] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1713 [10] |
0.2567 [10] |
0.3422 [10] |
0.4276 [71] |
0.5131 [549] |
0.5985 [112] |
0.6840 [0] |
0.7694 [0] |
0.8549 [9] |

Latency distribution:
10% in 0.0015 secs.
25% in 0.0023 secs.
50% in 0.0035 secs.
75% in 0.0046 secs.
90% in 0.0078 secs.
95% in 0.0170 secs.
99% in 0.4489 secs.
99.9% in 0.5298 secs.

Master + PR

Long Read

Summary:
Total: 83.0383 secs.
Slowest: 2.6376 secs.
Fastest: 1.6201 secs.
Average: 1.8671 secs.
Stddev: 0.2233 secs.
Requests/sec: 0.9634

Response time histogram:
1.6201 [1] |∎
1.7218 [25] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.8236 [10] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
1.9253 [23] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.0271 [9] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎
2.1289 [4] |∎∎∎∎∎∎
2.2306 [2] |∎∎∎
2.3324 [2] |∎∎∎
2.4341 [0] |
2.5359 [0] |
2.6376 [4] |∎∎∎∎∎∎

Latency distribution:
10% in 1.6657 secs.
25% in 1.7005 secs.
50% in 1.8408 secs.
75% in 1.9274 secs.
90% in 2.1897 secs.
95% in 2.5555 secs.

(Short) Read

Summary:
Total: 44.1265 secs.
Slowest: 0.4275 secs.
Fastest: 0.0003 secs.
Average: 0.0086 secs.
Stddev: 0.0108 secs.
Requests/sec: 1133.1050

Response time histogram:
0.0003 [1] |
0.0431 [49631] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0858 [217] |
0.1285 [131] |
0.1712 [0] |
0.2139 [0] |
0.2567 [0] |
0.2994 [10] |
0.3421 [0] |
0.3848 [0] |
0.4275 [10] |

Latency distribution:
10% in 0.0013 secs.
25% in 0.0037 secs.
50% in 0.0081 secs.
75% in 0.0094 secs.
90% in 0.0139 secs.
95% in 0.0213 secs.
99% in 0.0336 secs.
99.9% in 0.0919 secs.

Write

Summary:
Total: 40.5730 secs.
Slowest: 0.4169 secs.
Fastest: 0.0003 secs.
Average: 0.0046 secs.
Stddev: 0.0091 secs.
Requests/sec: 1478.8157

Response time histogram:
0.0003 [1] |
0.0420 [59668] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0836 [255] |
0.1253 [56] |
0.1669 [0] |
0.2086 [0] |
0.2503 [0] |
0.2919 [10] |
0.3336 [0] |
0.3752 [0] |
0.4169 [10] |

Latency distribution:
10% in 0.0012 secs.
25% in 0.0020 secs.
50% in 0.0032 secs.
75% in 0.0043 secs.
90% in 0.0070 secs.
95% in 0.0153 secs.
99% in 0.0252 secs.
99.9% in 0.0851 secs.

@jingyih jingyih changed the title [WIP] mvcc: fully concurrent read mvcc: fully concurrent read May 29, 2019
@jingyih jingyih removed the WIP label May 29, 2019
@jingyih
Copy link
Contributor Author

jingyih commented Jun 5, 2019

GCE 5k node scalability test result

Sig-scalability lead @wojtek-t helped me running the test and provided a quick analysis.

Result compared to baseline

Using test run 1130383116527996934 as baseline. It is right before 1130745634945503235.

  • Expensive LIST requests themselves are comparable. Using "LIST pods" as example.
Latency (ms) Baseline Image built from this PR
P50 1927 2185
P90 5280 5349
P99 7725 7871

Ref: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=pods&Scope=cluster&Subresource=&Verb=LIST

  • At the very beginning of the test where there are a lot of expensive "LIST pods", POST requests have much better latency. The example here is "POST clusterrolebindings".
Latency (ms) Baseline Image built from this PR
P50 2.40 2.67
P90 67.4 11.2
P99 993 25.4

Ref: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=clusterrolebindings&Scope=cluster&Subresource=&Verb=POST

  • "POST pods" also has better latency, this type of request spread through most part of the test.
Latency (ms) Baseline Image built from this PR
P50 18.4 3.16
P90 400 38.6
P99 962 218

Ref: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=APIServer&metricname=DensityResponsiveness&Resource=pods&Scope=namespace&Subresource=&Verb=POST

@xiang90 xiang90 self-assigned this Jun 5, 2019
@xiang90
Copy link
Contributor

xiang90 commented Jun 5, 2019

The P99 of POST pod is still bad. Probably worth some effort to dig.

@jingyih
Copy link
Contributor Author

jingyih commented Jun 5, 2019

The P99 of POST pod is still bad. Probably worth some effort to dig.

My understanding is that there was a scalability regression back then, the scalability team spent couple weeks trying to root cause it. (Not sure about the current status)

This test was ran during their debugging effort. P99 of POST pod is bad but not due to this PR.

@jingyih
Copy link
Contributor Author

jingyih commented Jun 11, 2019

@xiang90 I updated the test: 2a9320e. Is this what you had in mind?

Add TestConcurrentReadTxAndWrite which creates random reads and writes,
and ensures reads always see latest writes.
@xiang90
Copy link
Contributor

xiang90 commented Jun 12, 2019

@jingyih

yes. that is what i would expect.

@jingyih
Copy link
Contributor Author

jingyih commented Jun 12, 2019

how long does it take to finish this test?

I think locally it takes between 1 to 2 seconds to finish. On Travis I did not see timing for each individual unit test. I compared the finishing time of mvcc package with and without the new test, it is comparable.
https://travis-ci.com/etcd-io/etcd/jobs/207264327#L591
https://travis-ci.com/etcd-io/etcd/jobs/207257875#L591

@xiang90
Copy link
Contributor

xiang90 commented Jun 12, 2019

LGTM.

maybe @jpbetz wants to give this a final review.

@jingyih
Copy link
Contributor Author

jingyih commented Jun 12, 2019

Thanks! @xiang90

mvcc/backend/backend.go Show resolved Hide resolved
@@ -493,6 +518,7 @@ func (b *backend) begin(write bool) *bolt.Tx {
db := tx.DB()
atomic.StoreInt64(&b.size, size)
atomic.StoreInt64(&b.sizeInUse, size-(int64(db.Stats().FreePageN)*int64(db.Info().PageSize)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since the db.Stats() call acquires a read lock, only call it once here (instead of once for FreePageN and once for OpenTxN) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -323,6 +319,17 @@ func (t *batchTxBuffered) unsafeCommit(stop bool) {
}
}

func waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup, lg *zap.Logger) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a receiver function of batchTxBuffered? The parameters don't all appear necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tx and wg need to be passes into the function. batchTxBuffered's tx and wg will be reset / re-created immediate after this function call, whereas this function waits until all ongoing reads using the input tx is done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this a receiver and moving the logger out of the params might make that even more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean something like func (lg *zap.Logger) waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup)? I have mixed feeling about this. By convention (in this repo), zap logger is passed to functions as input parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more like func (t *batchTxBuffered) waitAndRollback(tx *bolt.Tx, wg *sync.WaitGroup) so that the only params that need to be passed are the ones that are semantically important to the call. But this is minor,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just inline the function in unsafeCommit? it doesn't seem to benefit from being separate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.


func (rt *concurrentReadTx) Lock() {}
func (rt *concurrentReadTx) Unlock() {}
func (rt *concurrentReadTx) RLock() {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised we don't do anything in RLock(). Should we try to adhere to the lock contract more closely (i.e. move logic from ConcurrentReadTx to here) ? If not we should clearly document what's going on.

Should Lock() and Unlock() be noops? Alternatives would be to delegate to RLock()/RUnock() or to panic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is tricky. ConcurrentReadTx() holds a lock when it is being created (and everything needs to happen when holding that lock, so cannot really move part of the logic to a separate function). After creation, there is no need / concept of locking and unlocking. So these functions end up being no-op. They only exist so that *concurrentReadTx implements ReadTx interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some comments with what you just said would certainly make it easier to understand.

The violation of the ReadTx interface guarantees do make for some odd behavior, e.g.:

tx := ConcurrentReadTx()
tx.RLock()
// do something
tx.RUnlock()
tx.RLock()
// do something
tx.RUnlock()

Would certainly not do what a developer might expect..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I am not happy with the current state. The current implementation of ConcurrentReadTx is compromised in readability because I wanted to do minimum change to various other existing interfaces. (Tried to define a new interface for concurrent read Tx, it caused ripple effects and the end result is kind of messy). Fortunately ReadTx interface is used internally in mvcc backend. Developer normally do not need to interact with this level. Instead, the mvcc store level transaction interface [1] should be enough (which is kept unchanged).

But I will add some comments try to clarify the implementation.

[1]

etcd/mvcc/kv.go

Lines 55 to 57 in 0de9b8a

// TxnRead represents a read-only transaction with operations that will not
// block other read transactions.
type TxnRead interface {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's loop back on this interface later. The comments should be sufficient for now.

mvcc/backend/read_tx.go Outdated Show resolved Hide resolved
@jingyih
Copy link
Contributor Author

jingyih commented Jun 13, 2019

@jpbetz Tried to address your comments in a823555. PTAL.

Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jpbetz
Copy link
Contributor

jpbetz commented Jun 13, 2019

From Travis CI:

gofmt checking failed:
mvcc/kvstore_txn.go
diff -u mvcc/kvstore_txn.go.orig mvcc/kvstore_txn.go
--- mvcc/kvstore_txn.go.orig	2019-06-13 00:24:39.392461717 +0000
+++ mvcc/kvstore_txn.go	2019-06-13 00:24:39.392461717 +0000
@@ -49,7 +49,7 @@
 }
 
 func (tr *storeTxnRead) End() {
-	tr.tx.RUnlock()	// RUnlock signals the end of concurrentReadTx.
+	tr.tx.RUnlock() // RUnlock signals the end of concurrentReadTx.
 	tr.s.mu.RUnlock()
 }

Some whitespace difference?

@jingyih
Copy link
Contributor Author

jingyih commented Jun 14, 2019

@jpbetz Updated. PTAL.

@jpbetz
Copy link
Contributor

jpbetz commented Jun 14, 2019

Great work!

LGTM

@xiang90
Copy link
Contributor

xiang90 commented Jun 14, 2019

@jingyih can you check the CI failures?

@jingyih
Copy link
Contributor Author

jingyih commented Jun 14, 2019

Travis CI failed with:

=== RUN TestLeasingGetWithOpts
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0xb5c7ca]
goroutine 24216 [running]:
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc00065c160, 0x1346c20, 0xc0003b0ec0, 0xc002edc378, 0x5, 0x5, 0xc002edc380, 0xb, 0xb, 0xc002edc380, ...)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:713 +0x14a
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc00065c160, 0xb5b88f, 0xc00016f130)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:637 +0x186
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc00183a240, 0xc00016f130, 0xc0003edaa8, 0xc002edc378, 0x0)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:530 +0x43d
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc00183a240, 0x1346c20, 0xc0003b0ec0, 0x0, 0x0)
/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:643 +0x777
go.etcd.io/etcd/etcdserver/etcdserverpb.(*watchWatchClient).Send(0xc00088cbb0, 0xc0003b0ec0, 0x0, 0x1e53018)
/go/src/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3697 +0x6b
go.etcd.io/etcd/clientv3.(*watchGrpcStream).run(0xc0001275f0)
/go/src/go.etcd.io/etcd/clientv3/watch.go:546 +0xd9b
created by go.etcd.io/etcd/clientv3.(*watcher).newWatcherGrpcStream
/go/src/go.etcd.io/etcd/clientv3/watch.go:277 +0x58c

@jingyih
Copy link
Contributor Author

jingyih commented Jun 14, 2019

The stack trace looks basically the same as the known flaky tests "TestLeasingXXX" listed in #10700. I don't think it is related to this PR.

@jingyih
Copy link
Contributor Author

jingyih commented Jun 14, 2019

semaphoreci failed with a known flaky test TestCtlV3AuthFromKeyPerm.

@xiang90 xiang90 merged commit 2c5162a into etcd-io:master Jun 14, 2019
@wojtek-t
Copy link
Contributor

This is awesome to see landing before 3.4 release - great work @jingyih and thanks a lot for making that happen to reviewers.

Just to answer one thing (sorry, 3 weeks on vacation)

The P99 of POST pod is still bad. Probably worth some effort to dig.

@xiang90 - note that those aren't etcd latencies, but apiserver latencies. The latencies are affected mostly by apiserver-related stuff. That experimental image obviously doesn't help for what apiserver is doing, thus the number are not that small as we would like them to see :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants