-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Final-finally-finalizationally, here's an asynchronous, safe, atomic cut-over phase.
This solution doesn't cause "table outage" (as in #65).
Here are the steps for a safe, atomic cut-over:
The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.
-
Connections C1..C9 operate on
tbl
with normal DML:INSERT, UPDATE, DELETE
-
Connection C10:
CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
-
Connection C10:
LOCK TABLES tbl WRITE, tbl_old WRITE
-
Connections C11..C19, newly incoming, issue queries on
tbl
but are blocked due to theLOCK
-
Connection C20:
RENAME TABLE tbl TO tbl_old, ghost TO tbl
- This is blocked due to the
LOCK
, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML ontbl
- This is blocked due to the
-
Connections C21..C29, newly incoming, issue queries on
tbl
but are blocked due to theLOCK
and due to theRENAME
, waiting in queue -
Connection C10: checks that C20's
RENAME
is applied (looks for the blockedRENAME
inshow processlist
) -
Connection 10:
DROP TABLE tbl_old
Nothing happens yet;tbl
is still locked. All other connections still blocked. -
Connection 10:
UNLOCK TABLES
BAM! The
RENAME
is first to execute, ghost table is swapped in place oftbl
, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shinytbl
Some notes
- We create
tbl_old
as a blocker for a premature swap - It is allowed for a connection to
DROP
a table it has under aWRITE LOCK
- A blocked
RENAME
is always prioritized over a blockedINSERT/UPDATE/DELETE
, no matter who came first
What happens on failures?
Much fun. Just works; no rollback required.
- If C10 errors on the
CREATE
we do not proceed. - If C10 errors on the
LOCK
statement, we do not proceed. The table is not locked. App continues to operate as normal. - If C10 dies just as C20 is about to issue the
RENAME
:- The lock is released, the queries C1..C9, C11..C19 immediately operate on
tbl
. - C20's
RENAME
immediately fails becausetbl_old
exists. - The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
- The lock is released, the queries C1..C9, C11..C19 immediately operate on
- If C10 dies while C20 is blocked on
RENAME
: Mostly similar to the above. Lock released, then C20 fails theRENAME
(becausetbl_old
exists), then all queries resume normal operation - If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned:
DROP
,UNLOCK
. Nothing terrible happens, some queries were blocked for some time. We will need to retry - If C20 dies just after C10
DROP
s the table but before the unlock, same as above. - If both C10 and C20 die, no problem:
LOCK
is cleared;RENAME
lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate ontbl
.
No matter what happens, at the end of operation we look for the ghost
table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl
, and the operation worked atomically.
A side note on failure is the matter of cleaning up the magic tbl_old
. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.
Impact on app
App connections are guaranteed to be blocked, either until ghost
is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.
Impact on replication
Replication only sees the RENAME
. There is no LOCK
in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.
Activity
an atomic cut-over implementation, as per issue #82
baloo commentedon Aug 1, 2016
Hello @shlomi-noach ,
First thanks for opensourcing gh-ost :) It looks very good :)
The only question remaining in my mind was about this particular subject (cut-over phase), thanks for documenting it. But I believe I now have two more:
What ensures that no other connections writes to
tbl
table in-between those events:CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
LOCK TABLES tbl WRITE, tbl_old WRITE
wouldn't it be safer to:
LOCK TABLES tbl WRITE;
CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
LOCK TABLES tbl_old WRITE;
I believe this way the replication client knowns once the
CREATE_TABLE
occurs in the replication stream, no writes totbl
will occur ever. And it's safe to proceed further down (with the rename and so on).Other question: You assume:
From my understanding of the mysql documentation this is only guaranteed by the lock acquirement algorithm that mysql uses [https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html]:
I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?
shlomi-noach commentedon Aug 2, 2016
@baloo, thank you for your review!
There is nothing to ensure that, and there is no need to ensure that; the table creation is merely a step in the direction of beginning the cut-over. There is no problem that
tbl
still takes writes, these are still being read in the binary log and propagated to the ghost table.Moreover, if we reverse the order as you suggest, implication is we waste more time during the
lock
ed phase.This doesn't work like that in MySQL, see: http://dev.mysql.com/doc/refman/5.7/en/lock-tables.html:
You cannot just grab more locks as you go along, you have to grab them all at once.
The relevant code in MySQL is https://github.com/mysql/mysql-server/blob/a533e2c786164af9bd276660b972d93649434297/sql/mdl.cc#L2312. You are correct this relies on internal behavior. This behavior exists for over 15 years or so, and, having discussed (informally, Safe Harbour etc.) with engineers - has no intention of going away.
I'm glad you point it out, I'll make it explicit in the fine print.
This entire scheme came to be because of a limitation in MySQL, where you cannot
rename
a table you have underlock
in the same session. I'm in good mind to influence next MySQL version to support the above, rendering this entire algorithm unnecessary.baloo commentedon Aug 2, 2016
Thanks for clarifications :)
13 remaining items
shlomi-noach commentedon Apr 25, 2020
It can’t happen because gh-ost tails the binary log, either on master or on the replica, and there may be still events to handle
In the binary logs during the cut-over.
wukongHH commentedon Jun 11, 2021
Hi,has gh-ost already utilize this new functionality?
shlomi-noach commentedon Jun 11, 2021
It is unfortunate that the implementation actually does not meet gh-ost's requirement, and using the new implementation sadly complicates the logic rather than simplifies it. I therefore made no progress meeting the new implementation.
wukongHH commentedon Jun 11, 2021
Can I ask why the new implementation can't simplify the logic? I don't understand by myself.o(╥﹏╥)o
shlomi-noach commentedon Jun 11, 2021
cenkore commentedon Jun 16, 2021
@shlomi-noach
This entry should be incorrect. In this case, the table name should be cut-over normally, because after the session is broken, the lock is released on C10 and C20 will get the lock and cut-over table directly. Finally gh-ost will fail with error
Expected magic comment on _xxx_del, did not find it
in retry stage and exit, but the cut-over has actually been completed.shlomi-noach commentedon Jun 21, 2021
I don't think I have the bandwidth to solve this again. Unless someone else has a definitive solution, consider using
-cut-over=two-step