Skip to content

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

@shlomi-noach

Description

@shlomi-noach
Contributor

Final-finally-finalizationally, here's an asynchronous, safe, atomic cut-over phase.
This solution doesn't cause "table outage" (as in #65).

Here are the steps for a safe, atomic cut-over:

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.

  • Connections C1..C9 operate on tbl with normal DML: INSERT, UPDATE, DELETE

  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'

  • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

  • Connections C11..C19, newly incoming, issue queries on tbl but are blocked due to the LOCK

  • Connection C20: RENAME TABLE tbl TO tbl_old, ghost TO tbl

    • This is blocked due to the LOCK, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on tbl
  • Connections C21..C29, newly incoming, issue queries on tbl but are blocked due to the LOCK and due to the RENAME, waiting in queue

  • Connection C10: checks that C20's RENAME is applied (looks for the blocked RENAME in show processlist)

  • Connection 10: DROP TABLE tbl_old
    Nothing happens yet; tbl is still locked. All other connections still blocked.

  • Connection 10: UNLOCK TABLES

    BAM! The RENAME is first to execute, ghost table is swapped in place of tbl, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny tbl

Some notes

  • We create tbl_old as a blocker for a premature swap
  • It is allowed for a connection to DROP a table it has under a WRITE LOCK
  • A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

What happens on failures?

Much fun. Just works; no rollback required.

  • If C10 errors on the CREATE we do not proceed.
  • If C10 errors on the LOCK statement, we do not proceed. The table is not locked. App continues to operate as normal.
  • If C10 dies just as C20 is about to issue the RENAME:
    • The lock is released, the queries C1..C9, C11..C19 immediately operate on tbl.
    • C20's RENAME immediately fails because tbl_old exists.
    • The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
  • If C10 dies while C20 is blocked on RENAME: Mostly similar to the above. Lock released, then C20 fails the RENAME (because tbl_old exists), then all queries resume normal operation
  • If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: DROP, UNLOCK. Nothing terrible happens, some queries were blocked for some time. We will need to retry
  • If C20 dies just after C10 DROPs the table but before the unlock, same as above.
  • If both C10 and C20 die, no problem: LOCK is cleared; RENAME lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on tbl.

No matter what happens, at the end of operation we look for the ghost table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to tbl, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic tbl_old. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.

Impact on app

App connections are guaranteed to be blocked, either until ghost is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.

Impact on replication

Replication only sees the RENAME. There is no LOCK in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.

Activity

baloo

baloo commented on Aug 1, 2016

@baloo

Hello @shlomi-noach ,

First thanks for opensourcing gh-ost :) It looks very good :)
The only question remaining in my mind was about this particular subject (cut-over phase), thanks for documenting it. But I believe I now have two more:

  • What ensures that no other connections writes to tbl table in-between those events:

    • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
    • Connection C10: LOCK TABLES tbl WRITE, tbl_old WRITE

    wouldn't it be safer to:

    • Connection C10: LOCK TABLES tbl WRITE;
    • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
    • Connection C10: LOCK TABLES tbl_old WRITE;

    I believe this way the replication client knowns once the CREATE_TABLE occurs in the replication stream, no writes to tbl will occur ever. And it's safe to proceed further down (with the rename and so on).

  • Other question: You assume:

    A blocked RENAME is always prioritized over a blocked INSERT/UPDATE/DELETE, no matter who came first

    From my understanding of the mysql documentation this is only guaranteed by the lock acquirement algorithm that mysql uses [https://dev.mysql.com/doc/refman/5.7/en/lock-tables.html]:

    LOCK TABLES acquires locks as follows:

    • Sort all tables to be locked in an internally defined order. From the user standpoint, this order is undefined.
    • If a table is to be locked with a read and a write lock, put the write lock request before the read lock request.
    • Lock one table at a time until the session gets all locks.

    I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
    Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

shlomi-noach

shlomi-noach commented on Aug 2, 2016

@shlomi-noach
ContributorAuthor

@baloo, thank you for your review!

What ensures that no other connections writes to tbl table in-between those events:

There is nothing to ensure that, and there is no need to ensure that; the table creation is merely a step in the direction of beginning the cut-over. There is no problem that tbl still takes writes, these are still being read in the binary log and propagated to the ghost table.

Moreover, if we reverse the order as you suggest, implication is we waste more time during the locked phase.

wouldn't it be safer to:

  • Connection C10: LOCK TABLES tbl WRITE;
  • Connection C10: CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'
  • Connection C10: LOCK TABLES tbl_old WRITE;

This doesn't work like that in MySQL, see: http://dev.mysql.com/doc/refman/5.7/en/lock-tables.html:

... A session that requires locks must acquire all the locks that it needs in a single LOCK TABLES statement. ...

You cannot just grab more locks as you go along, you have to grab them all at once.

I believe it to be defined in https://github.com/percona/percona-server/blob/5.7/sql/mdl.h#L147
Although I believe the code tends to confirm your hypothesis, the documentation clearly says that user-code shouldn't assume on any ordering. Am I missing something?

The relevant code in MySQL is https://github.com/mysql/mysql-server/blob/a533e2c786164af9bd276660b972d93649434297/sql/mdl.cc#L2312. You are correct this relies on internal behavior. This behavior exists for over 15 years or so, and, having discussed (informally, Safe Harbour etc.) with engineers - has no intention of going away.

I'm glad you point it out, I'll make it explicit in the fine print.

This entire scheme came to be because of a limitation in MySQL, where you cannot rename a table you have under lock in the same session. I'm in good mind to influence next MySQL version to support the above, rendering this entire algorithm unnecessary.

baloo

baloo commented on Aug 2, 2016

@baloo

Thanks for clarifications :)

13 remaining items

shlomi-noach

shlomi-noach commented on Apr 25, 2020

@shlomi-noach
ContributorAuthor

It can’t happen because gh-ost tails the binary log, either on master or on the replica, and there may be still events to handle
In the binary logs during the cut-over.

ghost
wukongHH

wukongHH commented on Jun 11, 2021

@wukongHH

Noteworthy that as of MySQL 8.0.13 there's support for RENAME TABLES under LOCK TABLES, specifically designed for gh-ost, see: https://mysqlserverteam.com/the-mysql-8-0-13-maintenance-release-is-generally-available/

gh-ost is yet to utilize this new functionality.

Hi,has gh-ost already utilize this new functionality?

shlomi-noach

shlomi-noach commented on Jun 11, 2021

@shlomi-noach
ContributorAuthor

It is unfortunate that the implementation actually does not meet gh-ost's requirement, and using the new implementation sadly complicates the logic rather than simplifies it. I therefore made no progress meeting the new implementation.

wukongHH

wukongHH commented on Jun 11, 2021

@wukongHH

Can I ask why the new implementation can't simplify the logic? I don't understand by myself.o(╥﹏╥)o

shlomi-noach

shlomi-noach commented on Jun 11, 2021

@shlomi-noach
ContributorAuthor
cenkore

cenkore commented on Jun 16, 2021

@cenkore

@shlomi-noach

If C20 dies just after C10 DROPs the table but before the unlock, same as above.

This entry should be incorrect. In this case, the table name should be cut-over normally, because after the session is broken, the lock is released on C10 and C20 will get the lock and cut-over table directly. Finally gh-ost will fail with error Expected magic comment on _xxx_del, did not find it in retry stage and exit, but the cut-over has actually been completed.

shlomi-noach

shlomi-noach commented on Jun 21, 2021

@shlomi-noach
ContributorAuthor

I don't think I have the bandwidth to solve this again. Unless someone else has a definitive solution, consider using -cut-over=two-step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @baloo@shlomi-noach@youzipi@wukongHH@cenkore

        Issue actions

          Describing safe, blocking, atomic, pure-mysql cut-over phase · Issue #82 · github/gh-ost