Describing safe, blocking, atomic, pure-mysql cut-over phase

_Final-finally-finalizationally_, here's an asynchronous, safe, **atomic** cut-over phase.
This solution doesn't cause "table outage" (as in #65).
### Here are the steps for a safe, atomic cut-over:

The solution we offer is now based on two connections only (as opposed to three, in the optimistic approach). "Our" connections will be C10, C20. The "normal" app connections are C1..C9, C11..C19, C21..C29.
- Connections C1..C9 operate on `tbl` with normal DML: `INSERT, UPDATE, DELETE`
- Connection C10: `CREATE TABLE tbl_old (id int primary key) COMMENT='magic-be-here'`
- Connection C10: `LOCK TABLES tbl WRITE, tbl_old WRITE`
- Connections C11..C19, newly incoming, issue queries on `tbl` but are blocked due to the `LOCK`
- Connection C20: `RENAME TABLE tbl TO tbl_old, ghost TO tbl`
  - This is blocked due to the `LOCK`, but gets prioritized on top connections C11..C19 and on top C1..C9 or any other connection that attempts DML on `tbl`
- Connections C21..C29, newly incoming, issue queries on `tbl` but are blocked due to the `LOCK` and due to the `RENAME`, waiting in queue
- Connection C10: checks that C20's `RENAME` is applied (looks for the blocked `RENAME` in `show processlist`)
- Connection 10: `DROP TABLE tbl_old`
  Nothing happens yet; `tbl` is still locked. All other connections still blocked.
- Connection 10: `UNLOCK TABLES`
  
  **BAM!** The `RENAME` is first to execute, ghost table is swapped in place of `tbl`, then C1..C9, C11..C19, C21..C29 all get to operate on the new and shiny `tbl`
#### Some notes
- We create `tbl_old` as a blocker for a premature swap
- It is allowed for a connection to `DROP` a table it has under a `WRITE LOCK`
- A blocked `RENAME` is always prioritized over a blocked `INSERT/UPDATE/DELETE`, no matter who came first
### What happens on failures?

Much fun. Just works; no rollback required.
- If C10 errors on the `CREATE` we do not proceed.
- If C10 errors on the `LOCK` statement, we do not proceed. The table is not locked. App continues to operate as normal.
- If C10 dies just as C20 is about to issue the `RENAME`:
  - The lock is released, the queries C1..C9, C11..C19 immediately operate on `tbl`.
  - C20's `RENAME` immediately fails because `tbl_old` exists.
  - The entire operation is failed, but nothing terrible happens; some queries were blocked for some time is all. We will need to retry everything
- If C10 dies while C20 is blocked on `RENAME`: Mostly similar to the above. Lock released, then C20 fails the `RENAME` (because `tbl_old` exists), then all queries resume normal operation
- If C20 dies before C10 drops the table, we catch the error and let C10 proceed as planned: `DROP`, `UNLOCK`. Nothing terrible happens, some queries were blocked for some time. We will need to retry
- If C20 dies just after C10 `DROP`s the table but before the unlock, same as above.
- If both C10 and C20 die, no problem: `LOCK` is cleared; `RENAME` lock is cleared. C1..C9, C11..C19, C21..C29 are free to operate on `tbl`.

No matter what happens, at the end of operation we look for the `ghost` table. Is it still there? Then we know the operation failed, "atomically". Is it not there? Then it has been renamed to `tbl`, and the operation worked atomically.

A side note on failure is the matter of cleaning up the magic `tbl_old`. Here this is a matter of taste. Maybe just let it live and avoid recreating it, or you can drop it if you like.
### Impact on app

App connections are guaranteed to be blocked, either until `ghost` is swapped in, or until operation fails. In the former, they proceed to operate on the new table. In the latter, they proceed to operate on the original table.
### Impact on replication

Replication only sees the `RENAME`. There is no `LOCK` in the binary logs. Thus, replication sees an atomic two-table swap. There is no table-outage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Here are the steps for a safe, atomic cut-over:

Some notes

What happens on failures?

Impact on app

Impact on replication

13 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Describing safe, blocking, atomic, pure-mysql cut-over phase #82

Description

Here are the steps for a safe, atomic cut-over:

Some notes

What happens on failures?

Impact on app

Impact on replication

Activity

baloo commented on Aug 1, 2016

shlomi-noach commented on Aug 2, 2016

baloo commented on Aug 2, 2016

13 remaining items

shlomi-noach commented on Apr 25, 2020

wukongHH commented on Jun 11, 2021

shlomi-noach commented on Jun 11, 2021

wukongHH commented on Jun 11, 2021

shlomi-noach commented on Jun 11, 2021

cenkore commented on Jun 16, 2021

shlomi-noach commented on Jun 21, 2021

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions