tarantool · mkostoevr · Jul 31, 2024
diff --git a/source/index.rst b/source/index.rst
@@ -13,3 +13,4 @@ Tarantool internals
    allocators
    cbus
    ports
+   memtx_tx
diff --git a/source/memtx_tx.rst b/source/memtx_tx.rst
@@ -0,0 +1,202 @@
+.. vim: ts=4 sw=4 et
+
+MVCC in memtx
+======
+
+Introduction
+------------
+
+This section shows a rough explaination of MVCC in Tarantool, which will be handy to have in mind before digging into its implementation details.
+
+MVCC is a facility allowing to execute multiple transactions concurrently. All transactions in Tarantool are executed on one thread (TX), but due to the cooperative multitasking, if MVCC is enabled, transactions can yield the control one to another. This has an advantage that long lasting transactions can be executed concurrently with fast ones if they yield once in a while.
+
+The problem with this facliilty is that conflicts between transations possible. In general, a transaction must see the data it's working with in a consistent state. That means two things:
+
+1. If a transaction once read a tuple by a key, it must see the same tuple at this key for the rest of its life (up until it's committed or rolled back).
+2. There must be a theoretical point in time where the transaction could be executed the way it has been executed and see the database in the state it has seen.
+
+The database state includes data from different indexes and spaces. If a transaction has:
+
+- read a ``tuple_1`` by ``key_1`` in ``space_1``;
+- read ``tuple_2``, which is the first tuple greater than ``key_2`` in the ``space_2``;
+- found out that there's no ``key_3`` in the ``space_3``,
+
+it can safely assume that for the rest of the transaction:
+
+- ``tuple_1`` still exists at ``key_1`` in the ``space_1``;
+- ``tuple_2`` is still the first tuple greater that ``key_2`` in ``space_2``;
+- and there's still no ``key_3`` in the ``space_3``.
+
+If any of its assumptions are broken (for example someone replaced ``tuple_1`` with ``tuple_1a`` at ``key_1`` in ``space_1``), then the two transactions are conflicted and we have a bunch of options of what to do next, see the `corresponding section <#memtx-tx-conflicts>`_ for more details on MemTX conflicts and how they're handled. Actually all the MVCC boils down to maintaining the state visible to a transaction and conflicting it when required,
+
+MemTX TX conflicts
+------------------
+
+There's a nmber of ways transactions may conflict:
+
+1. Someone deletes the tuple someone else has read:
+
+   ..  code:: lua
+
+       tx1.index:get({1})    -- {1}
+       tx2.index:delete({1}) -- conflict, tx1 had read the key we delete.
+
+2. Someone looks up in a space by a key, and another one inserts a new matching tuple:
+
+   ..  code:: lua
+
+       tx1.index:select({0}, {iterator = 'GE'}) -- {} (nothing found)
+       tx2.index:insert({1})                    -- conflict, we insert the key matching the one tx1 had requested (but haven't found).
+
+3. Someone replaces the tuple someone else has read:
+
+   ..  code:: lua
+
+       tx1.index:get({1})         -- {1, 37}
+       tx2.index:replace({1, 73}) -- conflict, tx1 had read the key we replace.
+
+In case of a conflict we have two transactions: a breaker and a victim. TODO: Who and why is victim and breaker? There's two ways to handle conflicts:
+
+1. Abort the victim (if the it is writing).
+2. Put the victim into a (possibly deeper) read view (if the victim is read-only).
+
+Key points to remember:
+
+1. Transactions conflict if one deletes a tuple another one had read.
+2. Transactions conflict if one inserts a tuple another one unsuccessfully attempted to read (requested by a matching key).
+3. A conflicted read-only transaction is moved into a read view on conflict (WHEN???).
+4. If both conflicted transactions are writing, the one to commit first aborts another one.
+
+Also (not mentioned explicitly):
+
+1. Two transactions can only conflict if at least one of them is writing.
+2. If a transaction being in a read view attempts to write a tuple, it's aborted.
+3. If a transaction is aborted, it's required to abort all the transactions that read the values the transaction has written.
+4. Sometimes a conflict can be caused not by any write, but only by an insertion or deletion. For example, if tx1 counted three tuples (``{1}``, ``{2}`` and ``{3}``), replace of ``{2}`` might not cause a conflict, since it does not break the state the first transaction is in: it still counts the same amount of tuples:
+
+   ..  code:: lua
+
+       tx1.index:count({0}, {iterator = 'GE'}) -- 3: {1}, {2}, {3}.
+       tx2.index:replace({2})                  -- might not conflict, count({0}, {iterator = 'GE'}) is still 3 for tx1.
+
+There're a bunch of ways a conflict might occure and few ways to go when this happens.
+
+1. If the conflicted transaction is read-only, it's moved into a read view once another (writing) transaction commits.
+2. If the conflicted transaction is writing (so both transactions are writing), the one committed first aborts another one.
+
+As been noted above, there're a bunch of conditions in which conflicts may happen, but there's also several ways how they can be resolved,
+
+General overview
+----------------
+
+As mentioned above, in general we only have few things to consider:
+
+- the read set (the tuples the transaction has read);
+- the read gaps (the keys the transaction attempted to read by and got nothing, these are index-specific as keys are);
+- the write set (the tuples the transaction has written).
+
+The read set is maintained simply: each tuple read is performed via the ``memtx_tx_tuple_clarify`` function. The function looks for the version of the tuple that is visible to the performing transaction, and adds itself into the list of readers of that tuple. So for each tuple there's a set of reading transactions stored in ``memtx_story::reader_list``. Also there's a dedicated field in the transaction struct for toring its read set (``struct txn::read_set``).
+
+The read gaps are collected in a bit more sophisticated manner: they're stored as gap items in ``memtx_story::link[n].read_gaps`` and ``index->read_gaps``. There's few types if such items in the system:
+
+1. ``GAP_INPLACE``.
+2. ``GAP_NEARBY``.
+
+Also there're gap items that don't exactly represent read gaps, but rather operations performed on indexes. These can only be found in ``index->read_gaps``:
+
+1. ``GAP_FULL_SCAN``.
+2. ``GAP_COUNT``.
+
+The write sets are not stored explicitly, but the concept of stories, mentioned above, can be used in order to track modifications performed on a tuple in an index. This concept is explained in a separated chapter.
+
+Successors and their use
+------------------------
+
+Tipically tracking of key access is done using inerval trees, this allows to check if operations conflicts (for example, an inserted key matches with a key range read by a concurrent transaction). But this approch performs checks with a logarithmic complexity, while the memtx's TX manager had been created with a goal of getting O(1) for all operations given the amount of concurrent transactions is small (in-memory transactions are meant to be fast).
+
+In order to approach the O(1) complexity when checking for key collisions with concurrent transactions the successor-based system had been introduced. The idea is simple:
+
+1. Once an element is inserted into a memtx tree index, it's successor in the index can be returned by the insertion method for free.
+2. We can store some information in it, for example, the unsuccessful attempts to read tuples by key prior to that element.
+3. So we can use this information to find conflicts in O(1) complexity.
+
+Consider the following example:
+
+1. TX1 selects tuples matching ``>= 1`` and gets ``{2}`` and ``{3}`` so we store in the ``{2}`` element the information that for TX1 there was no element matching ``>= 1`` before ``{2}``.
+2. TX2 transaction concurrently inserts ``{1}``. It sees in its successor (which is ``{2}``), that TX1 requested tuples from the index and haven't found tuples matching ``>= 1`` prior to ``{2}``. But we have just inserted ``{1}`` and it matches the key, so we have a conflict with TX1 here.
+
+Just like that we have found a conflict of TX2 with TX1 and can handle it. The good thing is that whether TX1 also requested data from other places in the index does not affect this specific conflict: it's found in-place.
+
+Story chains
+------------
+
+One of basic components of the memtx's transaction manager is story chains: the chains of ``struct memtx_story`` linked the way following invariants are maintained:
+
+- If the story object specifies the current state of the index (this is the last operation), the ``struct memtx_story::link[n].in_index`` points to the index (it points to ``NULL`` otherwise).
+- A story chain specifies the story of a tuple, only the last story entry has the link to the index in in_index member.
+- Actual story entry also contains the read_gaps meber which is up to date with the current state of the index (othervice it's empty).
+- For the actual story the newer_story pointer points to NULL (i points to the more actual story othervice).
+- The story chains are ordered by psn.
+- GC looks for the oldest read views openned and checks their psn. Then it deletes all the stories with pns less than that (the state they represent are not used by any read view).
+
+The story chain starts with a dirty tuple in an index. First N >= 0 stories are written by in-progress transactions, next N >= 0 stories are created by prepared transactions, next N >= 0 by committed transactions, and the last N >= 0 stories are created by transactions that are rolled back.
+
+For stories of committed transactions: add_stmt == NULL.
+For stories of prepared transactions: add_psn != 0
+
+Read trackers
+-------------
+
+Each read of a tuple is tracked by memtx indexes using the ``memtx_tx_tuple_clarify`` function. The idea of the function is to save the fact a tuple had been read by a transaction. This is done using linked lists of ``struct tx_read_tracker`` objects stored in the tuple stories (``struct memtx_story::reader_list``) and itransactions (``struct txn::read_set``).
+
+So the tracker has two links: ``in_read_set`` (linking into the coresponding transaction list) and ``in_reader_list`` (linking into the reader list of the story it's read).
+
+Gap items
+---------
+
+The gap items are directly or indirectly created on various operations on the index. The directly created ones are: ``GAP_NEARBY``, ``GAP_FULL_SCAN`` and ``GAP_COUNT``. The ``GAP_INPLACE`` is either created lazily after an index calls ``memtx_tx_track_point`` or when a conflict detected with ``GAP_NEARBY`` (this is to be explained in details in its own section).
+
+**GAP_INPLACE**
+
+This gap item is an object that informs about the fact that some transaction has read by a key matching this story and has found nothing. That means that the story author conflicts wit it.
+
+This item is created in a number of cases.
+
+1. The first case is explicit call of the ``memtx_tx_track_point`` function. In this the TX manager saves the fact a transaction has found nothing equal to a full key in a sructure called "point holes" (more about it in a separated section).
+2. The second case is the insertion of a new element in to an index (not only for current transaction, but new element for all concurrent transactions). In case we insert an eleent into the index prior to a successor and it turns out it matches with a ``read_gaps`` entry of type ``GAP_NEARBY`` of its successor (or index, if the inserted element is the last in the index), the ``GAP_INPLACE`` gap item is created in its ``read_gaps``, so now the transaction conflicts with the one created the ``GAP_NEARBY``.
+
+**GAP_NEARBY**
+
+The gap is created using a dedicated function: ``memtx_tx_track_gap``. This gap item represents the fact a transaction hasn't found a uple prior to a specific successor (or until the end of an index). Once a new tuple is inserted into an index preior to that successor (or at the end of the index), it's compared with the key specified in the gap item and if it matches, the story of the new tuple is marked with the ``GAP_INPLACE`` item.
+
+**GAP_FULL_SCAN**
+
+This gap item is created on iterator-based lookup in an unordered index (hash). This item causes a transaction conflict on any insertion into the index after the iterator is created. Such items are only placed in the index the lookup is performed in (the ``memtx_tx_track_full_scan`` is called in).
+
+**GAP_COUNT**
+
+The gap item is created in case a transaction performs count of tuples in an index by a key. Once such an item is created for a transaction, any following insertion or deletion of a matching key in the index will conflict with it (also it wil conflict with all the previously created concurrrent modifications, not only inserts and deleteas, of matching tuples).
+
+Point holes
+-----------
+
+The point holes storage is used in order to save point hole entries created using the ``memtx_tx_track_point`` function. These entries are used and removed from the storage once a matching tuple is inserted into the index.
+
+Unsorted
+--------
+
+story->del_stmt (and the list with next_in_del_list).
+
+memtx_tx_history_rollback_added_story
+
+Actors:
+
+TX manager (singleton):
+- read_view_txns - all transactions moved to a read view;
+- point_holes - a list of point holes;
+- history - tuple -> histiry mappings;
+- all_stories - a list of all memtx_story objects.
+- all_txns - a list of all transactions;
+
+Space struct:
+- memtx_stories - all stories of the space.
diff --git a/source/toctree.rst b/source/toctree.rst
@@ -13,4 +13,5 @@
    wal-fmt
    allocators
    cbus
-   ports
+   ports
+   memtx_tx
-Original file line number
+Diff line change
@@ Expand Up / @@ -13,3 +13,4 @@ Tarantool internals @@
        allocators
        cbus
        ports
+       memtx_tx