GitHub Repo link notes on Design Data Intensive Applications book


Comments partial


Part I: Foundations of Data Systems


Ch 1: Relible, Scalable and Maintanable Apps


Data-intensive vs Compute-Intensive

CPU is rarely a limit these days. usually it is: - amount of data - complexity of data - speed at which it is changing


image


Thinking about Data Systems

  • Boundaries between data systems(DBs, Caches, queues) are becoming blurred

image


image

  • role of Software Engineer now also includes DataSystem designer (Arch?)
    • We have to address:
      • keeping data correct and complete during storm
      • providing consistent performance to clients, when sys is degraded
      • how to scale to handle increased load
      • what is a good API for this service?
    • factors:
      • team skill / exp
      • legacy sys
      • time-pressure
      • risk appetite
      • regulatory
      • etc etc
    • Note: what is legacy? assumptions/conventions w/o data

image

Reliability

  • fault-tolerant (aka resilient)
  • fault vs failure
  • Define scope of faults - we can’t tackle them all (i.e diff region? alien invasion?)
  • Essentially we build reliable sys from unreliable parts (i.e. my Mec*ano kit)
  • we need to deliberatly trigger faults (i.e. kill processe w/o warning). Many bugs are due to poor error handling
  • Netflix Chaos Monkey
  • we prever tolerating faults over preventing faults

Hardware Faults

-1st response: add redundancy - as it is well understood until recently hardware redundancy was sufficient, but it changes with the rise of flexibility and elasticity priorities, over single machine reliability Hence the move is towards systems that can tolerate the loss of machines, by using software fault-tolerance techniques (in preference OR in addition to hardware redundancy)

Software Errors

Human Error

image

Scalability

image

Maintainability

image

Simplicity - managing complexity

  • Explosion of the state space accidental complexity vs essential complexity

    Evolvability - making changes easy


Your solution will be custom

Ch 2: Data Models and Query lang-s

Abstraction

image


image

Relational Model

image

Document based Model

image

Graph model

image image


Data Storage and Data Retrieval - Data model and it’s quering go hand-in-hand

image


image


Ch 3: Storage and Retrieval

While the group has heard of many concepts in this chapter and is reasonably comfortable with relational databases, we don’t have great knowledge of non-relational terminology.

Action points

Which DB to use?

image image

DB Indexing

image

Extra on DB indexing

image

Hash Indexes

Hash tables, or dictionaries, are common to many programming languages. Ruby has a Hash class, while JavaScript has both Objects and Maps.

Understanding Big-O notation is important to digesting this chapter.

The book says:

As described so far, we only ever append to a file—so how do we avoid eventually running out of disk space? A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file.

It wasn’t well explained why having multiple files would stop you running out of disk space, by comparison to the compaction process. We believe, however, that you would want the compaction to occur on “closed” segments, so after a new segment is opened, the old segment can be compacted. It is also possible for the compaction process to fall behind if there is a spike in writes. In this case, lots of smaller segments might be faster to compact, at the expense of less effective compression.

Databases like PostgreSQL have a complex process for continuing to serve one version of a record while transactions create new versions. This is discussed in a later chapter so we won’t dig into this now.

LSM-Trees - Log Short Merge and SSTables (Sorted String)

image image

AWS defaults to general SSD storage these days, but you can get a discount for purchasing magnetic storage.

An advantage of an SSTable over log segments with hash indexes is the data storage can be greater than the available memory. Now that memory is cheaper than when the book was written, it is more feasible to store a whole database in memory. This, however, is unlikely to be possible with a data-intensive application. While AWS offers instances with 24TiB of memory, this is more inconvenient data, rather than big data, and comes at a high cost (US$218.40ph)

LSM trees can be a lot faster for writes than a B-tree; however, they can be slower for reads since data can exist across multiple files and memory.

While we understand that a Bloom filter is a probabilistic data structure that can tell you if a value would fit inside a set, we are not entirely clear on how they work.

B-Trees Index

image image

B-tree does not mean “binary tree”, because B-trees can have more than 2 branches. We are choosing the believe that the “b” means “balanced”, however:

Bayer and McCreight never explained what, if anything, the B stands for: Boeing, balanced, between, broad, bushy, and Bayer have been suggested.

https://en.wikipedia.org/wiki/B-tree

The book states that:

A four-level tree of 4 KB pages with a branching factor of 500 can store up to 250 TB

The equation for this is:

Branching Factor ^ Levels * Page Size / (to_MB * to_GB * to_TB)
500              ^ 4      * 4         / (1000  * 1000  * 1000)
= 250 TB

Other Indexing Structures

It’s amazing how efficiently databases can find ranged data with multi-column filters, e.g. geospatial data. Most major databases (e.g. PostgreSQL and MySQL) are smart enough to use table statistics to use the best available index to minimise sequential scanning but need to fall back to a scan when handling any column filters not covered by an index. Thankfully, those same databases can combine multiple indexes to perform a single query. They can also use tricks like parallel sequential scanning to minimise load time.

It was surprising to learn that:

the performance advantage of in-memory databases is not due to the fact that they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk

When updating a value in a heap file:

the record can be overwritten in place, provided that the new value is not larger than the old value. The situation is more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap location

This idea of moving data around and leaving pointers is reminiscent of a 2017 talk by Aaron Patterson on compacting garbage collection in Ruby 2.7.

Column-Oriented Storage

There was a discussion about whether filters be more performant on row or column-based storage. The current theory is that column-based filtering could be faster because values are grouped. This does, however, assume that there is an order on the column.

Ch 4: Encoding and Evolution

image image

Formats for Encoding Data

  • Language-specific formats
  • JSON, XML and binary variants
  • Binary encoding
  • Thrift and Protocol Buffers

image

Field tags and schema evolution

  • Data types and schema evolution

    Avro

    The writer’s schema and the reader’s schema

  • Schema evolution rules
  • But what is the writer’s schema?
  • Dynamically generated schemas

    Code generation and dynamically typed languages

    The merits of schemas

    Modes of Data Flow

    Data flow through databases

  • Different values written at different times
  • Archival storage

    Data flow through services: REST and RPC

    Web services

    Remote procedure calls (RPC)

    Current directions for RPC Data encoding and evolution for RPC

    Message passing data flow

    Message brokers Distributed actor frameworks

Part II: Distributed Data

Ch 5: Replication

Intro

WHY:

  1. Geo proximity to users
  2. fail-safe -> increases availability
  3. scale out -> increase throughput Types:
    • single-leader
    • multi-leader
    • leaderless

      Synchronous VS Async

Leaders and Followers

aka active/passive OR master-slave replication Used in:

  • DBs: PostreSQL, MySQL as well as MongoDB, etc etc
  • Non DBs: Kafka, RabbitMQ

Synchronous vs. asynchronous replication

  • Synchronous generally means one of the followers is only synced, since it is impractical to make all followers to sync as it will halt the system. It is then often called semi-synchronous
  • Often thought, replication is async -> WRITES are not guaranteed to be durable, but writes are not slowed down by replication
  • chain replication - sync replication with good performance ( used in MS Azure)

Setting up new followers

Steps:

  1. take snapshot (built in feature often, for MySQL use innobackupex)
  2. copy snapshot to the follower
  3. update follower with a newer data after snapshot’s exact pos in replication log ( aka log sequence number in PostgreSQL, binlog coordinates in MySQL )
  4. confirm it caught up

Handling node outages

  1. Follower failure: catch-up recovery
    • follower recovers from it’s log, which points to the last transaction processed and can copy newer data from the leader
  2. Leader failure: Failover
    1. Determine Leader has Failed
      1. nodes bounce messages between each other
    2. Choose new leader
      1. make all nodes agree on a new leader(consensus problem)
      2. ideally new leader needs to have most up-to-date data
    3. Reconfigure the system to use new leader
      1. Ensure old leader becomes a follower after recovery ISSUES:
        • async replication looses leaders newest writes
        • may cause issues on depndent systems - i.e. GitHub MySQL-Redis issue
        • split brain issue - both leaders accept writes
        • Timeout time - Time for recovery VS False positives (during peak load)

Perform Failovers manually to minimise issues

Implementation of replication logs

1. Statement-based replication

Leader’s log sent to all followers (it includes ever WRITE request (statement) - INSERT, UPDATE, DELETE for relational DB) so they can execute these statements themselves - Issues: 1. nondetermenistic functions generate diff values - i.e. RAND(), NOW() 2. autoincrement - should be in the same order on each replica 3. side effects of statements (i.e. triggers, stored procedures, user-defined functions) Real life use cases unclear?

2. Write-ahead log (WAL) shipping

Append only log. In case of B-trees data is written into WAL then onto B-tree index - used in PostgreSQL, Oracle - Disadvantage: - very low level representation of data -> replication is closely coupled with storage engine - hence can’t have leaders and followers on different DB versions

3. Logical (row-based) log replication

Contains - for inserts - new values - for deletions - data to id the row (PrimaryKey or values) - for updates - data to id the row + new values Used in MySQL binlog when configured to use row-based replication Advantage: - decoupled from storage engine and can be backwards compatible or even use different storage engines - easier to use with external apps. Refer change data captuer Ch11

4. Trigger-based replication

Unlike previous algos - it is defined in App and gives felixbility (i.e. Conflict Resolution, replicate only certain subsets of data) Use built-in RDB features: Triggers and Stored Procedures: exec custom code on WRITE transaction. Can log change into a separate table (i.e. Bucardo for Postgres)


Problems With Replication Lag

Eventual Consistency - Replication Lag can be seconds, but can be minutes+ under stress, resulting in the following 3 issues

1.Reading your own writes

Users might not see their changes / writes instantly, we need read-after-write(or read-your-writes) consistency. Implementation techniques for read-after-write consistency:

  1. Read from leader if user is an author (easy for profile settings / etc since only one user generally can read them)
  2. Track times of the last update and then decide if you need to read from leader
  3. Client can remember the timestamp of the most recent WRITE and client query read replica accordingly.
    1. this timestamp can be logical timestamp or actual system clock(Refer Unreliable Clocks) Cross-device complexities:
      • timestamp of user’s last upd needs to be centralized
      • Cell / radio network might go to a different datacentre than wire

        2. Monotonic reads

      • lesser guarantee than strong consistency
      • stronger guarantee than eventual consistency Archie by making a user read from a same replica assigned to them

        3. Consistent prefix reads

      • readers need to see writes in the same order

        Solutions for replication lag

Transactions (txns)

  • Single-node txns existed for a long time
    • however they can be expensive(performance & availability) in distributed systems\
    • and eventual consistency is inevitable some claim

      Multi-leader replication

aka master-master or active/active Leaders also act as followers to other leaders

Use cases for multi-leader replication

Multiple Datacentres

Advantages over single-leader

  1. Performance -> lower latency due to locality of data-centres
  2. Datacentre Outage Tolerance
  3. Network Tolerance - followers

Multi-leader configs - supported by default sometimes, but also can be done via ext tools:

  • BDR for Postgres
  • Tungsten Replicator for MySQL

Disadvantages of Multi-leader replication:

  • requires conflict resoluton - as same data can be concurrently modified
  • since it is often a retrofitted feature - it requires extra config (i.e auto-incrementing keys, triggers, integrity constraints)

    Clients with Offline ops

    In this case, every device has a local database that acts as a leader (it accepts write requests), and there is an asynchronous multi-leader replication process (sync) between the replicas of your calendar on all of your devices. The replication lag may be hours or even days, depending on when you have internet access available.

CouchDB is designed for this mode of operation LotusNotes

Collaborative editing

for faster collaboration, you may want to make the unit of change very small (e.g. a single keystroke), and avoid locking

Handling write conflicts

Synchronous vs. asynchronous conflict detection

Synchronous loses the main advantage of multi-leader repli‐ cation: allowing each replica to accept writes independently

Conflict avoidance

  • route all changes for a particular record through a particular datacentre
  • but if datacentre needs to be changed it poses an issue

    Converging towards a consistent state

    all replicas must arrive at the same final value when all changes have been replicated. Ways to achieve:

    1. Give each write a unique ID (data loss)
    1. i.e. LWW - last write wins - refer Detecting Concurrent Writes
      1. Give each replica a unique ID (data loss)
      2. merge the values together
      3. Record the conflict and solve later (i.e. by user)

        Custom conflict resolution logic

  • on write
  • on read

    Automatic Conflict Resolution

  • Conflict-free replicated data types (CRDTs)
    • two-way merge function
  • Mergeable persistent data structures
    • similar git version control system, and use a three-way merge function
  • Operational transformation
    • Google Docs

      What is a conflict?

      Multi-leader replication topologies

image

1. Circular Topology (MySQL default)

  • one node failing can be an issue, as work around the failed node requires manual config often

2. Star Topology (can be generalised to a tree)

3. All-to-all topology (most common)

  • better fault-tolerance as it avoid single point of failure
  • but some replicas can overtake others (similar to Consistent Prefix Reads).

    Attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2

image

version vectors can be used to order concurrently.

Conflict detection techniques are poorly implemented in many multi-leader replication systems.

  • PostgreSQL BDR does not provide causal ordering of writes,
  • Tungsten Replicator for MySQL doesn’teven try to detect conflicts

Leaderless replication

Some of the earliest replicated data systems were leaderless, but the idea was mostly forgot‐ ten during the era of dominance of relational databases.

Modern examples:

  • Amazon Dynamo system
  • Riak, Cassandra and Voldemort are open source datastores with leader‐ less replication models inspired by Dynamo

Writing to the database when a node is down

Problem: Stale responses from failed nodes Solution: Read requests are also sent to several nodes in parallel to be compared using ver. nums

Read repair and anti-entropy - methods to copy data to all replicas

  1. Read repair - checks versions on read. good for frequent reads
  2. anti-entropy- backround process. Unlike the replication log in leader-based replication, this anti-entropy process does not copy writes in any particular order, and there may be a significant delay before data is copied.

Quorums for reading and writing

r = quorum_reads # the minimum number of votes required for the read or write to be valid
w = quorum_writes
n = total_node_count

w + r > n

# i.e. 2 + 2 > 3

Limitations of quorum consistency

Sloppy quorums and hinted handoff

Detecting concurrent writes

Ch 6: Partitioning

partition AKA:

  • shard (MongoDB, Elastic, SolrCloud)
  • region (HBase)
  • tablet BIgTable
  • vnode (Cassandra, Riak)
  • vBucket (Couchbase)

Main Reasons for partitions: scalability

Different partitions can be placed on different nodes in a shared-nothing cluster

Partitioning VS replication

image

Often both combined. Although the choice of partitioning scheme is mostly independent of the choice of replication scheme.

Strategies (key range VS hash of key)

GOAL: to spread the data and the query load evenly across nodes -> avoid skewed data, hot spots

The simplest approach of avoiding hot spots would be to assign records to nodes randomly BUT when you’re trying to read a particular item, you have no way of knowing which node it is on, so you would have to query all nodes in parallel

Partitioning by key range

image

Key range partitioning, where keys are sorted, and a partition owns all the keys from some minimum up to some maximum. This has the advantage that effi‐cient range queries are possible, but there is a risk of hot spots if the application often accesses keys that are close together in the sorted order. In this approach, partitions are typically rebalanced dynamically, by splitting the range into two sub-ranges when a partition gets too big.

Partitioning by hash of key

image

Hash partitioning, where a hash function is applied to each key, and a partition owns a range of hashes. This destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly. When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed.


i.e 32-bit hash function which takes a string. Whenever you give it a new string, it returns a seemingly random number between 0 and 2 32 − 1.

Hash function needs to be cryptographically strong:

  • MD5 -> MongoDB, Cassandra
  • Fowler-Noll-Vo func -> Voldemort

But Ruby’s Object#hash or Java’s Object.hashCode() and other language build-in algos might not be suitable

Note: might depend on the input of the Hash, if string - might be ok, not ok

Compound Primary Key (ref p204-205)

Skewed workloads and relieving hot spots

simple technique is to add a random number to the beginning or end of the key. Just a two-digit decimal random number would split the writes to the key evenly across 100 different keys, allowing those keys to be distributed to different partitions. However, having split the writes across different keys, any reads now have to do addi‐ tional work, as they have to read the data from all 100 keys and combine it. This tech‐ nique also requires additional bookkeeping: it only makes sense to append the random number for the small number of hot keys; for the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you also need some way of keeping track which keys are being split.


Hybrid approaches (Key range partitioning + Hash partitions) are also possible, for example with a compound key: using one part of the key to identify the partition, and another part for the sort order

Secondary indexes

A secondary index usually doesn’t identify a record uniquely, but rather, it’s a way of searching for occurrences of a particular value

Document-partitioned index (aka local index):

image

the secondary indexes are stored in the same partition as the primary key and value. This means that only a single partition needs to be updated on write, but a read of the secondary index requires a scatter/ gather across all partitions.

scatter/gather

used by: MongoDB, Riak , Cassandra ,Elasticsearch , SolrCloud , and VoltDB

Drawbacks:

Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification .

Most database vendors recommend that you structure your partitioning scheme so that secondary index queries can be served from a single partition, but that is not always possible, especially when you’re using multiple secondary indexes in a single query (such as filtering cars by color and by make at the same time)

Term-partitioned index (aka global index):

image

the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is written, sev‐ eral partitions of the secondary index need to be updated; however, a read can be served from a single partition.


advantage of a global (term-partitioned) index over a document-partitioned index:

  • can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants.

downside of a global index:

  • writes are now slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).

    Rebalancing partitions

The process of moving data around between nodes in the cluster

Requirement: • After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster. • While rebalancing is happening, the database should continue accepting reads and writes. • Don’t move more data than necessary between nodes, to avoid overloading the network.

Strategies for rebalancing

How not to do it: hash `mod(N)

Fixed number of partitions

create many more partitions than there are nodes, and assign several partitions to each node. For example, a database run‐ ning on a cluster of 10 nodes may be split into 1,000 partitions from the outset, so that approximately 100 partitions are assigned to each node.

Dynamic partitioning

the number of partitions is usually fixed when the database is first set up, and not changed afterwards. Although in principle it’s possible to split and merge partitions (see next section), a fixed number of partitions is operationally simpler, and so many fixed-partition databases choose not to implement partition splitting. Thus, the number of partitions configured at the outset is the maximum number of nodes you can have, so you need to choose it high enough to accommo‐ date future growth. However, each partition also has management overhead, so it’s counterproductive to choose too high a number.

Dynamic partitioning

Partitioning proportionally to the nodes

Operations: automatic or manual rebalancing

Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance. However, it can be unpredictable. Rebalancing is an expensive operation, because it requires re-routing requests and moving a large amount of data from one node to another. If it is not done carefully, this can overload the network or the nodes, and cause performance problems for other systems.

«««< HEAD

This can be dangerous in combination with automatic failure detection. For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This puts additional load on the other nodes and the network, thus potentially overloading more nodes and causing a cascading fail‐ ure.

=======

This can be dangerous in combination with automatic failure detection. For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that the overloaded node is dead, and automatically re-balance the cluster to move load away from it. This puts additional load on the other nodes and the network, thus potentially overloading more nodes and causing a cascading failure.

7c91059 (UPD CH6 notes)

Request routing

image

here are a few different approaches to this problem

  1. Allow clients to contact any node (e.g. via a round-robin load balancer). If that node coincidentally owns the partition to which the request applies, it can handle the request directly; otherwise it forwards the request to the appropriate node.

  2. Send all requests from clients to a routing tier first, which determines the node that should handle the request and forwards it accordingly. This routing tier does not itself handle any requests, it only acts as a partition-aware load balancer

  3. Require that clients be aware of the partitioning and the assignment of partitions to nodes. In this case, a client can connect directly to the appropriate node, without any intermediary.

Ch 7: Transactions

Why?

  1. hardware may fail
  2. application may crash
  3. network interruptions
  4. Several clients may write
  5. partially been updated data reads
  6. Race conditions

Interesting items to be discussed:

  • concurrency control
  • race conditions
  • isolations (i.e read committed, snapshot isolation and serializability.)

TXNs make error handling easier for Application

The slippery concept of a transaction

There emerged a popular belief that transactions were the antithesis of scalability, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability [5, 6]. VS On the other hand, transactional guarantees are sometimes presented by database vendors as an essential requirement for “serious applications” with “valuable data”.

Both viewpoints are pure hyperbole.

The meaning of ACID

in practice, one database’s implementation of ACID does not equal another’s implementation i.e. a lot of ambiguity around the meaning of isolation

Half of ACID concepts are non-DB related

ACID vs BASE

Atomicity, Consistency, Isolation and Durability.

Basically Available, Soft state and Eventual consistency (some cases of Leaderless Replication is using this?)

Atomicity - abortability would be a better term

The ability to abort a transaction on error, and have all writes from that transaction discarded, is the defining feature of ACID atomicity

It does not describe what happens if several processes try to access the same data at the same time, because that is covered under the letter I for isolation

Consistency - not really a DB concept

an application-specific notion of the database being in a “good state”.

i.e. in an accounting system, credits and debits across all accounts must always be balanced.

this idea of consistency depends on the application’s notion of invariants, and it’s the application’s responsibility to define its transactions correctly so that they preserve consistency.

Isolation - TXNs appear to be ran serially

Isolation in the sense of ACID means that concurrently executing transactions are isolated from each other: they cannot step on each others’ toes

read -> they had run serially

Durability - not really a DB concept

data it has written will not be forgotten, even if there is a hardware fault or the database crashes.

usually also involves a write-ahead log or similar (see “Update-in-place vs. append-only logging”

Single-object and multi-object operations

Based on above -> we will focus on Atomicity and Isolation then

Single Object Writes

  • Atomicity can be implemented using a log for crash recov‐ ery (see “Update-in-place vs. append-only logging” on page 80),
  • isolation can be implemented using a lock on each object (allowing only one thread to access an object at any one time).

Generally useless, as a transaction is usually understood as a mechanism for grouping multiple operations on multiple objects into one unit of execution.

Multi Object Writes

Many distributed datastores have abandoned multi-object transactions because they are difficult to implement across partitions, and they can get in the way in some scenarios where very high availability or performance are required

Used in:

  • has a foreign key reference toa row in another table.
  • In a document data model,document databases lacking join functionality also encourage denormalization
  • secondary indexes

Handling errors and aborts

ACID databases are based on this philosophy: if the database is in danger of violating its guarantee of atomicity, isolation or durability, it would rather abandon the transaction entirely than allow it to continue.

BUT

Not all systems follow that philosophy: especially datastores with leaderless replication

Popular object-relational mapping (ORM) frameworks such as Rails’ ActiveRecord and Django don’t retry aborted transactions — the error usually results in an exception bubbling up the stack, so any user input is thrown away and the user gets an error message. This is a shame, because the whole point of aborts is to enable safe retries.

Rails App need to catch DeadLock errors manually and retry

retrying an aborted transaction is a simple and effective error handling mechanism, it isn’t perfect, example cases:

  • network failures
  • error is due to overload,
  • Only transient errors are worth retrying
  • TXN side-effects outside of the database,
  • client process fails

    Weak isolation levels

transaction isolation - serializable isolation means that the database guarantees that transactions have the same effect as if they ran serially, i.e. one at a time, without any concurrency.

Serializable Iso is Not ON by default, as it’s expensive for the DB to implement

“use an ACID database if you’re handling financial data!”, but that misses the point. Even many popular relational database systems (which are usually considered ‘ACID’) use weak isolation, ### Read committed

Read committed

  1. When reading from the database, you will only see data that has been committed (no dirty reads).
  2. When writing to the database, you will only overwrite data that has been committed (no dirty writes).

    No dirty reads

why:

  • If a transaction needs to update several objects, a dirty read means that another transaction may see some of the updates but not others.
  • If a transaction aborts, any writes it has made need to be rolled back

    No dirty writes

Why - avoids some kinds of concurrency problems::

  • If transactions update multiple objects, dirty writes can lead to a bad outcome. (i.e. Alive won a bit, but Bob gets the invoice)
  • read committed does not prevent the race condition between two counter increments

Implementing read committed

databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular object (row or document), it must first acquire a lock on that object … locking is done automatically by databases in read committed mode (or stronger isolation levels).

However, the approach of requiring read locks does not work well in practice, because one long-running write transaction can force many read-only transactions to wait until the long-running transaction has completed. This harms the response time of read-only transactions, and is bad for operability: a slowdown in one part of an application can have a knock-on effect in a completely different part of the application, due to waiting for locks.

Only when the new value is committed, transactions switch over to reading the new value.

Snapshot isolation and repeatable read

non-repeatable read or read skew - a timing anomaly some situations cannot tolerate such temporary inconsistency:

  • backups
  • Analytic queries and integrity checks

Snapshot isolation, also known as multiversion concurrency control (MVCC), is the most common solution to this problem

MVCC maintains several vers of an object side-by-side

Implementing snapshot isolation

readers never block writers, and writers never block readers

For snapshot isolation, the database must potentially keep several different committed versions of an object, because various in-progress transactions may need to see the state of the database at different points in time. Hence snapshot isolation is also known as a multiversion technique

Visibility rules for observing a consistent snapshot

  1. At the start of each transaction, the database makes a list of all the other transactions which are in progress (not yet committed or aborted) at that time. Any writes made by one of those transactions are ignored, even if the transaction sub‐ sequently commits.
  2. Any writes made by aborted transactions are ignored.
  3. Any writes made by transactions with a later transaction ID (i.e. which started after the current transaction started) are ignored, regardless of whether that transaction has committed.
  4. All other writes are visible to the application’s queries.

These rules apply to both creation and deletion of objects

or a long time, continuing to read values which (from other transactions’ point of view) have long been overwritten or deleted.

PSQL stores txn_id on tuple.

Indexes and snapshot isolation

many implementation details determine the performance of multiversion concurrency control. For example, PostgreSQL has optimizations for avoiding index updates if different versions of the same object can fit on the same page

MySQL -> UNDO log (might impact READ performance)

Preventing lost updates

Preventing write skew and phantoms

Serializability