Advanced Database Systems (COMP90050)

Ruochen ChenMarch 2, 2025About 54 min

Advanced Database Systems (COMP90050)

Week 1

Performance of a database system comes from Hardware (The speed of the processor, Number of processors, Number of drives and I/O bandwidth, Size of main memory, Communication network, Type of architecture), Software (Type and details of database technology used for a given application). and Database tuning, crash recovery (Indexing parameters, Data duplication, Sharing data)

Disk

Tracks: circular path on disk surface
Tracks are subdivided into disk sectors
magnetic head, which reads and writes data to the platter surfaces

Modelling Disk Access

Disk access time = seek time + rotational delay + (transfer length / bandwidth)

Seek time: the time the head of the actuator arm moves from a position to a required track.
Rotational delay: the time the rotation of the disk brings the required track sector to the head of the actuator arm.
Transfer length: the number of bytes to be transferred.
Bandwidth: the maximum rate of data transfer.

Disk Access Time

For HDD:

Disk Access Time = Seek Time + Rotational Delay + Transfer Time

For SSD:

Disk Access Time = Transfer Time

For both, Transfer Time = Transfer Length / Bandwidth

SSD (Solid-State Drive/Solid-State Disk)

No moving parts like Hard Disk Drive (HDD)
Silicon rather than magnetic materials
No seek/rotational latency
No start-up times like HDD
Runs silently
Random access of typically under 100 micro-seconds compared 2000 - 3000 micro-seconds for HDD
Relatively very expensive, nevertheless did not effect DBMS when they were born so key design decisions remain
Certain read/write limitations plagued it for years
SSD do not have seek and rotation delays.
Moore’s law: memory chip capacity doubles every 18 months since 1970
Joy’s law for processors: processor performance doubles every two years since 1984

Info

HDD vs SSD

Feature	HDD	SSD
Storage Type	Magnetic disks	Flash memory
Speed	Slower (100MB/s to 200MB/s)	Faster (500MB/s to 5000MB/s+)
Durability	Less durable, susceptible to damage	More durable, resistant to shocks
Capacity	Larger capacities (up to 10TB+)	Smaller capacities (up to 4TB+)
Noise	Noisy due to moving parts	Silent
Power Consumption	Higher power consumption	Lower power consumption
Cost	More affordable for large storage	More expensive per GB

Sequential reads are ideal for performance-sensitive workloads because the high throughput compensates for the initial seek.
Random reads, especially of small amounts of data, are heavily penalized by seek time, leading to poor performance.

Hit Ratio

Hit ratio = (Number of hits) / (Number of requests), or references safisfied by cache / total references
Hit ratio (HR): the percentage of requests that are satisfied by the buffer.
If the hit ratio (HB) is high, the effective access time (EA) will be closer to the buffer access time (BC), meaning accessing data will generally be fast.
If the hit ratio is low, the access time will be closer to the disk access time (D), meaning it will take longer to access the data, as more requests will result in disk accesses instead of buffer hits.
Effective memory access time, *EA = H*C+(1-H)M, where H = hit ratio, C = cache access time; M = memory access time

Week 2

Storage Area Networks (SANs)

A SAN is a network of storage devices that provides a single logical storage space to multiple hosts.
SANs are used to increase storage capacity, availability, and performance.
They are used for shared-disk file systems
They regularly also allow for automated back up functionality
It was the fundamental storage for data center type systems with mainframes for decades
Different versions evolved over time to allow for more data but fundamentals are the same even today
They came with their own networking capabilities
We visit a version of these when we see what can be done about failures
In a nutshell failure probability of one disk is different to 100s of disks which requires design choices

Communication Cost

transmit time = distance/c + message_bits/bandwidth, where c = speed of light (200 million meters/sec) with fibre optics
Today it is hard to reduce hardware related latency on contemporary systems further
The key cost in latency comes from software in networking
So each message length should be large to achieve better link utilization while keep the overall need to transfer data to a minimum

Module Availability

measures the ratio of service accomplishment to elapsed time
M = mean time to failure / (mean time to failure + mean time to repair), where Mean time to failure is the time elapsing before a failure is experienced

Database Types

数据库类型	特点	优点	缺点
Simple file(s)	简单文本文件，每行一条记录，字段由分隔符分开	对简单应用非常快	可靠性较低，维护困难，存在并发问题
RDBS	表格集合，行和列组成，键用于唯一标识每行	数据一致性高（有事务有ACID），与具体应用无关的优化，适用于多种应用	对于某些特殊应用可能较慢
Object Oriented DB Systems	直接存储对象，可能包含数据和方法的混合体	适合复杂数据处理的应用	应用范围有限，商业系统较少采用
NoSQL	非关系型数据库，灵活无固定模式，支持大数据处理	设计简单，线性扩展性好	一致性妥协，可能导致读取旧版本数据

Deductive database systems (DDBS)

Allows recursion, similar to logic programming paradigm
e.g. prolog

Database Architecture

架构类型 (Architecture Type)	描述 (Description)	优点 (Advantages)	缺点 (Disadvantages)
Centralized (集中式)	数据存储在一个中心位置。	简单易管理，数据一致性强。	单点故障风险高，可扩展性差。
Client-Server (客户端-服务器)	客户端（请求方）向服务器（服务提供方）发送请求，服务器处理请求并返回结果。数据通常存储在服务器端。	结构清晰，易于管理和维护；服务器可以集中处理数据和业务逻辑；可扩展性相对较好（可增加服务器）。	服务器可能成为性能瓶颈或单点故障；网络依赖性强；客户端需要安装或配置。
Distributed (分布式)	数据分布在多个节点上，可以位于不同地点，节点间可能协同工作。	高可用性，负载均衡，容错能力强；可扩展性好。	复杂度高，一致性维护困难（如CAP理论），网络延迟可能影响性能。
WWW (万维网)	数据分散在世界各地，有多个所有者，缺乏统一组织结构。	资源极其丰富，访问极其广泛，用户界面统一（浏览器）。	组织结构松散，安全性低，数据质量参差不齐，难以保证一致性和可靠性。
Grid (网格)	类似于分布式，但每个节点管理自己的资源，并遵循一定规则或协议进行协作。	资源共享，协同工作能力强；可扩展性好；容错性较好。	结构复杂，管理难度大，部署和维护成本高。
P2P (点对点)	与Grid类似，但节点地位平等，可以自由加入或离开网络，资源共享更自由。	自由度高，去中心化，抗单点故障能力强。	安全性和稳定性较差，资源查找效率可能不高，难以进行集中管理。
Cloud (云计算)	Grid的泛化版，资源（计算、存储等）作为服务按需获取，通常由第三方公司运营和管理。	弹性伸缩，按需付费，成本低廉（通常），易于管理（对用户而言）。	依赖互联网连接，隐私和数据安全问题（可能受制于服务商），服务可用性依赖于提供商。

Simple Nested Loop Join

for each tuple tr in r do begin
	for each tuple ts  in s do begin
		test pair (tr,ts) to see if they satisfy the join condition theta 
		if they do, add tr • ts to the result.
	end
end

r is called the outer relation and s the inner relation of the join.
Expensive since it examines every pair of tuples in the two relations.

Nested Loop Join and Block Nested-Loop Join

Criteria	Nested Loop Join (NLJ)	Block Nested-Loop Join (BNLJ)
Data Access Level	Record-by-record	Block-by-block
I/O Cost	High	Reduced
Seeks	High	Lower
Memory Requirement	Low	Higher (requires memory to store a block)
Best for	Small tables or indexed joins	Large tables with limited memory
Performance	Poor for large tables	Better for large tables

Worst Case Cost Estimation

r is called the outer relation and s is the inner relation of the join. In the worst case, if there is enough memory only to hold one block of each table, the estimated cost is:

Given:

Table	Records	Blocks
`customer`	10,000	400
`depositor`	5,000	100

Costs:

Method	Block Transfers	Seeks
Nested Loop Join	10,000 * 100 + 400 = 1,000,400	10,000 + 400 = 10,400
Block Nested-Loop Join	400 * 100 = 40,000	400 + 100 = 500

Nested Loop Join 会遍历 customer 表的每一行，然后对每一行都扫描整个 depositor 表，因此它的 Block Transfers 和 Seeks 都非常高。
Block Nested-Loop Join 会以块为单位处理 customer 表，减少了对 depositor 表的重复扫描次数，因此它的 Block Transfers 和 Seeks 都显著降低。

Other Join Comparison

Criteria	Nested-Loop Join	Block Nested-Loop Join	Indexed Nested-Loop Join	Merge Join	Hash Join
Data Access Level	Record-by-record	Block-by-block	Record-by-record (with index)	Sort and merge	Partitioning and hashing
I/O Cost	High	Reduced	Low	Low	Low
Seeks	High	Lower	Low	Low	Low
Memory Requirement	Low	Higher (for block storage)	Low	Moderate	High
Best for	Small tables or indexed joins	Large tables with limited memory	When the inner table has an index	When both tables are sorted	Large tables where sorting is expensive
Performance	Poor for large tables	Better for large tables	Efficient when an index exists	Very efficient for sorted data	Efficient when no natural order of data exists

Week 3

Sql query will be convert to relational algebra and then convert to physical plan.
Left deep join: the left table is processed first, then the right table is processed. This is what the optimizer will do by default.

Benefits of Relational Algebra and how to generate alternatives

Query optimizers use equivalence rules to systematically generate expressions equivalent to a given expression
SQL does not give this power as it defines what you want to get and not how you get it (good for the user but not for the machine)
And hence relational algebra is used, which is a procedural language and deals with the “how” part
One can generate all equivalent expressions exhaustively given a relational algebra expression, but it is very expensive in space and time for complex queries
So most systems take short cuts, i.e. consider only some plans

Heuristic and exhaustive optimization techniques

Perform selections early (reduces the number of tuples)
Perform projections early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations
Optimizers often use simple heuristics for very cheap queries, and perform exhaustive enumeration for more expensive queries

Indexing (A key choice to make for optimization)

Indexing mechanisms used to speed up access to desired data in a similar way to look up in a phone book
Search Key - attribute or set of attributes used to look up records/rows in a system, like an ID of a person
An index file consists of records (called index entries) of the form search-key, pointer to where data is
Index files are typically much smaller than the original data files and many parts of it are already in memory

Hash indices (more on start of week 4)

search keys are distributed hopefully uniformly across “buckets” using a “function”
A hash index organizes the search keys, with their associated record pointers, into a hash file structure. Order is not important.
Hash indices are always secondary indices.
Given a key the aim is to find the related record on file in one shot which is important.
An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file.
Typical hash functions perform computation on the internal binary representation of the search-key.

What gets faster with index?

Disk access, insertions, deletions, space overhead

B+ Tree

InnoDB default index type
Automatically reorganizes itself with small, local, changes, in the face of insertions and deletions.
Reorganization of entire file is not required to maintain performance.
Most of the higher level nodes of a B+tree would be in main memory already!
Check this out

Leaf nodes in a B+-tree file organization store records, instead of pointers
- Helps keep data records clustered even when there are insertions/deletions/updates
pointers support range queries

Metrics

If there are K search-key values in the file, the height of the tree is no more than logn/2 and it would be balanced, where n is the number of search-key values per node 一个节点最多的子节点数（即树的“阶”.
A node is generally the same size as a disk block, typically 4 kilobytes
- and n is typically around 100 (40 bytes per index entry).
With 1 million search key values and n = 100
- log50(1,000,000) = 4 nodes are accessed in a lookup.
Contrast this with a balanced binary tree with 1 million search key values — around 20 nodes are accessed in a lookup
- above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds

Week 4

Hash Indexing

Hashing is a technique used to map data of an arbitrary size to a fixed-size value, typically a number.
The hash function takes an input of any size and produces a fixed-size output.
The hash function should be deterministic, meaning that it should always produce the same output for the same input.

Bitmap Indexing

A bitmap index is a data structure that stores a set of values in a compact form, allowing for fast and efficient retrieval of values that match a given condition.
Applicable on attributes that take on a relatively small number of distinct values, used for business analysis
A bitmap index is used to quickly locate the values that match a given condition, without having to search through the entire index.
A bitmap index is typically used to speed up queries that involve a large number of values that are stored in a database.
Bitmap Example
Check This Out

Index Structures Comparison

Index Type	Data Structure	Key Characteristics	Advantages	Disadvantages	Best Use Cases
B+ Tree	Balanced Tree	Multi-way search tree All data in leaf nodes	- Disk-friendly - Efficient range queries - Stable performance	- High insert/delete overhead - Space consumption	Relational databases Ordered data Range queries
Hash Index	Hash Table	Direct mapping Keys to storage locations	- Fast insertion - O(1) equality queries - Compact space	- No range queries - Hash collisions - Not order-preserving	Equality lookups Caching Unique identifiers
Bitmap Index	Bitmap array	Represent data existence via bits	- Compressed storage - Fast bit operations - Efficient aggregation	- High update cost - Suitable for low-cardinality columns - Space overhead	Data warehouses Analytical queries Low-mutation data
Quadtree	Tree-based spatial division	Recursive four-way space partitioning 2D spatial indexing	- Fast spatial queries - Data clustering - Recursive division	- Unbalanced depth - Additional space overhead - Dimension-limited	Geographic Information Systems Spatial clustering Collision detection
K-D Tree	Binary space partitioning	Multi-dimensional space recursive division	- High-dimensional space queries - Nearest neighbor search - Dynamic adjustment	- Complex construction - Unstable performance - Curse of dimensionality	Machine learning Spatial clustering High-dimensional data
R-Tree	Balanced tree for spatial data	Hierarchical minimum bounding rectangles	- Efficient multi-dimensional indexing - Handles complex spatial queries - Supports overlapping regions	- Higher computational complexity - Overhead in updates - Performance degrades with high dimensionality	Geographical databases Spatial networks Location-based services
R+ Tree	Variant of R-Tree	No overlapping node regions	- Improved query performance - More precise spatial indexing - Reduced search space	- More complex construction - Higher maintenance cost - Less flexible with irregular data	Advanced spatial databases Precise geographic querying Complex spatial analysis

B+ Tree

Widely used in database indexing
Maintains data order
Supports efficient range queries

Hash Index

Commonly used for fast key-value lookups
Trades space for query speed
Unsuitable for sorting-required scenarios

Bitmap Index

Compressed storage
Ideal for low-cardinality, infrequently updated data
Bit operations provide efficient aggregation

Quadtree

2D space specialized indexing
Recursive division improves query efficiency
Common in Geographic Information Systems
典型的 Quadtree 实现中，其叶节点通常代表一个空间区域，该区域最终只包含一个数据点

K-D Tree

Multi-dimensional space generic indexing
Effective method for high-dimensional data
Excellent for nearest neighbor search scenarios、

R-Tree

Designed for multi-dimensional indexing
Uses minimum bounding rectangles (MBR)
Key in spatial databases and GIS
Handles non-point spatial objects efficiently
Bounding boxes of children of a node are allowed to overlap, multiple paths may need to be searched due to overlaps
To find data items intersecting a given query point/region, do the following, starting from the root node:
- If the node is a leaf node, output the data items whose keys intersect the given query point/region.
- Else, for each child of the current node whose bounding box intersects the query point/region, recursively search the child.

R+ Tree

Addresses R-Tree's overlapping limitations by ensuring no node region overlap
Ensures no node region overlap
Provides more precise spatial querying
Computational overhead for maintenance

Week 5

ACID Properties

Atomicity: A transaction is a single unit of work that either succeeds completely or fails completely.
Consistency: A transaction must maintain database consistency.
Isolation: Transactions should not interfere with each other.
Durability: Once a transaction is committed, it should remain committed even in the event of a system failure.

Flat Transactions

A flat transaction is a transaction that is executed as a single unit of work.
Flat transactions do not model many real applications
If a system fails during a flat transaction, all the changes are lost.

Savepoints

a mechanism that allows a transaction to create a savepoint before executing a set of statements.
If a system fails during a savepoint, all the changes made after the savepoint are lost.

Nested Transactions

Commit Rules:
- A subtransaction can either commit or abort, however, commit cannot take place unless the parent itself commits.
- Subtransactions have A, C, and I properties but not D property unless all its ancestors commit.
- Commit of subtransaction makes its results available only to its parents.
Rollback Rules:
- If a subtransaction rolls back, all its children are forced to roll back.
Visibility Rules:
- Changes made by a subtransaction are visible to the parent only when it commits.
- All objects of parent are visible to its children.
- Implication of this is that the parent should not modify objects while children are accessing them. [This is not a problem as parent does not run in parallel with its children.]

How do Transactions get to run: Monitors

The main function of a Transaction Processing Monitor is to integrate system components and manage resources so that transactions can run:
- TP monitors manage the transfer of data between clients and servers
- Breaks down applications or code into transactions and ensures that all the database(s) are updated properly
- It also takes appropriate actions if any error occurs

TP Monitor Services

Terminal management: Since many terminals run client software, the TP monitor should provide appropriate ACID properties between the client and the server processes
Presentation service: this is similar to terminal management in the sense it has to deal with different presentation (user interface) software
Context management: E.g. maintaining the sessions etc.
Start/Restart: Note: there is no difference between start and restart in TP based systems
Heterogeneity: If the application needs access to different DB systems, local ACID properties of individual DB systems is not sufficient. Local TP monitor needs to interact with other TP monitors to ensure the overall ACID properties.
Control communications: If the application communicates with other remote processes, the local TP monitor should maintain the communication status among the processes to be able to recover from a crash.

模块	主要职责
Terminal Management	管理客户端与服务器之间事务的一致性
Presentation Service	管理不同用户界面与后台服务之间的数据一致性
Context Management	跟踪用户会话和事务上下文
Start/Restart	系统故障恢复与事务一致性保障
Heterogeneity	支持跨多个数据库的分布式事务一致性
Control Communications	管理远程通信状态，支持事务恢复

using a single process for all the transaction processing monitor services is bad since One error in the process can impact all the transactions. This could lead to poor performance as well as monitors of different transactions cannot be distributed either.

Week 6

Some simple concurrency control approaches:

Dekker's algorithm (write some code to guarantee exclusive access) - needs almost no hardware support, but the code turns out to be complicated to implement especially for more than two transactions/processes
- takes storage space
- uses busy waiting
- efficient if the lock contention is low (that is frequency of access to the locks)
OS supported primitives (through interrupt calls) - expensive, independent of number of processes
- OS supported primitives such as lock and unlock
- need no special hardware
- do not use busy waiting and therefore more effective sometimes
Spin locks (using atomic lock/unlock instructions) – common
- efficient for low lock contentions - many DBMSs systems use them
- uses busy waiting
- Need some hardware support
  - Should be able to lock bus (communication channel between CPU and memory + any other devices) for two memory cycles (one for reading and one for writing).
  - During this time no other devices’ access is allowed to this memory location

Compare and swap

使用Compare and Swap操作可以在多线程环境中安全地实现变量的更新，避免了竞态条件和数据不一致的问题。

特性	自旋锁 (Spinlock)	Compare-and-Swap (CAS)
本质	一种同步锁机制/数据结构	一种原子操作/指令
目的	保护临界区，确保互斥访问	原子地比较并更新内存中的值
获取失败行为	忙等待（循环不断尝试获取锁）	立即返回结果（成功或失败，可能带旧值）
线程状态	获取失败时，线程保持运行状态，占用CPU	通常用于无锁算法，线程自行决定后续操作
资源消耗	长时间持有锁时，浪费CPU资源	单次操作开销小，但循环使用CAS会消耗CPU
实现方式	通常使用 CAS 或 Test-and-Set 等原子操作	现代CPU通常提供硬件支持
适用场景	锁持有时间短、竞争不激烈	实现无锁/低锁结构、原子变量、乐观锁
主要优势	锁获取成功后延迟低（无上下文切换）	无锁，避免阻塞和上下文切换，可伸缩性好
主要劣势	竞争激烈或持有时间长时效率低，耗CPU	实现复杂，可能发生无限重试（ABA问题）
关系	CAS 常被用作自旋锁的实现基础	CAS 是一种底层原子操作，可用于构建锁

特性	Test-and-Set (TAS)	Compare-and-Swap (CAS)
基本操作	读取并设置固定值 (通常是 true/1)	比较期望值，若匹配则更新为新值
灵活性	低，总是设置固定值	高，可以检查任意期望值并更新为任意新值
所需参数	通常只需内存地址	需要内存地址、期望值、新值
返回值	原始值 (用于判断是否成功获取锁)	通常返回操作是否成功 (布尔值)，有时也返回旧值
主要用途	实现简单的自旋锁	实现自旋锁、无锁数据结构、原子变量、乐观锁等
复杂度	较简单	相对复杂一些
硬件支持	许多处理器支持	现代处理器普遍支持
关系	可以用 CAS 来实现 TAS (CAS(addr, 0, 1))	CAS 是更通用的原子操作，TAS 可以看作是 CAS 的一个特例
总结:

TAS 更简单直接，适合实现非常基础的锁机制。它的行为是“尝试获取锁：如果没被占，我就占；如果被占，我知道了”。
CAS 更强大灵活，它说“尝试更新：只有当当前状态是我预期的样子时，我才把它改成新的样子；否则我就放弃这次更新，并告诉你实际情况”。这种条件性的原子更新能力使得 CAS 在构建复杂并发结构时非常有用。

A semaphore is a more general concurrency control tool, more flexible than a lock. Its working mechanism can be understood as:

get(): Attempt to acquire the semaphore, if it is already occupied:

If no one is waiting, "wait"
If someone is waiting, add yourself to the waiting queue

give(): Release the semaphore, and:

If there is a waiting queue, wake up the first one (FIFO)

Feature	Semaphore	Spin Lock
Nature	Concurrency control mechanism (counting allowed)	A type of lock implementation
Thread Blocking	Yes (threads are put to sleep while waiting)	No (threads busy-wait without yielding CPU)
Suitable Scenarios	High-contention situations needing queuing	Short critical sections, context switch is expensive
CPU Usage	Low (waiting threads don't consume CPU)	High (waiting threads consume CPU while spinning)
Wake-up Mechanism	FIFO queue wakes the next waiting thread	No wake-up mechanism, relies on continuous retry
Counting Support	Yes (can manage multiple resources)	No (typically only supports 1 lock)
Risk of Deadlock	Yes (misuse can lead to deadlocks)	Yes (can hang if lock isn't properly released)
Typical Implementation	OS-level (e.g., P/V operations, wait queues)	Atomic variables + busy-loop (user/kernel space)
Language Primitives	Java's `Semaphore`, POSIX `sem_wait()`	Atomic ops in C, custom SpinLock in Java
Reentrancy	Not reentrant by default	Not reentrant by default

Semaphore is like “you queue up and wait to be called”;

Spin Lock is like “you keep knocking on the door until someone opens”.

Deadlock solutions

In a deadlock, each process in the deadlock is waiting for another to release the resources it wants.
Have enough resources so that no waiting occurs, but not practical
Do not allow a process to wait long, simply rollback after a certain time
Linearly order the resources and request of resources should follow this order
Pre-declare all necessary resources
Periodically check the graph for cycles
Allow waiting for a maximum time on a lock then force Rollback

Recall that we want Concurrency

achieve isolation by sequentially processing each transaction
- not efficient
  - Provides poor response times
  - Wastes useful resources by not allowing transactions to use them
Concurrent execution should:
- still make sense/work
- not cause application programs (transactions) to malfunction
- not have lower throughput or bad response times than serial execution

Week 7

Dependency model

Concurrency control is about finding/addressing all the associated problems
need to define what actions and what objects in multiple concurrent transactions can cause a problem: dependency graph
Read-Read dependency does not affect Isolation
When dependency graph has cycles (circular dependencies) then there is a violation of isolation and a possibility of inconsistency
If two transactions have the same dependency graph, then they will have the same final state.

Define dependencies

Given two histories H1 and H2, H1 and H2 are equivalent if DEP(H1) = DEP(H2)
This implies that a given database will end up in exactly the same final state by executing either of the sequence of operations in H1 or H2
从历史序列 H1 根据定义求出 DEP(H1)，方法是枚举所有操作对，只要满足对象相同、顺序正确且中间没有写操作，就形成依赖
- example
```
H = ⟨(T1, R, O1), (T3, W, O5), (T3, W, O1), (T2, R, O5),
  (T2, W, O2), (T5, R, O4), (T1, R, O2), (T5, R, O3)⟩
```
1. T1 R O1 → 后面 T3 W O1 → 冲突：T1 → T3 ✅
2. T3 W O5 → 后面 T2 R O5 → 冲突：T3 → T2 ✅
- 1. T2 W O2 → 后面 T1 R O2 → 冲突：T2 → T1 ✅
- 所以 DEP (H) = < T 1, O1, T 3 >, < T 3, O5, T 2 >, < T 2, O2, T 1 >

Serial History

A history is considered isolated if and only if it is equivalent to a serial history.
In other words, all transactions are executed in a sequential order.
A serial history is the result obtained by executing each transaction in sequence.
N transactions can produce up to N! different serial histories.
There is no need to worry about choosing which serial history to accept.
So creating a concurrent transaction set that is actually equivalent to a serial history will be isolated.

Bad Executions: Wormhole

Transaction Order:
- If transaction T1 is executed before transaction T2, it is denoted as T1 << T2.
Before and After Sets:
- Before(T) = {T' | T' << T}: Represents the set of all transactions executed before transaction T.
- After(T) = {T' | T << T'}: Represents the set of all transactions executed after transaction T.
Definition of Wormhole Transaction:
- If a transaction T' belongs to both Before(T) and After(T), i.e., T' ∈ Before(T) ∩ After(T), then T' is called a wormhole transaction.
- This implies the existence of a circular dependency relationship T << T' << T.
Impact of Wormhole Transactions:
- The presence of wormhole transactions indicates that there is no good isolation between transactions, i.e., the conditions for serial execution are not met.
- This situation can lead to cycles in the transaction dependency graph, so disrupting the isolation of transactions.
Wormhole Theorem:
- A history is isolated (isolated) if and only if it has no wormhole transactions.

To avoid wormhole

use locks to make sure access is done in an orderly fashion, but need think about:
- types of locks we get
- how and when we get locks considering the progression of the transaction
- also paying attention to how we release the locks and when
avoid cycles in the dependency graph

Lock Compatibility Matrix

如果当前某个事务已经持有一个对象的锁，而你请求的锁与之不兼容，那你就不能获得锁，必须等待
共享锁又叫读锁（允许其他读，不允许写），独占锁又叫写锁（不允许其他读写）

Current Mode of Lock

Request	Free	Shared	Exclusive
Shared request (SLOCK) Used to block others writing/modifying	Compatible Request granted immediately Changes Mode from Free to Shared	Compatible Request granted immediately Mode stays Shared	Conflict Request delayed until the state becomes compatible Mode stays Exclusive
Exclusive request (XLOCK) Used to block others reading or writing/modifying	Compatible Request granted immediately Changes Mode from Free to Exclusive	Conflict Request delayed until the state becomes compatible Mode stays Shared	Conflict Request delayed until the state becomes compatible Mode stays Exclusive

Locking Theorem

A transaction is two phase if its all lock operations precede its unlock operations.
Locking theorem: If all transactions are well formed and two-phased, then any legal history will be isolated. Otherwise possible wormhole transactions.
Rollback theorem: An update transaction that does an UNLOCK and then does a ROLLBACK is not two phase.

Two-Phase Locking (2PL) and Strict Two-Phase Locking (S2PL)

two-phase locking guarantee serializability, with two-phase we want to prevent transactions from getting in-between each other and causing cycles in the dependencies basically, 都是悲观锁
“没有缩小阶段”（no shrinking phase）是指在 S2PL，一旦事务获得了所需的锁，它就不会在事务执行过程中释放这些锁
Locks must be released when commit/abort happens
Reduces concurrency and efficiency further。牺牲部分并发性来换取更强的一致性保障，特别是防止级联回滚
preventing cascading aborts 级联回滚:
- 当一个事务的中止导致多个事务回滚并且浪费工作，因为它们读取了一个被中止事务写入的对象时，这种情况下，S2PL是一个更好的选择，因为它可以减少资源的浪费

特性	普通 2PL	严格 2PL (S2PL)
锁的释放时机	可在事务中途释放	必须在 commit 或 abort 时统一释放
是否有“缩小阶段”	✅ 有缩小阶段	❌ 没有缩小阶段
是否防止级联回滚	❌ 不一定（可能读取未提交的数据）	✅ 可以防止（不允许读未提交的数据）
并发性	相对较高	更低（锁保留时间更长）
实现复杂度	较低	稍高
常见使用场景	适用于某些可容忍回滚的并发系统	适用于需严格隔离和避免错误传播的系统

Phantoms

commonly occur when we lock records but another transaction adds a new record
So reading the same table twice from one transaction presents two different sets of records as the other has inserted some in the meantime
Predicate locks solve this problem. Rather than locking records, lock based on condition
Page Locks can solve this problem too. In this case, the storage should be organized based on values of attributes.
High volume systems use specialized locking mechanisms as well:
- Key Range Locks: to protect b-trees completeness
- Hole Locks: to protect space for uncommitted deletes

Week 8

Degrees of Isolation

Degree	保证内容	对应 SQL 隔离级别
0	事务仅保证不写脏数据	低于 `Read Uncommitted`
1	不写脏数据，不读脏数据	`Read Uncommitted`/接近 `Read Committed`
2	+ 不可重复读	`Read Committed`（接近）
3	+ 无幻读，完整可串行化	`Serializable`

Isolation Level	Problems Avoided	Description
Read Uncommitted（Read Uncommitted）	❌Dirty Read、❌Non-repeatable Read、❌Phantom Read	A transaction can read data that has not been committed by other transactions. The lowest level, data is not secure.
Read Committed（Read Committed）	✅Dirty Read，❌Non-repeatable Read、❌Phantom Read	Can only read data that has been committed by other transactions. This is the default isolation level for SQL Server and Oracle.
Repeatable Read（Repeatable Read）	✅Dirty Read、✅Non-repeatable Read，❌Phantom Read	The same row data will be read consistently within a transaction. This is the default level for MySQL InnoDB.
Serializable（Serializable）	✅Dirty Read、✅Non-repeatable Read、✅Phantom Read	The highest isolation level, transactions are executed completely serially, with the lowest efficiency but the highest consistency. Usually implemented through locking or range query control.

Exception	Explanation
Dirty Read（Dirty Read）	A transaction reads data modified by an uncommitted transaction.
Non-repeatable Read（Non-repeatable Read）	A transaction reads the same data twice, with different results (because another transaction updated and committed the data).
Phantom Read（Phantom Read）	A transaction executes the same query twice, seeing "phantom" rows added (because another transaction inserted new rows).

Granularity of Locks

Lock Granularity	Description	Concurrency	System Overhead
Database-level Lock	Locks the entire database. The coarsest, suitable for management tools or backup operations.	Lowest	Smallest
Table-level Lock (Table Lock)	Locks an entire table, preventing other transactions from writing or reading and writing.	Low	Small
Page-level Lock (Page Lock)	Locks a page (e.g., 8KB) containing multiple rows. A compromise solution.	Medium	Medium
Row-level Lock (Row Lock)	Locks a single record/row, commonly used in OLTP systems.	High	High
Column-level Lock (Column Lock)	Locks a specific column, rarely used.	Highest	Highest

Acquire locks from root to leaf（从根到叶获取锁）
- 这意味着在访问或修改数据时，应该首先获取较高层次的锁（例如目录或父节点），然后再逐步获取较低层次的锁（例如子节点或叶子节点）。这样做可以确保在整个数据结构上的锁定顺序一致，避免死锁的发生。
Release locks from leaf to root（从叶到根释放锁）
- 这表示在完成数据操作后，应该先释放最低层次的锁（叶子节点），然后逐步向上释放较高层次的锁（直到根节点）。这样可以确保在释放锁的过程中，数据的完整性得到维护，同时减少锁持有的时间，提高系统的并发性能。

Lock Type (从排他性弱到强)

Lock Type	Full Name	Description
IS	Intent Shared（Intent Shared Lock）	Indicates that the transaction intends to add a shared lock at a finer granularity (e.g., row). Acts on table/page level.
IX	Intent Exclusive（Intent Exclusive Lock）	Indicates that the transaction intends to add an exclusive lock at a finer granularity. Allows multiple transactions to add exclusive locks at different granularity resources.
S	Shared Lock（Shared Lock）	Allows multiple transactions to read simultaneously, but cannot write. Used for read operations.
U	Update Lock（Update Lock）	A "middle lock" to prevent deadlocks, used in scenarios where a read is about to become a write. At most one U lock, other transactions can share reads but cannot write.
SIX	Shared with Intent Exclusive	Adds an S lock at the table level, allowing an X lock to be added at the row level. Typically, only one SIX lock can exist for a table.
X	Exclusive Lock（Exclusive Lock）	Prevents other transactions from reading or modifying the resource. The transaction itself can read and write, while others cannot do anything. Used for write operations.

加锁必须“从上到下声明意图”：保证锁的层次性和一致性

读操作：只要某一个父节点声明有“读意图”就够了。
写操作：所有父节点都要声明“写意图”。这是为了防止你在写某个对象时，上层节点没有声明写的意图，可能造成其他事务也读写同一块资源而无法检测到冲突。

Two-Version Locking and Nested Transactions and Locking

特性 / Feature	Two-Version Locking	Nested Transactions and Locking
目标 / Goal	提高读写并发性	管理嵌套事务中的锁继承与隔离
核心机制 / Core Idea	写事务创建临时版本；其他事务读取已提交版本	子事务锁递归继承至父事务，由最顶层事务最终决定提交或回滚
读操作影响 / Read Delay	只在提交阶段短暂阻塞	取决于锁粒度与子事务的并发结构
锁继承 / Lock Inheritance	无特殊继承机制	子事务完成后其锁会递归传递给父事务
父子事务并发 / Parent-Child Concurrency	不涉及事务结构	父事务不能与子事务并发执行；子事务可复用父事务锁
同级子事务访问 / Sibling Subtransactions	无需协调，同读不同版本	若访问同一对象，必须串行使用父事务的锁，保证串行化
适用场景 / Application Scenario	高并发读 + 轻写场景	模块化复杂业务流程，事务嵌套较多的系统
并发性 / Read-Write Concurrency	高	中等，依赖锁粒度和结构
实现复杂度 / Complexity	中等到高	高

Week 9

乐观锁允许事务在不加锁的情况下进行，然后在提交前进行验证以确保没有事务冲突。如果有冲突，则重新执行事务。这种方法可以提高并发性和性能，尤其是在冲突较少的环境中

Optimistic Concurrency Control and Pessmistic Concurrency Control

Feature	Optimistic Concurrency Control (OCC)	Pessimistic Concurrency Control (PCC)
Basic Assumption	Conflicts rarely occur	Conflicts frequently occur
Locking Strategy	Transactions do not lock data, allowing free reading and modification	Transactions lock data when reading and modifying, preventing other transactions from accessing
Conflict Detection Timing	Conflict detection is performed when the transaction is committed	Conflict detection is performed during the execution of the transaction
Performance Impact	Reduces lock contention, improves concurrency performance	Increases lock contention, may reduce concurrency performance
Applicable Scenarios	Read-heavy, write-light, with few conflicts	Write-heavy, with many conflicts
Transaction Termination Frequency	May be high, because conflict detection is performed at commit time	Usually low, because conflicts are prevented during the execution of the transaction
Retry Overhead	May be high, because aborted transactions need to be retried	Usually low, because fewer conflicts lead to fewer aborted transactions
Implementation Technology	Version control, shadow pages, validation, etc	Locking mechanism (shared locks, exclusive locks, etc)

"Backward Validation" and "Forward Validation" in OCC

项目	Backward Validation（向后验证）	Forward Validation（向前验证）
事务阶段	当前事务即将提交	当前事务正在执行或即将提交
当前事务创建的集合	读集（Read Set）	写集（Write Set）
检查的对象	与当前事务重叠的已提交事务的写集	与当前事务重叠的仍在执行的事务的读集
冲突检测目标	检查是否有其他事务修改了当前事务读取过的数据	检查是否有其他事务读取了当前事务准备写入的数据
假设状态	其他事务已完成	其他事务仍活跃中
冲突处理选项	通常：中止当前事务并重试	- 延迟当前事务 - 中止冲突事务 - 中止当前事务（多种选择）
策略选择依据	通常固定策略（如中止当前）	根据系统代价或策略决策（如事务优先级、代价）
并发性能	好（冲突后处理，提升读性能）	更高（灵活处理写冲突，减少无谓中止）
使用场景	冲突较少或主要关注读取一致性的系统	写入冲突更常见、希望更细粒度控制提交策略的系统

Backward Validation：读完再查有没有被别人改 → 改了就失败
Forward Validation：写之前看有没有人已经读了 → 冲突就选择怎么处理（退让 or 胜出）

Time-stamp-based Concurrency Control

特性	内容
相似性	和 OCC 一样，需要对事务进行冲突检测和验证（Validation）
主要区别	验证是持续的，在每个操作发生时就立即检查合法性
结果特性	一旦检测到冲突，事务立即被中止（Abort），没有延迟到提交阶段
核心机制	依赖时间戳（Timestamp）决定事务和操作的先后顺序，从而判断操作是否合法

TBCC 适合高度确定性的场景，如需要事务顺序完全一致。
OCC 更适合大多数一般性高并发系统，如 Web 应用等。

Comparison of Concurrency Control Schemes

特性	时间戳排序	两阶段加锁（2PL）	乐观并发控制（OCC）
控制机制	全局时间戳	加锁，遵循加锁阶段+解锁阶段	执行时无锁，提交前验证
死锁风险	无	有死锁风险	无死锁
冲突处理	立即中止	等待或依赖死锁检测	提交阶段统一中止
适合更新密集场景	不适合（中止频繁）	很适合（中止少）	不适合
适合读密集场景	一般	表现一般	非常适合
中止概率	高	低	中等至高
并发性能	中	中等	高（如果冲突少）
实现复杂度	中	高（死锁检测+锁管理）	中高

分布式系统中的事务一致性问题如何解决 Two-phase atomic commit protocol (2PC):

one-phase 一阶段提交协议忽略了确认与异常处理，在面对失败、冲突、或通信异常时无法保障原子性和一致性，不适合可靠的分布式数据库系统。
2PC is designed to allow any participant to choose to abort a transaction.
在多个分布式节点之间达成一致提交或中止事务的决定，确保事务的原子性（Atomicity） —— 要么所有节点都提交，要么全部回滚

阶段	协调者操作	参与者操作	要点说明
阶段 1（Prepare）	发送 Prepare 请求	检查本地事务状态：若能提交则记录日志并返回 Yes 否则返回 No	投票结果一旦为 Yes，参与者不能再改变决定
阶段 2（Commit/Abort）	根据投票结果广播 Commit 或 Abort	若收到 Commit → 提交事务若收到 Abort → 回滚事务	所有节点达成一致，事务原子性保障

具体操作

接口名称	发起者	接收者	作用说明
`canCommit(trans)`	协调者	参与者	协调者询问参与者是否可以提交事务。返回 `Yes` 或 `No`。
`doCommit(trans)`	协调者	参与者	协调者通知参与者提交事务。
`doAbort(trans)`	协调者	参与者	协调者通知参与者回滚事务。
`haveCommitted(trans, participant)`	参与者	协调者	参与者告知协调者其已经提交事务（用于清理协调者的状态信息）。
`getDecision(trans)` → Yes / No	参与者	协调者	参与者在投票 Yes 后长时间未收到回复时，请求协调者告知最终决策。用于恢复宕机或消息延迟。

canCommit 是第一阶段的核心接口：投票。
doCommit 和 doAbort 是第二阶段的决策通知。
haveCommitted 用于协助协调者垃圾回收旧事务状态。
getDecision 是异常恢复机制的一部分，防止参与者长时间卡在“已准备”状态。

What happens in failures…

失败场景	应对措施或动作
⏳ 协调者等待投票超时	协调者主动中止事务，向所有参与者发送 `doAbort` 消息以终止事务。
❓ 参与者已投票 Yes（不确定状态），但未收到协调者决策	参与者不能自行决定，必须使用 `getDecision` 向协调者查询事务结果。
📬 协调者崩溃或消息丢失，参与者完成操作后未收到决策	参与者可以单方面中止事务，避免无限期阻塞（这种情况依赖系统策略，可能限制单方面中止）。
📉 协调者崩溃导致不回复参与者	参与者通过超时机制调用 `getDecision` 或使用日志等待协调者恢复再获取决策。

超时机制（timeout actions）是 2PC 中防止“永远等待”的核心手段。
getDecision 方法是参与者用来从协调者那里获取最终决策（commit 或 abort）。
协调者有最终决策权，参与者只能在明确协调者失败或长期无响应时考虑本地中止。
2PC 因此被称为阻塞协议，因为投票 Yes 的参与者必须等待协调者做决定。

General Principle:

If transaction T is before transaction U in their conflicting access to objects at one of the servers then:
- They must be in that order at all of the servers whose objects are accessed in a conflicting manner by both T and U
- Plus, the objects remain locked locally until two-phase commit ends
每个服务器负责对其自己的对象应用并发控制
分布式事务的服务器集合成员共同负责确保事务按顺序执行
服务器独立行动会导致死锁

Time-stamp ordering

The coordinator accessed by a transaction issues a globally unique timestamp
The timestamp is passed with each object access
The servers are jointly responsible for ensuring serial equivalence:
- that is if T access an object before U, then T is before U at all objects

OCC

Validation takes place in phase 1 of 2PC protocol at each server
Transactions use a globally unique order for validation

Transactions with replicated data

The simple read one/write all scheme is not realistic
- because it cannot be carried out if some of the servers are unavailable which beats the purpose in many cases
The available copies replication scheme is designed to allow some servers to be temporarily unavailable

one-copy serializability

the effect of transactions on replicated objects should be the same as if they had been performed one at a time on a single set of objects

Available copies replication rule

Before a transaction commits, it checks for failures and recoveries of the RMs it has contacted, the set should not change during execution

Week 10

P(A) =

probability of a disk failing which we call event A, given a certain duration of time of use
As time t goes to infinity P(A) approaches 1…

Module availability

measures the ratio of service accomplishment to elapsed time
Mean time to failure / (Mean time to failure + mean time to repair), where Mean time to failure is the time elapsing before a failure is experienced

RAID

RAID (Redundant Array of Independent Disks) is a data storage virtualization technology that combines multiple physical disk drives into one or more logical units to enhance performance, reliability, and/or capacity. RAID technology achieves distributed storage and redundant backup of data across multiple disks through techniques such as striping, mirroring, and parity. The main goals of RAID include:

Improving performance: By distributing data across multiple disks, RAID can increase data read and write speeds, as multiple disks can operate simultaneously.
Enhancing reliability: Through data redundancy, RAID can protect data in the event of disk failure, preventing data loss.
Expanding capacity: By combining multiple disks into a single logical unit, RAID can provide greater storage capacity than a single disk.

RAID 级别	数据分布方式	校验信息	优点	缺点	可靠性（MTTF）说明
RAID 0	块级条带化	无	高 I/O 吞吐量，性能好	无冗余，任何磁盘故障都导致数据丢失	最低。整体系统 MTTF 显著降低，约为单块磁盘 MTTF 的 1/N（N 是磁盘数）。
RAID 1	镜像（Mirroring）	无	高可靠性，读取性能好（可并行读取）	存储空间利用率低（50%），成本高	很高。可容忍任意一块磁盘故障，整体 MTTF 明显高于单块磁盘。
RAID 2	位级条带化Bit level striping	按位校验	理论上高速（针对特定错误模式），高可靠性	实现复杂，成本极高，实际很少使用	高。具备纠错能力，整体 MTTF 提升，但因应用少不常见。
RAID 3	字节级条带化Byte level striping	专用校验盘（异或）	高传输速率，可容忍单盘故障	写入性能受校验盘瓶颈限制，读取有延迟	高。单盘故障可恢复，整体 MTTF 显著提升。
RAID 4	块级条带化Block level striping	专用校验盘（异或）	高吞吐量，可容忍单盘故障	写入性能差（校验盘瓶颈），读取性能好	高。与 RAID 3 类似，支持单盘恢复，MTTF 明显高于 RAID 0。
RAID 5	块级条带化	分布式校验块（异或）	高吞吐量，写入性能优于 RAID 4，可容忍单盘故障	写入性能仍有一定瓶颈，控制器复杂	较高。分布式校验避免单点瓶颈，单盘恢复，MTTF 较好。
RAID 6	块级条带化	分布式双校验块	可容忍双盘同时故障，可靠性最高	写入性能比 RAID 5 更差（需计算两个校验块）	非常高。可容忍两块盘同时故障，MTTF 极高，适用于关键数据场景。

RAID 0 专注于性能，牺牲了所有冗余。只要一盘坏，全部数据丢，因此整体 MTTF 大幅降低。
RAID 1 专注于可靠性和读取性能，牺牲了存储空间。镜像保护，只要一盘工作，数据仍可恢复，MTTF 提高。
RAID 2 是一种理论上的高性能方案，但实际应用很少。
RAID 3, 4, 5 在性能和可靠性之间提供了不同的平衡点，都提供单盘容错能力。RAID 5 通过分布式校验盘改善了 RAID 4 的写入瓶颈。
RAID 6 提供了最高的可靠性，能够容忍两块磁盘的故障，但写入性能相对较差。

Fault tolerance by voting

Failvote - Stops if there is no majority agreement.
Failfast - Similar to failvote except the system senses which modules are available and uses the majority of the available modules.
Failfast system has better availability than failvoting (since failvote stops when there is no majority agreement). | 特性 | Failvote | Failfast | | -------- | -------------------- | -------------------- | | 是否依赖总数 | 是（必须形成“总体多数”） | 否（只依赖“活着的”模块中形成多数） | | 容错能力 | 中（容忍不到半数失败） | 高（可容忍最多 N-1 个失败） | | 示例（10模块） | 最多容忍 4 个失败（第5个失败时停止） | 最多容忍 9 个失败（最后1个也能工作） | | 可用性 | 较低 | 更高 | | 风险 | 较少错误（需大多数同意） | 可能会依据小多数决策，风险略高 |

Compute availability of a failvote system

If there are n events, mean time to the first event = m/n, m is years of MTTF
Consider a system with modules each with MTTF of 10 years, Failvoting with 2 devices, MTTF = 10/2 = 5 years (system fails with 1 device failure). Failvoting with 3 devices, MTTF = 10/3 for the first failure + 10/2 for 2nd failure = 8.3 years
Lower availability for higher reliability (multiple modules agreeing on a value means that value is more likely to be accurate/reliable)

Probability of a particular module is not available

MTTR/(MTTF+MTTR) where MTTR is mean time to repair

Computing fault tolerance of a supermodule with repair

RAID系统MTTF计算公式 tut11例题

What is the mean time to failure value for different RAID systems (e.g., RAID 0 with 2 disks and RAID 2 with 2 disks, RAID 1 with 2 disks, RAID 1 with 3 disks, RAID 3 with 3 disks, RAID 6 with 5 disks)?
Let’s label the probability of one disk failure as p, and the mean time to failure of one individual disk is MTTF. P is between 0 and 1.
RAID 0 with 2 disks and RAID 2 with 2 disks – The system fails if one of the disks fails. The probability that one of the two disks fails (disk A or disk B) is p + p = 2p. So, as failure probability doubles mean time to failure is halved = 1/2 × MTTF.
RAID 1 with 2 disks - The system fails if both of the disks fail at the same time. The probability that both disks fail (disk A and disk B) is p ∗ p = p^2 as p is between 0 and 1, mean time to failure of the system will increase to MTTF^2.
RAID 1 with 3 disks - The system fails if three of the disks fail at the same time. The probability that all disks fail (disk A and disk B and disk C) is p ∗ p ∗ p = p^3. So, the mean time to failure of the system is accordingly MTTF^3.
RAID 3 with 3 disks- The system fails if 2 of the 3 disks fail at the same time. The probability that 2 disks fail is p ∗ p = p^2. There are 3 different possible combinations of 2 disk failures (A, B; or A, C; or B, C), so the probability that any of the 2 disks out of these 3 disks fail is 3p^2. The mean time to failure of the system is 1/3 × MTTF^2.
RAID 6 with 5 disks - The system fails if 3 out of the 5 disks fail at the same time. The probability that 3 disks fail is p ∗ p ∗ p = p^3. There are 10 different possible combinations of 3 disks failures out of 5 disks (A,B,C; or A,B,D; or A,B,E; A,C,D; or A,C,E; or A,D,E; or B,C,D; or B,C,E; or B,D,E; or C,D,E), so the probability that any of the 3 disks out of these 5 disks fail is 10p^3. Mean time to failure of the system is then 1/10 × MTTF^3.

RAID空间利用率公式

RAID级别	公式	说明
RAID 0	`n/n = 100%`	n = 总盘数，无冗余
RAID 1	`1/n × 100%`	n = 总盘数，镜像冗余
RAID 3/4/5	`(n-1)/n × 100%`	n = 总盘数，1个奇偶校验盘
RAID 6	`(n-2)/n × 100%`	n = 总盘数，2个奇偶校验盘
RAID 10	`50%`	无论多少盘，固定50%

RAID 1的通用公式

RAID 1空间利用率 = 1/n × 100%

2盘RAID 1: 1/2 = 50%
3盘RAID 1: 1/3 = 33.3%
4盘RAID 1: 1/4 = 25%

RAID 3/5的通用公式

RAID 3/5空间利用率 = (n-1)/n × 100%

3盘RAID 3: (3-1)/3 = 66.7%
4盘RAID 5: (4-1)/4 = 75%
5盘RAID 5: (5-1)/5 = 80%

RAID 6的通用公式

RAID 6空间利用率 = (n-2)/n × 100%

4盘RAID 6: (4-2)/4 = 50%
5盘RAID 6: (5-2)/5 = 60%
6盘RAID 6: (6-2)/6 = 66.7%

Recovery

Recovery algorithms have two parts

Actions taken during normal transaction processing to ensure enough information exists to recover from failures
Actions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability

Store data in DBMS

存储类型	描述	示例	崩溃后是否保留数据	用途
易失性存储（Volatile）	存储在断电或系统崩溃后会丢失的数据	主内存（RAM）、缓存	❌ 不保留	临时数据处理，如缓冲池
非易失性存储（Nonvolatile）	在大多数故障中可以幸存的数据存储设备，但不能完全依赖于其进行故障恢复	磁盘、磁带、闪存	✅ 一般保留	持久存储，如日志和数据文件
稳定存储（Stable）	理论上在所有硬件或软件故障中都能保留数据的存储（通过复制和写入协议实现）	使用双磁盘块+复制	✅ 模拟“绝对可靠”	用于日志和事务恢复的关键数据

Data Access

Physical blocks are those blocks residing on the disk.
Buffer blocks are the blocks residing temporarily in main memory.
Block movements between disk and main memory
- Input(B) transfers the physical block B to main memory.
- Output(B) transfers the buffer block B to the disk, and replaces the data there.
Transaction has private work-area, 保存了事务访问或更新过的所有数据项的本地副本（local copy）
Transactions transfer data between system buffer blocks and private work-area
- read(X) assigns the value of data item X to the local variable xi
- write(X) assigns the value of local variable xi to data item {X} in the buffer block
  - 初始的读/写操作是在内存层次结构memory hierarchy内进行的。
  - 第一次访问数据项 X 时执行 read(X)，后续访问都是对本地副本的操作。
  - 最后一次访问后，事务可以执行 write(X)。
  - output(B_x) 不一定立即跟随 write(X) 执行，系统可以在稍后执行输出操作。

Cyclic Redundancy Check

DBMSs borrow this method from networks to increase durability on disks
It further complicates matters but increases durability
It is a method, similar to parity computations we saw before, to catch problems in written data on disks
Most errors in communications or in our case on disk happen contiguously, i.e. is in bursts
CRC can be used to detect burst errors on disks with many bits changing
In fact the parity bit we saw can be seen as a 1 bit CRC computation
CRC/parity can be used even when we do not use a RAID design on disks

log-based recovery

Deferred database modification

The deferred database modification scheme records all modifications to the log, but defers all the writes to after partial commit.
在崩溃恢复期间，只有当日志中同时存在 <T_i start> 和 <T_i commit> 时，才需要对事务 T_i 进行重做。
redo T_i会将事务更新过的所有数据项的值设置为新的值, 重做必须按顺序

Immediate database modification

Allows database updates of an uncommitted transaction to be made as the writes are issued
Since undoing may be needed, update logs must have both old value and new value
Update log record must be written before database item is written
- undo(Ti) restores the value of all data items updated by Ti to their old values, going backwards from the last log record for Ti
- redo(Ti) sets the value of all data items updated by Ti to the new values, going forward from the first log record for Ti
- both must be idempotent
- if log contains <T_i start> and <T_i commit> then redo(Ti) must be executed
- if log contains <T_i start> but no <T_i commit> then undo(Ti) must be executed
- undo(Ti) must be executed before redo(Ti)

Feature	Deferred Database Modification	Immediate Database Modification
Change Application Time	Batch applied at transaction commit	Applied immediately during transaction
I/O Operations	Reduced	Increased
Real-time Performance	Lower	Higher
Commit Time	Potentially longer	Shorter
Recovery Process	Simple, discard uncommitted transactions	Complex, requires rollback of incomplete transactions
Transaction Log	Larger	Relatively smaller
Applicable Scenarios	High consistency requirements, can tolerate longer commit times	High real-time requirements, need immediate reflection of changes
Advantages	Increases efficiency, simplifies recovery	Enhances real-time performance, reduces commit time
Disadvantages	Commit time may be longer, larger transaction log	Complex recovery process

The redo log records the physical changes of data pages, used for recovery after a service crash to ensure transaction durability. The undo log records logical logs, used to restore original data during transaction rollback, ensuring transaction atomicity and consistency.

Problems in recovery procedure:

Searching the entire log is time-consuming.
We might unnecessarily redo transactions that have already output their updates to the database.
To solve this, use checkpoint:
- Output all log records currently residing in main memory onto stable storage
- Output all modified buffer blocks to the disk
- Write a log record < checkpoint> onto stable storage

Checkpointing

Scan backwards from the end of the log to find the most recent <checkpoint> record.
Continue scanning backwards until <Ti start> is found.
Only consider the part of the log following the start record. The earlier part of the log can be ignored during recovery and can be erased whenever desired.
For all transactions (starting from Ti or later) with no <Ti commit>, execute undo(Ti). (Done only in the case of immediate modification.)
Scanning forward in the log, for all transactions starting from Ti or later with a <Ti commit>, execute redo(Ti).

Recovery with Concurrent Transactions

Allow concurrent transactions: The system design allows multiple transactions to execute simultaneously.
Single disk buffer and log: All transactions share the same disk buffer and a single log file.
Buffer block updates: A buffer block can be updated by one or more transactions.
Strict two-phase locking: Use strict two-phase locking to ensure that updates of uncommitted transactions are not visible to other transactions.
Checkpointing technique changes: Due to the possibility of multiple active transactions during checkpointing, adjustments to the checkpointing technique are required.

Specific recovery procedure

Initialize undo-list and redo-list to empty
Scan the log backwards from the end, stopping when the first
- <checkpoint L> record is found
- For each record found during the backward scan:
  - if the record is <Ti commit>, add Ti to redo-list
  - if the record is <Ti start>, then if Ti is not in redo-list, add Ti to undo-list
  - For every Ti in L, if Ti is not in redo-list, add Ti to undo-list
At this point undo-list consists of incomplete transactions which must be undone, and redo-list consists of finished transactions that must be redone.
Recovery now continues as follows:
- Scan log backwards from most recent record, stopping when <Ti start> records have been encountered for every Ti in undo-list.
- During the scan, perform undo for each log record that belongs to a transaction in undo-list
- Locate the most recent <checkpoint L> record
- Scan log forwards from the <checkpoint L> record till the end of the log
- During the scan, perform redo for each log record that belongs to a transaction on redo-list

Log Record Buffering

Log Record Buffering:
- Log records are buffered in main memory.
- Log records can be output with a single output operation.
- Log records are output to stable storage when a block of log records in the buffer is full, or a log force operation is executed.
Log Force:
- Perform log force operations to commit transactions by forcing all log records (including commit records) to be written to stable storage.

Rules for log record buffering

Log records are output to stable storage in the order in which they are created.
Transaction Ti enters the commit state only when the log record <Ti commit> has been output to stable storage.
write-ahead logging or WAL rule: Before a block of data in main memory is output to the database, all log records pertaining to data in that block must have been output to stable storage.

Issues with Database Buffering

Removing existing blocks when the buffer is full:
- When new data blocks are needed and the buffer is full, existing data blocks must be removed from the buffer.
Write-back strategy for updated blocks:
- If the selected block has been updated, the write operation must be completed before outputting it to disk.
Handling uncommitted updates:
- If a block with uncommitted updates is output to disk, log records containing undo information must first be output to stable log storage (Write-Ahead Logging, WAL).
Avoid ongoing updates:
- There should be no ongoing update operations when outputting data blocks to disk.
Use of locks:
- Before writing a data item, a transaction acquires an exclusive lock on the block containing the data item to ensure no interference from other transactions.
Locks held for a short time are called latches:
- This type of lock is typically used to control short-term access to shared resources.

Buffer Management Complications

数据库缓冲可以有两种实现方式：
- 在真实主存：为数据库预留一部分真实主存区域。
- 在虚拟内存：使用虚拟内存作为数据库缓冲。
使用预留主存的缺点
- 灵活性受限：需要在数据库缓冲与应用程序之间提前划分内存，限制了灵活性。
- 需求变化：需求可能会发生变化，尽管操作系统知道如何最佳分配内存，但它无法改变内存分区。

Buffer Management

Page replacement: When the operating system needs to replace a modified page, the page is written to the swap space on disk.
Double paging problem: When the database decides to write buffered pages to disk, the buffered pages may be in swap space, leading to additional I/O operations.
Ideal handling process:
- The operating system should hand over control to the database when it needs to evict pages from the buffer.
- The database should first output log records and then output pages to the database instead of swap space.
- Release pages for use by the operating system.
Avoid double paging: The above methods can avoid double paging, but some operating systems do not support this feature.

Logical Logging

Limitations of physical undo: Operations like B⁺ tree insertions and deletions cannot be undone by restoring old values (physical undo) because once locks are released, other transactions may have updated the B⁺ tree.
Method of logical undo: Instead, insertions (or deletions) are undone by performing delete (or insert) operations (called logical undo).

Remote Backup Systems

provide high availability by allowing transaction processing to continue even if the primary site is destroyed

Part 1: Failure Detection and Control Transfer

Failure Detection
- The backup site must detect the failure of the primary site.
- To distinguish between primary site failure and link failure, multiple communication links must be maintained between the primary site and the remote backup.
- Use heartbeat messages for detection.
How to Transfer Control
- To take over control, the backup site first uses its database copy and all log records received from the primary site for recovery.
- Completed transactions will be re-executed, and incomplete transactions will be rolled back.
- When the backup site takes over processing, it becomes the new primary site.

Part 2: Recovery Time and Hot Standby Configuration

Recovery Time
- To reduce takeover delay, the backup site regularly processes redo log records.
- Perform recovery from the previous database state, execute checkpoints, and then delete the earlier part of the log.
Hot Standby Configuration
- The backup continuously processes arriving redo log records and applies updates locally.
- When a primary site failure is detected, the backup rolls back incomplete transactions and prepares to process new transactions.

Part 3: Durability Levels

Ensure the durability of updates by delaying transaction commits until updates are recorded in the backup. However, this delay can be avoided by allowing lower levels of durability.
One-safe
- Commit once the transaction's commit log record is written to the primary site.
- Problem: Updates may not reach the backup before it takes over.
Two-very-safe
- Commit only when the transaction's commit log record is written to both the primary site and the backup.
- Reduces availability because if either site fails, the transaction cannot commit.
Two-safe
- If both the primary site and backup are active, proceed as two-very-safe.
- If only the primary site is active, commit immediately when the transaction's commit log record is written to the primary site.
- Offers better availability than two-very-safe; avoids the issue of lost transactions in one-safe.

模式	何时提交事务	优点	缺点
One-safe	提交日志写入主站即可提交	高可用，快	有事务丢失风险：日志可能尚未到达备份站就宕机
Two-very-safe	日志写入主站和备站才允许提交	非常安全	可用性差，任何一边挂了都无法提交事务
Two-safe	两边都活着：two-very-safe；仅主站活：one-safe	折中方案，兼顾安全与可用性	小概率仍会丢失事务（主站挂、备站未及时同步）

Shadow Paging

是一种简单但效率不高的恢复方法，适合小规模、低并发的系统。虽然恢复快、不需要日志，但写入开销大、不适合现代高并发数据库系统。不用日志，不回滚，只切换指针。

Commit Process:

Flush all modified pages in main memory to disk.
Output the current page table to disk.
Make the current page table the new shadow page table:
- Maintain a pointer to the shadow page table at a fixed location on disk.
- Update the pointer to point to the current page table. Once the pointer to the shadow page table is written, the transaction is committed.

Recovery after Crash:

No recovery is needed—new transactions can start immediately using the shadow page table.

Page Reclamation:

Pages not pointed to by the current/shadow page table should be released (garbage collection).

Advantages of Shadow Paging:

No overhead of writing log records.
Simple recovery.

Disadvantages of Shadow Paging:

Copying the entire page table is expensive when the page table is large—can be reduced by using a B+ tree-like page table structure.
Even with the above extension, commit overhead is still high—requires flushing every updated page and page table.
Data becomes fragmented (related pages are separated on disk).
After each transaction completes, database pages containing old versions of modified data need garbage collection.
Difficult to extend the algorithm to allow concurrent transactions.

Data Warehousing

Data sources基本存的都是current data没historical，Corporate decision making需要historical，所以需要data warehouse
repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site
Greatly simplifies querying, permits study of historical trends
Shifts decision support query load away from transaction processing systems

Design Issues

Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g. at night)
Destination driven architecture: warehouse periodically requests new information from data sources
Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive
- Usually OK to have slightly out-of-date data at warehouse
- Data/updates are periodically downloaded form online transaction processing (OLTP) systems (most of the DBMS work we have seen so far)
Schema (what schema to use)
Data cleaning
- Merge address lists from different sources and purge duplicates
How to propagate updates
What data to summarize
- raw data is too large
- Aggregate values (totals/subtotals) often enough

OLAP: Online Analytical Processing

Interactive analysis of data
Multidimensional data
- Measture attributes e.g. sales, profit, etc.
- Dimension attributes e.g. time, location, product, etc.

Cross-tabulation (pivot table)

A table that summarizes data by combining different dimensions

Data Cube

A data cube is a multidimensional generalization of a cross-tab

OLAP Operations

Pivoting旋转: changing the dimensions used in a cross-tab
Slicing切片: creating a cross-tab for fixed values only
- Sometimes called dicing切块, particularly when values for multiple dimensions are fixed
Rollup上卷: moving from finer-granularity data to a coarser granularity
Drill down下钻: The opposite operation - that of moving from coarser-granularity data to finer-granularity data

OLAP implementation

The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are referred to as multidimensional OLAP (MOLAP) systems
OLAP implementations using only relational database features using rows and columns and no precalculated data cubes called relational OLAP (ROLAP) systems
Hybrid systems, which store some summaries in memory and store the base data and other summaries in a relational database, are called hybrid OLAP (HOLAP) systems
Early OLAP systems precomputed all possible aggregates in order to provide online response, Space and time requirements for doing so can be very high, Several optimizations available for computing multiple aggregates

Information Retrieval (IR) 比如Elasticsearch

特性	信息检索系统（IR）	数据库系统（DBMS）
数据结构	非结构化文本（文档）	结构化数据（表格、关系）
查询方式	基于关键词/相关性	基于 SQL、精确匹配
典型应用	搜索引擎、文档检索	OLTP、财务系统、库存管理
Schema	无需 schema	依赖 schema
响应速度	更快（针对检索任务）	更慢（复杂事务、多表联结）

Differences from database systems:
- IR systems don’t deal with transactional updates (including concurrency control and recovery)
- Database systems deal with structured data, with schemas that define the data organization.
- IR systems deal with some querying issues not generally addressed by database systems.
  - Approximate searching by keywords.
  - Ranking of retrieved answers by estimated degree of relevance.
- 使用了inverted index倒排索引，比如说ES。An inverted index is a data structure that maps keywords to the list of documents that contain them, enabling fast full-text search.

In Memory Databases

优点

高性能硬件与并行性：利用现代高性能CPU和多核并行处理，提高事务处理速度。
减少磁盘瓶颈：大幅增加缓冲区大小，减少对磁盘I/O的依赖，提升响应速度。
适用实时场景：尤其适合实时数据分析和在线事务处理（OLTP）场景。

挑战

磁盘I/O仍是瓶颈：尽管缓冲区大，但日志写入和持久化仍需依赖磁盘，限制整体性能。
数据冲突和并行性限制：多个事务并发访问相同数据会产生冲突，影响并行执行效率。
日志记录瓶颈：高频率事务导致日志写入成为性能瓶颈。
主存易失性：系统崩溃时，内存中数据全部丢失，恢复机制复杂。

商业64位系统支持

大容量主存支持：支持数十GB甚至更多内存容量，适合大规模内存数据库部署。
内存中数据处理：Memory resident data allows faster processing of transactions

优化机会

To reduce space overheads use different data structures, i.e. do not use indices built for disks
No need to pin buffer pages in memory before data are accessed, since buffer pages will never be replaced
Design query-processing techniques to minimize space overhead - avoid exceeding main memory limits during query evaluation
Improve implementation of operations such as locking and latching, so they do not become bottlenecks
Optimize recovery algorithms, since pages rarely need to be written out to make space for other pages

Week 12

NoSQL

类型	数据模型特点	典型用途	查询灵活性	关系处理能力	存储结构
Key-Value	简单的键-值对存储	缓存、会话管理、简单配置存储	低（基于键直接访问）	无	键映射到任意数据
Document	存储 JSON、XML 等半结构化文档	内容管理、用户资料、日志	高（支持复杂查询）	轻度关联（嵌套文档）	文档（如 JSON）
Graph	节点和边，适合复杂关系建模	社交网络、推荐系统、网络拓扑	高（基于关系查询）	强（复杂连接和路径）	图结构（节点+边）
Column	类似表格，但列是按族分组且可变长	大数据分析、时序数据	中（基于列族查询）	低（不强调复杂关联）	列族存储（如HBase）

Key-Value：最快最简单，适合快速读写。
Document：灵活，适合半结构化数据和复杂查询。
Graph：最擅长处理复杂关系和连接查询。
Column：适合大规模数据分析，列式存储优化了扫描性能。

CAP Theorem

Consistency: every node always sees the same data at any given instance (i.e., strict consistency)
Availability: the system continues to operate, even if nodes crash, or some hardware or software parts are down due to upgrades
Partition Tolerance: the system continues to operate in the presence of network partitions
CAP theorem: any distributed database with shared data, can have at most two of the three desirable properties, C, A or P. 在网络分区的情况下要么保证一致性，要么保证可用性

特性	ACID	BASE
全称	Atomicity, Consistency, Isolation, Durability	Basically Available, Soft state, Eventual consistency
设计目标	强调严格的数据一致性和事务完整性	强调系统的高可用性和可扩展性
一致性模型	强一致性（Strong Consistency）	最终一致性（Eventual Consistency）
事务处理	事务是原子操作，要么全部成功，要么全部失败	不保证每次操作立即一致，允许临时不一致
适用场景	传统关系型数据库，银行、电商、财务等需要严格一致性的系统	分布式系统、NoSQL数据库、大规模Web应用等需要高可用性的场景
性能和可用性	牺牲一定性能和可用性以保证数据严格一致	牺牲一定一致性以换取高性能和高可用性
隔离级别	多种隔离级别，保证并发事务的隔离	允许一定程度的状态不一致（软状态）
状态管理	数据状态必须严格符合预期	状态可能是“软状态”，会逐渐收敛到一致

ACID 更适合对数据正确性和一致性要求极高的场景，比如银行转账。
BASE 更适合对系统响应速度和可用性要求高、可以容忍短暂不一致的场景，比如社交媒体、搜索引擎。

一致性类型	说明	示例/应用场景
强一致性	所有读操作都能看到最近一次写操作的结果	分布式数据库的严格事务要求，ACID模型
弱一致性	不保证写操作后立即可读；可能看到旧数据	高吞吐系统，如缓存或某些NoSQL数据库
最终一致性	如果不再有新更新，所有副本最终都会收敛到一致的状态	DNS系统、CDN分发网络

Eventual consistency 变体

模型名称	保证内容	举例
因果Causal一致性	如果一个操作因另一个操作而发生（存在因果关系），它们的顺序会被保留	社交网络（例如：评论必须在帖子之后显示）
读-写Read-your-write一致性	用户在写入后总能读到自己的更新，不会读到旧值	用户设置变更后再次查看设定值
会话Session一致性	单个会话中，系统保证读-写一致性	移动App用户会话
单调读Monotonic read一致性	一旦读取了某个值，就不会看到更旧的值	时间轴/日志浏览
单调写Monotonic write一致性	写操作会按照发起顺序被应用	用户评论依次出现

A number of these properties can be combined
Monotonic reads and read-your-writes are most desirable