Database Management System (DBMS)

Encoding
Data Model
Storage Management
Index
Execution
Query Optimization
Transaction

Resources:

MIT 6.830; another version
SimpleDB, recommended
Database Management Systems, by Johannes Gehrke and Raghu Ramakrishnan, recommended
UW CSE444, recommended
Designing Data-Intensive Applications [recommended]

Database Management System (DBMS) Implementations: SQLite, PostgreSQL, MySQL, SQLite, Oracle, Microsoft, etc.

Categories:

online transaction processing (OLTP): small number of records per query
online analytics processing (OLAP): aggregate over large number of records
- Data warehouse:
  - works without affecting OLTP operations
  - OLTP system -> Extract-Transform-Load (ETL) -> data warehouse for OLAP operations

Encoding

evolution:

backward compatibility
- keep the new field intact
forward compatability

how to store data for un/marshaling, de/serialization, encode/decode:

Textual:
- JSON, XML
- CSV
Binary
- ASN.1 (deprecated)
- Apache Thrift, Google Protobuf
- Apache Avro

models:

Database
Service call
- service-oriented architecture (SOA) / micro services architecture: a server itself is a clientt to another service.
- REST v.s. SOAP (SOAP: an XML-based protocol described in Web Services Description Language (WSDL). It aims to be indenpendent from HTTP thought it is often over HTTP)
- RPC: gRPC (based on Protobuf), Rest.li (JSON over HTTP), Finagle (based on Thrift), RESTful API
Asynchronous message passing
- distributed actor framework

Data Model

Relation model (SQL): enforce a schema
Document database: one-to-many (NoSQL): self-contained document; relationships between documents are rare.
Graph model: many-to-many (NoSQL): anything is potentially related to everything

data structures:

LSM-tree (log-strutured merge-tree)
- log structure
- SSTable (sorted string table)
- an in-memory balanced tree data structure (memtable)
B-tree
- crash recovery
  - WAL (write-ahead log): an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself.
  - CoW (copy on write)
- latch (lightweight lock)

in-memory database:

faster:
- not because they dont need to read from the disk
- It is because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk
provide data models that are difficult to implement with disk-based indexes
Further changes: NVM (non-volatile memory)

Storage Management

Record ID (RID) typically is (PageID, SlotNumber). See future discussion on indexes and transactions.

LOB (large objects):

binary large object (BLOB)
character large object (CLOB)

Key questions:

How do we organize pages into a file (so that we can find free space efficiently) ?
- Sequence of pages (impl in simpleDB)
- Listed list of pages
- Directory of pages. Directory contains free-space count for each page. Faster inserts for variable-length records
How do we organize data within a page?

Buffer Manager: Brings pages in from memory and caches them

Column store v.s. Row store

Row-store storage managers are most commonly used today for OLTP systems. They offer high-performance for transactions.
Column-stores win for analytical workloads and are widely used in OLAP.

Index

RID: unique tuple identifier (e.g. (PageID, SlotID) )

search key: can be any set of fields (not the same as the primary key!)

data entry for key k can be:

(k, RID)
(k, list-of-RIDs)
the actual record with key k (also called indexed file organization)

index: collection of data entries (now we have two files: data file and index file)

dense or sparse
- dense: 1 2 3 4
- sparse: 10 20 30 40
primary or secondary
- primary: determines the location of indexed records
clustered or unclustered
- clustered (covering index): store the row data within the index
- unclustered (index with included column)

heap file: the value is a referece to the row stored elsewhere

Large indexes: index the index itself

hash-based: not good for range queries
tree-based:
- B tree: 1 node = 1 page
- B+ tree:
  - keep tree balanced in height
  - make leaves into a linked list to facilitate range queries

B+ tree (default index structure on most DBMSs):

parameter d
each node (except root) has d ≤ m ≤ 2d keys and m+1 pointers
each leaf has d ≤ m ≤ 2d keys and pointers to
- the data records
- the next leaf (for range queries)
usually: 2d x key-size + (2d+1) x pointer-size ≤ block-size

Multi-column indexes:

latitude, longtitude (R-tree)
R, G, B

Execution

SQL query is first transformed into physical plan. Execution of physical plan is pull-based.

Each operator pre-allocates heap space for I/O tuples and its internal state. DBMS limits how much memory each operator, or each query can use.

join operator(R S):

Hash join: one-pass when the memory can hold the relation
nested loop join; and (reduce the number of swap-in/out of pages) B(R) + T(R)B(S)
- page-at-a-time refinement B(R) + B(R)B(S)
- block-nested-loop refinement: B(R) + B(R)B(S)/(M-1) (M is the memory size)
sort-merge join
index-based nested loop join
- If index on S is clustered: B(R) + T(R)B(S)/V(S,a) B: sizeof R, T: #tuples, V: distinct values
- If index on S is unclustered: B(R) + T(R)T(S)/V(S,a)
  - Don’t build unclustered indexes when V(R,a) is small! (T » B)

What if the data does not fit in memory: two-pass algorithm

sorting
- a run: an increasing subsequence
- merge and sort run ~3B(R)
join (R S):
- runs of R; runs of S
- join while merging
grace-join (hashing)
- use hash to split R and S into k buckets (k <= M)
  - expected bucket size = B(R) / k <= M
  - When a bucket fills up, flush it to disk
  - using another hash function to join each pair of buckets
- join every pair of buckets
Hybrid join
- still into k buckets, but the first t buckets stay in the memory, while k-t buckets to disk (k <= M)
- we first join t buckets immediately and then join the k-t pairs of buckets
- t / k * B(s) <= M
- adjust t dynamically
  - start with t = k
  - when run out of memory, send one bucket to disk, t -= 1
- time: 3B(R) + 3B(S)

Query Optimization

Collect statistical summaries of stored data
Estimate size in a bottom-up fashion (most difficult)
- T(R)
- V(R, a)
- B(R)
- min, max
- histograms
- Collection approach: periodic, using sampling
Estimate cost by using the estimated size

Selectivity Factor: Each condition cond reduces the size by some factor called selectivity factor.

Q = SELECT * 
    FROM R, S, T
    WHERE R.B=S.B and S.C=T.C and R.A<40

T(Q) = T(R) x T(S) x T(T) x Selectivity(R.B = S.B) x Selectivity(S.C=T.C) x Selectivity(R.A<40)

Selectivity Factors for Conditions

A = c: 1 / V(R, A)
A < c: (c - Low(R, A)) / (high(R, A) - Low(R, A))
A = B: 1 / max(V(R, A), V(S, A))
- T(R⨝S) = T(R) x T(S) / max(V(R, A), V(S, A))

Histograms: make size estimation much more accurate

eq-width: 0-10: 1000, 10-20: 500, 20-30: 5000
eq-depth: 0-33: 2000, 33 - 47: 2000, 47 - 100: 2000
v-optimal (adopted by modern DBMS): Defines bucket boundaries in an optimal way, to minimize the error over all point queries

Transaction

Simplify the programming model:

the database takes case of the safety guarantees
Transitions are antithesis of scalability. (Hyperbole statement)
ACID: a marketing term because it’s unclear what guarantees you can actually expect. It depends on the implementation.
- Atomicity / Abortability: It does not mean concurerency, but means that multiple writes are done atomically.
- Consistency: the meaning depends on the application’s notion of invariants.
- Isolation / Serializability: concurrent transactions are isolated from each other. Each transaction appears to be committed serially. It incurs high performance penalty!
  - Weak Isolation: protect against some but not all concurrency issues (common practice)
- Durability: data won’t be lost. No absolute durability.

Weak isolations:

read committed: default setting for many databases
- no dirty reads: can only see the data that has been committed. Implemented by remembering both the old and new values, or by a row-level lock (slower)
- no dirty writes: can only overwrite data that has been committed. Implemented by a row-level lock
- NOTE: read committed does not preven race condition between two counter increments
  - A: read 10; B: read 10; A: write 11; B: write 11
  - expected: 12, output 11
Snapshot Isolation
- motivation: nonrepeatable read or read skew
- Design principle: readers never block writers and vice versa
  - Multi-version concurrency control (MVCC). (no dirty reads can be implemented by two-verison concurrency control.)
- Name confusion: also imprecisely called “serializable” in Oracle and “repeatable read” in PostgreSQL and MySQL
- There are big differences in the guarantees that databases actually provide, depite being ostensibly standardized.
Preventing Lost Updates