Book Notes: The Relational Model for Database Management, Version 2

Posted at — Jul 28, 2020

https://dl.acm.org/doi/book/10.5555/77708

The “relational model” is a data modeling approach that captures all data as rows and columns of (special) tables: relations. The model was first introduced by Edgar Codd in 1969, and he championed itit throughout the rest of his life - often battling against its misinterpretation and buzzwording. Out of that struggle came this book, published in 1989, as a compiled “this is why you’re wrong”.

The book is pretty dry, appropriate for a semi-technical book from the 80’s written by a mathematician. A lot of it is recognizable if you have any background in SQL, data analysis, or databases - and it’s still a fascinating look at the opinions and theories of a giant in the database field. Also, although it was written over 30 years ago, some of the complaints might as well have been written today.

right off the bat, Codd hits us with a legendary roast: “The slow acceptance of [AI] programs has been largely due to the use of incredibly complex data structures in that field. This supports the contention that AI researchers write their programs solely for other AI researchers to comprehend.”
- vicious, I love it
relation != table: relations may not have duplicate rows
- each row in a relation is an assertion: this tuple is a fact
- no positional concepts: the relation has no inherent ordering
all values are atomic (smallest representable form)
- how does this affect JSON fields?
why duplicate rows aren’t allowed:
- concurrent users means there has to be a single agreed-upon meaning for duplicate rows that apply to all relations, but that can’t be guaranteed
- more on that later
DBMS != DB; DB ∈ DBMS
- DBMS includes UX, operations, administration as well as data structure and access
Read/Write/Update/Delete ops for each data type -> 4N total operational commands
data relates by comparison of values
- the “domain”
RM (Relational Model) is intended as an abstraction - implementation is flexible, but Codd holds that it should possess the (many, many) enumerated properties
not a fan of SQL: duplicate rows, many missing features
transactions are required
- interesting, mentions the concept of WAL - did they exist at the time?
DBMS must be able to manage databases entirely through relational capabilities
Views must be supported (multiple chapters on this)
3- or 4-valued logic
“the future lies with specialized sublanguages that can intercommunicate with one another”
- sort of true today, I suppose
PK
- uniquely identifies the tuple
- each relation must have a PK
- may be constructed from multiple columns
FK indicates the PK of another table, and all the values in FK must exist in that PK

Ch. 1

databases are collections of assertions: facts about the world
in RM, all information only represented as data values in relations
repeating groups are discouraged - normalization to remove these where possible
- prior data models handled repeating groups by positioning tuples in those groups close to each other
- this violates non-positionality of RM
data catalog is necessary, because everything must be accessible from same relational UX
no duplicate rows allowed
physical data independence and transparency to users
- DB operation and user interaction should never be affected as a result of a change in the phsyical representation of the data
“three-schema architecture”: storage representation, base relations, views
- better known as levels of abstraction, today: physical, logical, external
domains as extended data types
- enforce same-type comparisons during joins and whatnot
- integrity constraints on the type
column descriptions: column declarations with name, type, constraints
- like: num_hamburgers int check(num_hamburgers > 0)
PK
- required for each relation
- unique
- no missing values allowed
- composite OK
Views can support PK
- or at a minimum a “weak identifier”: column(s) that uniquely ID a row, but aren’t defined as PK
FK
- must reference PK in another relation
- no automatic FK-PK pairing done by the DBMS - must be hand-defined by DBA
composite domains
- domains but built out of multiple columns
- order of definition matters: (colA, colB) is different from (colB, colA)
composite columns
- requires a pre-defined composite domain
- must be within a single relation (columns can’t be from different relations)
- may themselves consist of other composite columns
missing values
- replacement values are not allowed: no "” as a stand-in for NULL, for example
- DBA must be able to declare if a column allows missing values or not
universal relation not allowed
- a single relation built by combining all relations into one gigantic super-relation
- this is a weird idea. Codd makes more arguments against it later.

Domains as Extended Data Types

Codd treats this as fundamental to the RM
domain != column; column values ∈ domain
named, are objects, declarable, have defined range, have defined comparability (<, >)
“domains are the glue that holds a relational database together”
- which relations use which domains make it clear if groups of relations can be separated into a different database
- enforcing use of domains makes it clear when a comparison is valid
- makes explicit how relations are connected
domain definition is more consistent and efficient than storing them in application logic
- also centralizes constraints
“domain-constrained operators” for type safety/constraint enforcement
- also during joins and comparisons
“essential that the DBMS + RL contain a command that is capable of referring to all columns currently drawing their values from a specified domain without the user having to list those columns”
- FAO_AV: Find All Occurrences of Active Values - search the whole DB for that value, using the domain, return all places it shows up
- FAO_LIST: Find All Occurrences in a List of values - search the whole DB for that value, using the domain, return all places where they show up
domains must support integrity checks
domains participate in PK, FK, </>, unionability, integrity, inclusion checks
unioning may be more inferrable with domains
- not sure on this, maybe in some cases but it seems a fringe benefit
- non-trivial unions where a domain may be used by multiple columns won’t see a difference
indexes over the domains - largely for performance
DBMS must also support some native data types and functions in wide use
- dates, decimal currency
- UDFs for special date handling
- day(), month(), year(), etc. functions
- clock time functions like hour(), minute()
- datetime types and functions
- timezones + conversions
- non-negative decimal currency and free-decimal currency
- functions as a data type
- function arguments as a data type
maximum type safety all the time always

The Basic Operators

set of tools for data retrieval
equal to “first-order predicate logic”
- operations/functions that return true or false
the operations take 1 or 2 relations and return a relation (“operational closure”)
- as a result, results can be used in later operations
- a join b -> c; c join d -> e
Codd notes it’s expressed algebraically here, but logical is simpler + more concise
- contemporary optimizers could handle logical more easily, because they could only optimize one command at a time
- concise -> fewer commands -> better optimization
operators designed to use relations: no duplicates allowed
- allowing duplicates means operators can’t be used in any order
- alternatively, duplicates can be handled explicitly
example of Cartesian product U = S x T = T x S
- given that columns are prefixed S.column and T.column if any column names overlap: “citation ordering”
- unless S == T, in which case S.column_1 and T.column_2
- “but like please don’t use the Cartesian product”
operators
- projection: subset of columns, the SELECT part (but DISTINCT implied because these are relations)
- ϴ-select: subset of rows, the WHERE part, using value comparisons like <= or > or !=
  - produces boolean values, so ϴ-selects can be chained
  - implies: p -> q, “if p is true then q is also true”, p is a subset of q
- ϴ-join, usually just called “join”
  - column-wise concatenate two relations
  - A join B -> columns(a) ++ columns(b) where ϴ is true
  - ϴ is a boolean-producing function of columns (the comparands)
  - if it’s an equality function, the “equi-join”
  - A join B on A.x = B.y -> x and y are the comparands
  - logically equal to Cartesian(A, B) where A.x = B.y, but Cartesian is inefficient
  - also degree(A ϴ B) = degree(A) + degree(B); (degree(X) means “# of columns in X”)
- natural join: equi-join, but only one copy of the comparands is kept
- relational union
  - union of elements of the same type (so same data type or same domain)
  - A, B are union-compatible if degree(A) = degree(B) and a mapping between columns(A) and columns(B) exists
  - length(intersect(columns(A), columns(B))) == degree(A) == degree(B)
  - after union, duplicates are removed b/c the result is a relation
- intersection: relations A, B are union-compatible -> return only rows that appear in both relations
- difference: relations A, B are union-compatible -> return only rows from A that don’t appear in B
- division
  - relations A, B share at least one comparand column
  - find largest relation such that result[target columns] x B[comparand] ∈ A[target columns ++ comparand]
  - find the biggest “block” (target columns) in A that can be built from its own values multiplied by B
  - degree(quotient) == degree(dividend) - degree(divisor)
  - degree(remainder) == degree(dividend)
  - fuckin’ wild
projection, -selection, and -join can be executed in any sequence
- but only if no duplicates
- easier to optimize if possible
quantifiers
- existential: exists x in A
- universal: for all x in A
relational assignment: table aliasing
insert: add new rows to a relation
- duplicate rows ignored; rows with duplicates in unique-constrained columns also ignored
- indexes updated
- union may also be allowed
update: change values in existing rows (must respect referential integrity)
- cascade update
  - multi-user environments means it’s impossible for anyone to know all locations of PK
  - DBMS should cascade any PK updates automatically
  - if sibling PK exists, update those too if desired
  - indices also updated
  - domain support would make it easier to track all occurring values of the target
- cascade missing
  - same as cascade update, but filling with missing values instead
delete: remove rows from database
- PK requirement for relations means delete should cascade, or else integrity violations in FK
- cascade delete
  - safer than regular delete because avoids integrity violations
  - but deletes more than just target row, so should still be handled with care
  - sibling PK can be targeted optionally
- cascade missing
  - even safer than cascade delete, since no downstream rows are deleted
  - but introducing missing means any columns that disallow missing will cause op to fail

Advanced Operators

more power and flexibility, “without introducing programming concepts”
must respect operational closure (all operations return a relation)
framing a relation: equivalent of GROUP BY
- does NOT create a subrelation for each group (a “partition”, like in Spark)
- for each group in the framing column, order rows by ascending value
- may allow range partitioning (like grouping over a calculated column)
- aggregation functions: calculating a function over each group
- these reduce to a row for each group
- implies the application of grouping to relations joined to from the framing relation
- so behind the scenes order of operations is join-group-agg
extending a relation: adding columns
- could be used to make two relations union-compatible
- is used in outer joins: non-matching tuples are extended to include missing columns with imputed missing values
semi-ϴ-join
- like a regular join, but where one operand is a projection just to the comparand
- relations A, B; B is reduced to only the join columns
- intended for use where A and B are located in different physical locations
  - reducing one of the relations to just the join columns minimizes data over wire
- similar to “filter join”
left- and right-outer-join
- relations A, B; A join B
- find all tuples from indicated side that didn’t match condition
  - left: tuples from A
  - right: tuples from B
- extend the found tuples with a tuple of missing values (length = degree of non-indicated side)
- union join result with the extended tuples
full join
- relations A, B; right-join(A, B) union left-join(A, B)
- because duplicates are removed
- probably has a different implementation when duplicates are involved
- chock-a-block full o’ missing values
MAYBE qualifier: in joins, allow missing values in the comparand
outer natural joins: same as left- right-joins, but omits one of the comparand columns
outer set operators: relations A, B
- outer union: extend both by each other, then take the union
- outer difference: extend both by each other, union, then set difference(extended A, union result)
- outer intersect: extend both by each other, semi-join each result to each other, then union
T-join operators: inequality joins, but without requiring a complete join
- during the join, values are sorted in the comparands
- when values are repeating, they’re ordered within the value using a tiebreaker, often the PK
- go through A, and compare each tuple to the first tuple of B that matches the join condition
- introduces possibility of “cross-ties”/“quads”: multiple tuples in A match multiple tuples in B
- T-joins are only allowed to use each row of the each operand once to remove quads
- outer T-join: same as T-join, but keeps all tuples from A
user-defined select/join operators: UDFs that return booleans, for filtering and join conditions
recursive join operator: bridge table; a single parent-child relation self-joined until fully formed

Naming

every object needs a single unambiguous name; each object class may have specific rules governing names
“interchangeability of operands must not be reduced”
domains/data types must have unique names across relations and functions
relations and functions must have unique names, extending across columns
columns only need to be unique within a relation
domains, relations, and functions share a name pool
columns should be named for their purpose + their domain
an aside: Codd implies that SQL did not include a specification for table.column referencing, but literally every RDBMS I’ve worked with has this convention - was this introduced after the book was written? or perhaps not adhered as much as I think?
selection by table.column must be enforced and unambiguous
- specifically for non-distributed databases: distributed ones must also include source
UNION should be by column name + domain whenever possible; user should only have to specify if duplicate columns occur
join or division should automatically prefix duplicate columns with source relation name (table.column notation)
- preserves column uniqueness in resulting relation
projection (column subset) must accept user-specified names or keep source column names
function-generated columns
- default to function_name.first_argument
- append “a sufficiently large substring of the function-invoking expression” to avoid duplicate names
archived relations default to relation_name + YYYYMMDD + version # that day
- for example daily_sales_20001122_3
integrity constraints also nameable
- useful for error investigation: “well-named constraint x threw error”
“detective mode” naming: query results can be named
- called “detective mode” because a user might work through several queries on the DB looking for an answer

Commands for the DBA

retrieval, ingest, deletion, archival, organization, and user access
does not address physical data representation and access handling
DB must support
- domain creation: name, type, range, comparator validity (<=, >= logic)
- domain rename: change only reference name, keep all other characteristics
  - needs to update anything referencing this domain
  - may lock large parts of DB depending on implementation
  - using an immutable ID may be more suitable? separate name from ID
- alter domain: change some characteristic of the domain (not just the name)
  - probably will affect application logic so, uh, handle with care
- drop domain: also drops any index based on it just fyi
- create R-table or view: name, definition (if view), column names + domains, PK or substitute, FK(s)
  - DBMS may include automatic index creation on PK/FK (or inclusion into existing global index on their domain)
- rename R-table/view: any references by name in application programs will need to change
- drop R-table/view
  - my favorite
  - also drops its integrity constraints, and dependent views, and authorization constraints
  - “catalog block” option: postpone updates to the master catalog until a consistent state is reached before committing
    - omits cascade by default, but requires user input if a dependent is found
  - if catalog block not specified, drop ‘em all
  - temp tables are dropped immediately and unceremoniously
  - not a true delete; dropped tables are archived for a configurable time period, then rows are deleted
  - more like a timestamped GC mark
- append columns: populates with missing values unless some default is specified
- rename columns: also updates any referencing indexes
- alter columns
- drop columns: doesn’t truly drop - locks user access, and DBMS deletes later
- create index: DBMS may offer delayed creation to handle in batches
  - not clear if batching is Codd’s suggestion or just an option
  - doesn’t seem super user-friendly, unless DBA team is coordinating batch ddl updates
- drop index
- create snapshot: much like creating materialized view, produces and physically stores a snapshot of relation
- load an R-table: may need special logic to convert from non-relational source data to relational format
  - “control duplicate rows”: remove duplicate rows if found
- extract an R-table: dump a relation (like to CSV)
note: indexes are solely for performance and should be separated from logical handling
- I.e., they can’t be used to enforce uniqueness, they should benefit from uniqueness.
- “an abuse” and “a DBMS design error” - tell us how you really feel
another note!
- Codd uses a “shopping cart analysis” example where a grocery store manager would want to understand which items are bought together. To do this, each purchase would need to be identified - a transaction ID would be necessary. But you wouldn’t necessarily need each customer to be uniquely ID’d! “[T]he social security [number] approach represents a serious confusion between distinctiveness and identification”
- as in, there’s a difference between saying “user X bought [a, b, c], user Y bought [d, e]” and “[Abby bought [a, b, c], Bill bought [d, e]”
- distinguishing between users != storing personally identifying information about the users

Missing information

aren’t we all
Codd says " I do not feel this part of the relational model rests on such a solid theoretical foundation as the other parts”
- cower, mortals
missing information commonly identified with a placeholder value, like -1 for an integer field
- Codd is, uh, not a fan
specifies two types (marks) of missing information
- A-mark: “missing and applicable” - “we could know, but don’t”
  - like not knowing someone’s birthday - presumably they have one but we don’t know it
- I-mark: “missing and inapplicable” - “we don’t know and it wouldn’t make sense to know”
  - like not knowing your cat’s SSN - not only do you not know it, why the fuck would you?
row may contain 1 or more missing values, but not ALL missing values
- a row of all missing values is useless information
- “we know nothing about this unidentified row” should get the response “yeah no shit”
semantics for missing values are NOT the same as semantics for the type of those values
- A-marks in an integer column does NOT behave like an integer
would an I-mark be necessary in a properly normalized database…?
- maybe in a view in such a DB: business logic may require a view that does introduce I-marks
interaction between I-marks, A-marks, and real values is hierarchical
- I > A > real => I + A = I; A + real = A
- inherits from the higher level
may be possible to denote a column that allows missing values with a single byte
- how DO current DBs denote nullability?
“entity integrity”
- PK not allowed to contain missing values b/c referential integrity
- neither PK nor FK can contain I-marks
- disallowing missing values does NOT make a column a PK
- FK may allow A-marks in the RM, but this isn’t represented in contemporary DBs
- if A-marks are allowed in an FK, document why
“referential integrity”: FK must have an existing PK to draw from
handling of missing values must be globally consistent in the DB
- otherwise you’ll see different behavior, decidedly user-unfriendly and probably toxic to consistency
mark replacement
- A marks may be replaced with any real value from that domain
- if authorized and maintains integrity constraints, I-marks may be replaced with A-marks or real values from that domain
- I-marks represent “unknowable” information, so replacing it should require someone to know what they’re doing
  - hence “if authorized” and “maintains integrity constraints”
A-marks allow some kinds of what-if analysis to take place
- like “if these A-marks in column X are set to 100, what is the sum of X?”
3-valued logic: True, False, Maybe
tautological correctness: conditions that are logically true, but evaluation would cause Maybe or False
- (x < 1) | (x = 1) | (x > 1), x can be A-marks
- logically should be True, but an A-mark for x would cause a Maybe
- detect these conditions and warn users, or develop safeguards
4-valued logic: True, False, A-mark, I-mark
- fun fact, the truth tables can be generated from matrix multiplication
allowing Maybe means users will need to specify if those values are allowed in joins or WHERE conditions
- similar to “where x = y or x is NULL”
equality/ordering of missing values
- semantic equality: what the values mean are equal
- formal/symbolic equality: value type and missing type are equal
- A-mark == I-mark -> always false
- I-mark == I-mark -> I-mark
- semantic ordering: < or > in an RL sublanguaged
- symbolic ordering: aforementioned I > A > real values
important: semantic and symbolic do NOT match
- Codd argues this is because semantic occurs at a higher level of abstraction than symbolic
- semantic equality used at the query level (highest abstraction from actual data)
- symbolic equality used to order results before return (lower abstraction)
- they don’t have to match because they affect different things, and users shouldn’t have to care
- in removing duplicates, missing values are evaluated symbolically (A == A, I == I)
  - not sure I buy this one theoretically, but if PK is required for all relations and cannot have missing, doesn’t seem like an issue
- joins follow semantic rules
- scalar functions follow symbolic rules
statistical functions should allow substitutions of missing values at query time
- allows more what-if analysis
- may also be allowed to exclude missing
- replacements should be in the arguments, not the results, and must match domain
joins can only produce A-marks, not I-marks
3- or 4-valued logic adds more query complexity
- seems like a lot of extra burden on the users
- Codd mentions this as a limitation of SQL, but seems like a natural consequence of having >2 truth values
re: normalisation
- when normalising, pretend missing values do not exist
- if a missing is replaced with a real value, then normalise that value
- apply constraints only when you try to replace the missing value

Response to technical criticisms regarding missing information

“please stop @-ing me about SQL, it’s not my fault”
the RM != SQL; SQL is an attempted implementation of the RL of RM
“missing information is not just a representation issue”
- defining what it means for a value to be missing?
- handling should be global and automatic, not left to users or application programs
these criticisms are wrt missing values, instead of allowing default values
- note: “default value scheme” refers to replacing missing with some placeholder value
criticism: “requires user to make the distinction between ‘X known to the system to not be Y’ and ‘X not Y’”
- rephrased: “users need to understand the difference between ‘X is missing’ and ‘X != Y’
- default value scheme means users need to decide that value, make sure everyone understands what it means, and enforce constraints themselves
- application programs will also need to have handling for that value
- default values don’t actually fix the problem of missing values, they just change to an arbitary replacement
Codd re: claim that RM’s handling of missing values is counterintuitive:
- “people used to think the Earth was flat and said round was counterintuitive”
- spicy
criticism: “leads to a breakdown of normalization”
- statements
  - relation R = columns(A, B, C)
  - A -> B (A determines B; B is functionally dependent on A)
  - tuple (?, b1, c1) ∈ R
  - value b1 != b2
  - value ? = A-mark
  - insert(R, (?, b2, c2))
  - if ? is later set to a1 for both tuples, A -> B condition is violated
- rebuttal
  - A -> B wouldn’t be enforced until attempt is made to replace ? with a1
  - at that time, constraints would reject the duplication of PK
  - default value scheme would require this to be enforced immediately on insert of (?, b2, c2)
  - by contrast, missing values are NOT real-valued and do not have to follow functional/multi-valued dependency rules
criticism: “implementation anomalies”
- rebuttal: “like, again, dude, SQL isn’t the RM and it doesn’t perfectly follow the RM, so fuck you”
criticism: “application of statistical functions returns undefined if missing values exist”
- rebuttal: “ok sure, there should be a way to skip missing values or replace them with something”
criticism: “interface with host languages is difficult because many don’t handle missing values very well or at all”
- rebuttal: “it’s easier and clearer to have a systemic, global way to define and handle missing values rather than have users define it for each and every column”
so now for Codd’s criticism of the default value scheme
- doesn’t provide tools for handling missing, just representing missing
- representing missing with a real value forces constraints to be applied on INSERT
  - RM allows constraint at UPDATE
- missing has to be separately defined for each column: more burden on users
- handling missing becomes responsibility of applications - they have to provide those values
  - so they also have to know what those values are
  - instead of just providing/handling nothing, which is a lot easier
- missing values are handled like real values, ignoring logical semantics
- “a step backward from the RM to an ad hoc, unsystematic approach”
  - more of a complaint than a criticism
not all default values are bad, but they must make sense
- like opening a bank account and not putting $ in, so it defaults to 0 instead of missing
Codd implores DB vendors + makers to “place more emphasis on the introduction and retention of database integrity”
- and “invest more effort in learning the systematic treatment of missing information”
- “do better, you animals”

Qualifiers

“an expression that can be used in a command to alter some aspect of the execution of the command”
- like optional args
truth-valued expressions default to True
MAYBE expressions
- A-MAYBE: “expression evaluates to TRUE or A-mark”
- I-MAYBE: “expression evaluates to TRUE or I-mark”
temporarily replace missing values with a real value
- AR: “A-mark Replace”
- IR: “I-mark Replace”
- ESR: “Empty Set Replace”
ORDER BY
- if < is inapplicable, the base type is used
- can order either by one of the source relations, or the resulting relation
- must warn if the ordering isn’t represented by one of the columns in the result
  - like ordering by “part_id”, but it isn’t in the result
  - RM operators can’t exploit any information not represented by the values themselves
ONCE ONLY: each tuple can only be used once to avoid quad situation
- each relation has 2 or more values that match the other -> 2 x 2 -> 4 -> “quad”
DOMAIN CHECK OVERRIDE: temporarily disable DB check that join columns are from the same domain
EXCLUDE SIBLINGS
- rude
- for UPDATE/DELETE/etc., do not hit values in sibling columns in other relations
DEGREE OF DUPLICATION: after an operation, print a message with how many duplicate rows were removed
SAVE: save the result to the database
- applied at the session level
- result is added to the catalog when the session ends
- assuming sufficient permissions
VALUE: in a new column, indicate the value to use instead of a missing value

Indicators

“a side effect of executing a relational command”
- like status codes and logs
expressed immediately before a relational command
- so { indicator } { query }
indicators come (mostly) in pairs
- “u” to accept indicator from previous command
- “v” to accept indicator from current command
empty relation: result returned no rows
empty divisor: in relational division, if the “denominator” is empty
missing information: were any missing values encountered
non-existing argument indicator: a command is missing arguments
- also cancels the execution
domain not declared: in a create table command, one of the columns is trying to draw from a nonexistent relation
domain-check indicator: two columns are being compared, but their domains don’t match and the override wasn’t specified
domain not droppable, column still exists
- drop domain was attempted, but columns that draw from it still exist
- cancels execution
duplicate-row indicator: during load to DB, indicates that duplicate rows were found
- assumes table already exists
- never activates inside a truly relational DB, because all relations are unique
duplicate PK indicator: during load to DB, duplicate PK values were found
- not allowed, so cancels execution
non-redundant ordering operator: during ORDER BY, the column(s) being ordered by don’t appear in the query result
catalog block operator: indicates that a catalog block is active, so the DB is suppressing cascading actions
view not tuple-insertible: the view’s definition doesn’t allow for tuples to be inserted
view not component-updatable: at least one component of every tuple in the view is not updatable
view not tuple-deletable: tuples can’t be deleted from the view
the last three indicators are off when the view is created, and are then inferred by the DB

Query and Manipulation

“guaranteed access”: all values can be accessed by some combination of table name, PK, and column name
- this path should always be available “independent of database or vendor”
- “represents an associate-addressing scheme”: data isn’t accessed by pointer, it’s accessed by association with these names + PK
- only works if PK are absolutely required by the DB
parseable relational data sublanguage (also called “RL” in this book)
- SQL is one (although Codd would scoff)
- must be built into the DB
- parseable: must be string-representable
- all operands are relations
- all manipulative operations produce relations, with indicators as necessary (see previous chapter)
  - referenced earlier as “operational closure”
- Codd mentions multiple-choice questions as an interface - like drop-down menus
  - this has its place, but it’s not in an interface to raw data manipulation
- an RL simplifies program maintenance, formal analysis, interface development
- “interactive tools that are claimed to make written or typed programs obsolete do not yet appear to support the maintenance requirement adequately”
  - Ha! Tell me about it. Could have been written today.
RL should express four-valued, first-order predicate logic
- support retrieval, view definition, insertion, update, deletion, missing information, integrity constraints, authorization constraints
- if distributed, management with “distribution independence”
  - breaking commands into parts relative to the distribution of the data
  - reconstructing results from the distributed elements
high-level insert/update/delete
- high-level refers to abstraction: higher gives more freedom for behind-the-scenes optimization
- “can also be extremely important in obtaining efficient handling of transactions across a distributed database”
- victim of “papering over” of distributed realities? Codd would have been the type to be aware of those issues, but not sure how developed the awareness/theory would have been in 1989
operational closure
- operations must only produce relations or sets of relations (they cannot do anything else)
- claim “no more operators is allowed” is wrong: this concept is just a constraint on the nature of potential new operators
- claim “operators cannot generate a relation that lacks a PK” is wrong:
  - constraint on relations is row uniqueness
  - conveniently represented by a PK
  - but adequately captured by a “weak identifier” column set
transaction support
- “a logical unit of work that transforms a consistent state of the DB into a consistent state, without necessarily preserving consistency at all intermediate points”
- an atomic unit of work that ensures consistency at start and end states
- entire operation succeeds or fails
- DBMS responsible for ensuring none of the cached changes are persisted in the event of a hardware/software failure
- errors in the transaction itself also cause it to fail
- BEGIN/COMMIT/ROLLBACK
blocks on the database description (the catalog)
- CAT/END CAT commands
  - end cat lol
- postpones cascading commands so it’s possible to make changes to relations without immediately affecting anything referencing them
- a concurrency mechanism then
- seems underdeveloped in this text
dynamic mode
- live command interpretation (e.g. psql) without needing source code change or recompilation
- principle: avoid halting DB activities as much as possible
triple mode
- RL can be used at terminal, in application programs, and in “automated constraint reactive statements”
  - like TRIGGERs in PG
- develop/debug DB statements in isolation from the rest of the application code
- the basis for embedding “select a, b, c from schema.table where a = x” in app code
four-valued logic: truth tables
- True, missing-Applicable, missing-Inapplicable, False
- handled behind the scenes - users need to specify what actions to take in response, not how these are defined or evaluated
missing information: manipulation
- uniform, systematic, type-independent missing information handling
- MAYBE qualifier
effect of missing values on arithmetic operators
- previously mentioned in the missing information chapter
- operations involving missing values have pre-defined behavior
- inherit the highest-order value: I-mark > A-mark > real values
effect of missing values on concatenation
- same as with arithmetic: I-mark > A-mark > real values
domain-constrained operators and DOMAIN CHECK OVERRIDE
- value comparisonss are constrained to same domain unless the override is enabled
operators constrained by basic data type
- when comparing values generated by a function or two functions
- only the base types must match, not the entire domain
- because function return types may not be defined in a domain already, but the functions need to work
prohibition of essential ordering
- ordering on a relation cannot contain information not present in the relation
- or more specifically, it can’t be relied on or enforced
- not in the RM, anyway - some vendors support this behavior
interface to single-record-at-a-time host languages
- result delivery must be “paginated”: deliverable in chunks of rows
- cursors for the host languages to iterate can be supported
  - but these should only scan results, not DB internals
- without an ORDER BY, applications shouldn’t assume consistent or repeatable ordering
comprehensive data sublanguage
- all operations must be supported in all modes (terminal, application, trigger statement)
- there are DB products for which the three have different sublanguages
- Codd references an ANSI/SPARC document as an example of this madness
  - “advocated 42 distinct interfaces and (potentially) 42 distinct languages”
  - fuck that
library checkout: version control checkout built into the DB
library return: version control merge

Integrity Constraints

“the preservation of accuracy”
“preventing the DB from being damaged … [is] far easier to concieve than … repairing the damage once done”
- should NOT be the responsibility of add-ons, extensions, packages, etc. - more easily bypassed
non-relational DBMS and AI prototypes were developed to encode “as much as possible the system’s behavior into the data structure”
- justified as “might simplify the programming”
- but “actually, the programming became much more complicated, and the systems became much harder to understand”
- what a familiar conversation
integrity constraints “independent of the data structure” and specified in RL
entity integrity: PK cannot have missing values, FK cannot contain I-marks
referential integrity: FK must have a corresponding PK somewhere
- Codd blames poor support for referential integrity on a lack of support for domains
types of constraints
- D: Domain integrity
- C: Column integrity
- E: Entity integrity
- R: Referential integrity
- U: User-defined integrity
Timing
- TC: constraints are checked no later than the end of each relational request
- TT: constraints are checked no later than the end of the transaction
integrity constraints live in the catalog
- type E only stores the PK declarations there, because it’s otherwise an automated constraint
evaluation
- which constraints apply?
- what timings apply?
  - type E is always TC
  - types D, C are usually TC
- before each command completes, TC constraints are checked
- any TT checks found for each command are added to the list of the entire transaction’s checks
- before the transaction commits, all of the accumulated TT checks are enforced
referential and user-defined integrity constraints need to specify if they’re TC or TT
R- and U-type violations need a defined response
- commit may be prohibited
before any RL statement completes, DBMS must examine catalog for constraints
- an automated process
R-type constraints store response + column names of related PK/FK in the catalog
U-type constraints store response + constraint definition in the catalog
DBMS is responsible for activating constraints, NOT users or applications
- constraints are considered part of the DBMS environment + user community, not individual users
- DBMS models a micro-world - constraints are akin to speed of light and other physical laws
- enforcement should be constant and automatic - inherent to the DBMS
- in a sense, a CI pipeline running against the state of the DB and its allowed states
D-, C-, E-type violations
- not allowed: execution cancelled and an error + reason returned to the user
users may prohibit missing values (“not missing” constraint)
- blocks A-marks and I-marks from the indicated columns/relations
- independent of indexes: indexes are supposed to be for performance only, entirely separate from logical structure of the DB
- objects with indexes and not-missing constraints are common (like PK) but this is not a strict requirement
user may prohibit duplicate values in a column (“unique” constraint)
a row of all missing values is illegal: it represents no information
DB should maintain an audit log of all actions performed against the DB, and by whom
non-subversion: any non-RL interface language must be provably unable to subvert these restrictions
creating a constraint
- specify name, type, definition, timing
- catalog block required, then DB checks its current state against the constraint
- if current state is invalid, and user can’t correct it immediately at terminal, constraint is rejected
- if a C-type constraint to prohibit missing values is added, new missing values are rejected but existing ones are left alone
  - not sure this makes sense, the DB would have to track the existing missing values and factor them into later constraint checks
“minimal adequate scope of checking”
- inspect the minimum possible area in the DB to ensure integrity
integrity constraints should be executable as commands (or functions that accept a relation)
on-the-fly, end-of-command, end-of-transaction techniques
- allow the postponing of violations until later
- developed in the context of transactions: intermediate violations of state are allowed, as long as the end state is consistent

User-defined Integrity Constraints

constraints that reflect business needs, which are outside of the default built-in DB constraints
also triggers, including time-based triggers
must define timing (TC, TT), conditions to trigger, the constraint to check, and the response
stored in the catalog
can be triggered by application program or user
retrieval, insert, update, or delete
time-based triggers
- no timing conditions
- scheduled, or at set intervals
- queue structure, FIFO: DBMS sets a reminder for the earliest trigger, and when completed the queue is incremented
conditions must evaluate to True/False
- can be set on DB states, or on changes in it
- “salary <= x”, or “salary cannot be increased by >y% at one time”
- Codd uses the convention of OLD.x and NEW.x - same as they are in Postgres, interesting
inserting missing values
- where allowed, default missing is an A-mark
- if not allowed, operation is rejected
updating values to I-marks
- must be allowed and not violate any other constraints
“Normalisation was originally conceived as a systematic way of ensuring that a logical design of a relational database would be free from insertion, update, and deletion anomalies.”
functional, multi-valued, join, inclusion dependencies
- Codd mentions “a need for a data model that can easily accommodate such changes without impairing the correctness of already developed application programs”
- a way of modeling database migrations - like with TLA+
- functional dependency: “B depends on A”, A -> B, “A therefore B”
- multi-valued dependency: “B and C both depend on A, but independently”: A -> B and A -> C
- join dependency: “A can be created by joining B and C”, A = B * C
- inclusion dependency: “A is a subset of B”, A is in B; all FK are inclusion-dependent on their PK
possible responses
- reject: do nothing
- cascade: if a value has any dependencies, cascade changes to those downstream dependencies
- mark: “if the cause of a violation can be pinned on a non-PK column(s) and missing values are allowed, mark the column(s) with A-marks”
  - sort of like an ELSE NULL

Catalog

“the DB and its descriptions (the catalog) are seen as a collection of relations”
should be fast, since it’s a central control point and should not bottleneck activities
dynamic on-line catalog: the catalog must exist as a relation and be interacted with as one
concurrency: DBMS supports multiple users at the same time
responsible for storing table, view, and domain definitions; also constraints and UDFs
stores authorization data - who has permission to access what
database statistics for optimization
- row counts and cardinality for every column in every relation
- optimizer should reference this information

Views

“insulate users from the base relations” - like front-end tables
the external level of abstraction: data UX
views are defined by a query over the base relations or other views
- definitions are stored in the catalog
- must also have a unique name, like a base relation
- no procedures - must be a query only, but any query is ok
- do not have knowledge of data representation/access paths/methods
- “one should not have to be a programmer to define a view”
definitions can be retrieved from the catalog by RL
views must act like base relations; queries shouldn’t make any distinction
DBMS should allow insert/update/delete to both relations and views
- but there are conditions when views can’t be altered: more later
- and views may not have PK, while relations must
column names should be a combination of source table and source column
definitions do not indicate the column domains - they’re deduced after the definition is stored, then kept in catalog

View Updatability

some views can’t accept inserts/updates/deletes in an unambiguous way: the “view updatability problem”
- these commands have to be translated into commands onto the base relations
- but the construction of the view may not allow an unambiguous decision of which rows to affect or which relation to apply to
“all views that are theoretically updatable are updatable by the system”
- H.W. Buff proved the general question of “can this view be updated?” can’t be logically answered for all cases
- Dayal and Bernstein; Keller have done works on view updatability
DBAs may supply a program or procedure to decide the updating logic if they so choose
these are suggested algorithms that would supply a “unique or uniquely sensible collection of corresponding changes to be made to the base relations that have as their effect upon the extension of the view the changes requested by the user”
- is there a single unambiguous change set on the base relations that will have this exact effect on the view?
- can happen within a transaction as needed
assumptions
- A1: the definition of a view and its consequences with respect to insert/update/delete of its tuples must be understood by users
  - users need to know what effect any of those commands will have on the base relations
- A2: deciding if a view is tuple-{x} able depends on expanded view definition, base table definitions, integrity constraints, and any statistical/aggregate functions in the view
  - do the functions have an inverse?
- A3: “tuple-{x}able” is decided at view definition time, independent of any downstream views and ignoring actual table contents
  - as in, it’s entirely a logical determination
  - it also means no prohibitions are set on inspections of table contents at execution time
- A4: operator type immutability
  - inserts can only be inserts, updates only updates, etc.
  - so an update should never be transmuted to a delete plus an insert, for example
- A5: all views are subject to interpretation algorithms that decide the response to the commands
  - only applicable to the 2nd algorithm
- operands are relations and may not have duplicate rows
  - much more discipline is required if not true relations
consequences
- A1: when the DBMS refuses an action, users have all the information to figure out why
- A2: “decision algorithm is independent of implementation”, because it’s all based on relational logic
- A3: instead of re-evaluating on every query, decide once at view definition by static analysis
- A4: it’s easier to trace actions, and produces no surprising effects - may restrict available solutions?
the results from applying the algorithms are stored in the catalog
non-updatable views are not rejected - they just have actions restricted
steps
- get fully-expanded view definition
- convert it to RL
- reduce to minimum operations (without Cartesian products if possible)
- apply algorithm
- record result in catalog
during update or insert, is the target untransformed (stored in the DB) or transformed (result of a computation)
- i.e., is the target a base relation or a view
- if it’s transformed, what were the functions used?
  - if they don’t have an inverse, not component-updatable
- “back-traceable”: the target and its components can be traced back to specific rows in the base relations
  - part 1: are all columns back-traceable?
  - part 2: are all rows back-traceable?
- if part 2, tuple-deletable
- if part 1 and part 2, tuple-insertible
- if part 2, and relevant columns in the target are back-traceable, component-updatable
Codd remarks that base relations and views should be indicated as fully normalised or not - if not, may have update anomalies
a view based solely on a row filter from 1 (one) relation, without duplicates:
- is tuple-deletable
- is component-updatable for all backtraceable columns
- is tuple-insertible if all columns are backtraceable
- at insert, DB still needs to check that it matches the view definition
- at update, the row may be removed or rejected by other DBA-set constraints
a view based solely on a column filter from 1 (one) relation:
- if PK is included, it’s a 1:1 row match -> tuple-insertible AND tuple-deletable
  - equivalent to a row filter with no conditions
- candidate keys do NOT get the same treatment
  - they’re not in the catalog so the 1:1 match can’t be directly inferred
  - and their constraints can be dropped (while a PK/FK’s can’t), which would violate the inferred updatability
- possible improvement: any column with unique AND not-missing constraints gets this determination
a view based on the join of 2 (two) relations
- if the comparands are many:many, not tuple-deletable
  - deletion can’t guarantee that specific rows will be deleted
  - the quad problem again
  - the comparands must be candidate key, PK, or unique + not missing constrained to be tuple-deletable
  - if comparands are backtraceable, tuple-insertible as well
- if the comparands are 1:many or many:1
  - tuples can be inserted if the component parts from each base relation do not exist in both relations
  - if the target values exist in only 1 of the relations, the relation that is missing the values gets the insert
  - if both relations have the component parts, that means the row already exists
  - if a quad would be generated by the join, deleting any 1 of them would result in a number that can’t be generated by the m x n rows
  - the algorithms here apply to PK-PK, PK-FK, or FK-PK joins - Codd doesn’t consider non-key comparands even if they meet the conditions
inequality joins: only greatest/least-style joins, plus the ONCE qualifier, are eligible
natural join: same treatment as join-based views
outer join: same as join, but allows the second-cited operated to contribute tuples of empty values
relational division: none of the tuple-{x}able apply
union
- deletion allowed if row only occurs in one of the base relations
- V1: insert/update allowed if the view is disjoint union and the row can only match integrity constraints of one of the base relations
- V2: DBA may define a function that unambiguously determines which relation to update/insert
  - it’s evaluated like an integrity constraint, but at command-interpretation instead of command or transation time
outer union
- if insertion contains a column that doesn’t exist in the defining relation, the offending column will be dropped from the inserted tuple and the DB will throw a warning
  - as in, it truncates the insert tuple to only columns that exist in the defining relation
- deletions are handled the same as in regular union
RL difference (T = R - S)
- deletions are applied to R, since rows from S aren’t represented in T
  - inserting into S would work, but violates A4 (operator immutability)
- inserts go to R only, provided it doesn’t exist in S (which causes no action)
outer difference: same as RL difference but with type checking on differences between T and R/S
more on normalization
- “make insert/update/delete operations clear in meaning”
- “involves breaking relations into relations of smaller degree”
- “tends to reduce performance because many more joins must often be executed”
- “the database models a micro-world”
  - PK: each distinct type of object has a distinct id within that type
  - table name: each table has a distinct name (table being a type of object)
  - every fact in the DB is a fact about the object ID’d by that PK
  - each fact is the single-valued immediate property of that object
  - facts are collected together into a single relation if they share a type
- anomalies in non-normalized DBs arise from when the facts are too coupled
  - an update to one type unduly affects facts of another type
- views that serve as base relations for other views should be described in the catalog
  - called “conceptual relations”
semi-insert
- base relations Suppliers (S) and Capabilities (K)
- view T = S[B = C]> K (right outer)
- insert new tuple for K’s columns into T: (s3, k8)
- if s3 !∈ S => reject
- if s3 ∈ T and the K component is missing, update this row to use the k component of the inserting tuple
- otherwise, copy the s-component from an s3 tuple and attach to the k-component of the inserting tuple
semi-update (target s3)
- if s3 !∈ T => reject
- else, update all tuples in T that have value s3
semi-delete/archive
- if s3 !∈ T => reject
- if s3 !∈ T and only once, mark all target components as missing
- otherwise, remove all s3 rows except one, then mark its target components missing
note: this chapter was important because of normalization
- the original intent of normalization is to control data change anomalies
- a consequence of splitting relations into smaller normalized parts is that seeing all the data together again requires a view
- hence the importance of supporting their updatability - according to Codd

Authorization

“who has access to what parts of the database and for what purpose”

Second Brain