https://dl.acm.org/doi/book/10.5555/77708
The “relational model” is a data modeling approach that captures all data as rows and columns of (special) tables: relations. The model was first introduced by Edgar Codd in 1969, and he championed itit throughout the rest of his life - often battling against its misinterpretation and buzzwording. Out of that struggle came this book, published in 1989, as a compiled “this is why you’re wrong”.
The book is pretty dry, appropriate for a semi-technical book from the 80’s written by a mathematician. A lot of it is recognizable if you have any background in SQL, data analysis, or databases - and it’s still a fascinating look at the opinions and theories of a giant in the database field. Also, although it was written over 30 years ago, some of the complaints might as well have been written today.
- right off the bat, Codd hits us with a legendary roast:
“The slow acceptance of [AI] programs has been largely due to the use of incredibly complex data structures in that field. This supports the contention that AI researchers write their programs solely for other AI researchers to comprehend.”
- relation != table: relations may not have duplicate rows
- each row in a relation is an assertion: this tuple is a fact
- no positional concepts: the relation has no inherent ordering
- all values are atomic (smallest representable form)
- how does this affect JSON fields?
- why duplicate rows aren’t allowed:
- concurrent users means there has to be a single agreed-upon meaning for duplicate rows that apply to all relations, but that can’t be guaranteed
- more on that later
- DBMS != DB; DB ∈ DBMS
- DBMS includes UX, operations, administration as well as data structure and access
- Read/Write/Update/Delete ops for each data type -> 4N total operational commands
- data relates by comparison of values
- RM (Relational Model) is intended as an abstraction - implementation is flexible, but Codd holds that it should possess the (many, many) enumerated properties
- not a fan of SQL: duplicate rows, many missing features
- transactions are required
- interesting, mentions the concept of WAL - did they exist at the time?
- DBMS must be able to manage databases entirely through relational capabilities
- Views must be supported (multiple chapters on this)
- 3- or 4-valued logic
- “the future lies with specialized sublanguages that can intercommunicate with one another”
- sort of true today, I suppose
- PK
- uniquely identifies the tuple
- each relation must have a PK
- may be constructed from multiple columns
- FK indicates the PK of another table, and all the values in FK must exist in that PK
Ch. 1
- databases are collections of assertions: facts about the world
- in RM, all information only represented as data values in relations
- repeating groups are discouraged - normalization to remove these where possible
- prior data models handled repeating groups by positioning tuples in those groups close to each other
- this violates non-positionality of RM
- data catalog is necessary, because everything must be accessible from same relational UX
- no duplicate rows allowed
- physical data independence and transparency to users
- DB operation and user interaction should never be affected as a result of a change in the phsyical representation of the data
- “three-schema architecture”: storage representation, base relations, views
- better known as levels of abstraction, today: physical, logical, external
- domains as extended data types
- enforce same-type comparisons during joins and whatnot
- integrity constraints on the type
- column descriptions: column declarations with name, type, constraints
- like:
num_hamburgers int check(num_hamburgers > 0)
- PK
- required for each relation
- unique
- no missing values allowed
- composite OK
- Views can support PK
- or at a minimum a “weak identifier”: column(s) that uniquely ID a row, but aren’t defined as PK
- FK
- must reference PK in another relation
- no automatic FK-PK pairing done by the DBMS - must be hand-defined by DBA
- composite domains
- domains but built out of multiple columns
- order of definition matters: (colA, colB) is different from (colB, colA)
- composite columns
- requires a pre-defined composite domain
- must be within a single relation (columns can’t be from different relations)
- may themselves consist of other composite columns
- missing values
- replacement values are not allowed: no "” as a stand-in for NULL, for example
- DBA must be able to declare if a column allows missing values or not
- universal relation not allowed
- a single relation built by combining all relations into one gigantic super-relation
- this is a weird idea. Codd makes more arguments against it later.
Domains as Extended Data Types
- Codd treats this as fundamental to the RM
- domain != column; column values ∈ domain
- named, are objects, declarable, have defined range, have defined comparability (<, >)
- “domains are the glue that holds a relational database together”
- which relations use which domains make it clear if groups of relations can be separated into a different database
- enforcing use of domains makes it clear when a comparison is valid
- makes explicit how relations are connected
- domain definition is more consistent and efficient than storing them in application logic
- also centralizes constraints
- “domain-constrained operators” for type safety/constraint enforcement
- also during joins and comparisons
- “essential that the DBMS + RL contain a command that is capable of referring to all columns currently drawing their values from a specified domain without the user having to list those columns”
- FAO_AV: Find All Occurrences of Active Values - search the whole DB for that value, using the domain, return all places it shows up
- FAO_LIST: Find All Occurrences in a List of values - search the whole DB for that value, using the domain, return all places where they show up
- domains must support integrity checks
- domains participate in PK, FK, </>, unionability, integrity, inclusion checks
- unioning may be more inferrable with domains
- not sure on this, maybe in some cases but it seems a fringe benefit
- non-trivial unions where a domain may be used by multiple columns won’t see a difference
- indexes over the domains - largely for performance
- DBMS must also support some native data types and functions in wide use
- dates, decimal currency
- UDFs for special date handling
- day(), month(), year(), etc. functions
- clock time functions like hour(), minute()
- datetime types and functions
- timezones + conversions
- non-negative decimal currency and free-decimal currency
- functions as a data type
- function arguments as a data type
- maximum type safety all the time always
The Basic Operators
- set of tools for data retrieval
- equal to “first-order predicate logic”
- operations/functions that return true or false
- the operations take 1 or 2 relations and return a relation (“operational closure”)
- as a result, results can be used in later operations
- a join b -> c; c join d -> e
- Codd notes it’s expressed algebraically here, but logical is simpler + more concise
- contemporary optimizers could handle logical more easily, because they could only optimize one command at a time
- concise -> fewer commands -> better optimization
- operators designed to use relations: no duplicates allowed
- allowing duplicates means operators can’t be used in any order
- alternatively, duplicates can be handled explicitly
- example of Cartesian product U = S x T = T x S
- given that columns are prefixed S.column and T.column if any column names overlap: “citation ordering”
- unless S == T, in which case S.column_1 and T.column_2
- “but like please don’t use the Cartesian product”
- operators
- projection: subset of columns, the SELECT part (but DISTINCT implied because these are relations)
- ϴ-select: subset of rows, the WHERE part, using value comparisons like <= or > or !=
- produces boolean values, so ϴ-selects can be chained
- implies: p -> q, “if p is true then q is also true”, p is a subset of q
- ϴ-join, usually just called “join”
- column-wise concatenate two relations
- A join B -> columns(a) ++ columns(b) where ϴ is true
- ϴ is a boolean-producing function of columns (the comparands)
- if it’s an equality function, the “equi-join”
- A join B on A.x = B.y -> x and y are the comparands
- logically equal to Cartesian(A, B) where A.x = B.y, but Cartesian is inefficient
- also degree(A ϴ B) = degree(A) + degree(B); (degree(X) means “# of columns in X”)
- natural join: equi-join, but only one copy of the comparands is kept
- relational union
- union of elements of the same type (so same data type or same domain)
- A, B are union-compatible if degree(A) = degree(B) and a mapping between columns(A) and columns(B) exists
- length(intersect(columns(A), columns(B))) == degree(A) == degree(B)
- after union, duplicates are removed b/c the result is a relation
- intersection: relations A, B are union-compatible -> return only rows that appear in both relations
- difference: relations A, B are union-compatible -> return only rows from A that don’t appear in B
- division
- relations A, B share at least one comparand column
- find largest relation such that result[target columns] x B[comparand] ∈ A[target columns ++ comparand]
- find the biggest “block” (target columns) in A that can be built from its own values multiplied by B
- degree(quotient) == degree(dividend) - degree(divisor)
- degree(remainder) == degree(dividend)
- fuckin’ wild
- projection, -selection, and -join can be executed in any sequence
- but only if no duplicates
- easier to optimize if possible
- quantifiers
- existential: exists x in A
- universal: for all x in A
- relational assignment: table aliasing
- insert: add new rows to a relation
- duplicate rows ignored; rows with duplicates in unique-constrained columns also ignored
- indexes updated
- union may also be allowed
- update: change values in existing rows (must respect referential integrity)
- cascade update
- multi-user environments means it’s impossible for anyone to know all locations of PK
- DBMS should cascade any PK updates automatically
- if sibling PK exists, update those too if desired
- indices also updated
- domain support would make it easier to track all occurring values of the target
- cascade missing
- same as cascade update, but filling with missing values instead
- delete: remove rows from database
- PK requirement for relations means delete should cascade, or else integrity violations in FK
- cascade delete
- safer than regular delete because avoids integrity violations
- but deletes more than just target row, so should still be handled with care
- sibling PK can be targeted optionally
- cascade missing
- even safer than cascade delete, since no downstream rows are deleted
- but introducing missing means any columns that disallow missing will cause op to fail
Advanced Operators
- more power and flexibility, “without introducing programming concepts”
- must respect operational closure (all operations return a relation)
- framing a relation: equivalent of GROUP BY
- does NOT create a subrelation for each group (a “partition”, like in Spark)
- for each group in the framing column, order rows by ascending value
- may allow range partitioning (like grouping over a calculated column)
- aggregation functions: calculating a function over each group
- these reduce to a row for each group
- implies the application of grouping to relations joined to from the framing relation
- so behind the scenes order of operations is join-group-agg
- extending a relation: adding columns
- could be used to make two relations union-compatible
- is used in outer joins: non-matching tuples are extended to include missing columns with imputed missing values
- semi-ϴ-join
- like a regular join, but where one operand is a projection just to the comparand
- relations A, B; B is reduced to only the join columns
- intended for use where A and B are located in different physical locations
- reducing one of the relations to just the join columns minimizes data over wire
- similar to “filter join”
- left- and right-outer-join
- relations A, B; A join B
- find all tuples from indicated side that didn’t match condition
- left: tuples from A
- right: tuples from B
- extend the found tuples with a tuple of missing values (length = degree of non-indicated side)
- union join result with the extended tuples
- full join
- relations A, B; right-join(A, B) union left-join(A, B)
- because duplicates are removed
- probably has a different implementation when duplicates are involved
- chock-a-block full o’ missing values
- MAYBE qualifier: in joins, allow missing values in the comparand
- outer natural joins: same as left- right-joins, but omits one of the comparand columns
- outer set operators: relations A, B
- outer union: extend both by each other, then take the union
- outer difference: extend both by each other, union, then set difference(extended A, union result)
- outer intersect: extend both by each other, semi-join each result to each other, then union
- T-join operators: inequality joins, but without requiring a complete join
- during the join, values are sorted in the comparands
- when values are repeating, they’re ordered within the value using a tiebreaker, often the PK
- go through A, and compare each tuple to the first tuple of B that matches the join condition
- introduces possibility of “cross-ties”/“quads”: multiple tuples in A match multiple tuples in B
- T-joins are only allowed to use each row of the each operand once to remove quads
- outer T-join: same as T-join, but keeps all tuples from A
- user-defined select/join operators: UDFs that return booleans, for filtering and join conditions
- recursive join operator: bridge table; a single parent-child relation self-joined until fully formed
Naming
- every object needs a single unambiguous name; each object class may have specific rules governing names
- “interchangeability of operands must not be reduced”
- domains/data types must have unique names across relations and functions
- relations and functions must have unique names, extending across columns
- columns only need to be unique within a relation
- domains, relations, and functions share a name pool
- columns should be named for their purpose + their domain
- an aside: Codd implies that SQL did not include a specification for table.column referencing, but literally every RDBMS I’ve worked with has this convention - was this introduced after the book was written? or perhaps not adhered as much as I think?
- selection by table.column must be enforced and unambiguous
- specifically for non-distributed databases: distributed ones must also include source
- UNION should be by column name + domain whenever possible; user should only have to specify if duplicate columns occur
- join or division should automatically prefix duplicate columns with source relation name (table.column notation)
- preserves column uniqueness in resulting relation
- projection (column subset) must accept user-specified names or keep source column names
- function-generated columns
- default to function_name.first_argument
- append “a sufficiently large substring of the function-invoking expression” to avoid duplicate names
- archived relations default to relation_name + YYYYMMDD + version # that day
- for example daily_sales_20001122_3
- integrity constraints also nameable
- useful for error investigation: “well-named constraint x threw error”
- “detective mode” naming: query results can be named
- called “detective mode” because a user might work through several queries on the DB looking for an answer
Commands for the DBA
- retrieval, ingest, deletion, archival, organization, and user access
- does not address physical data representation and access handling
- DB must support
- domain creation: name, type, range, comparator validity (<=, >= logic)
- domain rename: change only reference name, keep all other characteristics
- needs to update anything referencing this domain
- may lock large parts of DB depending on implementation
- using an immutable ID may be more suitable? separate name from ID
- alter domain: change some characteristic of the domain (not just the name)
- probably will affect application logic so, uh, handle with care
- drop domain: also drops any index based on it just fyi
- create R-table or view: name, definition (if view), column names + domains, PK or substitute, FK(s)
- DBMS may include automatic index creation on PK/FK (or inclusion into existing global index on their domain)
- rename R-table/view: any references by name in application programs will need to change
- drop R-table/view
- my favorite
- also drops its integrity constraints, and dependent views, and authorization constraints
- “catalog block” option: postpone updates to the master catalog until a consistent state is reached before committing
- omits cascade by default, but requires user input if a dependent is found
- if catalog block not specified, drop ‘em all
- temp tables are dropped immediately and unceremoniously
- not a true delete; dropped tables are archived for a configurable time period, then rows are deleted
- more like a timestamped GC mark
- append columns: populates with missing values unless some default is specified
- rename columns: also updates any referencing indexes
- alter columns
- drop columns: doesn’t truly drop - locks user access, and DBMS deletes later
- create index: DBMS may offer delayed creation to handle in batches
- not clear if batching is Codd’s suggestion or just an option
- doesn’t seem super user-friendly, unless DBA team is coordinating batch ddl updates
- drop index
- create snapshot: much like creating materialized view, produces and physically stores a snapshot of relation
- load an R-table: may need special logic to convert from non-relational source data to relational format
- “control duplicate rows”: remove duplicate rows if found
- extract an R-table: dump a relation (like to CSV)
- note: indexes are solely for performance and should be separated from logical handling
- I.e., they can’t be used to enforce uniqueness, they should benefit from uniqueness.
- “an abuse” and “a DBMS design error” - tell us how you really feel
- another note!
- Codd uses a “shopping cart analysis” example where a grocery store manager would want to understand which items are bought together. To do this, each purchase would need to be identified - a transaction ID would be necessary. But you wouldn’t necessarily need each customer to be uniquely ID’d! “[T]he social security [number] approach represents a serious confusion between distinctiveness and identification”
- as in, there’s a difference between saying “user X bought [a, b, c], user Y bought [d, e]” and “[Abby bought [a, b, c], Bill bought [d, e]”
- distinguishing between users != storing personally identifying information about the users
- aren’t we all
- Codd says " I do not feel this part of the relational model rests on such a solid theoretical foundation as the other parts”
- missing information commonly identified with a placeholder value, like -1 for an integer field
- specifies two types (marks) of missing information
- A-mark: “missing and applicable” - “we could know, but don’t”
- like not knowing someone’s birthday - presumably they have one but we don’t know it
- I-mark: “missing and inapplicable” - “we don’t know and it wouldn’t make sense to know”
- like not knowing your cat’s SSN - not only do you not know it, why the fuck would you?
- row may contain 1 or more missing values, but not ALL missing values
- a row of all missing values is useless information
- “we know nothing about this unidentified row” should get the response “yeah no shit”
- semantics for missing values are NOT the same as semantics for the type of those values
- A-marks in an integer column does NOT behave like an integer
- would an I-mark be necessary in a properly normalized database…?
- maybe in a view in such a DB: business logic may require a view that does introduce I-marks
- interaction between I-marks, A-marks, and real values is hierarchical
- I > A > real => I + A = I; A + real = A
- inherits from the higher level
- may be possible to denote a column that allows missing values with a single byte
- how DO current DBs denote nullability?
- “entity integrity”
- PK not allowed to contain missing values b/c referential integrity
- neither PK nor FK can contain I-marks
- disallowing missing values does NOT make a column a PK
- FK may allow A-marks in the RM, but this isn’t represented in contemporary DBs
- if A-marks are allowed in an FK, document why
- “referential integrity”: FK must have an existing PK to draw from
- handling of missing values must be globally consistent in the DB
- otherwise you’ll see different behavior, decidedly user-unfriendly and probably toxic to consistency
- mark replacement
- A marks may be replaced with any real value from that domain
- if authorized and maintains integrity constraints, I-marks may be replaced with A-marks or real values from that domain
- I-marks represent “unknowable” information, so replacing it should require someone to know what they’re doing
- hence “if authorized” and “maintains integrity constraints”
- A-marks allow some kinds of what-if analysis to take place
- like “if these A-marks in column X are set to 100, what is the sum of X?”
- 3-valued logic: True, False, Maybe
- tautological correctness: conditions that are logically true, but evaluation would cause Maybe or False
- (x < 1) | (x = 1) | (x > 1), x can be A-marks
- logically should be True, but an A-mark for x would cause a Maybe
- detect these conditions and warn users, or develop safeguards
- 4-valued logic: True, False, A-mark, I-mark
- fun fact, the truth tables can be generated from matrix multiplication
- allowing Maybe means users will need to specify if those values are allowed in joins or WHERE conditions
- similar to “where x = y or x is NULL”
- equality/ordering of missing values
- semantic equality: what the values mean are equal
- formal/symbolic equality: value type and missing type are equal
- A-mark == I-mark -> always false
- I-mark == I-mark -> I-mark
- semantic ordering: < or > in an RL sublanguaged
- symbolic ordering: aforementioned I > A > real values
- important: semantic and symbolic do NOT match
- Codd argues this is because semantic occurs at a higher level of abstraction than symbolic
- semantic equality used at the query level (highest abstraction from actual data)
- symbolic equality used to order results before return (lower abstraction)
- they don’t have to match because they affect different things, and users shouldn’t have to care
- in removing duplicates, missing values are evaluated symbolically (A == A, I == I)
- not sure I buy this one theoretically, but if PK is required for all relations and cannot have missing, doesn’t seem like an issue
- joins follow semantic rules
- scalar functions follow symbolic rules
- statistical functions should allow substitutions of missing values at query time
- allows more what-if analysis
- may also be allowed to exclude missing
- replacements should be in the arguments, not the results, and must match domain
- joins can only produce A-marks, not I-marks
- 3- or 4-valued logic adds more query complexity
- seems like a lot of extra burden on the users
- Codd mentions this as a limitation of SQL, but seems like a natural consequence of having >2 truth values
- re: normalisation
- when normalising, pretend missing values do not exist
- if a missing is replaced with a real value, then normalise that value
- apply constraints only when you try to replace the missing value
- “please stop @-ing me about SQL, it’s not my fault”
- the RM != SQL; SQL is an attempted implementation of the RL of RM
- “missing information is not just a representation issue”
- defining what it means for a value to be missing?
- handling should be global and automatic, not left to users or application programs
- these criticisms are wrt missing values, instead of allowing default values
- note: “default value scheme” refers to replacing missing with some placeholder value
- criticism: “requires user to make the distinction between ‘X known to the system to not be Y’ and ‘X not Y’”
- rephrased: “users need to understand the difference between ‘X is missing’ and ‘X != Y’
- default value scheme means users need to decide that value, make sure everyone understands what it means, and enforce constraints themselves
- application programs will also need to have handling for that value
- default values don’t actually fix the problem of missing values, they just change to an arbitary replacement
- Codd re: claim that RM’s handling of missing values is counterintuitive:
- “people used to think the Earth was flat and said round was counterintuitive”
- spicy
- criticism: “leads to a breakdown of normalization”
- statements
- relation R = columns(A, B, C)
- A -> B (A determines B; B is functionally dependent on A)
- tuple (?, b1, c1) ∈ R
- value b1 != b2
- value ? = A-mark
- insert(R, (?, b2, c2))
- if ? is later set to a1 for both tuples, A -> B condition is violated
- rebuttal
- A -> B wouldn’t be enforced until attempt is made to replace ? with a1
- at that time, constraints would reject the duplication of PK
- default value scheme would require this to be enforced immediately on insert of (?, b2, c2)
- by contrast, missing values are NOT real-valued and do not have to follow functional/multi-valued dependency rules
- criticism: “implementation anomalies”
- rebuttal: “like, again, dude, SQL isn’t the RM and it doesn’t perfectly follow the RM, so fuck you”
- criticism: “application of statistical functions returns undefined if missing values exist”
- rebuttal: “ok sure, there should be a way to skip missing values or replace them with something”
- criticism: “interface with host languages is difficult because many don’t handle missing values very well or at all”
- rebuttal: “it’s easier and clearer to have a systemic, global way to define and handle missing values rather than have users define it for each and every column”
- so now for Codd’s criticism of the default value scheme
- doesn’t provide tools for handling missing, just representing missing
- representing missing with a real value forces constraints to be applied on INSERT
- RM allows constraint at UPDATE
- missing has to be separately defined for each column: more burden on users
- handling missing becomes responsibility of applications - they have to provide those values
- so they also have to know what those values are
- instead of just providing/handling nothing, which is a lot easier
- missing values are handled like real values, ignoring logical semantics
- “a step backward from the RM to an ad hoc, unsystematic approach”
- more of a complaint than a criticism
- not all default values are bad, but they must make sense
- like opening a bank account and not putting $ in, so it defaults to 0 instead of missing
- Codd implores DB vendors + makers to “place more emphasis on the introduction and retention of database integrity”
- and “invest more effort in learning the systematic treatment of missing information”
- “do better, you animals”
Qualifiers
- “an expression that can be used in a command to alter some aspect of the execution of the command”
- truth-valued expressions default to True
- MAYBE expressions
- A-MAYBE: “expression evaluates to TRUE or A-mark”
- I-MAYBE: “expression evaluates to TRUE or I-mark”
- temporarily replace missing values with a real value
- AR: “A-mark Replace”
- IR: “I-mark Replace”
- ESR: “Empty Set Replace”
- ORDER BY
- if < is inapplicable, the base type is used
- can order either by one of the source relations, or the resulting relation
- must warn if the ordering isn’t represented by one of the columns in the result
- like ordering by “part_id”, but it isn’t in the result
- RM operators can’t exploit any information not represented by the values themselves
- ONCE ONLY: each tuple can only be used once to avoid quad situation
- each relation has 2 or more values that match the other -> 2 x 2 -> 4 -> “quad”
- DOMAIN CHECK OVERRIDE: temporarily disable DB check that join columns are from the same domain
- EXCLUDE SIBLINGS
- rude
- for UPDATE/DELETE/etc., do not hit values in sibling columns in other relations
- DEGREE OF DUPLICATION: after an operation, print a message with how many duplicate rows were removed
- SAVE: save the result to the database
- applied at the session level
- result is added to the catalog when the session ends
- assuming sufficient permissions
- VALUE: in a new column, indicate the value to use instead of a missing value
Indicators
- “a side effect of executing a relational command”
- like status codes and logs
- expressed immediately before a relational command
- so { indicator } { query }
- indicators come (mostly) in pairs
- “u” to accept indicator from previous command
- “v” to accept indicator from current command
- empty relation: result returned no rows
- empty divisor: in relational division, if the “denominator” is empty
- missing information: were any missing values encountered
- non-existing argument indicator: a command is missing arguments
- also cancels the execution
- domain not declared: in a create table command, one of the columns is trying to draw from a nonexistent relation
- domain-check indicator: two columns are being compared, but their domains don’t match and the override wasn’t specified
- domain not droppable, column still exists
- drop domain was attempted, but columns that draw from it still exist
- cancels execution
- duplicate-row indicator: during load to DB, indicates that duplicate rows were found
- assumes table already exists
- never activates inside a truly relational DB, because all relations are unique
- duplicate PK indicator: during load to DB, duplicate PK values were found
- not allowed, so cancels execution
- non-redundant ordering operator: during ORDER BY, the column(s) being ordered by don’t appear in the query result
- catalog block operator: indicates that a catalog block is active, so the DB is suppressing cascading actions
- view not tuple-insertible: the view’s definition doesn’t allow for tuples to be inserted
- view not component-updatable: at least one component of every tuple in the view is not updatable
- view not tuple-deletable: tuples can’t be deleted from the view
- the last three indicators are off when the view is created, and are then inferred by the DB
Query and Manipulation
- “guaranteed access”: all values can be accessed by some combination of table name, PK, and column name
- this path should always be available “independent of database or vendor”
- “represents an associate-addressing scheme”: data isn’t accessed by pointer, it’s accessed by association with these names + PK
- only works if PK are absolutely required by the DB
- parseable relational data sublanguage (also called “RL” in this book)
- SQL is one (although Codd would scoff)
- must be built into the DB
- parseable: must be string-representable
- all operands are relations
- all manipulative operations produce relations, with indicators as necessary (see previous chapter)
- referenced earlier as “operational closure”
- Codd mentions multiple-choice questions as an interface - like drop-down menus
- this has its place, but it’s not in an interface to raw data manipulation
- an RL simplifies program maintenance, formal analysis, interface development
- “interactive tools that are claimed to make written or typed programs obsolete do not yet appear to support the maintenance requirement adequately”
- Ha! Tell me about it. Could have been written today.
- RL should express four-valued, first-order predicate logic
- support retrieval, view definition, insertion, update, deletion, missing information, integrity constraints, authorization constraints
- if distributed, management with “distribution independence”
- breaking commands into parts relative to the distribution of the data
- reconstructing results from the distributed elements
- high-level insert/update/delete
- high-level refers to abstraction: higher gives more freedom for behind-the-scenes optimization
- “can also be extremely important in obtaining efficient handling of transactions across a distributed database”
- victim of “papering over” of distributed realities? Codd would have been the type to be aware of those issues, but not sure how developed the awareness/theory would have been in 1989
- operational closure
- operations must only produce relations or sets of relations (they cannot do anything else)
- claim “no more operators is allowed” is wrong: this concept is just a constraint on the nature of potential new operators
- claim “operators cannot generate a relation that lacks a PK” is wrong:
- constraint on relations is row uniqueness
- conveniently represented by a PK
- but adequately captured by a “weak identifier” column set
- transaction support
- “a logical unit of work that transforms a consistent state of the DB into a consistent state, without necessarily preserving consistency at all intermediate points”
- an atomic unit of work that ensures consistency at start and end states
- entire operation succeeds or fails
- DBMS responsible for ensuring none of the cached changes are persisted in the event of a hardware/software failure
- errors in the transaction itself also cause it to fail
- BEGIN/COMMIT/ROLLBACK
- blocks on the database description (the catalog)
- CAT/END CAT commands
- postpones cascading commands so it’s possible to make changes to relations without immediately affecting anything referencing them
- a concurrency mechanism then
- seems underdeveloped in this text
- dynamic mode
- live command interpretation (e.g. psql) without needing source code change or recompilation
- principle: avoid halting DB activities as much as possible
- triple mode
- RL can be used at terminal, in application programs, and in “automated constraint reactive statements”
- develop/debug DB statements in isolation from the rest of the application code
- the basis for embedding “select a, b, c from schema.table where a = x” in app code
- four-valued logic: truth tables
- True, missing-Applicable, missing-Inapplicable, False
- handled behind the scenes - users need to specify what actions to take in response, not how these are defined or evaluated
- missing information: manipulation
- uniform, systematic, type-independent missing information handling
- MAYBE qualifier
- effect of missing values on arithmetic operators
- previously mentioned in the missing information chapter
- operations involving missing values have pre-defined behavior
- inherit the highest-order value: I-mark > A-mark > real values
- effect of missing values on concatenation
- same as with arithmetic: I-mark > A-mark > real values
- domain-constrained operators and DOMAIN CHECK OVERRIDE
- value comparisonss are constrained to same domain unless the override is enabled
- operators constrained by basic data type
- when comparing values generated by a function or two functions
- only the base types must match, not the entire domain
- because function return types may not be defined in a domain already, but the functions need to work
- prohibition of essential ordering
- ordering on a relation cannot contain information not present in the relation
- or more specifically, it can’t be relied on or enforced
- not in the RM, anyway - some vendors support this behavior
- interface to single-record-at-a-time host languages
- result delivery must be “paginated”: deliverable in chunks of rows
- cursors for the host languages to iterate can be supported
- but these should only scan results, not DB internals
- without an ORDER BY, applications shouldn’t assume consistent or repeatable ordering
- comprehensive data sublanguage
- all operations must be supported in all modes (terminal, application, trigger statement)
- there are DB products for which the three have different sublanguages
- Codd references an ANSI/SPARC document as an example of this madness
- “advocated 42 distinct interfaces and (potentially) 42 distinct languages”
- fuck that
- library checkout: version control checkout built into the DB
- library return: version control merge
Integrity Constraints
- “the preservation of accuracy”
- “preventing the DB from being damaged … [is] far easier to concieve than … repairing the damage once done”
- should NOT be the responsibility of add-ons, extensions, packages, etc. - more easily bypassed
- non-relational DBMS and AI prototypes were developed to encode “as much as possible the system’s behavior into the data structure”
- justified as “might simplify the programming”
- but “actually, the programming became much more complicated, and the systems became much harder to understand”
- what a familiar conversation
- integrity constraints “independent of the data structure” and specified in RL
- entity integrity: PK cannot have missing values, FK cannot contain I-marks
- referential integrity: FK must have a corresponding PK somewhere
- Codd blames poor support for referential integrity on a lack of support for domains
- types of constraints
- D: Domain integrity
- C: Column integrity
- E: Entity integrity
- R: Referential integrity
- U: User-defined integrity
- Timing
- TC: constraints are checked no later than the end of each relational request
- TT: constraints are checked no later than the end of the transaction
- integrity constraints live in the catalog
- type E only stores the PK declarations there, because it’s otherwise an automated constraint
- evaluation
- which constraints apply?
- what timings apply?
- type E is always TC
- types D, C are usually TC
- before each command completes, TC constraints are checked
- any TT checks found for each command are added to the list of the entire transaction’s checks
- before the transaction commits, all of the accumulated TT checks are enforced
- referential and user-defined integrity constraints need to specify if they’re TC or TT
- R- and U-type violations need a defined response
- before any RL statement completes, DBMS must examine catalog for constraints
- R-type constraints store response + column names of related PK/FK in the catalog
- U-type constraints store response + constraint definition in the catalog
- DBMS is responsible for activating constraints, NOT users or applications
- constraints are considered part of the DBMS environment + user community, not individual users
- DBMS models a micro-world - constraints are akin to speed of light and other physical laws
- enforcement should be constant and automatic - inherent to the DBMS
- in a sense, a CI pipeline running against the state of the DB and its allowed states
- D-, C-, E-type violations
- not allowed: execution cancelled and an error + reason returned to the user
- users may prohibit missing values (“not missing” constraint)
- blocks A-marks and I-marks from the indicated columns/relations
- independent of indexes: indexes are supposed to be for performance only, entirely separate from logical structure of the DB
- objects with indexes and not-missing constraints are common (like PK) but this is not a strict requirement
- user may prohibit duplicate values in a column (“unique” constraint)
- a row of all missing values is illegal: it represents no information
- DB should maintain an audit log of all actions performed against the DB, and by whom
- non-subversion: any non-RL interface language must be provably unable to subvert these restrictions
- creating a constraint
- specify name, type, definition, timing
- catalog block required, then DB checks its current state against the constraint
- if current state is invalid, and user can’t correct it immediately at terminal, constraint is rejected
- if a C-type constraint to prohibit missing values is added, new missing values are rejected but existing ones are left alone
- not sure this makes sense, the DB would have to track the existing missing values and factor them into later constraint checks
- “minimal adequate scope of checking”
- inspect the minimum possible area in the DB to ensure integrity
- integrity constraints should be executable as commands (or functions that accept a relation)
- on-the-fly, end-of-command, end-of-transaction techniques
- allow the postponing of violations until later
- developed in the context of transactions: intermediate violations of state are allowed, as long as the end state is consistent
User-defined Integrity Constraints
- constraints that reflect business needs, which are outside of the default built-in DB constraints
- also triggers, including time-based triggers
- must define timing (TC, TT), conditions to trigger, the constraint to check, and the response
- stored in the catalog
- can be triggered by application program or user
- retrieval, insert, update, or delete
- time-based triggers
- no timing conditions
- scheduled, or at set intervals
- queue structure, FIFO: DBMS sets a reminder for the earliest trigger, and when completed the queue is incremented
- conditions must evaluate to True/False
- can be set on DB states, or on changes in it
- “salary <= x”, or “salary cannot be increased by >y% at one time”
- Codd uses the convention of OLD.x and NEW.x - same as they are in Postgres, interesting
- inserting missing values
- where allowed, default missing is an A-mark
- if not allowed, operation is rejected
- updating values to I-marks
- must be allowed and not violate any other constraints
- “Normalisation was originally conceived as a systematic way of ensuring that a logical design of a relational database would be free from insertion, update, and deletion anomalies.”
- functional, multi-valued, join, inclusion dependencies
- Codd mentions “a need for a data model that can easily accommodate such changes without impairing the correctness of already developed application programs”
- a way of modeling database migrations - like with TLA+
- functional dependency: “B depends on A”, A -> B, “A therefore B”
- multi-valued dependency: “B and C both depend on A, but independently”: A -> B and A -> C
- join dependency: “A can be created by joining B and C”, A = B * C
- inclusion dependency: “A is a subset of B”, A is in B; all FK are inclusion-dependent on their PK
- possible responses
- reject: do nothing
- cascade: if a value has any dependencies, cascade changes to those downstream dependencies
- mark: “if the cause of a violation can be pinned on a non-PK column(s) and missing values are allowed, mark the column(s) with A-marks”
- sort of like an ELSE NULL
Catalog
- “the DB and its descriptions (the catalog) are seen as a collection of relations”
- should be fast, since it’s a central control point and should not bottleneck activities
- dynamic on-line catalog: the catalog must exist as a relation and be interacted with as one
- concurrency: DBMS supports multiple users at the same time
- responsible for storing table, view, and domain definitions; also constraints and UDFs
- stores authorization data - who has permission to access what
- database statistics for optimization
- row counts and cardinality for every column in every relation
- optimizer should reference this information
Views
- “insulate users from the base relations” - like front-end tables
- the external level of abstraction: data UX
- views are defined by a query over the base relations or other views
- definitions are stored in the catalog
- must also have a unique name, like a base relation
- no procedures - must be a query only, but any query is ok
- do not have knowledge of data representation/access paths/methods
- “one should not have to be a programmer to define a view”
- definitions can be retrieved from the catalog by RL
- views must act like base relations; queries shouldn’t make any distinction
- DBMS should allow insert/update/delete to both relations and views
- but there are conditions when views can’t be altered: more later
- and views may not have PK, while relations must
- column names should be a combination of source table and source column
- definitions do not indicate the column domains - they’re deduced after the definition is stored, then kept in catalog
View Updatability
- some views can’t accept inserts/updates/deletes in an unambiguous way: the “view updatability problem”
- these commands have to be translated into commands onto the base relations
- but the construction of the view may not allow an unambiguous decision of which rows to affect or which relation to apply to
- “all views that are theoretically updatable are updatable by the system”
- H.W. Buff proved the general question of “can this view be updated?” can’t be logically answered for all cases
- Dayal and Bernstein; Keller have done works on view updatability
- DBAs may supply a program or procedure to decide the updating logic if they so choose
- these are suggested algorithms that would supply a “unique or uniquely sensible collection of corresponding changes to be made to the base relations that have as their effect upon the extension of the view the changes requested by the user”
- is there a single unambiguous change set on the base relations that will have this exact effect on the view?
- can happen within a transaction as needed
- assumptions
- A1: the definition of a view and its consequences with respect to insert/update/delete of its tuples must be understood by users
- users need to know what effect any of those commands will have on the base relations
- A2: deciding if a view is tuple-{x} able depends on expanded view definition, base table definitions, integrity constraints, and any statistical/aggregate functions in the view
- do the functions have an inverse?
- A3: “tuple-{x}able” is decided at view definition time, independent of any downstream views and ignoring actual table contents
- as in, it’s entirely a logical determination
- it also means no prohibitions are set on inspections of table contents at execution time
- A4: operator type immutability
- inserts can only be inserts, updates only updates, etc.
- so an update should never be transmuted to a delete plus an insert, for example
- A5: all views are subject to interpretation algorithms that decide the response to the commands
- only applicable to the 2nd algorithm
- operands are relations and may not have duplicate rows
- much more discipline is required if not true relations
- consequences
- A1: when the DBMS refuses an action, users have all the information to figure out why
- A2: “decision algorithm is independent of implementation”, because it’s all based on relational logic
- A3: instead of re-evaluating on every query, decide once at view definition by static analysis
- A4: it’s easier to trace actions, and produces no surprising effects - may restrict available solutions?
- the results from applying the algorithms are stored in the catalog
- non-updatable views are not rejected - they just have actions restricted
- steps
- get fully-expanded view definition
- convert it to RL
- reduce to minimum operations (without Cartesian products if possible)
- apply algorithm
- record result in catalog
- during update or insert, is the target untransformed (stored in the DB) or transformed (result of a computation)
- i.e., is the target a base relation or a view
- if it’s transformed, what were the functions used?
- if they don’t have an inverse, not component-updatable
- “back-traceable”: the target and its components can be traced back to specific rows in the base relations
- part 1: are all columns back-traceable?
- part 2: are all rows back-traceable?
- if part 2, tuple-deletable
- if part 1 and part 2, tuple-insertible
- if part 2, and relevant columns in the target are back-traceable, component-updatable
- Codd remarks that base relations and views should be indicated as fully normalised or not - if not, may have update anomalies
- a view based solely on a row filter from 1 (one) relation, without duplicates:
- is tuple-deletable
- is component-updatable for all backtraceable columns
- is tuple-insertible if all columns are backtraceable
- at insert, DB still needs to check that it matches the view definition
- at update, the row may be removed or rejected by other DBA-set constraints
- a view based solely on a column filter from 1 (one) relation:
- if PK is included, it’s a 1:1 row match -> tuple-insertible AND tuple-deletable
- equivalent to a row filter with no conditions
- candidate keys do NOT get the same treatment
- they’re not in the catalog so the 1:1 match can’t be directly inferred
- and their constraints can be dropped (while a PK/FK’s can’t), which would violate the inferred updatability
- possible improvement: any column with unique AND not-missing constraints gets this determination
- a view based on the join of 2 (two) relations
- if the comparands are many:many, not tuple-deletable
- deletion can’t guarantee that specific rows will be deleted
- the quad problem again
- the comparands must be candidate key, PK, or unique + not missing constrained to be tuple-deletable
- if comparands are backtraceable, tuple-insertible as well
- if the comparands are 1:many or many:1
- tuples can be inserted if the component parts from each base relation do not exist in both relations
- if the target values exist in only 1 of the relations, the relation that is missing the values gets the insert
- if both relations have the component parts, that means the row already exists
- if a quad would be generated by the join, deleting any 1 of them would result in a number that can’t be generated by the m x n rows
- the algorithms here apply to PK-PK, PK-FK, or FK-PK joins - Codd doesn’t consider non-key comparands even if they meet the conditions
- inequality joins: only greatest/least-style joins, plus the ONCE qualifier, are eligible
- natural join: same treatment as join-based views
- outer join: same as join, but allows the second-cited operated to contribute tuples of empty values
- relational division: none of the tuple-{x}able apply
- union
- deletion allowed if row only occurs in one of the base relations
- V1: insert/update allowed if the view is disjoint union and the row can only match integrity constraints of one of the base relations
- V2: DBA may define a function that unambiguously determines which relation to update/insert
- it’s evaluated like an integrity constraint, but at command-interpretation instead of command or transation time
- outer union
- if insertion contains a column that doesn’t exist in the defining relation, the offending column will be dropped from the inserted tuple and the DB will throw a warning
- as in, it truncates the insert tuple to only columns that exist in the defining relation
- deletions are handled the same as in regular union
- RL difference (T = R - S)
- deletions are applied to R, since rows from S aren’t represented in T
- inserting into S would work, but violates A4 (operator immutability)
- inserts go to R only, provided it doesn’t exist in S (which causes no action)
- outer difference: same as RL difference but with type checking on differences between T and R/S
- more on normalization
- “make insert/update/delete operations clear in meaning”
- “involves breaking relations into relations of smaller degree”
- “tends to reduce performance because many more joins must often be executed”
- “the database models a micro-world”
- PK: each distinct type of object has a distinct id within that type
- table name: each table has a distinct name (table being a type of object)
- every fact in the DB is a fact about the object ID’d by that PK
- each fact is the single-valued immediate property of that object
- facts are collected together into a single relation if they share a type
- anomalies in non-normalized DBs arise from when the facts are too coupled
- an update to one type unduly affects facts of another type
- views that serve as base relations for other views should be described in the catalog
- called “conceptual relations”
- semi-insert
- base relations Suppliers (S) and Capabilities (K)
- view T = S[B = C]> K (right outer)
- insert new tuple for K’s columns into T: (s3, k8)
- if s3 !∈ S => reject
- if s3 ∈ T and the K component is missing, update this row to use the k component of the inserting tuple
- otherwise, copy the s-component from an s3 tuple and attach to the k-component of the inserting tuple
- semi-update (target s3)
- if s3 !∈ T => reject
- else, update all tuples in T that have value s3
- semi-delete/archive
- if s3 !∈ T => reject
- if s3 !∈ T and only once, mark all target components as missing
- otherwise, remove all s3 rows except one, then mark its target components missing
- note: this chapter was important because of normalization
- the original intent of normalization is to control data change anomalies
- a consequence of splitting relations into smaller normalized parts is that seeing all the data together again requires a view
- hence the importance of supporting their updatability - according to Codd
Authorization
- “who has access to what parts of the database and for what purpose”