Infolinks

Tuesday 17 July 2012

Codd Rules (Edgar F.Codd)

Codd's 12 rules

From Wikipedia, the free encyclopedia


Codd's twelve rules are a set of thirteen rules (numbered zero to twelve) proposed by Edgar F. Codd, a pioneer of the relational model for databases, designed to define what is required from a database management system in order for it to be considered relational, i.e., a relational database management system (RDBMS).[1][2] They are sometimes jokingly referred to as "Codd's Twelve Commandments".
Codd produced these rules as part of a personal campaign to prevent his vision of the relational database being diluted, as database vendors scrambled in the early 1980s to repackage existing products with a relational veneer. Rule 12 was particularly designed to counter such a positioning.
Even if such repackaged non-relational products eventually gave way to SQL DBMSs, no popular "relational" DBMSs are actually relational, be it by Codd’s twelve rules or by the more formal definitions in his papers, in his books or in succeeding works in the academia or by its coworkers and successors, Christopher J. Date, Hugh Darwen, David McGoveran and Fabian Pascal. Only less known DBMSs, most of them academic, strive to comply. The only commercial example, as of December 2010, is Dataphor.
Some rules are controversial, especially rule three, due to the debate on three-valued logic.

The rules

Rule (0): The system must qualify as relational, as a database, and as a management system.
For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database.
Rule 1: The information rule:
All information in a relational database (including table and column names) is represented in only one way, namely as a value in a table.
Rule 2: The guaranteed access rule:
All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row.
Rule 3: Systematic treatment of null values:
The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way.
Rule 4: Active online catalog based on the relational model:
The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data.
Rule 5: The comprehensive data sublanguage rule:
The system must support at least one relational language that
  1. Has a linear syntax
  2. Can be used both interactively and within application programs,
  3. Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback).
Rule 6: The view updating rule:
All views that are theoretically updatable must be updatable by the system.
Rule 7: High-level insert, update, and delete:
The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.
Rule 8: Physical data independence:
Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.
Rule 9: Logical data independence:
Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.
Rule 10: Integrity independence:
Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications.
Rule 11: Distribution independence:
The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully :
  1. when a distributed version of the DBMS is first introduced; and
  2. when existing distributed data are redistributed around the system.
Rule 12: The nonsubversion rule:
If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

 ==========

Codd's 12 rules



Codd wrote these rules to keep database software vendors honest. Especially in the old days (1980), many products were touted as relational database management systems (RDBMS) when they weren't. Failure to adhere to these rules can corrupt data, and it always leads to confusion.
I'll try to explain practical aspects as opposed to rigorous mathematical proofs. A full and detailed explanation requires reading Codd's published works, as well as works by several other luminaries. The topic is very technical and difficult.

http://en.wikipedia.org/wiki/Codd%27s_12_rules



Many thanks to the Wikipedia. I've kept some of the Wikipedia text, and added my own notes. (I found a rumor that Rule 0 is the foundation of all the other rules, and as a summary is not apparently viewed as one of the 12 detailed rules.)



Rule 0: The system must qualify as relational, as a database, and as a management system. 

It turns out that none of our RDBMS is a "true" RDBMS by the strict definition. However nearly all SQL databases are reasonably close. The are relational in that tables are related to each other and can be queried via a "join". They are databases in that they store data in an organized structure, and are well suited to large amounts of data. They are a management systems in that they include SQL statements to "manage" database structure and to perform administrative tasks such as granting permissions. The management system aspect is crucial: the system must be administered via the same query language and facilities as the data to be considered an RDBMS.



Rule 1: The information rule. All information in the database is to be represented in one and only one way, namely by values in column positions within rows of tables.

This refers to normalization (and perhaps to a lesser extent, data typing). Basically, only put one piece of data in each field, and do not put the same data into different fields or different tables.  People are templed to do things like put a comma separated list of names into a text field. Software outside of SQL can easily parse the comma separated list into an array. We call this "de-normalization" and it is a bad idea. De-normalization is a shortcut that usually ends poorly.


Rule 2: The guaranteed access rule. All data must be accessible.

When Codd invented relational databases, the standard practice was that much of the data could not be accessed directly. It was necessary to start with some known snippet of information, and then read linking data from various (obscure) places to find the buried scrap of data. All data in SQL databases can be accessed directly, even if it usually is accessed together with data that gives it context. Restated, every row must have a unique primary key. That key might be a multi-field key, or a single field, but it must be unique. As a result, some RDBMS have a hidden, unique row id for every row.



Rule 3: Systematic treatment of null values.

Codd realized that empty fields occur, and the database must have a consistent way to handle nulls. These aren't just empty strings or the number zero. SQL does handle nulls differently from empty strings. However, handling of "missing" is not necessarily distinct from null in nearly all RDBMS. This is an important mathematical problem, but in the real world it boils down to a policy decision that has to be carefully documented, and carefully tested in the application.



Rule 4: Active online catalog based on the relational model.

There must be system table that describe the data tables. All aspects of creating and modifying the data tables properties must be manipulated via SQL. It is also bad if the system uses strange, non-SQL tricks to keep track of your data. It is worse if the system can't find your data. 



Rule 5: The comprehensive data sublanguage rule.

We use SQL as the relational language. I've heard rumors that Codd was upset about the weaknesses and limitations of SQL. In defense of SQL I'll point out that it works. Granted, SQL is nearly always used together with a normal programming language. On the plus side (and as required by Codd) SQL does support definitions (tables and rows), views, create, retrieval, update, delete, security, integrity constraints, and transactions. Transactions allow work to be done without effecting the database until a "commit" is issued at which point all the changes take effect. Transactions may also be canceled via "rollback" in which case all the work essentially never existed and the state of the data is unchanged.



Rule 6: The view updating rule. 

A view is a table that has been created dynamically by a query. Not all SQL databases have updateable views in the practical sense, although in theory the system can update any view by recreating it from scratch.


Rule 7: High-level insert, update, and delete.

"High-level" means that you have effect multiple rows from multiple tables with a single query. While obvious to modern SQL users, old products were unable to do this, and I'm sure that led to obscure code, horrible bugs, and data corruption. The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.



Rule 8: Physical data independence. 

You can move the database to a different disk and it still works the same way. I think this refers to old products which relied on the file structure of the disk for data access.Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure.



Rule 9: Logical data independence. 

In practice there are changes you can make to your table structures that won't effect queries, but you have to be careful. SQL is not comprehensive enough allow any logical change. There are known limits of SQL. In real life this is a challenge, but generally dealt with by common practices and certain safeguards (such as copying a table before major logical changes). I think by "logical changes" the rule refers to adding columns, changing column data types, and perhaps even adding and deleting rows. It would be very bad (and unlikely) to design a data warehouse where deleting rows from one table made a query fail. (Where "fail" is "crash". Returning no records when no records exist is not failure from the databases's point of view.)Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence.




Rule 10: Integrity independence. 
SQL saves table information in other SQL tables which are together known as the catalog. These changes are made at an administrative level via SQL queries, and usually these have little or no effect on the application.




Rule 11: Distribution independence. 
To me, this is similar to Rule 8 about physical independence. In fact, some RDBMS move data around to improve performance, and the SQL is completely uneffected.The distribution of portions of the database to various locations should be invisible to users of the database. 


Rule 12: The non-subversion rule. 
The system must not have features that allow you to subvert database structure integrity. Basically, the system must not include back doors that let you cheat the system for features such as administrative privileges or data constraints. We would never want to do this, and I can't think of any SQL systems that allow it.

No comments:

Post a Comment