MultiValue Lacks Value
With the advent of XML — itself of a hierarchic bent — there is effort to reposition the old “multivalue” (MV) database technology as “ahead of its time,” and the products based on it (MVDBMS) as undiscovered “diamonds in the rough.”
Now, a well-known and often expressed peeve of mine is how widespread the lack of foundation knowledge is in the IT industry. It is the conceptual and logical-physical confusion deriving from it that produced the MV technology in the first place, and is now behind current attempts at its resurgence. And there hardly is a more confused bunch than the proponents of MV technology. Anything written by them is so incomprehensible and utterly confused that it readily invokes Date’s Incoherence Principle: It is not possible to treat coherently that which is incoherent. What is more, efforts to introduce clarity and precision meet with even more fuzziness and confusion (see "No Value In MultiValue"). For anybody who believes that anything of value (pun intended) can come from such thinking, I have a bridge in Brooklyn for sale.
Notwithstanding its being called “post-relational,” MV databases and DBMSs originate in the Pick operating system invented decades ago, and are essentially throwbacks to hierarchic database technology of old. For a feel of how problematic the multivalue thinking — if it can be called that — is, consider quotes from two attempts to explain what MV technology is all about. It is always fascinating, although no longer that surprising, to see how many errors and how much confusion can be packed into short paragraphs.
The first quote is the starting paragraph in "The Innovative 3-Dimensional Data Model," an explanation of the Pick model posted on the Web site of a MV software.
“D3 significantly improves on the relational data structure by providing the ability to store all of the information that would require three separate tables in a relational database, in a single 3-dimensional file. Through the use of variable field and variable record lengths, the D3 database system uses what is called a 'post-relational" or "three-dimensional' data model. Using the same example (project reporting by state and fiscal period), a single file can be set up for a project. Values that are specific to each state are grouped logically and stored in the project record itself. In addition, the monthly budget and actual numbers can then be located in the same project definition item. There is no limit to the amount of data that can be stored in a single record using this technology … the same data that requires a multi-table relational database structure can be constructed using a single file in D3."
Comments:
- It is at best misleading, and at worst disingenuous to claim the MV data structure is an “improvement” on the relational structure. First, the hierarchic structure underlying MV precedes the relational model. And second, the relational model was invented to replace the hierarchic model (which it did), the exact opposite of the claim!
Note: In fact, if I recall correctly, the Pick operating system preceded even the first generation hierarchic DBMSs and was only later extended to database management.
- The logical-physical confusion raises its ugly head right in the first sentence of the first paragraph. Unlike a MV file, which is physical, relational tables are logical. There is nothing in the relational model — and intentionally so — to dictate how the data in tables should be physically stored and, therefore, nothing to prevent RDBMSs to store data from multiple logical tables in one physical file. And, in fact, even SQL products — which are far from true implementations of the relational model — support such features. The important difference is that while true RDBMSs (TRDBMS) insulate applications and users from the physical details, MVDBMSs do not.
- Paper representations of R-tables are two-dimensional because they are pictures of R-tables, not the real thing. A R-table with N columns is a N-dimensional representation of the real world.
- The term “post-relational” — which has yet to be precisely defined — is used in marketing contexts to obscure the non-relational nature of MV products. Neither it, nor the term “three-dimensional” have anything to do with “variable field” and “variable record length,” implementation features that can be supported by TRDBMSs. That current SQL DBMSs lack such support is not a relational, but product flaw.
- It’s the “Values that are specific to each state [that] are grouped logically” that give MV technology its name and throw into serious question whether MV technology adheres to the relational principle of single-valued columns. The purpose of this principle is practical: it avoids serious complications, and takes advantage of the sound foundations of logic and math. This should not be interpreted to mean that "single-valued" means no lists, arrays, and so on. A value can be anything and of arbitrary complexity, but it must be defined as such at the data type (domain) level, and MV products do not do that. In fact, MV files are not relational databases for a variety of reasons, so even if they adhered to the SVC principle, it wouldn’t have made a difference (for an explanation why, see the first two papers in the new commercial Database Foundation Series launched at Database Debunkings.)
The second quote is from a response by Steve VanArsdale to my two-part article, "The Dangerous Illusion: Normalization, Performance and Integrity" in DM Review)
“Multi-value has been called an evolution of the post-relational data base. It is based upon recognition of a simple truth. First considered in the original theories and mathematics surrounding relational data base rules in the 1960’s, multi-value was presumed to be inefficient in the computer systems of the time. A simplifying assumption was made that all data could be normal. Today that is being reconsidered. The simple truth is that real data is not normalized; people have more than one phone number. And they buy more than one item at a time, sometimes with more than one price and quantity. Multi-value is a data base model with a physical layout that allows systematic manipulation and presentation of messy, natural, relational, data in any form, first-normal to fifth-normal. In other words: with repeating groups in a normalized (one-key and one-key-only) table.”
VanArsdale repeats the “post-relational evolution” nonsense. He suffers from the same physical-logical confusion, distorts history to fit his arguments, and displays an utter lack of knowledge and understanding of data fundamentals.
- Some “simple truth.” Multivalue was not “first considered in the original theories and mathematics surrounding relational database rules." The relational model was invented explicitly to replace hierarchic technology, of which multivalue is one version, the latter having nothing to do with mathematics.
- VanArsdale has it backwards. It was, in fact, relational technology that was deemed inefficient at its inception by hierarchic proponents who claimed their approach had better performance. The relational model has indeed simplifying purposes, but that is an issue separate of efficiency. How can logic, which governs truth of propositions about the real world, have anything to say about the performance of hardware and software (except, of course, that via data independence, it gives complete freedom to DBMS designers and database implementers to do whatever they darn please at the physical level to maximize performance, as long as they don’t expose that level to users)?
- How we represent data logically has to do with the kind of questions we need to ask of databases — data manipulation — and with ensuring correctness (defined as consistency) via DBMS integrity enforcement. We have learned from experience that hierarchic representations complicate manipulation and integrity, completely ignored by MV proponents. What is more, such complications are unnecessary: there is nothing that can be done with hierarchic databases, that cannot be achieved with relational databases in a simpler manner. And simplicity means easier and less costly database design and administration, and fewer application development and maintenance efforts.
I have no idea what “Multi-value is a data base model with a physical layout that allows systematic manipulation and presentation of messy, natural, relational, data in any form, first-normal to fifth-normal” means: - What is a “database model”? Is it anything like a data model? If so, why use a different name?
- What “physical layout” does not allow “systematic manipulation and presentation”? And what does a physical layout have to do with the data model — any data model — employed at the logical level? Is there any impediment to relational databases implementing any physical layout that multivalue databases implement?
- Is “messy natural, relational data” a technical term? Data is not “naturally” relational or hierarchic/multivalue. Any data can be represented either way, and Occam’s Razor says the simplest one should be preferred (which is exactly what Codd’s Information Principle says.).
- Every R-table (formally, time-varying relation, or relvar) is in first normal form by definition. But multivalue logical structures are not relations, so does it make sense to speak of normal forms in general, and 1NF in particular in the MV context? (Again, see the Foundation Series.)
If I am correct, then how can multivalue proponents claim that their technology is superior to relational technology? Regarding the first quote above:
- There is no reference to integrity constraints.
- The focus is on one, relatively simple application — “project reporting by state and fiscal” — for which the hierarchic representation happens to be convenient; no consideration is given to other applications, which it likely complicates.
- What happens if and when the structure changes?
It is common for MV proponents to use as examples relatively simple and fixed logical structures, to focus on a certain type of application, and to ignore integrity altogether.
Note: This is, in fact, exactly what Oracle did when it added the special CONNECT BY clause to its version of SQL, for explode operations on tree structures. Aside from violating relational closure by producing results with duplicates and meaningful ordering, it works only for very simple trees.
Why don’t MV proponents mention integrity? You can figure that out from another reaction to my above mentioned DM Review article by Geoff Miller:
“The valid criticism of the MV structure is that the flexibility which it provides means that integrity control generally has to be done at the application level rather than the DBMS level — however, in my experience this is easily managed.”[emphasis added]
I would not vouch for flexibility (databases with a hierarchic bent like MVDBMSs are notoriously difficult to change), but be that as it may, anybody with some fundamental knowledge knows that integrity is 70 — 80 percent of database effort. Schema definition and maintenance is, in effect, nothing but specification and updating of integrity constraints, the sum total of which are a DBMSs understanding of what the database means (the internal predicate, see Practical Issues in Database Management). It follows that in the absence of integrity support, a DBMS does not know what the database means and, therefore, cannot manage it. Products failing to support the integrity function — leaving it to users in applications — are not fully functional DBMSs. That is what we used to have before we had DBMSs: files and application programs.
That MV products do not support a full integrity function is a direct implication of the hierarchic MV structure: data manipulation of hierarchic databases is very complex and, therefore, so is integrity, which is a special application of manipulation. So complex that integrity is not implemented at all, which, by the way, is one reason performance, may sometimes be better. In other words, they trade integrity for performance.
Chris Date says about hierarchic structures like MV and XML:
“Yet another problem with [hierarchies] is that it’s usually unclear as to why one hierarchy should be chosen over another. For example, why shouldn’t we nest [projects] inside [states], instead of the other way around? Note very carefully too that when the data has a “natural” hierarchic structure as — it might be argued — in the case with (e.g.) departments and employees [projects and states is not that natural], it does not follow that it should be represented hierarchically, because the hierarchic representation isn’t suitable for all of the kinds of processing that might need to be done on the data. To be specific, if we nest employees inside departments, then queries like “Get all employees in the accounting department” might be quite easy, but queries like “Get all departments that employ accountants” might be quite hard.”
So here’s what I suggest users insist on, if they want to assess MV products meaningfully. For a real-world database that has a moderately complex schema that sometimes changes, a set of integrity constraints covering the four types of constraints supported by the relational model, and multiple applications accessing data in different ways:
- Have MV proponents formulate constraints in applications and queries the MV way,
- Have relational proponents design the database and formulate the constraints in the database and queries using truly relational (not SQL!) products such as Alphora’s Dataphor data language, and/or an implementation of Required Technologies’ TransRelational Model™,
- Then, judge which approach is superior by comparing them.
To quote:
“I've been teaching myself Dataphor, a product that I learned about through your Web site! As a practice project, I've been rewriting a portion of a large Smalltalk application to use Dataphor, and I've been stunned to see just how much application code disappears when you have a DBMS that supports declarative integrity constraints. In some classes, over 90% of the methods became unnecessary.”
—David Hasegawa, "On Declarative Integrity Support and Dataphor"
Wouldn’t Miller say this is easier to manage?
References
--
Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for 20 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on data management technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, and IRS. He is founder, editor and publisher of Database Debunkings, a Web site dedicated to dispelling persistent fallacies, flaws, myths and misconceptions prevalent in the IT industry (Chris Date is a senior contributor). Author of three books, he has published extensively in most trade publications, including DM Review, Database Programming and Design, DBMS, Byte, Infoworld and Computerworld. He is author of the contrarian columns Against the Grain, Setting Matters Straight, and for The Journal of Conceptual Modeling. His third book, Practical Issues in Database MANAGEMENT serves as text for his seminars.
Contributors : Fabian Pascal
Last modified 2005-04-12 06:21 AM