The Data Exchange Tail - Part 2
Part 1 | Part 2
As I explained in Part 1, data management is about meaning. A DBMS that does not know what the data means cannot manage it — that is, protect its integrity and manipulate it. This is then done by users in applications, something that DBMSs were invented to avoid. The meaning of data is conveyed to DBMSs in the form of logical models, which are mappings to databases of conceptual (or business) models based on some data model — a general theory of data. Communication with the DBMS is via a data language that concretizes the data model.
Is XML a data management technology? For a database specialist, this should be the first consideration. Unfortunately, as is typical of bandwagons, the "me-too's" and overnight "experts" who come out of the woodwork to make pronouncements on new fads, know and understand very little about the fad itself, let alone about data fundamentals. They are "mechanics" who try to extend the fad everywhere, whether it belongs or not, just so they can be associated with it. Therefore, extreme care must be exercised in assessing XML; one must "go to the source," so to speak, to its founders, to avoid nonsense that may seem reasonable in the absence of foundation knowledge.
The Problem
Last month, I quoted from the seminal article in Scientific American by Bosak and Bray their definition of the problem to be solved by XML, as follows:
"Give people a few hints, and they can figure out the rest. They can look at this page, see some large type followed by blocks of small type and know that they are looking at the start of a magazine article. They can look at a list of groceries and see shopping instructions. They can look at some rows of numbers and understand the state of their bank account. Computers, of course, are not that smart; they need to be told exactly what things are, how they are related and how to deal with them ... although HTML is the most successful electronic-publishing language ever invented, it is superficial: in essence, it describes how a Web browser should arrange text, images and push-buttons on a page. HTML's concern with appearances makes it relatively easy to learn, but it also has its costs ... although your doctor may be able to pull up your drug reaction history on his Web browser, he cannot then e-mail it to a specialist and expect her to be able to paste the records directly into her hospital's database. Her computer would not know what to make of the information."
In other words, unlike people, who can infer meaning from the way data is presented, computerized systems cannot. Therefore, they cannot exchange data in Web pages because the HTML underlying them is a presentation technology. On the face of it, then, XML was intended for data exchange, not data management, but note that the focus is on meaning.
Data exchange requires agreement on (a) what data is to be exchanged, and (b) its physical format, which are orthogonal (independent) considerations. Suppose, for example, that a personnel management system feeds data to a payroll system. For this to work, the two departments must agree on what personnel data is to be fed (say, name, position, seniority, and so on) and the physical format in which it will be transmitted (say, ASCII delimited).
Note very carefully that when they agree on the data, the departments actually agree on a common meaning of that data. This must be the case, because the agreement derives from their own systems, which contain the two departments' logical models, within which the data must fit. Note also that once the common meaning is agreed upon, the payroll system does not need to be told "what the data is" each time data is sent to it by the personnel system. Indeed, that's the point of the upfront agreement in the first place. Thus, given an agreed meaning, data exchange requires only a physical format which, as I mentioned, is orthogonal to meaning. Any format will do, as long as it is agreed upon. Now, the industry lacks many things, but format is hardly one of them; there is a plethora of physical formats (see conclusion on this point) to choose from. So why invent yet a new one?
Consider now the XML solution to the exchange problem:
"The solution, in theory, is very simple: use tags that say what the information is, not what it looks like. For example, label the parts of an order for a shirt not as boldface, paragraph, row and column -- what HTML offers -- but as price, size, quantity and color ... tags almost always come in pairs. Like parentheses, they surround the text to which they apply. And like quotation marks, tag pairs can be nested inside one another to multiple levels."
- Since the data is already agreed on, tags that "say what the data is" carry no value. For data exchange purposes the tags are no more than delimiters, just like commas, or spaces, or any other such that tells the receiving system where data values physically start and end. It follows that while XML can be used as a physical format for data exchange, it has no particular advantage over any other such format, including existing ones that could have been used
- If XML does not confer any advantage on data exchange, as a physical format it does have drawbacks. The physical level is where performance issues can and should be legitimately considered, and that's where the XML choice does not seem sensible. Consider the following sample from a XML document:
-<FILE_INFO>
<FILENAME_VERSIONED>xlplot.zip</FILENAME_VERSIONED>
<FILENAME_PREVIOUS>xlplot.zip</FILENAME_PREVIOUS>
<FILENAME_GENERIC>xlplot.zip</FILENAME_GENERIC>
<FILENAME_LONG>
<FILE_SIZE_BYTES>2697970</FILE_SIZE_BYTES>
<FILE_SIZE_K>2570</FILE_SIZE_K>
<FILE_SIZE_MB>2.57</FILE_SIZE_MB>
</FILE_INFO>
-<EXPIRE_INFO>
<HAS_EXPIRE_INFO>N</HAS_EXPIRE_INFO>
<EXPIRE_COUNT />
<EXPIRE_BASED_ON>Days</EXPIRE_BASED_ON>
<EXPIRE_OTHER_INFO />
<EXPIRE_MONTH />
<EXPIRE_DAY />
<EXPIRE_YEAR />
</EXPIRE_INFO>
First, tags overwhelm data. Second, the tags are repeated in each and every XML document or records transmitted. And as I explained, neither the tags' content, nor their repetition is necessary for data exchange. What is necessary is an agreed-upon delimiter that maximizes the transfer efficiency of the physical format.
The consequences in practice are readily predictable and have not taken long to materialize. For a reality check see, for example,"The Horror of XML." They defeat the very purpose of, and justification for which XML has been purportedly invented, and demonstrate the point of this paper very well.
Conclusion
In his reaction to a previous criticism of XML, I was taken to task by Rick Jeliffe, who claims to be a contributor to XML:
"XML is just a nice, little low-level technique which has some nice properties at the current state of technology for transmitting data ... [it] was developed for entirely physical purposes: to provide a fairly rich and adaptable format for sending small collections of data between systems in a way that has some nice performance characteristics (readable, lo-tech, integrates with URLs, etc.) ... clearly some people do want XML for more than just for transmitting data. They do want XML Schemas to be the basic model for database systems. That particular sub-use of XML-related systems is fair target for concerns such as Mr. Pascal's, and I think it is very good to have vigorous discussion on them ... [but] to blanket condemn XML comes across a little hysterical … Opposing XML in general … is as futile as opposing the stack, or the CRC (Cyclic Redundancy Check), or PostScript to relational databases [sic] ... It would be more productive to see how XML as a technology could enhance serious relational database implementations, or how relational analysis can improve XML, rather than spreading confusion."
Points arising:
- I am the last to disagree that fighting fads is futile, but that does not make fads correct solutions to problems.
- XML simply does not have any edge — let alone a performance edge! — over any other physical format, many of which have been available for data exchange.
- If XML is a physical exchange format, then how is it possible to "analyze it relationally", when the relational model is purely logical and has nothing to do with physical implementation details? The logical physical conclusion raises its ugly head yet again. Never fails.
It is important to point out how lack of foundation knowledge leads to confusion and inconsistency in positions practitioners take on performance. On the one hand, they complain that relational technology, which is purely logical, causes performances problems because it "ignores the physical level"; yet on the other hand, they accept and advocate a physical format for exchange which is highly-and unnecessarily! — inefficient.
Given that XML is now blatantly claimed to be not just a data exchange format, but a data management technology, and a better substitute for relational technology to boot, it must be assessed as such. Which is precisely what the framework proposed in Part 1 is for.
Stay tuned for Part 3, "The Data Management Dog."
References
"Managing Data With XML: Forward To The Past?"
"XML Data Management: Setting Some Matters Straight, Part I"
"XML Data Management: Setting Some Matters Straight, Part II "
"XML Data Management: Setting Some Matters Straight, Part III"
"XML: Response To A Response To A Response, Part I"
"XML: Response To A Response To A Response, Part II"
--
Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for 20 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on data management technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder, editor and publisher of Database Debunkings, a web site dedicated to dispelling persistent fallacies, flaws, myths and misconceptions prevalent in the IT industry (Chris Date is a senior contributor). Author of three books, he has published extensively in most trade publications, including DM Review, Database Programming and Design, DBMS, Byte, Infoworld and Computerworld. He is author of the contrarian columns Against the Grain, Setting Matters Straight, and for The Journal of Conceptual Modeling. His third book, Practical Issues in Database Management, serves as text for his seminars.
Contributors : Fabian Pascal
Last modified 2006-01-04 01:47 PM