XML is supposed to be a universal syntax for representing information. It is poor at this task because of three fundamental limitations of the syntax. ConciseXML resolves these problems in a backward-compatible way.
The worst problem with XML is that the syntax does not help you distinguish between the name of a part and a "type". Because of XML’s limitations, workarounds are convoluted and confusing. Most people that have some experience with XML don’t understand this problem which compounds it even further.
First some terminology; Here’s a typical statement in XML:
<boat color="white" size="23"> <sail area="47"/> </boat>
Using XML terminology: This is an xml element. "boat" is the tag name. Color is an attribute name and white is an attribute value.
The content of the element
is <sail area="47"/>.
Using English terminology we’ve got a boat with two properties [color and size] and one part, a sail. From object oriented experience we know that the difference between a "property" and a "part" often isn’t so important. Usually both are represented by the same construct, a field in an object. Sometimes whether something is a property or a part is ambiguous. Such as, is color really the "paint part" of the object? Happily, most of the time this doesn’t really matter and our programs don’t need to delve into Knowledge Representation theory, we just use them [a property or a part] how we want to in the domain of our program and everything works out ok.
So if we think the difference between a part and a property doesn’t really matter then we ought to use the same syntactic construct to describe each, so we might write our above structure as:
<boat color="white" size="23" sail=<sail area="47"/> > </boat>
or we might have a complex object for the color, instead of a string, say:
<boat color=<paint "white" "marine_finish"/> size="23" />
Unfortunately xml permits only strings as values of attributes. So you might hope we could write:
<boat color="white" size="23" sail="<sail area="47"/>" > </boat>
Even if we could "escape" the double quotes around 47, we still can’t do this since attribute values must be strings WHICH DO NOT CONTAIN ANGLE BRACKETS.
OK so we’re forced to use our original syntax. Now let’s look a little more closely at it.
Observe that "<boat>" makes us an instance of a "boat". So boat is really the name of a class, or a type.
Well using that same logic, what’s "<sail/>" ? It must be creating a sail object so "sail" is the name of a class. Well if sail is the name of a class, then it can’t be the "part name" of something within boat. But in fact people use just this syntax to name parts in XML all the time because they can’t use the foo="bar" syntax for naming their part.
Now sometimes people try to get around this by doing:
<boat color="white" size="23"> <sail name="mainsail" area="47"/> </boat>
We’ve now got a clear "name of part within boat" right? Well not really because we are using our attribute syntax for the part name. Although we MIGHT want to have a name of something be an attribute of the thing, for the most part, its better to have the part name OUTSIDE of the thing itself. After all the thing might be the "mainsail" part of the boat, but at night we use it as the "top sheet" part on the bed. It is still the same thing so we can’t change it just because we’re using it differently.
If you embed a "name" in a part, then when you go to find the part of the boat named mainsail, you essentially have to do content addressable memory. It is slower first of all, but more important is the inability to share this object in other places as is normal in normal object oriented programming.
You may observe another
problem with our example. We’ve encoded the size attribute as a string "23".
But we all know it is really a number. Well what if we WANT the size to
sometimes actually be a string of digits, and other times actually be a number?
We can define our schema one way or the other but we can’t have it both ways as
we can in an object system that permits not just fields that can hold several
different types of objects, but also a syntax like "23" for strings
and 23 for
numbers.
In the XML community all this hair doesn’t seem to bother anyone. They make up objects like:
<address> <street><some_street_obj/></street> </address>
and everyone’s happy since "street" functions as both a type and a part name. Then someone wants to add a "cross street" part to our element. Suppose we represent it like so:
<address ..> <street><some_street_obj/></street> <cross_street><some_other_street_obj/></cross_street> </address>
Now we’ve got a problem because the value of our part named cross-street is an object of type "street" but we’ve got some other PART named "street".
XML fails to make clear the distinction between part-names and types. This ambiguity is deadly for any kind of reasonably complex information.
The most common and most important kind of computer data is code. So we might represent a function call in XML like so:
<launch_boat when="3:14PM" where="LA"/>
OK that’s not so bad [until we have non-string arguments that is, see the above problem].
But programmers like to be terse and don’t like to have to type in keywords all the time, so they generally prefer syntaxes that allow passing of arguments by position such as:
<launch_boat "3:14PM "LA"/>
The arguments are distinguished by their order. But XML doesn’t allow such syntax. You must put in the name of the attribute. We could do something like:
<launch_boat> <arg>"3:14PM"</arg> <arg>"LA"</arg> </launch_boat>
but that’s even more verbose than mentioning the keywords in the first place.
Lack of by-position arguments and the lack of elements as attribute values makes XML a lousy syntax for the most important kind of data on a computer.
When we have an element that contains a content, you must end it with an ending tag that mentions the tag again as in our top example:
<boat color="white" size="23"> <sail area="47" /> </boat>
Mentioning boat a 2nd time is redundant. Worse, if you want to change the name to, say, "sailboat", now you’ve got to change it in two places making it likely that you’ll get the two out of sync and cause an error. In situations where an ending tag is a long ways away from the beginning tag, it is often clearer to name the ending tag. But for short expressions, like what are typically in the most important kind of computer data, long ending tags are a burden more than an aid.
Long ending tags were designed to make the XML more readable. However, because you’ve got to use the name of a tag twice every time you use it with a content, developers will be reluctant to give tags long descriptive names in the first place, thus rendering their XML less readable.
This error-prone and verbose syntax conspires to make it more difficult to produce syntactically correct information that is easily readable by humans.
Water solves these problems with XML 1.0 by using a concise form of XML called ConciseXML.