W3C XML Schema: DOs and DON'Ts

Written by Kohsuke KAWAGUCHI
Translation to Japanese by R.Nanba
$Revision: 1.9 $

Introduction

It's easy to learn and use W3C XML Schemas once you know how to avoid the pitfalls. Here are some DOs. You should at least learn the following things.

Here are some DON'Ts.

The fact is, you don't lose anything by following these DON'Ts, as the rest of this paper demonstrates.

Too long to remember? Then here is the one line version.

Consider W3C XML Schema as DTD + datatype + namespace

Motivation for this document

Several similar documents are already available on the web. But I discovered that those documents are written by some special people; they are brilliant people who always drive things to the limit. They simply can't stop inventing cool tricks that even the working group member can't imagine.

XML Schema is their new favorite toy.

There has to be a different document. A document for those who use W3C XML Schema for business --- for those who are at a loss how to use it.

So the goal of this document is to provide a set of solid guidelines about what you should do and what you shouldn't do.

I always welcome comments. If you have one, please let me know.

Why you should avoid complex types

If you don't know what a complex type is, then don't let it trouble you. Whatever small gain this functionality offers is vastly outweighed by its complexity.

Furthermore, you won't lose anything by losing complex types; the fact is, if a schema can be written by using complex types, then you can always write it without complex types.

To be precise, you can always write it without understanding complex types, but unfortunately you have to type <complexType> elements.

So what you should do is to consider a <complexType> as something you have to write as a sole child of the <element> element. That is, you write element declarations as follows:



<xs:element name="head">

  <xs:complexType>    <!-- consider this as a place holder -->

    

    <!-- define content model by using model groups. -->

    ...

    

    <!-- then refer to attribute groups -->

    <xs:attributeGroup ref="head.attributes" />

    

  </xs:complexType>

</xs:element>

So why spend your precious time learning something you don't need?

Convinced? then there is no need to read more.

In short, a complex type is a model group plus inheritance minus ease of use. A complex type and a model group are siblings in the sense that they are used to define content models. A complex type lacks ease of use because you can't use it from other complex types or model groups. On the other hand, model groups can be used without such a restriction.

"Inheritance" is the only advantage that a complex type has. So let me explain why you don't want to use inheritance. There are two types of inheritance: specifically, extension and restriction.

Extension allows you to append additional elements after the content model of the base type. So the following model group reproduces the semantics of the extension.



<xs:group name="extendedType">

  <xs:sequence>

    <xs:group ref="baseType"/>

    

    <!-- append things that you want -->

    ....

  </xs:sequence>

</xs:group>

Restriction allows you to restrict the content model of the base type. But even if you use this functionality, you still have to write the whole content model of the new type. Basically you type the same thing whether you use a complex type or a model group.

So, what do you get by using the restriction? The only thing you get is error checking. Validators are supposed to report an error if you fail to make a content model a restricted one.

But unfortunately, this is hardly an advantage.

First, it is a tough job for validators to strictly enforce this check. You can have a look at the part of the spec that defines this constraint . The entire section 3.9.6 is devoted to specifying what is allowed and what is not.
And you should know that there exists a strong temptation for developers to skip the enforcement of this constraint because most people won't notice that the check is skipped. At the time of this writing, no validators are known to strictly enforce this constraint.

So it is highly likely that your validator is not capable of fully enforcing this constraint. That takes away the only advantage of restriction.

Second, even if you write the restriction correctly, you may get an error from your validator. Consider the following example:



Base type:

<xs:all>

  <xs:element name="a" />

  <xs:element name="b" />

  <xs:element name="c" minOccurs="0" />

</xs:all>



New type derived by restriction:

<xs:all>

  <xs:element name="b" />

  <xs:element name="a" />

</xs:all>

The latter looks like a proper restriction of the former. In fact, every content model that is accepted by the new type is also accepted by the base type. But W3C XML Schema prohibits this. Specifically, the above derivation violates "schema component constraint: particle derivation OK (all:all, sequence:sequence -- recurse)". This is just the tip of the iceberg. If you are interested in this issue, consult the last page of MSL.

None of the above problems occurs if you use model groups instead of complex types.

When it comes to derivation by restriction, a general understanding is not enough; you have to have a very detailed understanding of how it works.

Why you should avoid attribute declarations

To be precise, what you should avoid is global attribute declarations and not local attribute declarations. The following is an example of a global attribute declaration.



<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <!-- attribute whose name is foo -->

  <xs:attribute name="foo" type="xs:float" />

  

  <xs:element name="root">

    <xs:complexType>

      <xs:attribute ref="foo" />

    </xs:complexType>

  </xs:element>

</xs:schema>

The fact is, this schema does not accept the following instance.


<root xmlns="http://example.com" foo="5.12"/>

Instead, it accepts the following instance, which is probably not what you want.


<root xmlns="http://example.com" ns:foo="5.12" xmlns:ns="http://example.com" />

Attribute groups do not have this problem. So instead of using an attribute declaration, you should use an attribute group.



<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <xs:attributeGroup name="root.attributes">

    <!-- attribute whose name is foo -->

    <xs:attribute name="foo" type="xs:float" />

  </xs:attributeGroup>

  

  <xs:element name="root">

    <xs:complexType>

      <!-- content model -->

      ....

      

      <xs:attributeGroup ref="root.attributes" />

    </xs:complexType>

  </xs:element>

</xs:schema>

An attribute group can refer to other attribute groups. In this way, you can write common attributes in one attribute group, then refer to it from others.

Why you should avoid notation declarations

If you haven't heard about notations, then please be assured that you are not losing anything. Notation is there only because of the backward compatibility. There is no need to learn it.

If you do know notations, then you should know that notations in W3C XML Schema are not compatible with notations in DTD because notation is a QName.

The following example is from the spec.



<xs:notation name="jpeg"

             public="image/jpeg" system="viewer.exe" />



<xs:element name="picture">

 <xs:complexType>

  <xs:simpleContent>

   <xs:extension base="xs:hexBinary">

    <xs:attribute name="pictype">

     <xs:simpleType>

      <xs:restriction base="xs:NOTATION">

       <xs:enumeration value="jpeg"/>

       <xs:enumeration value="png"/>

       . . .

      </xs:restriction>

     </xs:simpleType>

    </xs:attribute>

   </xs:extension>

  </xs:simpleContent>

 </xs:complexType>

</xs:element>



<picture pictype="jpeg">...</picture>

This example is OK. But the following fragment is not valid even if the prefix "pic" is properly declared.



<pic:picture pictype="jpeg"> ... </pic:picture>

Confused? You have to write it as follows because it is a QName.



<pic:picture pictype="pic:jpeg"> ... </pic:picture>

Apparently it fails to serve its only raison d'etre.

There is really no reason to stick to notations. Notations are for SGML.

Why you should avoid local declarations

W3C XML Schema allows you to declare elements inside another element:



<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <xs:element name="person">

    <xs:complexType>

      <xs:sequence>

        <xs:element name="familyName" type="xs:string" />

        <xs:element name="firstName" type="xs:string" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

</xs:schema>

But generally you should avoid this if possible because the above schema does not match the following instance:



<person xmlns="http://example.com">

  <familyName> KAWAGUCHI </familyName>

  <firstName> Kohsuke </firstName>

</person>

Instead, you have to write it as:



<foo:person xmlns:foo="http://example.com">

  <familyName> KAWAGUCHI </familyName>

  <firstName> Kohsuke </firstName>

</foo:person>

Not only does this require more typing, it is also a bad use of XML namespace. To avoid this problem, you should write:



<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

      targetNamespace="http://example.com">

  <xs:element name="person">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="familyName" />

        <xs:element ref="firstName" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  

  <xs:element name="familyName" type="xs:string"/>

  <xs:element name="firstName" type="xs:string"/>

</xs:schema>

Or another way to solve the problem is to blindly add elementFormDefault="qualified" to the schema element. In this way, you can safely use local element declarations.

But it probably isn't worth the effort to understand exactly what this means. Just understand that it makes the schema behave in the "right" way.

Why you should avoid substitution groups

In short, the complexity of this functionality is too much to be practical. The main difficulties are:

Simply put, a substitution group is another way to write a <choice>. So you can always use a <choice> instead of a substitution group. And <choice> is necessary anyway.

To use substitution groups properly, first you have to learn complex types, then several additional attributes, rules to use them, and finally the effect of using them. Even if you manage to get through this brave new world, your document authors still need to follow the same path all over again because otherwise they can't write documents properly. What a pity.

If you still think you want to use substitution groups, then let me show you it's not as easy as you think.

Firstly, the content models of substitution group members must be related to each other by type derivation. That means you cannot write their content models freely. Soon you'll find yourself writing an abstract element as a substitution group head with a strange content model, just to maintain proper derivations between members. That's not right.


Secondly, attributes to control the substitution behavior are difficult to use and understand. There is an attribute called block, which is one of the attributes you use to control the substitution group. There is another attribute called final, which basically takes one of "extension", "restriction", or "#all" as its value.

final may look irrelevant to the substitution group, but the truth is it's internally called "substitution group exclusions" and, as its name suggests, it controls the behavior of the substitution group. Do you know what is the internal name for the block attribute? It's "disallowed substitutions". Having trouble understanding the difference? Yeah, me too. Actually, both are used to control the substitution behavior, but in a different way.

For example, to prohibit the substitution of element Y by another element Z, the only way to do this is to add block="substitution" to Y. But even under the presence of this attribute, it is NOT an error to have Z in the substitution group of Y. It's just that you can never substitute Y with Z in your documents.

Even worse, if Y designates yet another element X as its substitution group head ( X <- Y <- Z ), then it is OK to substitute X with Z.

All these things make it impractical to use a substitution group in the real world, although it may look harmless when you are experimenting. And that's why you should avoid it.

Why you should avoid a chameleon schema

W3C XML Schema allows the schema element without the targetNamespace attribute. And some people call those schemas chameleon schemas.
Why they are called "chameleon" is irrelevant; what you should know is to avoid them.

One reason is that it is highly likely that validators will have interoperability problems here.

Another reason is that some people like to invent cool tricks by using a chameleon schema. But don't be fooled by those tricks; they are for schema hackers, not for ordinary good citizens.

Unfortunately, if you want to know exactly why you should avoid them, then you have to learn what they are.

Consider the following chameleon schema



<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <!-- note that targetNamespace attribute is absent. -->

  

  <xs:element name="person">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="familyName" />

        <xs:element ref="firstName" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

  <xs:element name="familyName" type="xs:string"/>

  <xs:element name="firstName" type="xs:string"/>

</xs:schema>

Then you write another schema file and include the above by using the include element.



<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"

           targetNamespace="http://example.com">

  

  <xs:include schemaLocation="above.xsd" />

  

  <xs:element name="root">

    <xs:complexType>

      <xs:sequence>

        <xs:element ref="person" maxOccurs="unbounded" />

      </xs:sequence>

    </xs:complexType>

  </xs:element>

</xs:schema>

It seems OK, but actually it's not. Look at the line written in red. It looks like a reference to the familyName element. But it's wrong.
Since this chameleon schema is included by a schema with targetNamespace="http://example.com/", the familyName element is in this namespace. So to refer to this declaration, you have to rewrite the red line to



<xs:element ref="bp:familyName" xmlns:bp="http://example.com" />

Now what happens if you want to reuse this chameleon schema from a schema whose target namespace is http://www.foo.com? The answer is you can't.

As you can see, the only merit of the chameleon schema is gone.

Even worse, you can't detect this error in some validators because they think that those missing components may appear afterward.

Conclusion

As you see, there are many pitfalls that should be avoided. But avoiding those pitfalls will make your life actually easier, because you have less to learn. You don't even lose the expressiveness of W3C XML Schema.

So, keep it simple and have a happy life!