Content-typing XHTML
Introduction
The content of documents on the WWW is important - not only in the traditional sense, but also because a UA need to understand the content from a technical point of view.
It is a common belief that the file extension - typically the last three letters
of a filename or URI - is used
to identify the conten of the file or resource. This practice was used by many
older systems, but is not applicable on the web. The type of content is identified
by the HTTP specification,
specifically the Content-Type
header.
The single most used such content type is text/html
. This
value lets a web browser or other user-agent know that the content following
is HTML of some sort, and
that the browser should attempt to present it to the user in a way she can
handle.
XHTML is quite a different beast.
Please note that for the rest of this document the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in RFC 2119.
XHTML 1.0
XHTML 1.0 can be referred to as a transitional stage between the HTML 4.01 and XHTML 1.1.
Since the 1.0 version is designed to be compatible with
HTML 4.01, a content type
of text/html
MAY
1 be
used to identify the content - but only if the
XHTML document is
written according to the
HTML
Compatibility Guidelines.
For all other documents claiming to be
XHTML 1.0, a
content-type
of application/xhtml+xml
MUST be used.
XHTML 1.1
For the 1.1 version of XHTML
the specifications are clear: text/html
MAY NOT be
used. The XHTML "native"
content type of application/xhtml+xml
SHOULD be used,
whilst the generic XML content
type of application/xml
MAY be used.
This soup of content types leaves us with a fairly clear course of
action for 1.1: if we use
XHTML 1.1
2,
we should serve application/xhtml+xml
as the content type.
The horror! The Horror! 3
You've guessed it. Serving up a document containing what is to
all intents and purposes
XHTML as
text/html
simply means that browsers will jump into error
correcting mode and deal. Serving the same document as
application/xhtml+xml
will, in most cases, present the user
with a download dialogue of some sort.
This is undesirable. Sadly there is no way out offered by the specifications, so we'll have to roll our own.
mod_rewrite trickery
Running modern versions of the Apache web server gives you the possibility of voodoo - the mod_rewrite type. What we want to do is make sure that only those browsers who can handle XHTML documents get the content-type identifying it as such.
HTTP gives us the ability
to do this. Most user-agents will
4
send the Accept
5
request header, which
... can be used to specify certain media types which are acceptable for the response. Accept headers can be used to indicate that the request is specifically limited to a small set of desired types, as in the case of a request for an in-line image.
This is called content negotiation and is exactly what we want. User-agents
which believe, correctly or incorrectly, that they can handle
XHTML should
include the string application/xhtml+xml
in their list of
accepted content-types.
If they do, we can use this fact to both comply with the specification and avoid making a mess of other browser's handling of our pages by invoking the following magic:
RewriteEngine On RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml RewriteCond %{HTTP_ACCEPT} !application/xhtml\+xml\s*;\s*q=0 RewriteCond %{REQUEST_URI} \.html$ RewriteCond %{THE_REQUEST} HTTP/1\.1 RewriteRule .* - "[T=application/xhtml+xml; charset=ISO-8859-1]"
The above could either be placed in the Apache main configuration file,
a separate file included by the main configuration, or even in a
.htaccess
-file.
Written in more humane language, the rule works as follows:
Turn the rewrite engine on ; If HTTP_ACCEPT contain the string "application/xhtml\+xml" AND HTTP_ACCEPT does not contain "application/xhtml\+xml\s*;\s*q=0" AND REQUEST_URI ends in ".html" AND THE_REQUEST is a HTTP/1.1 THEN change the content-type sent to "application/xhtml\+xml"
Please note that the above algorithm does not take into consideration the q-paramater -- and it really, really should.
I know, I know ...
This, as I am painfully aware, does not solve the problem that any
XHTML 1.1
content is served up as text/html
to browsers who
don't happen to understand application/xhtml\+xml
.
Technically speaking this is in grave violation of the specification.
Ignoring, for a moment, that most things these days violate one specification or other, methods exist to solve the problem by alternative means. Content could be stored in the XHTML format on the server and - since XHTML is just another XML-based language - converted to plain HTML on the fly. With heavy caching this might not even be painful.
The method outlined in this document, however, does really no harm. It will get the job done without too much pain, and saves the author from embedding each page into a server-side browser accept detect-and-decide script.
But first and foremost: it does no harm.
References
- XHTML Media Types
- http://www.w3.org/TR/xhtml-media-types/
- The HyperText Transfer Protocol
- http://www.w3.org/Protocols/rfc2616/rfc2616.html
- The mod_rewrite reference documentation
- http://httpd.apache.org/docs/mod/mod_rewrite.html
- Key words for use in RFCs to Indicate Requirement Levels
- http://www.rfc-editor.org/rfc/rfc2119.txt
1 For a definition of the word MAY in this context, please refer to RFC 2119.
2 ... and I did.
3 From Joseph Conrad's Heart of Darkness via Francis Ford Coppola's Apocalypse Now and finally to Genndy Tartakovsky's Dexters Laboratory, illustrating how good quotes bubble up the tree of culture until it falls off and bashes its head in on a rock.
4 ... in a perfect world.
5 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.1