Formatted text in xml files
XML files are a relatively easy format to store simple user content in. The files are human readable (and editable), can be opened in any text editor and there are many, many parsers available that will parse your custom xml files
But how do you tell the parser which tags are valid? Two options are available:
As per usual, w3schools provides solid tutorials on both. My favourite method is using DTD's, and that's the one I'll be describing here.
An example
The DTD file shown below defines a very simple article structure
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT content (#PCDATA)>
<!ELEMENT article (title, description, content)>
<!ATTLIST article publish (true|false) "true">Each XML element is defined using an ELEMENT tag containing information about the tag's contents. If a tag can contain any kind of typed text except for other XML elements, the content is said to be 'parsed character data', or #PCDATA. If the tag must contain a very strictly structured set of elements this might be indicated as in the example (title, description, content). This tag must contain a 'title', continue with a 'description' and finish off with a 'content' tag. Multiple options can be indicated using the or sign '|' and a regex like system of stars and plus signs can be used to indicate multiple occurences of items (for example <!ELEMENT lunch (fish|chips)*>).
Attributes are indicated per element using the ATTLIST tag. In the example only the article tag has an attribute: 'publish' can be either 'true' or 'false' and has a default value of 'true'. The complete syntax can be found at w3schools and is fairly self-explanatory.
XML files using the article structure must include a DOCTYPE header pointing to the location of the article DTD. For example
<!DOCTYPE article SYSTEM "/doctypes/article.dtd">
<article>
<title>On squirrels</title>
<description>A short reflection on nut burrying rodents.</description>
<content>(content goes here)</content>
</article>Formatting
And now for the promised gold: storing formatted text in an XML file. If you can store simple HTML mark-up in an XML file you can write a site that uses those files to create complex HTML pages. The files are user readable so a content provider with a minimal knowledge of HTML will be able to create and edit the files. At its most basic, HTML is as simple as BBCode so if it can be made to fit inside a content <div> without wrecking the layout then HTML is the perfect way to let users enter simple formatting.
Even better: Starting from this approach you can take it to the next level by building a CMS that creates these XML files. Provided the syntax is clear this should be no big problem. Having a strict format for your content will allow for a clean cut between the site's front end and back end code. You could even use it to provide your clients with a basic 'edit-your-own-content' site that they can then upgrade to a CMS based system - without changing the front-end or the content storage!
So what would a simple, strict, content-only HTML look like?
Content Markup Language
My answer to this question is Content Markup Language, or CML. It contains the following tags:
- Formatting
em,strongandq- Special tags
br,imgandcode- Headers
h1,h2andh3. Here I went by the LaTeX convention that three is enough. If you want to useh1for something different on the site you can easily make the parser replaceh1byh2,h2byh3etc.- Paragraphs
p- Lists
ul,ol,li,dl,dtanddd- Tables
table,thead,tfoot,tbody,tr,thandtd- The anchor
a
All tags can have an id and a class and additional attributes are defined where needed (src, href, rowspan etc). In addition, all the usual HTML entities are (special characters) are defined.
The complete definition file can be found at http://whiteboxcomputing.com/dtd/cml1_4.dtd.
Using CML
Here's an updated version of the article XML format:
<!ENTITY % cml SYSTEM "http://whiteboxcomputing.com/dtd/cml1_4.dtd">
%cml;
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT content (#PCDATA | em | strong | p)*>
<!ELEMENT article (title, description, content)>
<!ATTLIST article publish (true|false) "true">This version's content tags may contain parsed character data and any number of em, strong and p elements. The definition starts out by importing the CML DTD as an entity %cml;. This entity is basically a macro for all the code in the imported file, so we need to call it once to insert the text from that file into this one. After that, all elements, entities and attributes of the CML definition file are available to our article class.
Element classes
Of course, naming each element separatly is tedious and inefficient. This can be remedied by grouping elements using entities. Entities are a kind of macros: when the DTD file is parsed every occurence of an entity is replaced by the text it represents.
The CML DTD defines six different element classes:
<!ENTITY % formatting "em | strong | q">
<!ENTITY % special "br | img | code">
<!ENTITY % headings "h1 | h2 | h3">
<!ENTITY % block "p | table | ol | ul | dl">
<!ENTITY % inline "(#PCDATA | %formatting; | %special; | a)*">
<!ENTITY % inblock "(#PCDATA | %formatting; | %special; | %block; | a)*">Using these element classes we can define a final version of the article format that allows for any kind of CML based formatting inside the content tag.
<!ENTITY % cml SYSTEM "http://whiteboxcomputing.com/dtd/cml1_4.dtd">
%cml;
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT content %inblock;>
<!ELEMENT article (title, description, content)>
<!ATTLIST article publish (true|false) "true">Any XML file following the article structure and including the correct DOCTYPE header will now be parsable by a host of XML parsers in different languages. In addition, parsing invalid article files will result in a set of error messages generated automatically by your parser of choice. Most parsers will return line numbers and detailed error messages, allowing you to point the content provider in the right direction. This way you can create a powerful parser for your content without typing a line of parsing code.
And what about the html?
Because CML is based on XHTML any CML code is valid XHTML - provided you use it inside the correct element - a div for example. This means that parsing an article file into an HTML file is easy:
- Parse the file using your parser of choice
- Extract the non-cml parts (title etc) and use them as you will
- Paste the raw CML into your HTML document at the correct place
Et voila, a strict, extendible, human-editable way to allow formatted content in your XML files.
Sep 21st, 2009
Comments
No comments yet! Feel free to post some using the form below.
If you wish to add code to your comment you can use code tags, like this: <code class="php">yourCodeHere</code>.
Quite a large number of languages are supported, although I can't guarantee it'll be pretty. Inside the code tags you can use any characters except for the string "</code>".