A Quick Intro to XML

I mentioned at the beginning of this chapter that the text of our novel needs style information and context information in order to be displayed. This is added to our text by way of markup, which logically describes the text, and a style sheet, which explains how the text marked up in certain ways should look.

There are lots of ways to mark up a text file, the most common is with XML (the extensible markup language) and this is the language LibreOffice uses in its .ODT format and is the basis of the .EPUB and .MOBI formats. XML is very simple in its basic form, as you can see by this well formed XML file:

<greeting>Hello, World!</greeting>

Though if you have had a look at the uncompressed files inside an .ODT document, you will have discovered that they can get pretty complex pretty quickly!

I’m not going to go into great detail about XML here, largely because you don’t need to be an XML guru to get your export plugin working properly (though it might help if things aren’t working as you expect) and because lots of excellent free resources already exist on the internet (check out www.w3schools.com/xml/‎ for starters). You need to know the basics, though, and that’s what I’ll try and cover here as briefly as I can (if you are familiar with HTML, the language used to mark up web pages, you’ll find XML very similar).

Here we go:

  1. Firstly, all your content needs to be enclosed in tags.
    <thisIsCorrect>Some text</thisICorrect>
    <thisIsNot>Some</thisIsNot> text
  2. The start tag is the tags name encased in greater and less than arrows,
  3. Your end tag has to be exactly the same as the start tag (XML names are case sensitive!) but with the addition of a forward slash after the less than arrow
  4. You are free to use almost anything you like as a tag (though a specific XML format, like LibreOffice’s .ODT format, might mandate specific tags). The only punctuation characters allowed, though, are ‘.’, ‘-’ and ‘_’, and of these only the underscore can start the name. Whitespace is similarly prohibited.
  5. An empty tag can be written without the end tag if you add a forward slash before the less than arrow (empty tags are often used in include other files into your document and therefore the tag doesn’t have any content of its own).
  6. Tags can be nested
  7. But ***NOT*** overlapped
  8. Tags can have two types of data: attributes and content. Attributes are included inside the tag, content between them. An attribute’s value is enclosed in quote marks.<example anAttribute="true">Some content.</example>
    It is often pretty arbitrary whether data is included as an attribute or as a nested tag. Thus the data above could be marked up like so:
    <example><anAttribute>true< /anAttribute>Some content.</example>
    As a rule of thumb, if the data is to be displayed on the screen, it should be content. If it isn’t directly seen on screen, it is best included as an attribute.
  9. If you want to include a comment in the XML as a reminder to yourself you enclose it thusly
    <!-- don’t forget to buy milk! -->
    and it will be ignored by any program rendering the files contents.
  10. Finally it is often useful to include some hints to the program reading the file about the contents of it. This is done with an XML processing instructions which are enclosed in greater and less than arrows next to question marks. It is also a good idea to include an XML Declaration (which is very similar to a processing instruction), which gives basic information about the file, such as what version of the XML specification the file is using.
    <?xml version="1.0" standalone="yes" ?>
    The standalone attribute lets the program know that this XML document doesn’t need to load an external description of the format to make sense.

Ok, hopefully by now you can read a pretty basic XML file. However, when dealing with LibreOffice files you will also come into contact pretty quickly with namespaces. Namespaces can look a bit complicated, but they aren’t. Namespaces allow tags with the same name to be used in the same document without confusion (at least as far the application processing it is concerned). This is important because XML files can include external files, files that you have no control over. Imagine you have written a document which includes an SVG (an XML based vector graphic format), without namespaces you would run the risk of a tag in the SVG colliding with one you were using.

“So what to do namespaces look like?” I hear you cry. Well they are simply URIs like “http://www.w3.org./2000/svg”. They don’t have to actually point to anything, but it is customary and sensible for them to point to a document describing the format. As URIs are a bit long to actually use, you associate the URI with a more convenient prefix, and we then use the prefix (with a colon, to separate it from the tags name) in place of the namespace URI, like so:

<myTag xmlns:svg="http://www.w3.org./2000/svg">
<svg:ellipse rx="100" ry="100">

Once a namespace has been linked to prefix it can be used on the tag itself and all its nested children. If we want to simplify things still further we can set a namespace to be the default, then we don’t even have to include the prefix and colon. The default namespace again applies to a tag and all its nested children.

<myTag xmlns="http://www.w3.org./2000/svg">
<ellipse rx="100" ry="100">

Note that in this example <myTag> is now moved into the http://www.w3.org./2000/svg namespace so you would have to remember this when trying to access it.

Series Navigation<< The Difference Between Text and Binary FilesA Quick Intro to XML Schemas >>

Leave a Reply

Your email address will not be published. Required fields are marked *