Resolution and Resolution Independence

Another thing we are going to have to bear in mind when formatting our book is that the resolutions of our output devices are very different. Resolution refers to the number of dots in a specified area, and generally speaking, the denser the dots are packed, the better things look. These dots can be ink created by a printer or pixels created by a monitor (or electronic ink on the screen of an e-reader). Your document, too, will have a resolution, and my well be made up of bits and pieces that have there own resolutions (such as if you have a couple of photographs included along with your text).

The most common unit of measurement for resolution differs depending on what sort of device you are outputting to, when you’re printing, dpi is used, which stands for dots per inch; whereas, most commonly, when talking about monitors the, resolution of the entire device is given. So a printer might print an A4 page at 300 dpi; whereas, a kindle keyboard has a resolution 600×800. When talking about graphics, such as a photo, it is common to refer to its resolution in the same way as the output device it is intended to be viewed on i.e. if it it is to be printed, then its dpi is used; whereas, if it is to be viewed on a screen, its entire resolution is given. This poses a bit of dilemma for us as we are outputting to both!

The most important thing for you to understand about resolution is how it relates to size. Basically, the higher the resolution of your document, the bigger it will print/display, and the higher the resolution of the output device, the smaller it will print/display. However, in many situations you can set size as well as resolution, in these cases the image is scaled behind the scenes; and therefore, changing the resolution of the document or output device changes the amount scaling that needs to be done and, therefore, increase or decreases the quality of the image.

So what does all this mean for us and where is the resolution independence that was mentioned it the title of this section? First the good news, unless you screw things up, the text of your novel will be resolution independent, which is to say that it will be displayed/printed optimally no matter what the resolution of the output device.

Unfortunately, it is quite easy to screw things up. We face three main challenges: firstly, though the text is resolution independent, any images contained alongside it aren’t; secondly, though our text may be resolution independent, we can still go awry when specifying how the text is laid out (e.g. when we specify how much we want paragraphs indented); and thirdly, because the printed and electronic versions of our book are going to be different, though we can, for example, specify that we want our title to be 3cm high and expect it to be rendered beautifully in both versions, 3cm takes up a different percentage of the space on a printed page than it does on an e-reader, and this gets even more confusing when you consider that an e-reader can be held both portrait and landscape.

The solution to all these problems is, sadly, in most cases at the moment, to simplify the formatting of the e-book (the printed version can remain as complicated as you like.) This, for most people will be reasonably straight forward, but if you rely a lot on clever layout for some artistic effect, or if you are writing a more academic style book, which uses lots of diagrams and foot notes, it can be tricky and the results disappointing.

Vector Graphics, Raster Graphics and Text

From reading the last section you might be wondering how the text of your book manages to be resolution independent. To understand this you need to understand the difference between vector graphics and raster graphics. Raster graphics explicitly specify what colour each of their pixels are e.g. pixel one is black, pixel two is black, pixel three is black etc. Whereas, vector graphics describe what colour their pixels are e.g. pixels one to three are black. Vector graphics are, therefore, much better suited to graphics that consist of geometric shapes and patterns, as these are easy to describe, and raster graphics to images that have organic contents, like photos.

Vector graphics have two very important advantages over raster graphics (don’t worry, they also have some disadvantages). Firstly: they are usually a lot smaller. Imagine if earlier the line of pixels we were describing wasn’t three long, but three million. This can be described in our vector format thusly: pixels one to three million are black; whereas, in a raster graphic, we have to have three million different entries, one for each pixel! (Though if you were to try to convert something like a photo to a vector graphic you might actually end up with a bigger file as it might not be composed of nice easy to define shapes). Secondly: vector graphics are resolution independent. That is to say you can scale the graphic as much as you like and it will be exactly the same (except for being bigger or smaller of course!)

For example, imagine our vector graphic describes a circle that has a radius of five. Now imagine we want to scale it by a factor of two. All we do is say we have circle whose radius is five*two. When our circle is drawn on the screen it will be a perfect circle (or, at least, as perfect as is possible), just twice as big as before. Now imagine we have a raster graphic that describes the same circle. The first problem we encounter is that if our circle has a radius of five pixels, we only have a 10×10 square of pixels to define it. This is not enough to create a smooth curve, so our original circle is going to be pretty jaggedy. Now, when we scale it by a factor of two, the computer has no way of knowing that this jaggedy shape is meant to be a circle, so it won’t be able to smooth out the curves even though it now has more pixels to do it. All it can do is take its best guess at figuring out what the extra pixels should be filled with by averaging the values of pixels that have been specified, a process that will actually make the circles outline worse!

Now, the text of your novel generally starts off life as a special type of vector graphic. This is how you experience it in your word-processor, where it is made up of the characters themselves, a font, which describes how the characters should be drawn on the screen, and various levels of style information, which fine tune everything. It is possible to convert this text into either a plain vector graphic or a raster graphic, though there are only a couple of special cases where you would do this, mostly when you want to edit the text in graphics application e.g if you were wanting to include a quote from the book on your book cover (though most graphics packages will allow you to use the text and graphics directly without first converting it).

The Difference Between Text and Binary Files

Ok, so you have your novel. It is made up of some text and some graphics, either vector, raster or both. Now your novel needs to be stored in file on your computer. Files can be divided into two main types: text and binary. A text file is, unsurprisingly, a file that contains text. By text I mean that the low level 0s and 1s that are stored on the computer’s disk can be mapped to text characters stored in a character set in a meaningful way. To be clear, a text file isn’t one which contains text when you open it in a word processor (though it might). It is one which contains text if you open it in a text editor. A binary file might also contain text, but it needs another level of processing to take the binary information and turn it into character data.

Why is this important? Because, in order to create our converter plugin for LibreOffice, we are going to want to have a good poke about inside the internals of our novel’s file so we can see how the format works, and if we can’t see those internals because the file type is binary, we are going to have problems. We are also going to want to be poking about inside our converted file to make sure everything is indeed working as intended. We also want to be able to use stand system tools to create and check it.

Now, with all these advantages, you might wonder why anyone would use a binary format. However, it’ll come as no surprise to most of you to learn that an example of a binary file would be one using Microsoft Office’s older .doc format. You can see this for yourself by saving a copy of a file in LibreOffice using the .doc format and trying to open it up in a text editor (such as gedit).

Binary files are preferred my Microsoft for almost exactly the same reasons that everyone else prefers text files, because it makes it harder for people to write programs that are compatible with that format, meaning you have to purchase specialised programs from them to accomplish simple tasks rather than using the tools that are provided by your operating system, a third party or yourself. You will find that lots of file formats used by older big software companies are binary for this reason.
The only advantages binary files have over text files (for the user) are that they are (usually) smaller, (possibly) faster and that they make it easier to combine different files into one (e.g your novel and your cover). Though there are various way this can be done in a text file, they often result in something that is not readable by standard tools.

A fairly simple way for text files to get most of the advantages of binary files is to split the information into various different files and store them in a directory compressed using the zip algorithm. This gives a small binary file that is so easy to turn back into a series of text files that it can, for all intents and purposes, be considered a text file. This is the technique used by LibreOffice for it’s .odt file format. If you like, you can make a copy of one of your files, decompress it and have a poke about inside, as, in a minute, we are going to be looking at some of these files (don’t worry if you don’t know how to do this, as I’ll be showing you later).

A Quick Intro to XML

I mentioned at the beginning of this chapter that the text of our novel needs style information and context information in order to be displayed. This is added to our text by way of markup, which logically describes the text, and a style sheet, which explains how the text marked up in certain ways should look.

There are lots of ways to mark up a text file, the most common is with XML (the extensible markup language) and this is the language LibreOffice uses in its .ODT format and is the basis of the .EPUB and .MOBI formats. XML is very simple in its basic form, as you can see by this well formed XML file:

<greeting>Hello, World!</greeting>

Though if you have had a look at the uncompressed files inside an .ODT document, you will have discovered that they can get pretty complex pretty quickly!

I’m not going to go into great detail about XML here, largely because you don’t need to be an XML guru to get your export plugin working properly (though it might help if things aren’t working as you expect) and because lots of excellent free resources already exist on the internet (check out www.w3schools.com/xml/‎ for starters). You need to know the basics, though, and that’s what I’ll try and cover here as briefly as I can (if you are familiar with HTML, the language used to mark up web pages, you’ll find XML very similar).

Here we go:

  1. Firstly, all your content needs to be enclosed in tags.
    <thisIsCorrect>Some text</thisICorrect>
    <thisIsNot>Some</thisIsNot> text
  2. The start tag is the tags name encased in greater and less than arrows,
    <anExampleTag>
  3. Your end tag has to be exactly the same as the start tag (XML names are case sensitive!) but with the addition of a forward slash after the less than arrow
    </anEampleTag>
  4. You are free to use almost anything you like as a tag (though a specific XML format, like LibreOffice’s .ODT format, might mandate specific tags). The only punctuation characters allowed, though, are ‘.’, ‘-’ and ‘_’, and of these only the underscore can start the name. Whitespace is similarly prohibited.
  5. An empty tag can be written without the end tag if you add a forward slash before the less than arrow (empty tags are often used in include other files into your document and therefore the tag doesn’t have any content of its own).
    <noEndTagNeeded/>
  6. Tags can be nested
    <nestedTag1><nestTag2></nestedTag2></nestedTag1>
  7. But ***NOT*** overlapped
    <overlappedTag1><overlappedTag2></overlappedTag1></overlappedTag2>
  8. Tags can have two types of data: attributes and content. Attributes are included inside the tag, content between them. An attribute’s value is enclosed in quote marks.<example anAttribute="true">Some content.</example>
    It is often pretty arbitrary whether data is included as an attribute or as a nested tag. Thus the data above could be marked up like so:
    <example><anAttribute>true< /anAttribute>Some content.</example>
    As a rule of thumb, if the data is to be displayed on the screen, it should be content. If it isn’t directly seen on screen, it is best included as an attribute.
  9. If you want to include a comment in the XML as a reminder to yourself you enclose it thusly
    <!-- don’t forget to buy milk! -->
    and it will be ignored by any program rendering the files contents.
  10. Finally it is often useful to include some hints to the program reading the file about the contents of it. This is done with an XML processing instructions which are enclosed in greater and less than arrows next to question marks. It is also a good idea to include an XML Declaration (which is very similar to a processing instruction), which gives basic information about the file, such as what version of the XML specification the file is using.
    <?xml version="1.0" standalone="yes" ?>
    The standalone attribute lets the program know that this XML document doesn’t need to load an external description of the format to make sense.

Ok, hopefully by now you can read a pretty basic XML file. However, when dealing with LibreOffice files you will also come into contact pretty quickly with namespaces. Namespaces can look a bit complicated, but they aren’t. Namespaces allow tags with the same name to be used in the same document without confusion (at least as far the application processing it is concerned). This is important because XML files can include external files, files that you have no control over. Imagine you have written a document which includes an SVG (an XML based vector graphic format), without namespaces you would run the risk of a tag in the SVG colliding with one you were using.

“So what to do namespaces look like?” I hear you cry. Well they are simply URIs like “http://www.w3.org./2000/svg”. They don’t have to actually point to anything, but it is customary and sensible for them to point to a document describing the format. As URIs are a bit long to actually use, you associate the URI with a more convenient prefix, and we then use the prefix (with a colon, to separate it from the tags name) in place of the namespace URI, like so:

<myTag xmlns:svg="http://www.w3.org./2000/svg">
<svg:ellipse rx="100" ry="100">
</myTag>

Once a namespace has been linked to prefix it can be used on the tag itself and all its nested children. If we want to simplify things still further we can set a namespace to be the default, then we don’t even have to include the prefix and colon. The default namespace again applies to a tag and all its nested children.

<myTag xmlns="http://www.w3.org./2000/svg">
<ellipse rx="100" ry="100">
</myTag>

Note that in this example <myTag> is now moved into the http://www.w3.org./2000/svg namespace so you would have to remember this when trying to access it.

A Quick Intro to XML Schemas

I mentioned in the previous post that some XML based file formats, like LibreOffice’s .ODT format, specify which tags you can use. These are specified in an XML Schema (older formats might use a Document Type Definition, as HTML does). You don’t really need to worry about how scheams are created; just be aware that they exist and what they do. Most schemas are stored as external files and then included in the XML document,

<greeting xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://www.robertawood.com/example greeting.xsd">

Schemas can, though, be included inline rather than in an external file, but this is rarely done as then the schema becomes part of the document which needs to be described in the schema, which gets a bit confusing. Anyway just so you know what a schema actually looks like, here is the schema for the simple example used in the previous post.

<?xml version="1.0" encoding="utf-8" ?>
     <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
                attributeFormDefault="qualified"
                targetNamespace="http://www.robertawood.com/xmlExample">
         <xs:element name="greeting">
             <xs:complexType>
                 <xs:simpleContent>
                     <xs:extension base="xs:string">
                         <xs:attribute name="language" type="xs:string"/>
                     </xs:extension>
                 </xs:simpleContent>
             </xs:complexType>
         </xs:element>
     </xs:schema>
 

This schema first sets its own namespace before setting ‘attributeFormDefault=”qualified”’, meaning that any attributes a tag has are required to use a namespace, too (by default they aren’t). It then defines the namespace for document it is going to be describing (‘http://www.robertawood.com/xmlExample’) before going on to describe the document as consisting of a top level (root) tag called ‘greeting’ that can only contain an attribute called ‘language’ and a character string (i.e. the greeting itself). So, putting this all together, our example XML file would look something like this (note we can no longer say that it is standalone as we are using an external schema to describe it):

<?xml version="1.0" encoding="utf-8" ?>
 <!-- This is a simple example of an XML file -->
 <raw:greeting xmlns:raw="http://www.robertawood.com/xmlExample"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://www.robertawood.com/xmlExample greeting.xsd"
               raw:language="English">
     Hello, World!
 </raw:greeting>

Hopefully you could make sense of that. Don’t worry too much if you wouldn’t be able to write a simple file of your own. For our purposes all you really need to be able to do is read it.

A Quick Intro to CSS

In the previous post we had a look at marking up some example text up with XML. Now I imagine you’re just as excited as me to view that example in all its glory, but before we can do that, we need to specify how the various parts of it should look. The most common way to specify this style information when using XML is with CSS (cascading style sheets). This is the same way it’s done on the web with HTML. For our ‘Hello, World’ example, the style sheet might look something like this:

greeting{
  font-size:120px;
  color:red;
}

Easy huh? Even if you are a complete CSS virgin, I’m pretty sure you’ve been able to guess at what I have specified, but if you are having one of those days: we told the viewer program that we wanted it to render any text contained within a greeting tag with a font size of 120 pixels and using the colour red (note that in the CSS file we have to use the American spelling: color). You may have noticed that we left out rather a lot of information too, this will all be provided by the viewers default style sheet. There is no standardisation of these defaults, but viewers tend to pick reasonably sane values.

If we save this style info in a file called helloworld.css, we can then link our XML file to it by including the following processing instruction:

<?xml-stylesheet type="text/css" href="helloworld.css"?>

Now we can take a quick peek at our first bit of marked up text! You may be wondering what to use to view it. The good news is that your computer probably already contains many XML viewing programs. Some of these will be designed for a particular XML document type, others will be general purpose. The best general purpose XML viewer you are likely to have is your web browser. So download the two files here and here (making sure to save them in the same directory), fire up Firefox, Chrome or whatever you use, and then click file open. And ta daa!

Well ok, it isn’t the worlds most exciting document, but hopefully you get the basic idea of how these technologies work together. CSS, like almost everything, can get pretty complex. Again I’m not going to dwell on the details because you don’t need to be any kind of CSS guru for what we are going to use it for, but what follows is what you need to know in a nutshell.

The first thing to grasp about CSS is the fact that it gives gives you lots of ways of referring to the same bit of text and that the more specifically a particular rule identifies the text, the higher its precedence is when merging the matched rules all together (it also doesn’t limit you to just the one CSS file but we aren’t going to use that functionality). So, in our example if we add another rule which identifies the first letter of the text inside the greeting tag like so:

greeting:first-letter {
color:blue;
}

This is more specific than our previous rule that applied to all the letters in the greeting tag, so it will have precedence over it and successfully turn the first letter blue.

As well as identifying text by tag name or sub-item you can also use class and id attributes. These aren’t really used with pure XML documents but are used extensively when styling HTML for the web. They are needed in HTML to deal with with the fact that HTML can’t use arbitary tag names, you are limited to those specified in the HTML standard. To get around this limitation, what would be the tag name in an XML document is set as the item’s class attribute, so an HTML fragment of our ‘Hello, World!’ example would look like this (note, <p> is the HTML paragraph tag):

<p class="greeting">Hello, World!</p>

To indicate we are giving a class attribute in our CSS rule we put a dot in front of the name (if we were using an id atrribute we would indicate it with the # sign).

.greeting {
  font-size:120px;
  color:red;
}

Alright, getting back to XML and identifying tags by name, if we add a nested tag to our example,

<greeting>Hello, <nestedTag>World</nestedTag></greeting>

We can reference the text in the nested tag in three ways. Firstly, as it is a child of the <greeting> tag, all of <greeting>’s CSS rules apply to it. We can also create our own rule especially for <nestedTag>,

nestedTag {
  font-size:240px;
}

Which, as it applies to the text more specifically than <greeting>’s rules do, it’ll override any that are declared in both, as font-size is. Finally, we can reference it by including its family tree separated by spaces:

greeting nestedTag {
  font-size:60px;
}

This rule would only apply to <nestedTag> tags that are children of <greeting> tags (whereas the previous rule applied to all <nestedTag> tags regardless of parentage.) As this is more specific than the previous rule, it would get priority over it and set the font size to 60px.

The other major concept you need to grasp about CSS is the way in which it can position the the bits of text identified by its rules. Every identified bit of text (or, more generally, content) is put into a box. This box has height, width, padding, margins and borders (all of which can be specified in the CSS), and also X (left), Y (top) and Z (z-index) coordinates.

Usually, these coordinates are relative to the text’s starting position (or, more accurately, the starting position of box around the text with all its padding, margins, borders etc.) This starting position can be one of three different types. Firstly, and most simply, it can follow on from the piece of text before it (or the box around it). This is called being inline. Secondly, it can start a new line below the previous bit of text (or the box around it). This is called block layout. And Finally, it can float around the previous or next bit of text. Floating is used a lot in HTML to achieve interesting and flexible layouts, but doing this is horribly unintuitive. Thankfully, we don’t need to use it in anything other than its most basic form, which is pretty straight forward to grasp: imagine a picture on a page and that the text of that page is flowing around it, that effect is achieved by using the float layout.

Once we have established the text’s starting position, we have two possible ways to move it from there. The most common is with relative positioning, the other is with absolute positioning. Relative positioning is pretty straight forward. If you set its X coordinate to 5px the bit of text will move 5px right from its starting position. Behind it, it will leave a 5px space, but it will overlap any text that follows it by 5px. If absolute positioning is used, the text is pulled from its starting position and the hole it leaves closes behind it. It is then position 5px to the right of its nearest parent that was positioned relatively, overlapping any content that was already in that position (whether overlapping content goes above or below the content it overlaps is controlled by the text’s Z coordinate.)

We also have a third, pseudo, way of moving text. We can set the left and top margins of our text’s. This way doesn’t actually move the entire text, but the effect is very similar to moving it relatively. The big difference with doing it this way is that any text that follows is moved up by the same amount as well, meaning we don’t get overlapping text.

That’s probably enough CSS to get you going. However, now that you know all about CSS, I’m afraid I have some bad news for you: unfortunately, CSS isn’t how LibreOffice handles style information. But before you despair, it is how MOBI (kindle) and EPUB (and pretty much everything else) do, so when we write our plugin, not only will have to convert its markup, we will also have to convert its style information to CSS. That’s it for now on CSS. If you are interested in the subject there are lots of good resources on the web, such as http://www.w3schools.com.

A Quick Intro to XSLT

The other type of markup file you’ll be coming to know intimately is an XSLT. Where CSS tells the application to render a certain piece of marked up text in a certain way, XSLT tells the application to change the markup of a certain piece of marked up text. This is incredibly useful when you want to change an XML file from one format to another e.g. from LibreOffice’s XML based .ODT format to Amazon’s XML based .mobi format.

In a nutshell, you create templates that identify bits of XML, describe how to transform them, and which of the identified bits of XML’s children should be processed next. When the parser reads the original XML file, each time it finds something that matches a template, it applies the changes. Because of this, it doesn’t matter in what order you write the templates in the XSLT file. However, this can get a bit confusing because it means that order in which things are output depends upon a combination of the order in which things appear in the original XML file, the order in which the templates match and the order in which the templates allow the children of the matched XML to be processed.

To give a better idea of what a real XSLT stylesheet looks like we’ll create one that’ll convert our Hello, World example from XML to HTML (the language used on the web).

The first is thing to note about XSLT is that it is written in XML. This has the advantage that you can use all the same tools to work on them as you would other XML files, but because XML isn’t really intended for marking up programming, it can sometimes be a bit clunky and, sometimes, things that are very common and very easy to do in other languages can be less straight forward. An example of this is looping. In most languages looping over a set of variables is pretty trivial; however, though it is possible in XSLT, it isn’t intuitive, especially when you consider that program flow in the XSLT is controlled by the original XML file (rather than the code defining the loop).

Another thing worth noting before we begin our example is that the text the XSLT outputs needs to be well formed XML. This doesn’t necessarily mean that you have to output XML (though it is by far the most common thing to do) it just needs to be compatible with it. For example, from our Hello, World XML file we will be outputting HTML, we just have to be careful not to use features of HTML that are invalid in XML (such as omitting closing tags etc.)

So lets begin. The first thing to do is tell helloWorld.xml that we are going to be transforming it. This is done by adding a processing instruction (this step isn’t necessary if you want to use command line XSLT processor to which you would supply both the original XML file and the XSLT).

<?xml version="1.0" encoding="utf-8" ?>
<?xml-stylesheet type="text/xml" href="./helloWorld.xslt"?>
<!-- This is a simple example of an XML file -->
<raw:greeting xmlns:raw="http://www.robertawood.com/xmlExample" raw:language="English">
  Hello, World!
</raw:greeting>

We then need to create our XSLT file (called helloWorld.xslt).(Nb. if you want to test this yourself in your browser, as of this writing, Chrome doesn’t support xslt, but it works fine in Firefox.)

<?xml version="1.0"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
</xsl:transform>

This should look very familiar to you. As you can see, all we have done is create a top level stylesheet element that we use to map the xsl namespace to the xsl prefix and raw namespace to the raw prefix (we could map it to a different prefix if wished). You can use either “stylesheet” or “transform” as your top level element. I’ll use transform throughout, but you might see stylesheet used if you look at other examples on the web.

This is a very minimal yet functional XSLT file. You might expect that, as we haven’t defined any templates, it won’t have any effect on the XML file. However, surprisingly, the output of this XSLT is

Hello, World!

All the markup has been stripped. How did this happen? Well, XSLT has a set of built in templates that will be applied if you don’t supply your own in a given situation. Most notably these match all the elements and instruct the processor to process their children whilst matching any text that is found and outputting it. Thus <raw:greeting>Hello, World!</raw:greeting> becomes plain old “Hello, World!”

Ok, lets add a template to our XSLT file so we can control the output ourselves.

<?xml version="1.0"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="greeting">
    We found a greeting!
  </xsl:template>
</xsl:transform>

We now get the output “We found a greeting!” Here we matched based on element name, but more complicated matches are possible by utilsing XPath (though we shouldn’t have to use this functionality). However, we will need to be able to match attribute names. To do this you simply prefix the name with the @ symbol to indicate you are matching an attribute not an element.

<?xml version="1.0"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="greeting">
    We found a greeting!
  </xsl:template>
  <xsl:template match="@language">
    We found a language attribute!
  </xsl:template>
</xsl:transform>

And ta daa!

Hello, World!

Oh, we seem to have got the same output we did before. Why is this? It’s because of the slightly confusing program flow of the XSLT processor. You have to explicitly tell it to apply templates to an element’s children. We do this by adding an “apply-templates” element to our template’s output, which also indicates where any output from any children should be placed. We can either apply all templates to the elements children or just a subset by using the “select” attribute.

<?xml version="1.0"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="greeting">
    We found a greeting!
    <xsl:apply-templates select="@language"/>
  </xsl:template>
  <xsl:template match="@language">
    We found a language attribute!
  </xsl:template>
</xsl:transform>

We now get the expected output of:

We found a greeting! We found a language attribute!

We need to add one more feature to our template before we actually get it to output HTML and that is we want to get at the text encased in our greeting element and language attribute. There are two ways of doing this. The first is by creating a rule to match the text and telling the XSLT proceesor to process the text children of the element (or using the built in rule that does this). The second way is by using the “value-of” element. As you can probably guess, the “value-of” element returns the value of the element or attibute it matches. In our case, using the “value-of” element is slightly clearer so that is what we’ll do. Note that “.” means get the value of the current element.

<?xml version="1.0"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="greeting">
    We found a greeting! The greeting was: <value-of selection="."/>
    <xsl:apply-templates select="@language"/>
  </xsl:template>
  <xsl:template match="@language">
    We found a language attribute! The language was: <value-of selection="@language"/>
  </xsl:template>
</xsl:transform>

Ok, now we have all the pieces we need to create our XSLT that will convert hellowold.xml into HTML. The most common way to convert XML to HTML is to convert the XML element name to an HTML class name, and that is what we’ll do (note that XSLT has a built in template that outputs the value of an attribute so we don’t need to supply our own).

<?xml version="1.0"?>
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="greeting">
    <html>
      <head><title>Greetings</title></head>
      <body>
        <h1>Greetings in <xsl:apply-templates select="@language"/></h1>
        <div class="greeting"><xsl:value-of select="."/></div>
      </body>
    </html>
  </xsl:template>
</xsl:transform>

A Brief Moan About Styles in LibreOffice

Ok, well I think that that’s enough background. It’s time we actually did something useful. The first step we are going to take is to prepare your manuscript in LibreOffice. Depending on how you wrote your novel this might just be a case of few minor tweaks or it might involve a fair bit of work. The thing that is important is whether or not you formatted your novel using styles. For the sake of simplicity, I’m going to assume that you didn’t and that you don’t have clue what I’m talking about. If you did use styles, just use this section as a check-list.

So what are styles? They can be thought of as the poor cousins to XML tags and CSS style rules (discussed in the intro to XML and CSS). That is to say they can be used to both define sections of your LibreOffice document and describe how these sections should be displayed. They have two parts: the style definition (similar to the way CSS defines styles) and application (similar to the way in which XML tags divide your work into logical sections).

However, compared to CSS, LibreOffice’s styles are really limited. The first shortcoming you’re likely to spot is the limited number of options available to LibreOffice’s style rules. However, more annoying in practice is the fact that LibreOffice doesn’t have any cascade and only very limited inheritance. With a bit of thought and planning these drawbacks can be worked around, but what will have you pulling your hair out are direct styles and formatting. This is formatting that LibreOffice ‘helpfully’ creates for you, and if you are not careful, it will very subtly screw up your e-book conversion. We will deal with this latter on in this section.

Some of you out there might remember I said earlier (or might already be aware) that LibreOffice’s ODF format uses XML, so you might be a bit confused by me saying that LibreOffice’s styles do the same job as XML tags, as, surely, the file already has XML tags, and what’s more, if it is XML based, why doesn’t LibreOffice just use CSS (or XSL-FO)? The answer to the first of these two related questions is yes it does use XML, but for better or worse, LibreOffice uses a flat XML format.

A flat XML format is one where the XML hierarchy is collapsed to just one level. If this sounds complicated, it is only because I’m explaining it badly. Hopefully a quick comparison will clear it all up. Here is an example of fairly normal XML document:

<example> 
  <person> 
    <name> 
      <first>Robert</first> 
      <middle>A</middle> 
      <last>Wood</last> 
    </name> 
    <profession>Writer</profession> 
  </person> 
  <person> 
    <name> 
      <first>Ann</first> 
      <last>Example</last> 
    </name> 
    <profession>Scuba Diver</profession> 
  </person> 
</example>

And here is the same information encoded in a flat XML file.

<example> 
    <first-name>Robert</first-name> 
    <middle-name>A</middle-name> 
    <last-name>Wood</last-name> 
    <profession>Writer</profession> 
    <first-name>Ann</first-name> 
    <last-name>Example</last-name> 
    <profession>Scuba Diver</profession> 
</example>

These two files both contain the same information, and they can be converted to and from each other fairly simply using XSL. However, the ‘flat’ example is clearly evil and should be burned at the stake. LibreOffice, though, goes one step further down the path to hell. Its file format is more akin to the following:

<example> 
    <para style=“first-name”>Robert</para> 
    <para style=“middle-name”>A</para> 
    <para style=“last-name”>Wood</para> 
    <para style=“profession”>Writer</para> 
    <para style=“first-name”>Ann</para> 
    <para style=“last-name”>Example</para> 
    <para style=“profession”>Scuba Diver</para> 
</example>

As you can see, if it wasn’t for the style information, we would have no way of telling that our second person had omitted their middle name, as all we would have would be a mess of para elements. But this is pretty much what LibreOffice does, and it is not going to change any time soon, so we need to accept it/drink copious amounts of alcohol until we forget about it and move on with our lives.
However, no amount of alcohol could dull the pain of style inheritance in LibreOffice. The way it is implemented is enough to have you scooping out your eyes with a spoon. In short, inheritance works backwards i.e. rather than children inheriting from their parents, parents inherit from their children.

To a certain extent this is an inevitable result of the flat XML format: as there isn’t any hierarchy in the structure of the document (as everything is on the same level), it is impossible for inheritance to work normally (as nothing has any parents or children). However, I fail to believe that the solution implemented by LibreOffice’s styles to this problem is the best one. But again, realistically, we aren’t going to be able change this, so we had better try to live with it.

The net result of this backwards inheritance is that, as you will see shortly, when defining your styles you have start with the youngest child and work your way back up to the greatest grandparent.

All this goes some way to answering the second question posed earlier (i.e. why CSS or XSL-FO aren’t used by LibreOffice), as without a coherent hierarchy of elements in the XML document, most of the features of these technologies wouldn’t be usable.
Before all this annoys me too much, we need get going, but before we begin we are going to have to do a bit planning, as, though the way we are setting up LibreOffice means we will be able to change the formatting, we need to pick some sort of formatting to begin with and it makes sense to pick an initial formatting that is pretty close to what we want to end up with. Also, because of the backwards way style inheritance works, we need to plan which styles are going to inherit from each other.

Get Set

Ok so let’s get some style. If your novel is formatted in a pretty standard manner, the styles I’m going to outline shortly should be all you need to create. However, if you do need to create a few extra styles, do so. Just remember to add them to the export plugin when we write that in a minute. These styles serve a dual purpose. They are most obviously how we lay out the book so we can print it using services such as CreateSpace but they are also what formats our book so it can be exported to ebook formats to be read on things like the kindle.

My suggestions of reasonable defaults for these styles are based on creating a printed book that uses the 5.25″ x 8″ form factor (CreateSpace is American so it is simpler if we deal in backwards imperial measurements). Change them to suit your own needs and aesthetic tastes. There are no set rules, but remember that less is more, and be careful not to make your book hard to read, after all no one is going to recommend your book because it uses a pretty font. They will criticise it if that font is unreadable or just plain annoying (Comic Sans, I’m looking at you).

I’m also assuming here that you want your novel laid out in a fairly standard fashion e.g. the first page has on it the novel’s title and the author’s name, the second page has on it the copyright info, novel parts start on left hand pages, chapters start on right hand pages, the first paragraph in every section starts flush to the margin, all other paragraphs have their first line indented, all dialogue is indented, page numbers are on the footer, the author’s name is on the right page header, book title is on the left, a short bio is on the last left page etc. Again, it’s your book. You’re free to monkey about with the layout as much as you like, but you might find easier if you work through the tutorial with this layout to begin with, and then make the changes you require once you understand what everything does.

Right, hopefully you got from the previous chapter that everything in your book needs to have a style assigned to it. If it doesn’t, it won’t be exported. It is important that you understand that this includes white space. It is quite common, when you are writing, to add white space by hitting the enter key a few times. At best, this padding will be ignored when the file is exported. At worst, it may confuse the export and make your document unreadable. If you want to add some white space, you must do it by defining some padding in the style that you are using. Tabs are included in this. If you are indenting something (such as the beginning of paragraph), don’t use a tab stop to do it. Instead, use a style that has its first paragraph set to be indented.

Now, before we start creating these styles, I strongly recommend starting a new document in LibreOffice (we will paste in the text of your novel later). While it is theoretically possible to add the styles to a pre-existing document, you often encounter strange formatting errors that can easily baffle you if it is your first attempt. So click ‘New’, and let’s begin.

Paragraph Styles

Right, to the styles. There are three main types that we care about: page styles, paragraph styles and character styles. As you’ve probably guessed, the paragraph and page styles define the pages and paragraphs in your book. The character styles can be best thought of providing overrides to the character style information included in the paragraph styles. We’ll start by creating the paragraph styles, as these are used by the other two types of styles. I’ll step through creating the styles, and then summarise them all in a table. If you are au fait with using styles, you can skip ahead and just use the table.

You should have a brand new, empty LibreOffice document in front of you (Click “File->New->Text Document” if not), and let’s begin! Click the styles button on the tool bar or “Format->Styles and Formatting” from the menus if you don’t have the button there.

LibreOffice Styles and Formatting Button

LibreOffice's Styles and Formatting Menu

Click the paragraph icon along the top of the dialogue box to make sure it’s selected then right click on the list of styles beneath it (which might be empty at this point) and select “New”.

Selecting paragraph styles from LibreOffice's styles and formatting dialog

Creating a new paragraph style in LibreOffice's Styles and Formatting dialog

The Organiser

This pops up the style dialogue box (if the “Organiser” tab isn’t currently selected, select it).

LibreOffice's Paragraph Style dialog

Right, the first thing we have to do is give our style a name. Call it “Novel-Paragraph” (though you can choose your own names for the styles, you must remember this when we write the plugin), and turn AutoUpdate on by clicking the checkbox. This enables you to make changes to the style later on and have the contents update themselves without having to reapply the style. We then have to set which style will follow after it. Paragraph styles only apply to a single paragraph, so when you hit return at the end of one, a new paragraph is created with a new style. What this style is, by default, is set here. Seeing as the most common thing to follow a Novel-Paragraph is another Novel-Paragraph that is what you should set it to (if it isn’t already). We then have the “Inherit from:” selection, as this is our base style we don’t want it linked to anything, so select “-None-”. Our final option is what category we put the style in; add it to custom styles. Finally on this tab, we have a summary of what we have set the style to be. Currently we haven’t set anything, so it is blank but not for long.

LibreOffice's Paragraph Style dialog with values filled in

Let’s crack on and move onto the first tab, “Indents & Spacing”.

Indents & Spacing

As this is a standard paragraph we need to indent the first line of it. How much you want to indent is up to you. To my eye, 0.64 cm looks good. Bear in mind that this is (and all the other setting are) just for the print version. We’ll set style information for the electronic version later, in the export plugin. If you want quite dense text, the other options can all be left as they are. Though, for a younger audience, you might want to experiment with more white space. We do, however, need to turn on “Register-true”.

Setting the indent and turning on Register-true in LibreOffice

Register-true is a somewhat confusingly titled option. Its name is a hangover from the printing era. What it does is make sure that all the text (that has Register-true enabled) is aligned to the same baselines. If you open a book up and place a ruler across both pages, you will find that the text on both facing pages sits happily on it, making the layout more appealing and, when printed on thin paper that shows the other side through, easier to read.

Tommy Lightbreaker by Robert A Wood showing nicely aligned text thanks to Register-true
Tommy Lightbreaker by Robert A Wood showing nicely aligned text thanks to Register-true

The way we achieve this is with Register-true. You want this turned on else your book is going to look messy. It can, though, give you slightly confusing problems. If you find later that some elements of your book’s contents aren’t sitting where you expect them to, consider whether it is because Register-true is moving them out of position and onto to the nearest baseline (nb. we will also have to turn Register-true on in the relevant Page Styles for it do anything!)

Alignment

Next we have Alignment. Novels are always justified (i.e the text goes right to the edges of the page and extra white space is added between words to enable this). This can look a bit odd if you are used to reading text left aligned on a computer screen, but it is just the way it is done. Other alignments might be technically better, but this is what people reading books expect, live with it. One side effect of justified text you might want to keep an eye out for is what’s called a river of white, where, purely by chance, the way the white space has been distributed over a series of lines creates distracting patterns. If you notice any of these when proofing your book, you’ll have to manually edit the lines till the effect goes away.

An example of text with a river of white
A typographic river running down the middle of a text passage (above bottom word “amet”).
The next tab is “Text Flow”. Though some of the options here might sound like a good idea as they can theoretically prevent things like having a page with just one line on etc., in my experience, they generally introduce more problems than they solve, so I would strongly recommend you leave everything on that tab unchecked.

Font

The next tab allows us to set the font and font size of the text in our paragraph. It can be very tempting to try and do something creative here, but I would urge you to resist. It is very hard to improve significantly on a basic serif font and very easy to turn your book into an unreadable mess (sans-serif fonts might look trendy, but they are harder to read, so steer clear of them for the main body of your book). I would bear in mind when making this decision that, to the readers of your book, there is no difference between an adequate choice of font and an excellent choice of font. Once you start reading, you are completely unaware of the font used (as long as the choice wasn’t abysmal!) Remember, the classics are classic for a reason, so start with Liberation Serif 10pt and don’t go very far away from it (Liberation Serif is a free version of Times New Roman).

Choosing a boring but functional font

As with white space, font size increases as the audience’s age/intelligence decreases. A fairly simple way to gauge what the correct font size is for your audience is to grab a random sample of books aimed at your target audience from the library and copy roughly what size they go for (remember, these choices, as already mentioned, only affect the printed copy of the book).

The rest of the settings can be left at their defaults. So, if we now go back to the Organiser tab, you should see these settings: Western text: 10pt + Register-true + Indent left 0.0cm, First Line 0.64 cm, Indent right 0.0cm (for some reason the fact the Alignment has been set to Justified isn’t shown in the current version, which doesn’t matter other than it is slightly confusing).

Organiser showing selected options.