Let’s take a look inside our test file. Open it up in your text editor, and if you ignore all the info we’re not really interested in, you should be able to see that the LibreOffice document has the following basic structure (in simplified form):
<office:document> <office:automatic-styles> <style:style>Here are the automatic styles.</style:style> … </office:automatic-styles> <office:body> <office:text> <text:p text:style-name="Novel-Part-Title"/> <text:p text:style-name="Novel-Chapter-Title"/> … <text:p text:style-name="Novel-Paragraph">Here is the text of our book.</text:p> … </office:text> </office:body> </office:document>
All we need to do is transform into something resembling (in simplified form) this:
<html> <head> <style>The styles that format our book.</style> </head> <body> <div id="Novel-Part"> <div id="Novel-Chapter"> <p>Here is the text of our book.</p> … </div> </div> </body> </html>
As you can see, they aren’t a million miles away from each other. It is mostly a case of just renaming the elements. The big exception to this is that, as I have said before, the .odt file has a flat layout that we want to turn into a nice hierarchical one. We are also going to have to do a bit of work to get rid of the automatic styles.
Now, there are a lot of different ways we can accomplish the export of our novel with xslt. Luckily we don’t really need to consider the efficiency of our stylesheet as it will only be run once every few years!
Let’s get to writing our template. If you are writing your own, I’m sure you’re aware, but just in case, you need to do so in a text editor (like gedit) and, for the sake of clarity, it should have the file extension “.xsl”. I won’t go back over any other basics, as you can recap them in the Quick Intro to XSLT article if need be.
Right, the first thing we need to deal with are the namespaces. You might have noticed there are quite a lot of them mentioned in the LibreOffice document. Luckily we only need to deal with the ones that we are actually using. A quick look at the example document will show you that we need a minimum of three namespace prefixes for the input document: office, style and text (if there are any images to deal with we will also need a forth, draw, and a fifth, xlink, if there are any links). The output file uses the defult namespace for the xhtml elements so we don’t need to worry about that, but it will also use some extras from Amazon. These use the mbp namespace prefix. As far as I know, there isn’t an official namespace for it, so I just use “http://www.amazon.com”. That just leaves the xsl namespace itself.
Now, we don’t want all of these namespaces to show up in our output, so we need to exclude those that are only relevant to the stylesheet. We do that with the “exclude-result-prefixes” attribute.
If you are going to try this out in your browser, you need to take one further step. You need to tell it that it is getting html. You do this with the method attribute of the output element (remember to remove it again when your template is finished). However, if you are going to be using a command line parser to test with (such as xsltproc), you are probably better off leaving it at xml.
Altogether this gives us:
<?xml version="1.0"?> <!-- First set up the namespaces. --> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:mbp="http://www.amazon.com" exclude-result-prefixes="xlink office style draw text"> <xsl:output method="html" indent="yes"/> </xsl:stylesheet>
This is a fully functional stylesheet. Not that it is that useful a one. But have a quick test that it is all working. As we don’t have any templates defined, the default ones will be used. These just output the text content of all of the elements, so you should see all the text of your test document output in a long line.