10 min read

DocBook and Modern Publishing

As researchers, developers, writers, or other information and content producers, we create written works in one fashion or another. The goal, usually, is to disseminate this information widely and to an audience in a consumable way that improves their sphere of knowledge in an useful way. In the current world of academic research, there are two main means of doing this, which I will call the informal and the formal distribution channels. The formal ones include publication of papers to various journals, conferences, and workshops. It might also include the publication of books, and usually does at some point. While much of this article is relevant to book publication, it is less so than the more commonly used formals channels of distribution. The informal channels include things like our academic and research blogs, our websites, and things like our talk slides.

In this world, among scientific and technical publications, the two most common means of producing content for distribution to formal channels is either LaTeX or Microsoft Word. For informal content, a huge number of potential formats exist, but often they are the same as the formal channels, usually consisting of LaTeX or Word. For presentations, they may also include PowerPoint and other specialized slide content. For the really casual publication, often the content that is most accessible to the uninitiated, the standard Web markup languages tend to see a lot of action. I know a lot of people in favor of things like open publishing and the like point to the Web languages, like (X)HTML, as much better alternatives to LaTeX and the like because they are easily consumed by anyone with a web browser.

All of these formats share a common historical rooting, and the state of modern publication in these channels shares this history, that they each, no matter the format, tend to focus more on the targeted publication of a given document to a particular medium with a particular formatting. That is, they emphasize a specific look and feel targeting a specific distribution an consumption mechanism: they are tightly bound up in the publication target.

This tight coupling of the source and target has a few advantages. Often, the produced document looks pretty good on the target (though, not always), and it is usually quite practical and usable there. Moreover, the workflow to get from the source language to the target is often quite easy. Interestingly, three of the most common languages for casual academic publishing (Word, LaTeX, and HTML) can all be used either structurally or visually. That is, you can, for example, use LaTeX in a very strict manner, so that the markup remains very structural and relatively independent of the resulting visual presentation, or you can minutely control the visual appearance of the end result. The same is true of the other two major source languages.

Also rather interestingly, each of these sources tends to natively support one target above the others, with Word and LaTeX both tending to focus more on the paginated content and with HTML traditionally focusing more on continuous, un-paged screens for online viewing. These both introduce some problems. On-screen viewing in a continuous format requires good reflow behavior that happens dynamically. It would also be nice if traditional publication metrics were also available, such as good paragraph reflowing or good spacing for textual content. Fortunately, many modern browsers are starting to get some of these features. However, while some people would like to disagree, paginated content should not be considered a dying presentation format. Indeed, there are many compelling use cases for well designed paginated content in today's set of media, and paginated content requires a different set of features than continuous text to look good. In particular, you want a degree of pixel perfect accuracy when working with paginated content that you may not be able to get with continuous content. You may also want to have good algorithms for pagination that ensure the right content goes onto the right page. The right combination of these designs choices makes a big difference in how nice the content looks. In paginated content, you have a number of great opportunities for making really interesting presentation choices that contribute to the overall content in important ways.

Unfortunately, while each platform and medium for distribution has its benefits and disadvantages, clearly, there is not one best solution, and I doubt that there ever will be. And in today's world, requiring your content to be consumed in a specific format designed for a specific medium can result in severely reduced performance for people who are consuming your information in another fashion. This would include trying to read paginated content on small screen devices that use a different aspect ratio, to trying to read continuous content on large screens (which can result in wasted real estate, or difficult to read formatting), but also other important things like the blind or visually impaired trying to read your content using a screen reader or a braille screen. Clearly there does not exist a one size fits all means of distributing content. Source formats that privilege one distribution format above others only make this problem worse and are not at all desirable.

Politically, however, we cannot expect the whole world to suddenly change how it does things just at the drop of a hat. A good solution is one that allows the community at large to alter its publication habits while still maintaining compatibility with the requirements of formal distribution channels. But even then, you want a solution that actually scales to the needs of the formal distribution channels, so that in the future you can actually have a hope that they will adopt this new and better way of doing things.

So what does such a solution look like? We want something that is neutral to the particular publishing target, so that we can target all the different platforms equally well. At the same time, we need to be able to leverage the features of those platforms adequately to enable us to produce not only adequate content, but high-quality content. After all, it is no solution that asks us to accept lower quality end products.

Many people have realized that structurally oriented document preparation can make their lives easier, because it allows for a separation of concern between the presentation of content and the content itself. You will find this in active practice with each of the three major source formats that we have previously mentioned. However, above and beyond that, one of the key insights of the semantic web research is that it is not just the structure of the document that matters, but the ability to scalably associate relevant semantic information at an appropriate resolution to content, instead of just the structure of the document, that makes content truly accessible. If your document is structured using semantic labels instead of just structural labels, it becomes that much easier to lift a document away from a specific target platform and into true platform neutrality.

Nonetheless, semantic content cannot be a one size fits all proposition. Without the ability to programmatically extend the semantic vocabulary to suite the needs of the author and publisher, one cannot really expect content to be scalable. Without an interface for defining these new vocabularies that is computer friendly, we also cannot rely on the semantic markup to scale well, because the computer will have no way to understand this new vocabulary without a rewrite. There must be a well defined means of establishing and consuming in an automatic fashion the vocabulary used by semantic documents as well as the documents themselves. It should also be possible to automatically produce content suitable for consumption by traditional distribution channels, which in the case of casual academic publishing means Word, LaTeX, and HTML.

I know that a lot of people in my community will probably laugh at this paragraph, but there already exists a mature framework for doing exactly this. While not tremendously popular in my circles of academia, the standards surrounding XML are the only mature standards of which I am aware that meet all the requirements listed in the above paragraph. What is more, XML has a strong precedent for use in the publishing industry at scale (though not as an author format). XML also comes with a host of tooling support in nearly every language and platform that you can imagine, making it perhaps one of the most, if not the most, transferrable, structured interchange formats on the planet. When it comes to the underlying technical framework to support the kind of target neutral publication that we should all desire, XML comes about as close to solving the problem as wre could ask for. If only it were not so blasted ugly to look at. On the other hand, any competent Programming Languages researcher will easily be able to write up their own lexical syntax that suffices for their use, without harming the rest of us. All other academic authors will not care, so it does not matter.

The last piece of the puzzle, and in some ways the most critical, is having a pre-made semantic vocabulary for the authoring of our academic material, preferably in a public, easily consumed standard that everyone can use and consume. I mean, if we do not have the common vocabulary, all the underlying technical framework in the world is not going to do us any good. As luck and the insatiable laziness of the programming world would have it, just such a vocabulary exists, and it is a good one.

The DocBook standard is a standard publishing vocabulary defined in RELAX NG for the authoring of technical documents. It is semantic, rather than presentation oriented, and is very rich. This means that, out of the box, for most technical documents, no extension of the language will be necessary. Moreover, there are a number of well written, publicly available XSL stylesheets that allow for the seamless production of XSL-FO, LaTeX, DocX, and HTML output from this semantic content. This means that documents composed in DocBook will integrate into existing workflows, allowing you to easily adopt it for your own use without imposing a significant change on those around you. Unless, of course, that is something that you want. The XSL-FO target is interesting here, because it enables the production of paginated documents in PDF form without a dependency on LaTeX as the output engine. In practice, this means that you can tweak the formatting of paginated content using a more sane layout and formatting language than TeX if you want to do so.

The DocBook standard allows you to leverage existing vocabularies for handling all manner of addition semantic requirements that might not be a part of the DocBook standard, making it supremely scalable. As an example that is quite relevant to most academic, technical publications, instead of defining its own Math layout vocabulary, it uses the pre-existing MathML standard, so that you get all the benefits of that standard, used together with the DocBook standard, without requiring change on behalf of either. Individual publishing houses or conferences can define their own formal vocabulary that will integrate into the DocBook standard and distribute for use by authors. Authors who use the appropriate tools will then be able to validate their documents as well defined for the specific needs of the conference without additional labor. Authors publishing to channels that do not have support for DocBook directly can first process their documents into the format appropriate for that channel without affecting the neutrality of the source document. Indeed, without modification, the same source document can be used for pre-press and draft copies as well as the final version, and the same source can even be used to automatically generate anonymized copies without needing to keep around separate copies of the same document.

Outside of formal channels, if DocBook is the source format, then you can use that same document to generate both an HTML and PDF version of your document in high-quality forms. That is, you will not be second-classing either format for the other. Each format will be equally readable and well presented with quality layout and design. That is more than can be said for a lot of projects. One of the places where you see DocBook being used very successfully right now is in the production of large technical documents for software products that need to distribute their documentation in a variety of formats.

The separation of style from the semantics means that you gain the ability to customize the presentation of your document to a variety of different styles if you so choose, simply by keeping around different stylesheets. Indeed, these stylesheets are not document specific, meaning that not only is the content divorced from the presentation, but the presentation is similarly independent, and will work with a variety of different documents. This can be seen by the public DocBook XSL stylesheets, which provide great defaults for producing content from DocBook sources to a variety of formats.

Now, I know that a lot of people are going to complain at this point that authoring anything on XML is just too much to ask anyone to do. The first response that I have to that is to say that writing in straight DocBook is not actually any different than authoring in HTML, except that you have more tags that you can use to create semantic (not visual) distinctions in your document, so it ends up being even more pleasant. On the other hand, there are certain types of documents that are not very semantically complex, and would use only a tiny portion of the semantic vocabulary. For many of these documents, they are mostly prose and have little more than paragraphs and sections in them. In this case, while the markup may not be that complex, I can understand why someone might want to stare at something a little nicer looking. In that case, I actually recomend that they start authoring their content in straight Markdown, which is nearly as structured for such a simple document, but which has a much less verbose format. However, as soon as you are tempted to insert some HTML to do something that the Markdown format does not support, that is an indication that you should move to authoring in the full blown DocBook.

The great thing about DocBook is that it makes all of your documents Archive ready. Its a text-based standard that will ensure that your documents are always usable. I know of no current better way to make your documents future ready than this. It is even better, I argue, than Plain Text, because it enables you to avoid presenting certain information using an ad hoc format, and instead ensures that any critical semantic content can always be presented in the appropriate format of the day.

Because most of the existing tools are easy to script, DocBook integration into existing workflows should be very easy. If you are already using a build script to render your documents, then you can simply add in a new stage to the compilation that will get you back to where you were. If you are not using a build script, it is usually one command to get you to the final rendered document unless you want to target a specific intermediate format for your workflow, in which case you might as well add some scripting to the mix.

Many tools exist for processing DocBook documents, from the free to the Enterprize quality commercial offerings. The easiest is to get started in is probably the cross platform Xalan and FOP applications from Apache, which enable PDF output using XSL-FO and that are cross-platform and stable. You will want to obtaint the DocBook5 XSL Stylesheets, which you can often find online or through your package manager. Finally, get your favorite text editor and see if it is schema aware, as that will enable you to edit while ensuring that your document is valid automatically. It can also give you hints about what tags are legal where and about the attributes that are available for a given tag. The XSL Stylesheets, when used with Xalan will enable you to produce Doc, XSL-FO, LaTeX, and HTML output from your DocBook sources easily. And that is really all you need to be productive and produce some great looking documents.

In summary, the future of publishing is in formats that are platform and media independent, and you as an author are well served by using formats that allow you to get your content in high-quality form onto as many different platforms as possible. You want to make your content accessible to the consumer. For technical publishing, DocBook literally sets the standard and makes it possible for you to future-ready your documents, ease publication to many different publications, maintain a high-quality of rendered output, and still integrate into existing publishing workflows. I highly recommend that people make use of these technologies to improve the quality of documents that they produce and ease their publishing burden.