5 min read

Splitting up Co-dfns Documentation

The Co-dfns compiler relies heavily on a workflow that leverages a number of technical documents to coordinate the various development processes, as well as supports the careful, stepwise refinement of a Function Specification into a compiler design which can be trusted and relied on.

In existing Increments, the compiler as well as all of its documentation were organized according to a "one file per document" approach. This meant that the entire Co-dfns compiler occupied a single file, and each of the 13 or so technical documents resided in a single document. Some of these documents are very large. And, while not the largest compiler by any means, the size of the file containing the compiler itself generally exceeded what most people considered a good file length.

I decided to split up the Co-dfns compiler documentation and the compiler file itself, and I chose to do so based one two primary factors. Firstly, no one except myself seemed to be comfortable working with large files. In order to work with these large files, I rely heavily on my text editing capacity, which admittedly can vary depending on which editor I am using at the moment. Normally I didn't have a problem with this. However, this can cause a problem with some editors, which may or may not allow easy traversal through a file based on nodes and elements. Thus, splitting up the files serves not only to potentially asuage those developers who balk at large file sizes, but it also could aid in navigation when using an editor that doesn't make things that easy on you within a single file. Something like the GitHub interface comes to mind here.

However, one does not simply split up files like this. To give you a picture of the relative non-triviality of the task, let's take a look at the flat document tree before the split:

    319 Cleanroom_Engineering_Guide.xml
   1901 Engineering_Change_Log.xml
  45797 Function_Specification.xml
     45 Increment_Certification_Report.xml
    169 Increment_Construction_Plan.xml
    677 Increment_Test_Plan.xml
    499 Project_Record.xml
   2521 Software_Architecture.xml
    629 Software_Development_Plan.xml
    388 Software_Requirements.xml
     34 Statement_of_Work.xml
   4116 Statistical_Testing_Report.xml
   1554 Usage_Specification.xml
  58649 total

Each document is a DocBook v5 XML document. On the one hand, this has many advantages, some of which we will use later on to make this whole task manageable. On the other hand, this also means that some things are verbose to write down. The Function Specification takes up the bulk of the documentation burden, and comes in at over 200 pages of tables and other structured data.

What's more, it's possible to make working with these documents harder, and not easier if they are split up into multiple smaller files. If the files are organized and divided in a logical manner which actually reduces the burden of in-file navigation, then the file division can help. If however, the division of files does not eliminate the need of sophisticated or non-trivial in-file navigation, then the choice of division only increases the amount of work to be done; instead of making it easier to find things, it is now harder, because one must first find the right file, and then find the right stuff inside of that file, rather than just finding the right stuff inside of a single file. This is important. I've seen too many multiple file layouts that make it difficult to find what you want becuase they are organized in such a way as to increase the navigation burden, not lessen it.

What we really want is the ability to do nearly all of our navigation in the file system and then do nothing but normal editing once we open up a file. That is, we should be able to get to almost exactly where we want to be through the file system, and should not feel the need for additional navigation once we are inside of a file. Realistically, that means that you don't want to have complex structure inside of the file if you can avoid it. When it comes to these documentations, we pick a small logical organizational unit of division. Basically, if you were to take the sets of tables of contents that would normally be associated with such efforts, you'd get a pretty good idea of what the file tree ought to look like. So, we divide on section boundaries, but we also tend to lift out tables and figures into their own files as well if they are big enough or if there are enough of them to warrant it within a single section (there often are). Finally, complex sets of enumerations where each enumeration might cover a paragraph (rather than, say, something less than a sentence in length), then that also gets divided out.

This isn't easy to do. There is a strong element of manual effort here, but there are also many things that need to be automated if there is to be any hope of getting this done in a reasonably reliable fashion and on time. That's where having each of these documents organized as a DocBook document comes in handy. On some of the larger documents, or where there are many similar elements, rather than relying on a text editor to go in and manually remove and lift out elements and then replace those elements with the appropriate include node, I created a conversion workspace in Dyalog APL to read in the XML data and allow me to manipulate the DOM tree directly through APL. With a few helper functions, I'm able to take the tree apart at whatever points are appropriate and store them into their files with much more ease than if I were having to do this manually. Indeed, I think it could take a week to split out the Function Specification document by hand without the aid of automated processing. This demonstrates one of the great benefits of having your documents stored in a machine parsable format that you can work with. It's also another reason why having semantic rather than presentation markup wins.

After all is said and done, let's take a look at the explosion of files we have in our tech/ tree:

$ find . | wc -l # The original tree node count
24
$ find . | wc -l # The split documentation node count
817

Wow, that's a huge increase! You're welcome everyone. You now have a much more sophisticated file and source tree layout for Co-dfns documentation, and it's about as good as it could get. This also gives some evidence to the amount of complexity that is actually contained inside of the technical documents, as well as the sheer amount of information. For obvious reason, I'm not going to show the line counts for each of these files. However, the vast majority are now under 100 lines, which I consider a good metric for editability without requiring complex in editor navigation. A few of them are larger than that, but none, I believe, are over 2000 lines. And I think less than five of those are actually greater than 1000 lines.

Finally, here's the code that I used to do this splitting:

caption←{⊃{⍺,' ',⍵}/2⌷⍉⊃((1+⊃⍵)=0⌷⍉⍵)⊂[0]⍵}
crconv←{¯1↓¨(¯1⌽CR=x)⊂x←⍵,CR}
doit←{(1+⍳⍴⍵)write¨⍵}
gettables←{{((⊃⍵)≥0⌷⍉⍵)⊂[0]⍵}¨((1⌷⍉⍵)∊⊂'table')⊂[0]⍵}
kids←{((⍺+⊃⍵)=0⌷⍉⍵)⊂[0]⍵}
lift←{w←⍵ ⋄ w[;0]-←⊃⍵ ⋄	w}
mkfn←{'tables/',⍵,'.xml'}
mkinc←{⍺←1 ⋄ ⍺'xi:include' ''(1 2⍴'href'(mkfn ⍵))}
numstr←{((⍺⍴10)⊤⍵)⊃¨⊂'0123456789'}
repl←{((~∨\(1⌷⍉⍵)∊⊂'table')⌿4↑[1]⍵)⍪⊃⍪/tableinc¨gettables ⍵}
rootify←{w←lift ⍵ ⋄ w[0;3]⍪←⊂xmlns ⋄ w}
sanitize←{⊃,/(sanin⍳⍵)⊃¨⊂sanout}
tableinc←{⊃⍪/(⊂1 4⍴(⊃⊃⍵)mkinc sanitize caption⊃⍵),4↑[1]¨1↓⍵}
write←{1 putfile(⎕←mkfn ⍺)(xmlconv ⍵)}
xmlconv←{(⎕UCS'UTF-8'⎕UCS⊢)¨crconv ⎕XML ⍵}

These are the sets of definitions I used to do the automatic processing for the Function Specification. For each document I would tweak these definitions a little to get the right stuff that would work for that specific task that I was doing. This code is meant as mostly a one off part of doing the transformation of the documents, and isn't intended to be the most elegant code. However, it useful to demonstrate how I can leverage APL to perform quick automation with very few lines of code. I probably would not have attempted such a transformation with any other language, though it could probably be done reasonably with some others. It's just a little out of the "easy to do" range for me in other languages, even my beloved Scheme.

Please support my work via GitTip.