Part 1 of a series of posts on the topic of building an RTF parser in Ruby.
The genesis for writing an RTF parser, starting in the early 2000s, was my earlier work in computer-based graphic design and publishing.
Consistently formatting a document becomes much simpler and more reliable when there is a way to “centrally control” formatting.
The possibility of doing this began with the introduction of paragraph styles in Microsoft Word version 3 dating from the late 1980s 1.
Shortly thereafter, Pagemaker version 3, released in 1988, added paragraph styles. QuarkXpress, initially released in 1987, also provided support for styles around that time.
These were crucial developments. The combination of Microsoft Word with Pagemaker or Quark Xpress provided a structured document workflow for the new generation of desktop publishing tools 2.
Paragraph styles could be used to centrally-control the formatting of document content.
A New Tool to Streamline Document Preparation
Despite the availability of these structuring mechanisms in Microsoft Word (the most widely used writing tool at the time), clients remained mostly ignorant of the concept of structuring, and would invariably supply unstructured documents.
(see previous post: A Closer Look at Document Structuring)
It was up to me to structure these documents. Manually doing this was a task repeated many times over many years.
Gradually I began planning in my mind a new kind of tool that would make the document structuring process very much faster and easier.
Choosing a Structured Document Format
Such a tool would need to read and write files using file formats that were in widespread use and openly documented. Microsoft Word’s native file format on the Mac was binary and proprietary. But rtf — “Rich Text Format” — was an ASCII text format with a published specification. 3
RTF was (and still is) supported by many word processors which makes it a feasible candidate for a structured document format, despite being complex beyond the needs of that task.
In addition, it is recognised, albeit with differences in interpretation, by most desktop publishing programs.
If the new tool was to read and understand the contents of an rtf file, then a parser for rtf would be required. Building such a parser became the first task for such a project.
The Missing Structured Document Format
It is a curious fact that despite nearly 40 years of desktop publishing workflows, there is still no standardised (and hence universally supported) document format that represents a document in purely structural form.
Neither of the ubiquitous Microsoft Word document formats: .doc
and .docx
do this. Nor does RTF.
While each of these formats captures whatever structural information is available within the document, they also include presentational data plus various kinds of metadata required to fully replicate the document in the particular authoring tool.
There is no structure-only format which would enforce a structure-only workflow.
Some may claim that semantic html is that format, and it could be. The only problem is that most word processors export html embellished with styles and inline css, the goal of which is to preserve formatting at the expense of obscuring structure.
The only thing which comes close to a structure-only document format is xml, but xml as a document authoring format lives in its own rarified ecosystem of large scale publishing systems.
-
In fact Microsoft Word for the PC, in its original character-mode incarnation, supported paragraph styles circa 1985 or earlier. This was the original version of Word developed by Charles Simonyi and others at Microsoft. ↩︎
-
It should be mentioned that sgml-based tools and workflows had existed since the late 1960s or 1970s (xml, which arose in the 1990s was the de facto successor to sgml). These were complex mainframe or mini-computer based tools that were expensive and involved specialised training of users. ↩︎
-
For more on the history and evolution of RTF, see the post “Why Don’t All RTF Parsers Recognise Styles?” on this blog. ↩︎