Paul Howson’s Website tdgq.com.au

Building a Better Structured Editor

Why Don’t All RTF Parsers Recognise Styles?

The RTF document format grew in parallel with Microsoft Word. Parsing and interpreting formatting controls when styles are mixed with direct formatting can be a challenge. Perhaps that’s why many RTF parsers ignore styles and their value as a structural device.

RTF is a widely supported document encoding format which grew out of and closely paralleled the evolution of Microsoft Word. According to Wikipedia:

Richard Brodie, Charles Simonyi, and David Luebbert, members of the Microsoft Word development team, developed the original RTF in the middle to late 1980s. Its syntax was influenced by the TeX typesetting language. The first RTF reader and writer shipped in 1987 as part of Microsoft Word 3.0 for Macintosh, which implemented the RTF version 1.0 specification. All subsequent releases of Microsoft Word for the Macintosh and all versions for Windows can read and write files in RTF format.

Early History of Word and RTF

The earliest versions of Word allowed you to select text and then apply formatting to that selected text. This is often called “direct formatting”. Hence RTF started life as a language which could describe direct formatting. Each formatting attribute (e.g. bold, italic, font size, etc) was described by an RTF control — a short sequence of letters preceded by a backslash, for example:

\li240

is an RTF control for setting the paragraph left indent to 240 units, where each unit is a twentieth of a point. Hence \li240 refers to a left indent of 12 points.

There was no notion of styles or the separation of structure and format in early versions of Word.

Word 3 for the Mac Introduces Styles and Outlining

Word 3 for the Macintosh (1987) introduced paragraph styles and outlining.

Paragraph Styles were named collections of formatting attributes. Instead of applying direct formatting to a paragraph, you could apply a style. Changing the definition of that style would change the formatting of all paragraphs tagged with that style. Obvious stuff these days, but a radical innovation at the time.

Word provided a built-in hierarchy of heading styles — “heading 1” to “heading 9”. If you used these styles for the headings in your document, Word could display the hierarchical structure of headings in outline mode, which also allowed you to manipulate the document structure.

Style Definitions in RTF Files

To accommodate style definitions, Microsoft added a style table near the start of an RTF file. The style table associated a style number and a style name with a collection of formatting attributes (RTF controls). A very basic style table containing one style might look like this:

\stylesheet{{\s5 \f34\li240\i Normal;}}

In this example the RTF controls and parameters are:

\s5 Style number 5 is being defined
\f34 Use font number 34
\li240 Paragraph left indent is 240 units (12 points)
\i Use italic font
Normal The style name (terminated by a semicolon which is not part of the name)

Connecting Styles with Paragraphs

There needed to be a way to connect style definitions to the paragraphs in the body of the document. This was done via the \sn RTF control which was inserted at the start of a paragraph, the n corresponding to the style number. (Like most RTF controls, this would remain in effect through a series of paragraphs unless explicitly changed.)

On encountering an \sn RTF control at the start of a paragraph, Word would then consult the style table for a style with that number, and from there determine the style name and formatting attributes for that paragraph.

Incorporating Styles into RTF Files

The manner in which Microsoft incorporated styles into RTF files suggests that RTF had gone through an evolution prior to its public release with Microsoft Word 3.0 for Macintosh and that there existed RTF readers (perhaps earlier versions of Word) which did not support styles.

Styles were added in a way which did not require the RTF reader software to know about styles. How was this done?

Microsoft decided that the \sn style control must be followed by the direct formatting controls contained in the style definition. Older RTF readers would simply ignore the \sn style control and interpret the direct formatting controls.

So while it would have been much cleaner and made for much simpler parsing to allow just a style control at the start of a paragraph like this:

\s5 This is a new paragraph…

…you in fact need to insert the style control and the formatting controls corresponding to the style definition, perhaps like this:

\s5\f34\li240\i This is a new paragraph…

I call these formatting controls “style re-statement controls” because they re-state the formatting controls already present in the style definition.

Styles and Local Overrides

With an existing user base for Word already accustomed to direct formatting, Microsoft couldn’t just replace direct formatting with styles and force people to use styles for all formatting. Styles require a degree of abstract thinking which not everyone possesses. So Microsoft had to make styles an addition rather than a substitute for direct formatting. This meant that you could still use direct formatting if you wished — as many people did and continue to do to this day.

Hence it became possible within Word to apply a style and direct formatting to a paragraph simultaneously.

How is such a combination represented in an RTF file? We have already seen how tagging a paragraph with a style requires re-stating the formatting controls corresponding to the style definition immediately after the \sn style control. The addition of direct formatting to a styled paragraph simply requires adding to or updating the series of formatting controls which follow the \sn style control.

Using the above example, assume we have a paragraph to which style number 5 has been applied. Here we see the style control followed by the formatting controls corresponding to the style definition:

\s5\f34\li240\i This is a new paragraph…

If we want to add, say, a 12pt right-hand indent to this paragraph we would write:

\s5\f34\li240\i\ri240 This is a new paragraph…

…where \ri240 is the RTF control for a 12pt right indent. Let’s say that instead we wanted to change the left indent to 6 points (or 120 units). Then we would write:

\s5\f34\li120\i This is a new paragraph…

Notice that we have not appended another \li control, rather we have modified the \li control already present as part of the “style-restatement controls” (although we could have appended another \li control, which would override the first one).

As explained above, an RTF reader that does not understand styles will simply interpret the formatting controls and ignore the \sn style control. The fact that this list of formatting controls no longer corresponds with those in the style definition will be of no concern.

Parsing Local Overrides in the Presence of Styles

Things get more complicated for an RTF reader that does know about styles. How is it to interpret the combination of a \sn style control plus a series of direct formatting controls which differ from those in the style definition?

Word treats discrepancies between the direct formatting controls and the style definition as direct formatting overrides. InDesign does the same, which is why RTF text placed in InDesign often ends up with those little “+” signs next to the style name, indicating that paragraphs have local formatting overrides (sometimes these can be quite obscure and infuriating).

In other words, the interpretation is “the document author has applied a style and then added local formatting overrides”.

Most RTF Parsers Take the Lazy Approach

The complexity of tracking styles and local overrides might be why many RTF readers simply ignore the style table and style controls altogether and revert to simply interpreting the direct formatting controls, which according to the RTF specification, have the final say on formatting. Apple’s system-level RTF parser is one example.

This practice gets the formatting right, but completely ignores the structural value of styles.

And in the world of professional publishing, styles correctly applied are valuable for the way they make clear the structure of a document. Structure then drives formatting. Formatting (i.e. visual appearance) then reflects structure.

All of us who read books or magazines (or websites) are accustomed to inferring structure from visual appearance.

In future blog posts we will look at some of the challenges in parsing RTF files.