Paul Howson’s Website tdgq.com.au

Building a Better Structured Editor

The Difference Between Document Structure and its Representation

People talk about documents by saying things like: “this is a plain text document”, “this is an XML document”, “this is an HTML document”, “this is a Word document”, “this is a Markdown document”.

These phrases describe the way the document has been serialised as a stream of bytes in a file — i.e. they describe the file format. But a document has more abstract qualities than just its file format.

Documents have content and content has structure. Content and structure can be thought of quite independently of how the document is serialised to a file format.

Structure is somewhat abstract concept. For example consider the notion of a heading paragraph. Headings have a particular role within the structure of a document. A heading is still a heading whether the document is serialised as xml, html, rtf, word, InDesign tags, markdown, TEX, or whatever. The “headingness” of a particular paragraph can be expressed in any of these file formats, hence it is a quality which transcends how it represented in a particular file format.

The same can be said of any structural elements: subheadings, block quotes, lists, etc.

Many current tools are so tightly connected to a particular file format that they are thought of as editors for that file format, rather than as editors of documents with abstract structure.

For example, Word will easily read and write files encoded it its own “Word” file format (.doc, .docx). It can also handle RTF. But html? InDesign tags? Markdown? No can do. So Word is primarily limited to being a “Word file editor”. Other tools might be limited to being an “HTML editor” or an “XML editor”or a “Markdown editor”.

These tools have one or two preferred file format formats and they cannot understand anything else. Few if any can fluidly move between a variety of formats.

This has given rise to a long-standing problem in the world of electronic publishing — the often severe difficulty of taking a document authored in one editing tool and converting it into a form suitable for another editing or production tool. Migrating documents in this way often involves a lot of manual work.

In a later post, we will look at what gets in the way moving between document formats.