Newsgroups: comp.text.sgml
Path: utzoo!sq!dns
From: dns@sq.sq.com (David Slocombe)
Subject: Re: SGML translators
Message-ID: <1990Sep14.181719.12244@sq.sq.com>
Organization: SoftQuad Inc., Toronto
References: <BLARSEN.90Sep11085038@spider.uio.no>
Distribution: comp
Date: Fri, 14 Sep 90 18:17:19 GMT
Lines: 140

In article <BLARSEN.90Sep11085038@spider.uio.no>Bjorn.Larsen@usit.uio.no writes:
 
>Does anybody have a description of the SGML-based translators available?
>I'm interested in translators such as
>
>  WP5.0    <-> SGML
>  RTF      <-> SGML
>  LaTeX    <-> SGML
>  troff    <-> SGML

What is requested here is in general not possible without the user 
supplying substantial additional information -- in fact this touches
upon the key motivation for SGML.

<background-info>

	SGML == Standard Generalized Markup Language, IS 8879-1986.

	See my parallel posting "How to obtain info on SGML"

	DTD == Document Type Definition. SGML tells how to create one.
	       This specifies the "grammar" of a class of documents.
	       A particular document is an "instance" of the language
	       specified by the "grammar", just like in your 
	       programming-languages course. 

</background-info>

The need for SGML is demonstrated by examining the problem of translating
a troff document (for example) into an SGML document:

If a computer program looks at a file containing a troff document, it
will see things like...

	.sp .5v
	.ti 2m
	text text text ....
	...text text.
	.sp .5v

Now we may decide, in context, that these formatting codes are
formatting a paragraph (we *visualize* the effect of the codes!), but
they *might* be formatting a "note" or a cell of a table or whatever.
And this must be about the simplest case.  In general it is a kind of
AI problem to deduce the logical structure of a document from its
formatting codes, and a task that requires considerable "training"
before it can be done algorithmically with any accuracy!

(Of course most troff documents use macro-calls, but this only hides the
problem a little:  someone still has to map the macro-calls to the SGML
elements, and this may be one-to-many unless the designer of the macro
package was already thinking in an SGML way.  If he was, then SGML
contributes to him a rigorousness and a software support that he has
never had before.)

In fact there is a company that specializes in software to do
exactly this.  They are:
	
	Avalanche Development Company
	947 Walnut Street
	Boulder, Colorado 80302
	(303) 449-5032
	FAX (303) 449-3246

Their FastTAG product accepts input from WP4.2 and WP5.0, OCR formats,
DCA/RFT files, Microsoft Word, print-image files, Calera PDA files
and Shaftstall Media Conversion files.  I think they are expanding
the list all the time.

BUT... you have to do considerable work to coach FastTAG, because by
itself it cannot be expected to know just what logical elements
make up your documents (i.e. it cannot intuit the DTD), *and* it cannot
guess at the format of each element in that DTD.  So you have to
tell it these things.  This is usually practical only if you are going
to convert a body of documents from a particular formatted form
to SGML.

Naturally!  That's why SGML is so important:  it is a way for document
creators to supply this valuable information about their work that
hitherto has been visible to the human reader (hopefully) but not
available to computer programs.

Instead of coding up your documents with formatting codes which result
in a visible image that your brain interprets to mean a certain logical
structure, you code your documents with the logical structure, and then
map that structure to formatting instructions in a separate operation.
The documents themselves then are much more "computable" as
data-structures, *and* you can take the same document/data-structure
and map it to different visual representations at different times for
different purposes.  Or even map it to different formatting languages
(e.g. troff at one site, Tex at another site). Or load it into a
database (mapping the logical structure into database-update language).

But again note that going from SGML to troff, for example, requires
that you specify just what troff codes or macros you want used for
each SGML logical structure.  There is nothing in the SGML form of
the document that binds to a particular visual representation.
So SGML->formatter-language cannot be automatic unless you supply
additional information.  At least this *can* be done with great
reliability, which is often *not* the case for formatter-language->SGML.

The mapping from SGML to a formatter-language is usually done using an
SGML parser/translator, i.e., a program that parses the SGML documents
(using a supplied Document Type Definition) and writes to its output
suitable formatting codes (or the macro-calls that represent them) to
typeset the document in a specific format.  The user must either supply
a mapping to formatting codes to produce the particular "look" desired,
or supply a mapping to macro-calls and then write a macro-package that
has the same effect.  In either case, the SGML parser has to be told
what to put out.

The parser has the advantage that a document that does not conform in
detail to the DTD simply won't be translated, just as is the case with
a C compiler.  This greatly eases the burden on the writer of the macro
package, who doesn't have to make his macros robust in the face of
incorrect input!

As to available parsers, I quote from a comp.text posting by my
colleague Yuri Rubinsky only a short time ago:

   Today, the most popular parsers, which are generally conceded to also
   be the most conformant [to the Standard], are those of Software
   Exoterica (of Ottawa Canada), licensed by Frame, Arbortext and
   Intergraph; and of Sobemap (of Brussels Belgium, marketed by Yard
   Software of Chippenham Wiltshire UK), licensed by Agfa Compugraphic
   CAPS, Interleaf, Context and Xyvision.  We have made available to our
   consulting clients the parser from Author/Editor, which is optimized to
   work with our SoftQuad Publishing Software sqtroff component.


Hope all this helps someone...

David.

----------------------------------------------------------------
David Slocombe				(416) 963-8337
Vice-President, Research & Development  (800) 387-2777 (from U.S. only)
SoftQuad Inc.				uucp: {uunet,utzoo}!sq!dns
720 Spadina Ave.			Internet: dns@sq.com
Toronto, Ontario, Canada M5S 2T9	Fax: (416) 963-9575
