
DownScript MANUAL
=================


The Basic Stages
----------------

There are two basic stages when using DownScript:

   1. Convert the ".ps" file into a ".raw" file.  Let's call this the
      "extraction" stage, since it uses Ghostscript to extract the text
      information from the postscript file.

   2. Convert the ".raw" file into a ".html", ".txt" or whatever file.
      Let's call this the "interpretation" stage, since we interpret the
      extracted textual information, looking at where pieces of text are
      located on the page, where they join horizontally to form words,
      where they join vertically to form paragraphs, etc...

The reason for having two stages is because stage 1 (extraction) can
take a long time.  In fact having two stages is not strictly necessary
(you can do the conversion in one step), but you may want to run
DownScript on a document more than once (e.g. to improve the output,
handle some pages individually, etc...) and repeating the extraction
process can be annoying (especially on slower computers).


Extraction Stage
----------------

The command for this looks like this:

    downscript -i /path/to/file.ps -o /tmp/file.raw

This will invoke Ghostscript (and its ps2ascii.ps code) to extract the
textual information and store the results in the raw file.  Note that
these raw files can get pretty big, and there's no real need to keep
them which is why I always put them in the /tmp directory.


Interpretation Stage
--------------------

This is the fun bit :).  The command looks like this:

    downscript -i /tmp/file.raw -o /tmp/file.html

If the document is small (or your computer is VERY fast :->), you could
skip the extraction stage altogether and do this:

    downscript -i /path/to/file.ps -o /tmp/file.html

This stage is where we have an interactive DownScript "session", which
is a blue screen showing a representation of the current page (a black
box), plus the pieces of extracted text (white boxes), and some
information in the top-right corner of the screen.  This screen is
showing you how DownScript is interpreting the extracted text.

The user interface is pretty basic right now.  Downscript started off as
a program for visualizing the extracted text, and is still not much more
than a laboratory for experimenting with ways of making the computer
"understand" PS documents and producing better HTML and ASCII.

Some keys:

    q  Q  Escape  :  Quit

    ]  >  +       :  Zoom in
    [  <  -       :  Zoom out
    
    PgDn  Space   :  Next page
    PgUp  BS      :  Previous page

    Ctrl-L	  :  Redraw screen

    Cursor keys   :  Scroll around page

    x  y          :  Increase "x-fudge" / "y-fudge"
    X  Y          :  Decrease "x-fudge" / "y-fudge"
    
    w             :  Write current page
    W             :  Write all pages


Note that uppercase letters need Shift to be pressed (e.g. "W" means
shift + w).

There are some other keys, which are not very useful in general except
for the "split page" ones (see the section below on handling
double-sided documents).


X-fudge and Y-fudge
-------------------

You might be wondering, what the hell is "x-fudge" and "y-fudge" ?  
Good question :-).  These are the amounts to make the box surrounding
each bit of text from the postscript file larger.  I'll try to explain:

You see, postscript files do some horrible things to text, like
splitting the word "object" into two pieces "ob" and "ject", just so
that the "ject" bit can be moved a bit closer to (or further away from)
the "ob" bit.  This is bad enough, but some postscript files are worse,
because they place _every_ letter of every word individually on the
page.  Argh !

So it's DownScript job to find all these pieces, and stitch them back
together to form whole words again.  This is where x-fudge comes in.  In
order to stitch "ob" and "ject" back together into "object", it checks
whether the box surrounding "ob" joins up to the box surrounding "ject".
Usually the boxes are not quite wide enough, so there is a small gap in
between them and we get "ob ject", clearly not what we want.  So by
making each box a bit wider, we can make the "ob" box join up to the
"ject" box and we get want we want ("object").  X-fudge is the
percentage (of the nominal font width) to make the boxes wider.

On screen, each bit of text is contained in a solid white box, with a
black cross through it, showing where it starts and ends.  When the zoom
factor is high enough, the text contained in the box is shown (in yellow
for normal text, orange for bold/italics).  Next to each white box is a
red area, which is as wide as the nominal font width.  The goal is an
x-fudge value which is large enough to have no red gaps between things
like "ob" and "ject", but small enough so that normal words are still
separate.  The default value (8%) is generally OK.

Y-fudge is similiar, but in this case we want the white boxes to join up
*vertically* to form whole paragraphs.  For many documents, the default
value (15%) is either OK or close to OK.  But many other documents have
a double spacing style, and for these the y-fudge value needs to be much
greater, often > 100%.

So... when converting single-sided documents, the usual method is to
adjust the y-fudge (and maybe x-fudge too, if needed) until the spacing
looks right, and then press "W" (shift + w) to write out the document.
Simple, huh ?


Output Formats
--------------

Here is a list of the currently supported output formats:

    HTML, ASCII, PLAIN, DEBUG and NROFF.

There is one more, RAW, which is not really an output format.  It just
turns on the extraction mechanism, so that the text information
extracted from the PS file can be saved.

[Note that NROFF support is currently broken, it is an experimental
 feature and may be removed later.]

The output format is normally deduced from the output file's extension,
for example ".html" for HTML, ".txt" for ASCII, etc...  You can override
this automatic selection by using the `--format' option.

What is the difference between ASCII and PLAIN ?  PLAIN, like the name
suggests, is purely plain text.  ASCII on the other hand uses the common
convention of using the backspace character (^H) to encode bold and
underline text.  For example, "A ^H A" is bold, since on a printer it
would overstrike the first 'A' with a second 'A', making it darker; and
"A ^H _" is underline, since on a printer it would overstrike the 'A'
with an '_' character underneath it.


Handling Double-Sided Documents
-------------------------------

DownScript has special support for handling documents that are
"double-sided" (i.e. have two columns of text per page).  Some
postscript files render these columns simultaneously, which really
confuses certain other postscript->text programs that exist (the result
is that both columns are intermingled, making the document unreadable),
but DownScript can handle them most of the time.  Here's how:

The main key is to remember is "S" (i.e. shift + s).  This causes all
pages to be "Split".  After the split operation, each page has become
two new pages, with the first one containing the text that was in the
left column, and the second one containing the text that was in the
right column.  The "s" key (i.e. without shift) will split just the
current page. 

[TIP: apart from "x" and "y", a letter key without shift generally acts
 on a single page (the current one), whereas the same key *with* shift
 acts on all pages.]

So... when converting double-sided documents, the usual method is to
adjust the y-fudge (and maybe x-fudge, if needed) until the spacing
looks right, then press "S" (shift + s) to split the pages, and then
press "W" (shift + w) to write out the document.  Too easy ! :->


Command Line Options
--------------------

Here is an in-depth run-down of each of the command line options.  These
options must be entered as shown, for example `-m' or `--mode' can be
used interchangeably but `--mo' is not allowed (unlike other programs).
Also the option name must be separated from the value with a space,
things like `--mode=640x400' and `-m640x400' won't be accepted.

`-m <mode>' 
`--mode <mode>'
    
    This option specifies the LibGGI display mode to use.  When absent,
    DownScript will use the default mode that LibGGI provides.  Any type
    of mode (e.g. 8-bit, 16-bit, 24-bit, etc...) should work, with the
    exception of text modes.

    Example:  --mode 640x400x24
    
`-t <target>'
`--target <target>'

    This options specifies what LibGGI target to use.  When absent,
    DownScript will use LibGGI's default target.

    Example:  --target xlib
    
`-i <file>'
`--input <file>'

    This specifies the input file to use.  When absent, input is taken
    from the standard input (and assumed to be uncompressed postscript).
    When present, the file can be "raw", uncompressed postscript, or
    compressed so long as the suffix is recognizable, for example
    ".raw", ".ps", ".ps.Z", etc...
    
    Example:  --input /foo/great_garbage_collection_jokes.ps.Z
    
`-o <file>'
`--output <file>'

    This specifies the output file, and indirectly what the output
    format is.  When absent, DownScript defaults to writing ASCII
    (unless overriden by the `--format' option) to the standard output.
    When present, DownScript uses the file suffix (e.g. ".html") to
    determine the output format.  If DownScript does not recognize the
    suffix, a warning message is generated and PLAIN is assumed.
    
    Example:  --output /tmp/great_garbage_collection_jokes.html
    
`-f <format>'
`--format <format>'

    This is used to force the output to be in a particular format, such
    as HTML or ASCII.  If the format is unknown, DownScript will exit
    with an error message and show a list of supported formats.  This
    option is mainly useful when piping the output to another program.
    
    Example:  --format debug

`-p <first>,<last>'
`--pages <first>,<last>'

    This option specifies a range of pages (inclusive from first..last)
    on which to operate.  All other pages outside of this range will be
    skipped when loading.  Page numbers begin at 1.  This option is
    mainly useful for handling really big documents that don't fit into
    memory as a whole.
    
    Example:  --pages 10,20

`-x <percent>'
`--xfudge <percent>'
`-y <percent>'
`--yfudge <percent>'

    Specify the initial X-fudge / Y-fudge value(s) to use (see above for
    an explanation of these).  This option is mainly useful in
    conjunction with the `--auto' option (see below).

    Example:  -x 10 -y 100
    
`-d'
`--double'

    Specify that the document is double-sided (i.e. is primarily written
    in two vertical columns of text).  This option is also mainly useful
    in conjunction with the `--auto' option (see below).
    
`-n'
`--nohag'

    This option disables "Horizontal Aggregation", a heuristic which
    causes pieces of text that are horizontally adjacent (but not close
    enough to be considered joining) to be lumped together.  This option
    is mainly useful in conjunction with the `--auto' option, and is
    provided because it can produce better results in certain cases.
    
`-a'
`--auto'

    This option allows DownScript to be used in a non-interactive
    fashion.  You can set the important parameters (xfudge, yfudge &
    double-sided) using the above command line options, run DownScript,
    look at the results, and use a trial-and-error approach to finding
    the parameters which produce the best output.

