
Chapter 15  Filters

  A filter is, essentially, a program that operates on a stream of
  characters. The source and destination of the character stream can be
  files, another program, or almost any character device. The transformation
  applied by the filter to the character stream can range from an operation
  as simple as character substitution to one as elaborate as generating
  splines from sets of coordinates.

  The standard MS-DOS package includes three simple filters: SORT, which
  alphabetically sorts text on a line-by-line basis; FIND, which searches a
  text stream to match a specified string; and MORE, which displays text one
  screenful at a time.


System Support for Filters

  The operation of a filter program relies on two MS-DOS features that first
  appeared in version 2.0: standard devices and redirectable I/O.

  The standard devices are represented by five handles that are originally
  established by COMMAND.COM. Each process inherits these handles from its
  immediate parent. Thus, the standard device handles are already open when
  a process acquires control of the system, and it can use them with
  Interrupt 21H Functions 3FH and 40H for read and write operations
  without further preliminaries. The default assignments of the standard
  device handles are as follows:

  Handle             Name                                 Default device
  
  0                  stdin (standard input)               CON
  1                  stdout (standard output)             CON
  2                  stderr (standard error)              CON
  3                  stdaux (standard auxiliary)          AUX
  4                  stdprn (standard printer)            PRN
  

  The CON device is assigned by default to the system's keyboard and video
  display. AUX and PRN are respectively associated by default with COM1 (the
  first physical serial port) and LPT1 (the first parallel printer port).
  You can use the MODE command to redirect LPT1 to one of the serial ports;
  the MODE command will also redirect PRN.

  When executing a program by entering its name at the COMMAND.COM prompt,
  you can redirect the standard input, the standard output, or both from
  their default device (CON) to another file, a character device, or a
  process. You do this by including one of the special characters <, >, >>,
  and | in the command line, in the form shown on the following page.

  Symbol             Effect
  
  < file             Takes standard input from the specified file instead of
                     the keyboard.

  < device           Takes standard input from the named device instead of
                     the keyboard.

  > file             Sends standard output to the specified file instead of
                     the display.

  >> file            Appends standard output to the current contents of the
                     specified file instead of sending it to the display.

  > device           Sends standard output to the named device instead of
                     the display.

  p1 | p2            Routes standard output of program p1 to become the
                     standard input of program p2. (Output of p1 is said to
                     be piped to p2.)
  

  For example, the command

  C>SORT <MYFILE.TXT >PRN <Enter>

  causes the SORT filter to read its input from the file MYFILE.TXT, sort
  the lines alphabetically, and write the resulting text to the character
  device PRN (the logical name for the system's list device).

  The redirection requested by the <, >, >>, and | characters takes place at
  the level of COMMAND.COM and is invisible to the program it affects. Any
  other process can achieve a similar effect by redirecting the standard
  input and standard output with Int 21H Function 46H before calling the
  EXEC function (Int 21H Function 4BH) to run a child process.

  Note that if a program circumvents MS-DOS to perform its input and output,
  either by calling ROM BIOS functions or by manipulating the keyboard or
  video controller directly, redirection commands placed in the program's
  command line do not have the expected effect.


How Filters Work

  By convention, a filter program reads its text from the standard input
  device and writes the results of its operations to the standard output
  device. When it reaches the end of the input stream, the filter simply
  terminates. As a result, filters are both flexible and simple.

  Filter programs are flexible because they do not know, and do not care
  about, the source of the data they process or the destination of their
  output. Thus, any character device that has a logical name within the
  system (CON, AUX, COM1, COM2, PRN, LPT1, LPT2, LPT3, and so on), any file
  on any block device (local or network) known to the system, or any other
  program can supply a filter's input or accept its output. If necessary,
  you can concatenate several functionally simple filters with pipes to
  perform very complex operations.

  Although flexible, filters are also simple because they rely on their
  parent processes to supply standard input and standard output handles that
  have already been appropriately redirected. The parent must open or create
  any necessary files, check the validity of logical character-device names,
  and load and execute the preceding or following process in a pipe. The
  filter concerns itself only with the transformation it applies to the
  data.


Building a Filter

  Creating a new filter for MS-DOS is a straightforward process. In its
  simplest form, a filter need only use the handle-oriented read (Interrupt
  21H Function 3FH) and write (Interrupt 21H Function 40H) functions to
  get characters or lines from standard input and send them to standard
  output, performing any desired alterations on the text stream on a
  character-by-character or line-by-line basis.

  Figures 15-1 and 15-2 contain prototype character-oriented filters in
  both assembly language and C. In these examples, the translate routine,
  which is called for each character transferred from the standard input to
  the standard output, does nothing at all. As a result, both filters
  function rather like a very slow COPY command. You can quickly turn these
  primitive filters into useful programs by substituting your own translate
  routine.

  If you try out these programs, you'll notice that the C prototype filter
  runs much faster than its MASM equivalent. This is because the C runtime
  library is performing hidden blocking and deblocking of the input and
  output stream, whereas the MASM filter is doing exactly what it appears to
  be doing: making two calls to MS-DOS for each character processed. You can
  easily restore the MASM filter's expected speed advantage by adapting it
  to read and write lines instead of single characters.

  
           name      proto
           page      55,132
           title     PROTO.ASM--prototype filter
  ;
  ; PROTO.ASM:  prototype character-oriented filter
  ;
  ; Copyright 1988 Ray Duncan
  ;

  stdin   equ     0               ; standard input handle
  stdout  equ     1               ; standard output handle
  stderr  equ     2               ; standard error handle

  cr      equ     0dh             ; ASCII carriage return
  lf      equ     0ah             ; ASCII linefeed

  _TEXT   segment word public 'CODE'

          assume  cs:_TEXT,ds:_DATA,ss:STACK

  main    proc    far             ; entry point from MS-DOS

          mov     ax,_DATA        ; set DS = our data segment
          mov     ds,ax

  main1:                          ; read char from stdin...
          mov     dx,offset char  ; DS:DX = buffer address
          mov     cx,1            ; CX = length to read
          mov     bx,stdin        ; BX = standard input handle
          mov     ah,3fh          ; function 3fh = read
          int     21h             ; transfer to MS-DOS
          jc      main3           ; if error, terminate

          cmp     ax,1            ; any character read?
          jne     main2           ; if end of file, terminate

          call    translate       ; translate character

                                  ; write char to stdout...
          mov     dx,offset char  ; DS:DX = buffer address
          mov     cx,1            ; CX = length to write
          mov     bx,stdout       ; BX = standard output handle
          mov     ah,40h          ; function 40h = write
          int     21h             ; transfer to MS-DOS
          jc      main3           ; if error, terminate
          cmp     ax,1            ; was character written?
          jne     main3           ; if disk full, terminate

          jmp     main1           ; get another character

  main2:                          ; end of file reached
          mov     ax,4c00h        ; function 4ch = terminate
                                  ; return code = 0
          int     21h             ; transfer to MS-DOS

  main3:                          ; error or disk full
          mov     ax,4c01h        ; function 4ch = terminate
                                  ; return code = 1
          int     21h             ; transfer to MS-DOS

  main    endp

  ;
  ; Perform any necessary translation on character
  ; from standard input stored in variable 'char'.
  ; This example simply leaves character unchanged.
  ;
  translate proc  near

          ret                     ; does nothing

  translate endp

  _TEXT   ends


  _DATA   segment word public 'DATA'

  char    db      0               ; storage for input character

  _DATA   ends


  STACK   segment para stack 'STACK'

          dw      64 dup (?)

  STACK   ends

          end     main            ; defines program entry point
  

  Figure 15-1.  PROTO.ASM, the source code for a prototype
  character-oriented MASM filter.

  
  /*
      PROTO.C:  prototype character-oriented filter

      Copyright 1988 Ray Duncan
  */

  #include <stdio.h>

  main(int argc, char *argv[])
  {
      char ch;

      while((ch=getchar()) != EOF)    /* read a character            */
      {
          ch = translate(ch);         /* translate it if necessary   */

          putchar(ch);                /* write the character         */
      }
      exit(0);                        /* terminate at end of file    */
  }


  /*
      Perform any necessary translation on character
      from input file. This example simply returns
      the same character.
  */

  int translate(char ch)
  {
      return (ch);
  }
  

  Figure 15-2.  PROTO.C, the source code for a prototype character-oriented
  C filter.


The CLEAN Filter

  As a more practical example of MS-DOS filters, let's look at a simple but
  very useful filter called CLEAN. Figures 15-3 and 15-4 show the
  assembly-language and C source code for this filter. CLEAN processes a
  text stream by stripping the high bit from all characters, expanding tabs
  to spaces, and throwing away all control codes except carriage returns,
  linefeeds, and formfeeds. Consequently, CLEAN can transform almost any
  kind of word-processed document file into a plain ASCII text file.

  
          name    clean
          page    55,132
          title   CLEAN--Text-file filter
  ;
  ; CLEAN.ASM     Filter to turn document files into
  ;               normal text files.
  ;
  ; Copyright 1988 Ray Duncan
  ;
  ; Build:        C>MASM CLEAN;
  ;               C>LINK CLEAN;
  ;
  ; Usage:        C>CLEAN  <infile  >outfile
  ;
  ; All text characters are passed through with high
  ; bit stripped off. Formfeeds, carriage returns,
  ; and linefeeds are passed through. Tabs are expanded
  ; to spaces. All other control codes are discarded.
  ;

  tab     equ     09h             ; ASCII tab code
  lf      equ     0ah             ; ASCII linefeed
  ff      equ     0ch             ; ASCII formfeed
  cr      equ     0dh             ; ASCII carriage return
  blank   equ     020h            ; ASCII space code
  eof     equ     01ah            ; Ctrl-Z end-of-file

  tabsiz  equ     8               ; width of tab stop

  bufsiz  equ     128             ; size of input and
                                  ; output buffers

  stdin   equ     0000            ; standard input handle
  stdout  equ     0001            ; standard output handle
  stderr  equ     0002            ; standard error handle


  _TEXT   segment word public 'CODE'

          assume  cs:_TEXT,ds:_DATA,es:_DATA,ss:STACK

  clean   proc    far             ; entry point from MS-DOS

          push    ds              ; save DS:0000 for final
          xor     ax,ax           ; return to MS-DOS, in case
          push    ax              ; function 4ch can't be used
          mov     ax,_DATA        ; make data segment addressable
          mov     ds,ax
          mov     es,ax

          mov     ah,30h          ; check version of MS-DOS
          int     21h
          cmp     al,2            ; MS-DOS 2.0 or later?
          jae     clean1          ; jump if version OK

                                  ; MS-DOS 1, display error
                                  ; message and exit...
          mov     dx,offset msg1  ; DS:DX = message address
          mov     ah,9            ; function 9 = display string
          int     21h             ; transfer to MS-DOS
          ret                     ; then exit the old way

  clean1: call    init            ; initialize input buffer

  clean2: call    getc            ; get character from input
          jc      clean9          ; exit if end of stream

          and     al,07fh         ; strip off high bit

          cmp     al,blank        ; is it a control char?
          jae     clean4          ; no, write it

          cmp     al,eof          ; is it end of file?
          je      clean8          ; yes, write EOF and exit

          cmp     al,tab          ; is it a tab?
          je      clean6          ; yes, expand it to spaces

          cmp     al,cr           ; is it a carriage return?
          je      clean3          ; yes, go process it

          cmp     al,lf           ; is it a linefeed?
          je      clean3          ; yes, go process it

          cmp     al,ff           ; is it a formfeed?
          jne     clean2          ; no, discard it

  clean3: mov     column,0        ; if CR, LF, or FF,
          jmp     clean5          ; reset column to zero

  clean4: inc     column          ; if non-control character,
                                  ; increment column counter
  clean5: call    putc            ; write char to stdout
          jnc     clean2          ; if disk not full,
                                  ; get another character

                                  ; write failed...
          mov     dx,offset msg2  ; DS:DX = error message
          mov     cx,msg2_len     ; CX = message length
          mov     bx,stderr       ; BX = standard error handle
          mov     ah,40h          ; function 40h = write
          int     21h             ; transfer to MS-DOS

          mov     ax,4c01h        ; function 4ch = terminate
                                  ; return code = 1
          int     21h             ; transfer to MS-DOS

  clean6: mov     ax,column       ; tab code detected
          cwd                     ; tabsiz - (column MOD tabsiz)
          mov     cx,tabsiz       ; is number of spaces needed
          idiv    cx              ; to move to next tab stop
          sub     cx,dx

          add     column,cx       ; also update column counter

  clean7: push    cx              ; save spaces counter

          mov     al,blank        ; write an ASCII space
          call    putc

          pop     cx              ; restore spaces counter
          loop    clean7          ; loop until tab stop

          jmp     clean2          ; get another character

  clean8: call    putc            ; write EOF mark

  clean9: call    flush           ; write last output buffer
          mov     ax,4c00h        ; function 4ch = terminate
                                  ; return code = 0
          int     21h             ; transfer to MS-DOS

  clean   endp


  getc    proc    near            ; get character from stdin
                                  ; returns carry = 1 if
                                  ; end of input, else
                                  ; AL = char, carry = 0
          mov     bx,iptr         ; get input buffer pointer
          cmp     bx,ilen         ; end of buffer reached?
          jne     getc1           ; not yet, jump

                                  ; more data is needed...
          mov     bx,stdin        ; BX = standard input handle
          mov     cx,bufsiz       ; CX = length to read
          mov     dx,offset ibuff ; DS:DX = buffer address
          mov     ah,3fh          ; function 3fh = read
          int     21h             ; transfer to MS-DOS
          jc      getc2           ; jump if read failed

          or      ax,ax           ; was anything read?
          jz      getc2           ; jump if end of input

          mov     ilen,ax         ; save length of data
          xor     bx,bx           ; reset buffer pointer

  getc1:  mov     al,[ibuff+bx]   ; get character from buffer
          inc     bx              ; bump buffer pointer

          mov     iptr,bx         ; save updated pointer
          clc                     ; return character in AL
          ret                     ; and carry = 0 (clear)

  getc2:  stc                     ; end of input stream
          ret                     ; return carry = 1 (set)

  getc    endp


  putc    proc    near            ; send character to stdout,
                                  ; returns carry = 1 if
                                  ; error, else carry = 0

          mov     bx,optr         ; store character into
          mov     [obuff+bx],al   ; output buffer

          inc     bx              ; bump buffer pointer
          cmp     bx,bufsiz       ; buffer full?
          jne     putc1           ; no, jump


          mov     bx,stdout       ; BX = standard output handle
          mov     cx,bufsiz       ; CX = length to write
          mov     dx,offset obuff ; DS:DX = buffer address
          mov     ah,40h          ; function 40h = write
          int     21h             ; transfer to MS-DOS
          jc      putc2           ; jump if write failed

          cmp     ax,cx           ; was write complete?
          jne     putc2           ; jump if disk full

          xor     bx,bx           ; reset buffer pointer

  putc1:  mov     optr,bx         ; save buffer pointer
          clc                     ; write successful,
          ret                     ; return carry = 0 (clear)

  putc2:  stc                     ; write failed or disk full,
          ret                     ; return carry = 1 (set)

  putc    endp


  init    proc    near            ; initialize input buffer

          mov     bx,stdin        ; BX = standard input handle
          mov     cx,bufsiz       ; CX = length to read
          mov     dx,offset ibuff ; DS:DX = buffer address
          mov     ah,3fh          ; function 3fh = read
          int     21h             ; transfer to MS-DOS
          jc      init1           ; jump if read failed
          mov     ilen,ax         ; save actual bytes read
  init1:  ret

  init    endp


  flush   proc    near            ; flush output buffer

          mov     cx,optr         ; CX = bytes to write
          jcxz    flush1          ; exit if buffer empty
          mov     dx,offset obuff ; DS:DX = buffer address
          mov     bx,stdout       ; BX = standard output handle
          mov     ah,40h          ; function 40h = write
          int     21h             ; transfer to MS-DOS
  flush1: ret

  flush   endp

  _TEXT   ends
  _DATA   segment word public 'DATA'

  ibuff   db      bufsiz dup (0)  ; input buffer
  obuff   db      bufsiz dup (0)  ; output buffer

  iptr    dw      0               ; ibuff pointer
  ilen    dw      0               ; bytes in ibuff
  optr    dw      0               ; obuff pointer

  column  dw      0               ; current column counter

  msg1    db      cr,lf
          db      'clean: need MS-DOS version 2 or greater.'
          db      cr,lf,'$'

  msg2    db      cr,lf
          db      'clean: disk is full.'
          db      cr,lf
  msg2_len equ    $-msg2

  _DATA   ends


  STACK   segment para stack 'STACK'

          dw      64 dup (?)

  STACK   ends

          end     clean
  

  Figure 15-3.  CLEAN.ASM, the source code for the MASM version of the CLEAN
  filter.

  
  /*
      CLEAN.C     Filter to turn document files into
                  normal text files.

      Copyright 1988 Ray Duncan

      Compile:    C>CL CLEAN.C

      Usage:      C>CLEAN  <infile >outfile

      All text characters are passed through with high bit stripped
      off. Formfeeds, carriage returns, and linefeeds are passed
      through. Tabs are expanded to spaces. All other control codes
      are discarded.
  */

  #include <stdio.h>

  #define TAB_WIDTH   8              /* width of a tab stop     */
  #define TAB     '\x09'             /* ASCII tab character     */
  #define LF      '\x0A'             /* ASCII linefeed          */
  #define FF      '\x0C'             /* ASCII formfeed          */
  #define CR      '\x0D'             /* ASCII carriage return   */
  #define BLANK   '\x20'             /* ASCII space code        */
  #define EOFMK   '\x1A'             /* Ctrl-Z end of file      */


  main(int argc, char *argv[])
  {
      char c;                        /* character from stdin    */
      int col = 0;                   /* column counter          */

      while((c = getchar()) != EOF)  /* read input character    */
      {
          c &= 0x07F;                /* strip high bit          */

          switch(c)                  /* decode character        */
          {
              case LF:               /* if linefeed or          */
              case CR:               /* carriage return,        */
                  col=0;             /* reset column count      */

              case FF:               /* if formfeed, carriage   */
                  wchar(c);          /* return, or linefeed,    */
                  break;             /* pass character through  */

              case TAB:              /* if tab, expand to spaces*/
                  do wchar(BLANK);
                  while((++col % TAB_WIDTH) != 0);
                  break;

              default:               /* discard other control   */
                  if(c >= BLANK)     /* characters, pass text   */
                  {                  /* characters through      */
                      wchar(c);
                      col++;         /* bump column counter     */
                  }
                  break;
          }
      }
      wchar(EOFMK);                  /* write end-of-file mark  */
      exit(0);
  }


  /*
      Write a character to the standard output. If
      write fails, display error message and terminate.
  */

  wchar(char c)
  {
      if((putchar(c) == EOF) && (c != EOFMK))
      {
          fputs("clean: disk full",stderr);
          exit(1);
      }
  }
  

  Figure 15-4.  CLEAN.C, the source code for the C version of the CLEAN
  filter.

  When using the CLEAN filter, you must specify the source and destination
  files with redirection parameters in the command line; otherwise, CLEAN
  will simply read the keyboard and write to the display. For example, to
  filter the document file MYFILE.DOC and leave the result in the file
  MYFILE.TXT, you would enter the following command:

  C>CLEAN <MYFILE.DOC >MYFILE.TXT  <Enter>

  (Note that the original file, MYFILE.DOC, is unchanged.)

  One valuable application of this filter is to rescue assembly-language
  source files. If you accidentally edit such a source file in document
  mode, the resulting file may cause the assembler to generate spurious or
  confusing error messages. CLEAN lets you turn the source file back into
  something the assembler can cope with, without losing the time you spent
  to edit it.

  Another handy application for CLEAN is to list a word-processed document
  in raw form on the printer, using a command such as

  C>CLEAN <MYFILE.DOC >PRN  <Enter>

  Contrasting the C and assembly-language versions of this filter provides
  some interesting statistics. The C version contains 79 lines and compiles
  to a 5889-byte .EXE file, whereas the assembly-language version contains
  265 lines and builds an 1107-byte .EXE file. The size and execution-speed
  advantages of implementing such tools in assembly language is obvious,
  even compared with such an excellent compiler as the Microsoft C
  Optimizing Compiler. However, you must balance performance considerations
  against the time and expense required for programming, particularly when a
  program will not be used very often.



