VOYEUR N2.0 Development Notes


VOYEUR N2.0 (VYRN20.BA) is almost 900 bytes larger than its Tandy counterpart, VOYEUR T2.0 (VYRT20.BA), and some explanation for this difference, and for the way BASIC tokens are decoded, may be of interest to users of both.  The NEC PC-8201A and the Tandy notebooks were made by Kyocera (Kyoto Ceramics) and share many features, including the use of the 80C85A microprocessor and Microsoft-designed operating systems, but there are significant differences in the latter respecting the tokenization of BASIC programs.  Writing VOYEUR for the 8201A was an exercize in cryptanalysis: unlike Tandy BASIC, which encodes only the line numbers proper, converting them to binary, NEC BASIC encodes all numbers other than data, using several different conventions: single digit numbers are represented by a byte which is the number raised by 17; for integers from 10 through 255, the byte is equal to the number, but is prefixed by byte 15.  Integers from 256 through 32767 are 16-bit binary (two bytes), prefixed by byte 28.

For real numbers, prefix 29 is used for single precision, and prefix 31 for double precision, with 23-bit and 55-bit precision, respectively (3 or 7 bytes, less 1 bit in the most significant byte, presumably reserved for a sign which does not seem to be used in .BA files, a separate token for "-" being used instead!).  A suffix byte is added to form the exponent for a binary multiplier that scales the number, shifting the decimal point or setting the exponent (after "E" for single precision or "D" for double).  This scaling is relative to the number "1", which is stored, when forced by suffix tags "!" or "#" on the number itself, as prefix 29 or 31 followed by three or seven nulls, respectively ("1" does not appear!), and a suffix of 129, thus allowing for a scaling relative to unity of 2 raised to a power ranging from -129 (0-129) to +126 (255-129).  This corresponds roughly to a power of 10 ranging from -38 to +38, the actual limits.

Although double-precision non-data numbers are unlikely to appear in a program for this computer, an algorithm in line 11 will correctly decode them (and single-precision numbers) in most instances.  Unfortunately, double-precision cannot be assured: with certain exponents, these numbers are degraded by errors of unknown origin to roughly single-precision.  Using a test program containing double-precision numbers of 1 and 1.234567890123457, each multiplied by 10 raised to a power ranging from -38 to +38, errors starting in the sixth or seventh place appeared in 29 of the 152 numbers, 14 for a mantissa of 1, and 15 for a mantissa of 1.234..., usually for the same (10) exponents (-36,-33,-15,-8, 8, 9, 16, 17, 31, and 36).  These exponents also yielded small errors in single-precision numbers, usually changing the least-significant figure by one.  In addition, the highest exponent, 38, stopped the program with an error!  All of the other (123) numbers were perfectly decoded with full 16-figure precision.

Perhaps the strangest number codings in NEC BASIC are for references to line numbers, as with GOTO, etc.:  When the prefix byte is 14, the 16-bit binary value of the two bytes that follow is the line number, but when the prefix byte is 13, the binary value is the address of that line!  The reason for the 8201A BASIC selecting one or the other representation is not obvious;  e.g., in storing VOYEUR itself as a .BA file, the subroutine line numbers in the ON...GOSUB command in line 32 are coded in a mix of both, as are the the GOSUB references to line 40 within these subroutines.  As the program was modified, the pattern of these mixed codes changed, but the logic for the changes was not apparent.  Since having the line address in place of the line number should increase execution speed, why is this not always done?  And why is prefix 14 used, not 28: is there a difference? (!)  In view of all this complexity, it is conceivable that there may be other special encoding not used in the programs tested to date, and therefore not incorporated in VOYEUR N2.0.

VOYEUR N2.0 is larger than VOYEUR T2.0 because of the number coding, but even more because of the differences in the ROM storage of BASIC keywords and the intricate assignment of tokens.  In Tandy ROM, the keywords are listed in order of their token values, and are complete but with their first letter byte (ASCII value) raised by 128.  (These can be read in their unraised form by VOYEUR, at the left end of the information line.)  Since the tokens and addresses are monotonically related, the address of each keyword can be approximated from its token by a linear equation.  This is done in line 11 of VOYEUR T2.0; a string of printable 7-bit ASCII characters stores the correction for the exact value.

In NEC ROM, the keywords are listed in alphabetical order along with their tokens, but with their first letters missing and their last-letter bytes raised by 128.  A null separates the "A" words from the "B" words, etc., so that counting the nulls while searching from the start allows the first letter to be restored; this was done initially by a prototype of VOYEUR for the NECs, but some words took as long as twelve seconds to find, leading to the direct token-to-address conversion now used, which is much faster but very much larger.  (A scan-search was first used in developing VOYEUR for the Tandys, counting the raised bytes to match the token, with similarly slow results.)

Tandy BASIC uses all bytes from 128 to 255 as tokens, and no others.  NEC BASIC uses most bytes from 129 to 254, but with many gaps, requiring the use of bytes elsewhere.  Most are in the control code area, below 32, but there are gaps there as well, leading to a spillover into the printable 7-bit range, to 41.  The bytes from 1 to 41 are not used directly as tokens, but are first raised by 128; they are then prefixed in .BA files by byte 255 to distinguish them.  In VOYEUR N2.0, each gap in the token ranges is represented by a dummy character (#) in the two strings required to store the address (lines 13 and 14) and in the string storing the first letter (line 16), which significantly increases their already-large size.

To generalize, 8201A .BA files use an inconsistant mixture of intricate code apparently intended to save program storage space with inexplicable waste which does the opposite.  As a possible indication of design haste, some bytes cannot be typed, including one (96) for which a character has been defined, the opening single quote.  Why did Microsoft, which developed both the Tandy and NEC ROM software, not make them identical?  Suzuki, Hayashi, and Rick have their names immortalized in both versions.  (Bill apparently settled for the money!)  Which system came first?  And why are there such wasteful conventions that they share?  In both, a colon is prefixed to the "ELSE" token and to the remark apostrophe token.  (Yes, a token is actually used for the apostrophe, and yet is prefixed by the REM token!  Don't ask.)  All of this required extra lines to be correctly decoded by VOYEUR.  Of the many keywords requiring parentheses, often unnecessarily, only "TAB" has the opening parenthesis attached to it in ROM, saving a byte in .BA files.  Why are tokens wasted on single-byte math functions, while "AS" and the last half of words such as "OUTPUT" are not tokenized, and "MAXFILES" uses two tokens, one for each half?  Then, of course, there are those aptly-named "dummy" expressions!

End