https://www.os2museum.com/wp/learn-something-old-every-day-part-iii/ OS/2 Museum OS/2, vintage PC computing, and random musings [os2floppy] Skip to content * Home * About + Wanted List * OS/2 History + OS/2 Beginnings + OS/2 1.0 + OS/2 1.1 + OS/2 1.2 and 1.3 + OS/2 16-bit Server + OS/2 2.0 + OS/2 2.1 and 2.11 + OS/2 Warp + OS/2 Warp, PowerPC Edition + OS/2 Warp 4 + OS/2 Timeline + OS/2 Library o OS/2 1.x SDK o OS/2 1.x Programming o OS/2 2.0 Technical Library + OS/2 Videos, 1987 * DOS History + DOS Beginnings + DOS 1.0 and 1.1 + DOS 2.0 and 2.1 + DOS 3.0, 3.1, and 3.2 + DOS 3.3 + DOS 4.0 + DOS Library * NetWare History + NetWare Timeline + NetWare Library * Windows History + Windows Library * PC UNIX History + Solaris 2.1 for x86 - PC-86-DOS PC DOS 1.1 From Scratch - Learn Something Old Every Day, Part III Posted on September 8, 2021 by Michal Necasek As part of a hobby project, I set out to reconstruct assembly source code that should be built with an old version of MASM and exactly match an existing old binary. In the process I learned how old MASM versions worked, and why programmers hated MASM. Note that "old versions" in this context means MASM 5.x and older, i.e. older than MASM 6.0. The way old MASM works is relatively straightforward but its documentation often explains it very poorly or not at all. MASM is a two-pass assembler, and that indirectly explains almost everything about its quirks. This is different from more modern N-pass assemblers which automatically run multiple passes to resolve ambiguities. The core of the problem is that MASM tries to be clever, but it's not nearly clever enough. It is very questionable whether MASM's cleverness is a solution or a problem; other assemblers are stricter, relying on programmers to resolve ambiguities. This perhaps puts slightly more of a burden on the programmer but results in more readable, consistent source code. Most ambiguities result from the fact that like most assemblers, MASM does not require symbols to be declared before they're referenced. In the first pass, MASM generates "provisional" code, making guesses about what unknown symbols are. At the end of the first pass, all symbols are known (if they're not, the assembly will fail). In the second pass, MASM applies what it learned in the first pass and generates the final object code. If the guesses made in the first pass turn out to be incompatible with the second pass, MASM will report the dreaded "phase error". More about that later. The crucial thing to understand is that in the first pass, MASM generates enough object code to resolve all offsets, i.e. at the end of the first pass, MASM will know for each symbol defined in the source code at which offset it will be located, because it will have determined how big all generated code and data is. Now comes the "cleverness". For example if MASM sees a JMP to an unknown label, it will assume a 16-bit near jump, i.e. a three-byte instruction. In the second pass, MASM may find out that the jump target is within +127/-128 bytes and generates a short jump, a two-byte instruction. Crucially, the third byte will be replaced by a NOP so that the instruction still effectively takes up three bytes. MASM might also find out that the label is in a different segment, requiring a far jump. In that case, the jump instruction will not fit within three bytes and a phase error will result. The programmer also has the option of writing 'JMP SHORT xxx' rather than 'JMP xxx'. In that case, MASM will always generate a two-byte short jump, and possibly fail with an error if the target is not within the range of a short jump. This is where those 'NOP after JMP' instructions come from. It is MASM (or perhaps some other assembler/compiler) turning a near jump into a short jump but not truly reducing the instruction size. If the jump target is in another segment, the programmer may also write 'JMP FAR PTR xxx', telling MASM to generate a far jump and avoiding a phase error if the target label is in another segment but not yet known in the first pass. Interestingly, there is at least one situation where the NOPs can be useful, especially because there does not appear to be any way (short of manually emitting opcodes) of telling MASM to generate a near jump when a short jump is possible. The BIOS component of DOS 1.x uses a CP/M inspired jump table where the "exported" interface is accessed by calling into some known base address plus an offset which is the function number times three (that being the JMP instruction size). The dispatch table looks like this: DISPATCH: JMP FUNC0 JMP FUNC1 JMP FUNC3 This would be conceptually invoked as 'CALL FAR PTR DISPATCH+(FUNC*3) ' because the dispatch table is assumed to consist of a sequence of near jumps. If MASM turns one or more of those jumps into short jumps but pads them with a NOP, the dispatch table will still work. If an assembler ends up producing only 2-byte jumps without padding, the dispatch table will go up in flames. There are other situations where NOPs can be generated. For example 'MOV DATA, 5' will be byte or word sized, depending on the type of 'DATA'. If 'DATA' has not yet been seen in pass 1, MASM will generate a 6-byte MOV instruction, big enough for a word-sized move. In pass 2, MASM may know that 'DATA' is a byte variable; in that case, the instruction will be reduced to 5 bytes, but again followed by a NOP. This situation is exactly what 'BYTE PTR' can be used for. When 'DATA' ends up being a variable with a known size (byte or word), MASM will set the MOV instruction size based on that and not complain. The programmer can write 'MOV BYTE PTR DATA, 5' to prevent MASM from guessing the instruction size, or to override what MASM would do. There are other situations where MASM can be unpleasantly clever. Remember those ASSUME directives? They are quite important. Consider a situation where everything (code and data) is in a single segment named CODE, and the source file contains an 'ASSUME CS:CODE' directive but not more. If you write 'MOV BYTE PTR VAR,1', you may get a phase error depending on whether 'VAR' has been seen or not. Why is that? MASM is clever and if it knows that VAR is in the code segment, it will automatically generate a CS segment override. But if it has not yet seen 'VAR' in the first phase, it won't leave room for the prefix, and in the second phase it'll report a phase error when it figures out that a segment prefix is needed but there's no room for it. Again, an explicitly coded segment prefix (e.g. 'MOV BYTE PTR CS:VAR,1') avoids this situation. Programmers need to keep this cleverness in mind because if they forget to say 'ASSUME DS:CODE' (assuming the DS segment register does in fact point to the CODE segment containing the data items), MASM will helpfully generate unnecessary CS segment overrides. Perhaps the most questionable MASM feature is guessing that when possible, a label refers to the value at the label's address. Thus 'MOV AX,WORD PTR [VAR]' can be shortened to 'MOV AX,[VAR]', because MASM reasonably assumes that moving to AX means a word-sized operation, but the same result can also be achieved with just 'MOV AX,VAR'. This leads to a confusing syntax where brackets sometimes must be used as a dereferencing operator and sometimes they're optional. I'm not sure what problem Microsoft was trying to solve by making the syntax so vague. It is clearly inconsistent because 'MOV AX,BX' and 'MOV AX,[BX]' are two different things, yet 'MOV AX,VAR' and 'MOV AX, [VAR]' is (often) the same. It's the kind of syntactic sugar that's bad for you. It's even worse because there are differences between MASM versions in this area. For example, MASM 1.10 will assemble 'MOV AX,VAR' the same way regardless of how VAR is defined. But IBM MASM 2.0 accept it if only we have 'VAR DW 0' and report an error ("Operand types must match") if 'VAR DB 0' is seen instead. MASM 5.10A flags the situation as a warning (again "Operand type must match") and produces the same code as old MASM 1.10. Microsoft appears to have gone back and forth on this, probably because the original MASM behavior was unhelpfully vague but too much existing code relied on it. Some other assemblers (e.g. SCP's ASM) have unambiguous syntax and 'MOV AX,VAR' will correspond to MASM's 'MOV AX,OFFSET VAR'; if dereferencing is desired, it must be made explicit with brackets. Much of this used to be documented in old MASM manuals, like the one here. For whatever reason, newer MASM documentation (e.g. MASM 5.0 User's Guide) does not bother explaining these seemingly small but very important details which are tied to MASM's two-pass processing. The behavior is not difficult to grasp once the basics of MASM operation are understood, but without that, MASM may appear to behave in a very arbitrary and capricious manner. This entry was posted in Assembler, Development, Microsoft, PC history. Bookmark the permalink. - PC-86-DOS PC DOS 1.1 From Scratch - 2 Responses to Learn Something Old Every Day, Part III 1. [yH5B][c9f2] DOS says: September 10, 2021 at 5:29 pm Did Borland document any of that behaviour better than Microsoft due to Turbo Assembler optionally emulating it? 2. [yH5B][8f7d] Michal Necasek says: September 10, 2021 at 5:56 pm I don't recall seeing this clearly explained in Borland's documentation, but I could have missed it. And it's not like Microsoft never documented it, more like it became some kind of a lost art. You'd think a nearly 500-page MASM 5.0 Programmer's Guide could spare a few paragraphs explaining the MASM passes, but no. Phase errors are mentioned, but not explained in depth. On the other hand, IBM's MASM 1.0 manual from 1981 actually explains the two passes reasonably well. Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment [ ] Name * [ ] Email * [ ] Website [ ] [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] This site uses Akismet to reduce spam. Learn how your comment data is processed. * Archives + September 2021 + August 2021 + July 2021 + June 2021 + May 2021 + April 2021 + March 2021 + February 2021 + January 2021 + December 2020 + November 2020 + October 2020 + September 2020 + August 2020 + July 2020 + June 2020 + May 2020 + April 2020 + March 2020 + February 2020 + January 2020 + December 2019 + November 2019 + October 2019 + September 2019 + August 2019 + July 2019 + June 2019 + May 2019 + April 2019 + March 2019 + February 2019 + January 2019 + December 2018 + November 2018 + October 2018 + August 2018 + July 2018 + June 2018 + May 2018 + April 2018 + March 2018 + February 2018 + January 2018 + December 2017 + November 2017 + October 2017 + August 2017 + July 2017 + June 2017 + May 2017 + April 2017 + March 2017 + February 2017 + January 2017 + December 2016 + November 2016 + October 2016 + September 2016 + August 2016 + July 2016 + June 2016 + May 2016 + April 2016 + March 2016 + February 2016 + January 2016 + December 2015 + November 2015 + October 2015 + September 2015 + August 2015 + July 2015 + June 2015 + May 2015 + April 2015 + March 2015 + February 2015 + January 2015 + December 2014 + November 2014 + October 2014 + September 2014 + August 2014 + July 2014 + June 2014 + May 2014 + April 2014 + March 2014 + February 2014 + January 2014 + December 2013 + November 2013 + October 2013 + September 2013 + August 2013 + July 2013 + June 2013 + May 2013 + April 2013 + March 2013 + February 2013 + January 2013 + December 2012 + November 2012 + October 2012 + September 2012 + August 2012 + July 2012 + June 2012 + May 2012 + April 2012 + March 2012 + February 2012 + January 2012 + December 2011 + November 2011 + October 2011 + September 2011 + August 2011 + July 2011 + June 2011 + May 2011 + April 2011 + March 2011 + January 2011 + November 2010 + October 2010 + August 2010 + July 2010 * Categories + 286 + 386 + 3Com + 3Dfx + 486 + 8086/8088 + Adaptec + AGP + AMD + AMD64 + Apple + Archiving + Assembler + ATi + BIOS + Books + Borland + BSD + Bugs + BusLogic + C + C&T + Cirrus Logic + CompactFlash + Compaq + Compression + Conner + Corrections + Creative Labs + Crystal Semi + Cyrix + DDR RAM + Debugging + DEC + Development + Digital Research + Documentation + DOS + DOS Extenders + Dream + E-mu + Editors + EISA + Ensoniq + ESDI + Ethernet + Fakes + Fixes + Floppies + Graphics + Hardware Hacks + IBM + IDE + Intel + Internet + Keyboard + Kryoflux + Kurzweil + LAN Manager + Legal + Linux + MCA + Microsoft + MIDI + NetWare + Networking + NeXTSTEP + NFS + Novell + NT + OS X + OS/2 + PC architecture + PC hardware + PC history + PC press + PCI + PCMCIA + Pentium + Pentium 4 + Pentium II + Pentium III + Pentium Pro + Plug and Play + PowerPC + Pre-release + PS/2 + QNX + Random Thoughts + RDRAM + Roland + Ryzen + S3 + SCO + SCSI + Seagate + Security + Site Management + SMP + Software Hacks + Solaris + Sound + Sound Blaster + Source code + Storage + Supermicro + TCP/IP + ThinkPad + Trident + UltraSound + Uncategorized + Undocumented + UNIX + UnixWare + USB + VGA + VirtualBox + Virtualization + VLB + Watcom + Wave Blaster + Western Digital + Windows + Windows 95 + Windows XP + Wireless + WordStar + x86 + Xenix + Xeon + Yamaha OS/2 Museum Proudly powered by WordPress.