16-Nov-79 06:55:47-EST,1355;000000000001 Mail from SU-SCORE rcvd at 16-Nov-79 0655-EST Date: 16 Nov 1979 0341-PST From: Mark Crispin Subject: system crash bugfix - for release 3 and 4 To: [SU-SCORE]Tops-20.DIS.16: ; cc: Hess at DEC-MARLBORO Problem: ILMNRF bug halt, illegal reference is at SAVE8. Quite reproducable. Diagnosis: The SAVE% JSYS is insufficiently paranoid. User did a CSAVE of a core image which has an indirect pointer to an inferior fork which in turn has a file mapped with no access. The program was (no surprise) an INTERLISP job. Len Bosack can explain it in more detail. Solution: In MEXEC.MAC, after the CALL SETMPG at SAVE3A-1, insert: SKIP FPG0A ;MAKE DAMN SURE THE PAGE IS READABLE ERJMP SAVE3B ;IT WASN'T - TRY NEXT PAGE This isn't necessarily the best solution, since it still does the SETMPG call, but it's better than trying to put ERJMPs at all the places in the save code where it is referencing the window page. A better fix would probably be to make the check before the SETMPG call more thorough, but I don't know enough yet to work that out. I haven't tested this out completely, nor have I checked to see if a similar bug exists in SSAVE%, but the fix seems to work. If anybody else finds out anything more about this, I'll be glad to hear about it. ------- 16-Nov-79 10:30:12-EST,1265;000000000001 Mail from SU-SCORE rcvd at 16-Nov-79 1029-EST Date: 16 Nov 1979 0715-PST From: Mark Crispin Subject: another ILMNRF bug fix, this time in IMPDV To: [SU-SCORE]Tops-20.DIS.16: ; I just discovered another cause of ILMNRF bug halts. IMPSTD incorrectly assumes that T2 contains a hash slot after the CHKNWP call, where actually T2 contains randomness. Since CHKNWP saves the temporaries (which is why T2 doesn't contain anything - even if it didn't T2 is still wrong), there is also no need to push T4 around that call, and T4 does have the correct index into HSTSTS. This fix is relevant to release 4 and to any release 3 site running the long leader NCP (MIT-XX and ISIE, possibly others). I am surprised I haven't been crashed with this bug before. At IMPSTD+2, replace: PUSH P,T4 ; Save index CALL CHKNWP ; Does it know about RAS/RAR, etc? JRST [ SKIPGE HSTSTS(T2) ; No, so if we think it's up, CALL HSTDED ; mark it down. JRST .+1] POP P,T4 ; Restore index IMPSTE: ... with: CALL CHKNWP ; Does it know about RAS/RAR, etc? JRST [ PUSH P,T4 ; No, save index SKIPGE HSTSTS(T4) ; If we think it's up, CALL HSTDED ; mark it down POP P,T4 JRST IMPSTE] IMPSTE: ... ------- 24-Nov-79 08:37:48-EST,1491;000000000001 Mail from SU-SCORE rcvd at 24-Nov-79 0837-EST Date: 24 Nov 1979 0528-PST From: Mark Crispin Subject: probable fix to the release 4 FILLFW hung job bug To: [SU-SCORE]Tops-20.DIS.16: ; The following appears to be a fix for the famous release 4 bug of getting hung in LGOUT waiting for some JFN's FILLFW count to be zeroed. I put this in yesterday and I've gone a full day without anybody getting hung this way, which is a first on my system. In JSYSA.MAC: At PMAP6+2, before the HLRZ C,FILLFW(A), insert: NOSKD1 ;SEIZE THE MACHINE At PMAP3-1, after the HRLM C,FILLFW(A), insert: OKSKD1 ;ALL CLEAR NOW In JSYSF.MAC: At CLZMRC+12 (what an appropriate name!) or so, before the HLLZ A,FILLFW(JFN), insert: NOSKD1 ;OWN THE MACHINE WHILE DOING THIS Three instructions down, after the HRRZS FILLFW(JFN), insert: OKSKD1 Dave Bell at DEC originally pointed me at the idea of checking PMAP's FILLFW hacking although he wasn't sure it was really the problem. I decided to check JSYSF as well. These two places are the only two which put FILLFW in an AC and hack it there instead of hacking it directly. Bell thinks that locking the JFN is probably good enough. Since locks are only good if everybody respect them, I didn't go that route. It is, however, probably alright to use OKSKED and NOSKED instead of OKSKD1 and NOSKDD1; I'm just paranoid about these things and hate needless bug halts. Mark ------- 30-Nov-79 02:10:51-EST,1893;000000000001 Mail from SU-SCORE rcvd at 30-Nov-79 0210-EST Date: 29 Nov 1979 2302-PST From: g.Meyers at SU-SCORE (Harris A. Meyers) Subject: bug in 3, 3A and 4 EXEC To: [SU-SCORE]Tops-20.DIS.16: Ever been in the Mini-EXEC and done an Exec command without a Reset first? The result is a continuous string of "? Invalad CMBFP pointer" messages. The cause is that the Exec command merges a new EXEC into the adress space, the COMND JSYS data is in page 0 which gets merged from the file, the flag that says that the EXEC has been initalized is in another page that is not in the file, and so it does not get cleared! The attached srccom was our fix, however maybe the Mini-EXEC should also do a reset before loading in a program. harris ------------------------------------------------------------------------- ; EXECPR.MAC.2 & EXECPR.MAC.1 29-Nov-79 1714 PAGE 1 LINE 1, PAGE 1 1) ;<3A-EXEC>EXECPR.MAC.2, 29-Nov-79 17:02:26, Edit by MEYERS 1) ;1 move CINITF to page 0 1) ;<3A-EXEC>EXECPR.MAC.3, 2-May-79 09:08:12, Edit by MCLURE LINE 1, PAGE 1 2) ;<3A-EXEC>EXECPR.MAC.3, 2-May-79 09:08:12, Edit by MCLURE LINE 4, PAGE 2 1) SRI,< ;1 1) ;1 move CINITF to page 0, this will cause an Exec command in the Mini-Exec 1) ;1 without a Reset to work. The same bug may be exercised by non-privliged 1) ;1 users by starting an EXEC, merging in a new copy, & doing a start. It 1) ;1 will loop forever typing "?invalad CMBFP pointer" 1) 1) CINITF: Z ;NON-ZERO AFTER STARTUP INITIALIZATION COMPLETED 1) >;1 SRI 1) 1) ;STORAGE FOR EXEC COMMAND INTERPRETER LINE 4, PAGE 2 2) ;STORAGE FOR EXEC COMMAND INTERPRETER LINE 1, PAGE 4 1) NOSRI,< ;1 move to page 0 1) CINITF: Z ;NON-ZERO AFTER STARTUP INITIALIZATION COMPLETED 1) >;1 NOSRI 1) LINE 1, PAGE 4 2) CINITF: Z ;NON-ZERO AFTER STARTUP INITIALIZATION COMPLETED 2) ------- 30-Nov-79 18:53:43-EST,656;000000000001 Mail from SU-SCORE rcvd at 30-Nov-79 1853-EST Date: 30 Nov 1979 1517-PST From: Mark Crispin Subject: Fix for NETRBG bughlts!! To: [SU-SCORE]Tops-20.DIS.18: ; At last, a fix for infamous NETRBG bug halt when taking the network down. Credits to KODA@ISID for this one. This fix is for both release 3 and release 4, and I strongly recommend it. In IMPDV.MAC, at IMPQOA, change: IMPQOA: SKIPN IMPORD ; Is output on? JRST RLNTBF ; No. don't queue it up to: IMPQOA: SKIPN IMPORD ; Is output on? JRST [ PUSH P,T1 ; No. Save AC1 CALL RLNTBF ; And discard the message POP P,T1 RET] ------- 30-Nov-79 18:54:20-EST,1116;000000000001 Mail from SU-SCORE rcvd at 30-Nov-79 1853-EST Mail-from: SRI-KL rcvd at 29-Nov-79 2223-PST Date: 29 Nov 1979 1802-PST From: LARSON at SRI-KL Subject: Connect to subdirectory fix To: MRC at SU-SCORE Remailed-date: 30 Nov 1979 1518-PST Remailed-from: Mark Crispin Remailed-to: [SU-SCORE]Tops-20.DIS.18: ; Here is a fix to allow users to connect to their own subdirectories without having to enter a password. It will allow user LARSON to connect to any directory in DSK*: provided the structure is mounted as domestic. This is the release 3 version. - - - - - - - - - - ;<3-MONITOR>JSYSA.MAC.1029 23-Nov-79 16:22:30 EDIT BY LARSON ;22 Allow user to connect to subdirectories of his logged in directory ;22 on all domestic structures (including PS:) - - - - - - - - - - (at acces8-2): ;22 JRST ACCES8 ;NO. SPECIAL CASE DOESN'T APPLY jrst [cain t4,"." ;22 was it a . jumpe t1,.+1 ;22 yes, ok if it was a substring jrst acces8] ;22 not a substring or . was not next char SETZ P2, ;YES. INDICATE OK TO DO THIS ------- 3-Dec-79 00:01:20-EST,798;000000000001 Mail from SU-SCORE rcvd at 3-Dec-79 0001-EST Mail-from: BBN-TENEXD rcvd at 2-Dec-79 1320-PST Date: 2 Dec 1979 1611-EST From: JBORCHEK at BBN-TENEXD Subject: Bug fix at impqoa: To: MRC at SCORE, KODA at ISID cc: ALLEN, TAPPAN Remailed-date: 2 Dec 1979 1450-PST Remailed-from: Mark Crispin Remailed-to: [SU-SCORE]Tops-20.DIS.20: ; I think that the proper fix would be to save t1 in rlntbf rather than at each call site since may be more than one place where rlntbf gets called and clobbers something important in t1. I know for a fact such a place exists in our imp driver, put it in rlntbf and we wont ever have to worry about it again. ------- [Note from MRC: I agree with John. I am going to do make RLNTBF save T1 in my system.] 5-Dec-79 02:52:25-EST,764;000000000001 Mail from SU-SCORE rcvd at 5-Dec-79 0252-EST Date: 4 Dec 1979 2338-PST From: Mark Crispin Subject: bug in SIN JSYS To: [SU-SCORE]Tops-20.DIS.20: ; I just got an answer to my SPR 20-13701 about inconsistant results being returned by the SIN JSYS when you specified a count and a null as the terminator. It turns out BYTINA checks for a file less than 5 bytes long, and if so, assumes it has no line numbers, but forgets to tell the rest of the world. The patch is to change BYTIA2+3 from a CAIGE T3,T4 to a SKIPA. In the source code, you can delete the MOVE T3,FILCNT(JFN) as well as the patched over instruction and change the CAIN to a CAIE. This patch is valid for both release 3 and release 4. ------- 8-Dec-79 00:46:30-EST,4180;000000000001 Mail from SU-SCORE rcvd at 8-Dec-79 0045-EST Mail-from: UTAH-20 rcvd at 7-Dec-79 2044-PST Date: 7 Dec 1979 2128-MST From: FRANK at UTAH-20 (Randy Frank) Subject: bug with bat blocks in swapping space To: Tops-20: cc: crossland at UTAH-20 Remailed-date: 7 Dec 1979 2048-PST Remailed-from: Mark Crispin Remailed-to: [SU-SCORE]Tops-20.DIS.21: ; For about the past two weeks we've been going around in circles with soft disk errors in our swapping space. They haven't been bad enough to cause crashes, but have definitely caused performance degradation. Being good little boys we dutifully put the bad blocks in the BAT blocks, and were obviously consternated by the fact that the bad pages were still being used. Sure enough, according to the code in dskalc swapping pages which happen to be in bat blocks are marked as "used" in the drum bit table at system initialization time. Unfortunately, the wrong page in the drum bit table is marked as unavailable! Evidently (this is Crossland's guess) at some point in the not to distant past, some code was added at GTCUB3 in PHYSIO which has the effect of converting logical drum addresses (as found in CST1) to disk addresses in a non-linear fashion. It does this by basically allocating the first cylinders worth of drum addresses to the first disk in PS:, the next cylinders worth of drum addresses to the the second disk in PS:, etc. This was a (worthwhile) attempt to split drum references more evenly over the packs in PS:. Unfortunately, the reverse of the splitting algorithm was not done in dskalc when it converts disk addresses to drum addresses for the purpose of marking bat block pages as unavailable in the drum bit table. The patch to dskalc follows to fix this problem follows. As an interesting aside, if you only have a one pack PS: you won't notice this bug, for obvious reasons! Another aside worth knowing: If you are crashing from swapping related bughlts (caused by real bad spots on the disk), the pages causing the crashes are NOT added by the monitor to the bat blocks before the monitor crashes. You must manually add the pages to the bat blocks. There is a nice little utility for manipulating bat blocks (printing them out, adding to them, deleting from them) written by Bill MacCormack at DDC, which I have it anyone wants it. Send me a msg. There was also a recent useful big buffer article on swapping related bughlts which should be read if you're suffering from such problems. If you can't get a copy from your favorite software specialist, let me know and I'll send you one. The patch begins at DOBDRM+12 It appears to be valid for both 3A and 4 (only tested on 3A though) ; DSKALC.MAC.2 & DSKALC.MAC.1 7-Dec-79 2022 PAGE 1 LINE 1, PAGE 1 1) ;PS:<3A-MONITOR-SOURCES>DSKALC.MAC.2 6-Dec-79 16:35:22, Edit by FRANK 1) ; FIX CALCULATION OF LOGICAL DRUM ADDRESS FOR BAT BLOCKS 1) ;<3A.MONITOR>DSKALC.MAC.11, 24-Jul-78 22:56:16, Edit by MCLEAN LINE 1, PAGE 1 2) ;<3A.MONITOR>DSKALC.MAC.11, 24-Jul-78 22:56:16, Edit by MCLEAN LINE 31, PAGE 81 1) MOVEM C,CKBNDX ;SAVE BAD PAIR INDEX 1) MOVE C,SDBTYP(D) ;GET ADDR OF DSKSIZ TABLE 1) ; A NOW CONTAINS SECTOR OFFSET WITHIN SWAP SPACE FOR THIS PACK 1) IDIV A,SECCYL(C) ;GET CYLINDER # WITHIN SWAPPING 1) IMUL A,SDBNUM(D) ;MULTIPLY BY NUMBER OF PACK IN THIS STRUCTURE 1) MOVEI C,-SDBUDB(P4) ;GET PACK # + SDB LOCATION 1) SUB C,D ;COMPUTE PACK NUMBER IN STRUCTURE 1) ADD A,C ;OFFSET BY PACK # WITHIN STRUCTURE 1) MOVE C,SDBTYP(D) ;GET ADDR OF DSKSIZ TABLE 1) IMUL A,SECCYL(C) ;CALCULATE LOGICAL DRUM ADDRESS 1) ADD A,B ;GET THIS SWAP ADDRESS 1) TXO A,DRMOB ;SAY IS AN OVERFLOW ADDRESS 1) MOVEM A,CKBDRA ;SAVE ADDRESS TO BE ASSIGNED LINE 31, PAGE 81 2) MOVEI B,-SDBUDB(P4) ;GET PACK # + SDB LOCATION 2) SUB B,D ;COMPUTE PACK NUMBER IN STRUCTURE 2) IMUL B,SDBNSS(D) ;GET FIRST SWAP SECTOR ON THIS PACK 2) ADD A,B ;GET THIS SWAP ADDRESS 2) TXO A,DRMOB ;SAY IS AN OVERFLOW ADDRESS 2) MOVEM C,CKBNDX ;SAVE BAD PAIR INDEX 2) MOVEM A,CKBDRA ;SAVE ADDRESS TO BE ASSIGNED ------- 25-Dec-79 01:16:01-EST,4112;000000000001 Mail from SU-SCORE rcvd at 25-Dec-79 0115-EST Mail-from: RUTGERS rcvd at 24-Dec-79 2107-PST Date: 19 Dec 1979 0207-EST From: HEDRICK at RUTGERS Subject: send to NVT's To: admin.mrc at SU-SCORE Remailed-date: 24 Dec 1979 2126-PST Remailed-from: Mark Crispin Remailed-to: [SU-SCORE]Tops-20.DIS.22: ; We noticed that SEND * would randomly hang. At first we thought it was an RH11 problem, especially since it cleared up for a while. Then when we got back on the net, the problem came back. Very suspicious... Turns out that SEND to a NVT doesn't always work. What happens is that TTMSG puts the message in the sendall buffer, and then calls a routine to start output (STRTOU). For normal F.E. lines, this routine actually starts the output. But NVT's are odd. In this case, they just set a flag. Then an asynchronous process checks it now and then. When the flag is on, the asynchronous process knows some NVT wants to do I/O, so it then checks all the NVT's. (This seems an excessively indirect way to do things, doesn't it?) The check consists of SOBE start output Unfortunately, the SOBE may well skip, since it checks only the conventional output buffer, and the message is in the sendall buffer. What happens is that the next time some output would normally go to the terminal, the output process starts and you get the send, as well as the normal output. But if nothing is happening to the terminal, output on it never gets started, and the message is not delivered. The sender's job is meanwhile waiting for the character count in the sendall buffer to go to zero. Needless to say, this doesn't happen. So his job appears to be hung. It turns out that the wait will eventually time out. The timeout is .5sec * the number of characters in the message. At that point, the message gets thrown away, and the sender's job goes on. The fix seems to be to start output also if there is a sendall. There are two choices: the sendall buffer count is non-zero, or the sendall-in-progress bit is on. The latter seems cleaner for some minor reasons. Anyway, here is the patch. It is for 3A. I have also looked at rel. 4 somewhere or other, and I conjecture that the same bug is present there. The following patch is to IMPDV.MAC. We also put a note in TTYSRV warning anyone who modifies it that ttsal is also defined in IMPDV. It is possible that this won't be much use to other sites. It is important to us because - we like to warn all users, via SEND *, when something funny is going on in the hardware - SEND is not a privileged command on our system. (i.e. TTMSG is non-privileged, and the EXEC has a SEND command that is just like the old ^ESEND, as well as taking user names as arguments.) ***** CHANGE #1; page 1, line 1; page 1, line 1  --------------------------------- ; -*- Macro -*-.=9195  ***** CHANGE #2; page 2, line 16; page 2, line 16 ;***TEMP-PARAMS*** PIESLC==0 ;***TEMP-PARAMS  --------------------------------- ;***TEMP-PARAMS*** PIESLC==0 ;***TEMP-PARAMS ;[35] offsets in dynamic data in tty area (copied from TTYSRV) ;[35] anything changed here should be changed there, too ttflg1==0 ;[35] tt%sal==1b0 ;[35] sendall being done to this line mskstr ttsal,ttflg1,tt%sal ;[35]  ***** CHANGE #3; page 16, line 11; page 16, line 11 MOVE P1,NVTPTR ;COUNT THRU NVT LINES IMPTS1: HRR T2,P1 ;GET TERMINAL NUMBER IN T2 CALL LCKTTY ;GET ADDRESS OF DYNAMIC DATA JUMPLE T2,IMPTS2 ;IF NON-STANDARD BLOCK CHECK FOR OUTPUT PUSH P,T2 ;SAVE ADDRESS OF DYNAMIC DATA CALL TTSOBE ;ANY OUTPUT? CALL NETTCS ;YES  --------------------------------- MOVE P1,NVTPTR ;COUNT THRU NVT LINES IMPTS1: HRR T2,P1 ;GET TERMINAL NUMBER IN T2 CALL LCKTTY ;GET ADDRESS OF DYNAMIC DATA JUMPLE T2,IMPTS2 ;IF NON-STANDARD BLOCK CHECK FOR OUTPUT PUSH P,T2 ;SAVE ADDRESS OF DYNAMIC DATA JN TTSAL,(T2),IMPT1A ;[35] IF IN SENDALL, THERE IS OUTPUT CALL TTSOBE ;ANY OUTPUT? IMPT1A: CALL NETTCS ;YES  ------- ------- 25-Dec-79 02:33:36-EST,1326;000000000001 Mail from SU-SCORE rcvd at 25-Dec-79 0233-EST Date: 24 Dec 1979 2301-PST From: Mark Crispin Subject: follow up to Chuck Hedrick's message about NVT sendall To: [SU-SCORE]Tops-20.DIS.22: ; Chuck's patch does indeed fix the problem of sendall to NVT's hanging until output to the NVT is ready. Actually, any TTMSG to an NVT will hang this way, and since we also have unprivileged TTMSG fixing this problem was also desirable for us. However, I felt that having IMPDV know about the format of dynamic data wasn't the right way to go about it, so I wrote a different patch, hitting at the root of the problem in TTSOBE. In TTYSRV.MAC, change: TTSBE1: JN TTOTP,(T2),R ;NONSKIP IF OUTPUT IS STILL ACTIVE to: TTSBE1: LOAD T1,TSALC,(T2) ;TRY TO GET SOMETHING REASONABLE IN T1 JN ,(T2),R ;NONSKIP IF OUTPUT ACTIVE OR SENDALL What this does is make SOBE non-skip if there is a pending sendall at the same time it checks for output active. In addition, T1 is set up with the size of the sendall buffer, in an attempt to try to have SOBE always return a defined quantity in AC1 -- at the worse, you'll get 0 in AC1, but that's better than randomness if output was active; and if it's non-zero then it's probably the right thing. -- Mark -- ------- 26-Dec-79 03:55:58-EST,727;000000000001 Mail from SU-SCORE rcvd at 26-Dec-79 0355-EST Date: 26 Dec 1979 0007-PST From: Mark Crispin Subject: FTSCTL bug To: [SU-SCORE]Tops-20.DIS.22: ; Problem: FTSCTL occasionally hangs during ICP when talking to a host not in the system's hash table (for example, an unknown host) in release 4. It is hanging after sending the socket on the contact socket, but it doesn't try to open up the telnet sockets. Diagnosis: The TLO A,(1B1) at ACPT1+1 should be removed. It's a good idea anyway, if only to speed up ICP. If the FTP contact socket is wedged hanging for it isn't going to help things any, and it's a bug the way it now interacts with the NCP and other hosts. ------- 26-Dec-79 03:56:19-EST,601;000000000001 Mail from SU-SCORE rcvd at 26-Dec-79 0356-EST Date: 26 Dec 1979 0011-PST From: Mark Crispin Subject: more on FTSCTL To: [SU-SCORE]Tops-20.DIS.22: ; (This is probably for release 4 only) It takes 5 seconds after the current on-shelf job is given to be an FTP server before a new one is made. At ZATACH+3, between the SETOM PNDJOB and JRST REGO you should insert a PUSHJ P,GETJOB to cause a new on-shelf job to be made as soon as possible. Otherwise, multiple FTPs trying to get at the same host can only come in one every 5 seconds or so. -------