From genmark@ford.gatech.edu Fri May 15 17:04:34 1992 Received: from ford.gatech.edu by sunflower.bio.indiana.edu (4.1/9.5jsm) id AA04791; Fri, 15 May 92 17:04:33 EST Received: from ford.gatech.edu by ford.gatech.edu (AIX 3.1/UCB 5.61/4.03) id AA23218; Fri, 15 May 92 18:00:05 -0400 Date: Fri, 15 May 92 18:00:05 -0400 From: genmark@ford.gatech.edu ("Electronic Mail Server") Message-Id: <9205152200.AA23218@ford.gatech.edu> Subject: Instructions Apparently-To: gilbertd@sunflower.bio.indiana.edu Status: R GENMARK : SYSTEM FOR PREDICTING PROTEIN CODING REGIONS Version 1.1 4/15/92 (Internet Electronic Mail Server) GENERAL INFORMATION GenMark is a software package available from the Georgia Tech School of Applied Biology & Office of Information Technology for the quick analysis of newly sequenced DNA. GenMark 1.1 is based on a special type of Markov chain model of coding and noncoding nucleotide sequences. It proves to be a quite sensitive indicator of protein coding regions in E.Coli and closely related species. The yield of false positive predictions from the analysis of a 96bp segment is about 10%, for false negatives, about 22.5% . The process for training the program for other species is fairly straightforward, and new species will be added later, based on demand and available information. GenMark is robust to the presence of ambiguities in newly sequenced DNA - up to 10% of the sample DNA may be indicated by ambiguity symbols. GenMark receives its submissions from your local electronic mail service and will reply with a list of open reading frames that it recognizes as protein coding regions. There are also various other options, such as a PostScript(tm) graph of the results, which may optionally be requested. GenMark should reply within an hour of a sequence's submission by way of electronic mail. SUBMISSION OF SEQUENCES FOR ANALYSIS Nucleotide sequences destined for processing should be sent via E-mail to: genmark@ford.gatech.edu The subject line of this message must contain one of three keywords: instructions registration genmark If the subject of the message is "instructions", GenMark will reply with the most current submission instructions and news available on the system. If the subject of the message is "registration", your message will be logged in a registration roster. It is NOT necessary to register in order to use GenMark. If you decide to register, we ask that you include your name, your E-mail address, and a brief list of the organisms which you would like to see supported in future versions of GenMark (the family Enterobacteriaceae should be fairly well represented by the E. Coli information). We will keep those persons who register informed with further developments in the software and its options. If the subject of the message is "genmark", the program will try and analyze the contents of the message as sequence information. The message should minimally have the word "data" on a line by itself, followed by the sequence information (see below for a discussion on how to supply options and some example submissions). SUPPLYING OPTIONS TO GENMARK No options are required for GenMark to function. The options specified below just change the manner in which the program works. Only one option is permissible per line. All of the options must occur before the keyword "data" and the sequence information. ALL OF THE KEYWORDS MUST BE ENTERED IN LOWERCASE LETTERS, the sequence itself doesn't matter. The options: # A comment. The rest of the line, after this symbol, is utterly ignored. address Alternative E-mail address. After this option, include a valid E-mail address to which the program should send the output to (if it is different than the address from which it was sent). name The name of the person who submitted the sequence. This is particularly important for sites where several people will be submitting sequences from the exact same E-mail address. After this option, include the name. order The Markov chain order to use. If you don't know what this is don't mess with it. Higher is better, up to a point. The default is 4, though orders 1 through 5 are now available. After this option, include the new order. psgraph Give PostScript(tm) output. This instructs the program to include a PostScript graph of the results which can be printed on any PostScript compatible printer. The page is divided into six horizontal panels with the probability function on the y-axis, and the nucleotide position along the x-axis. The six panels represent the six different frames, panels 1-3 indicate frames 1-3 on the direct strand, and panels 4-6 indicate frames 1-3 on complementary strand. Open reading frame indicators appear along the middle of each graph. Since there's a limit to the size of E-Mail messages, expect the PostScript output to be sent as several messages. step Set the window step. This must be stated as a multiple of 3 nucleotides. The default is 12. The practical upshot of this setting, is that it allows you some freedom in adjust- ing the resolution of the PostScript(tm) graph. For instance, step setting of 3 gives 4 times the resolution of the default of 12. threshold Set the open reading frame threshold. This number is the number between 0 and 1 (or between 0 and 100) that is the minimum value of the probability function (a percentage) that an open reading frame must have to be accepted as a protein coding region. The default is 0.50. title The title you want to give to your PostScript(tm) graph. window The size of the analysis window (if you don't know what this is, don't play with it). The default is 96 nucleotides and generally 96 to 144 nucleotides works best. SAMPLE SUBMISSIONS TO GENMARK SAMPLE 1 > mail genmark@ford.gatech.edu Subject: genmark # This example shows a minimal submission, just using the defaults set by # the program. # # NOTE: this will reply automatically to the exact address that it was sent # from with only a list of open reading frames. # # The actual DNA sequence may have any standard ambiguity DNA symbols in it # Anything that isn't a letter (like numbers, punctuation, spaces, carriage # returns) will just be ignored. data TCSSATGCATGHCATCGATWWCTCAGTCAGNA... SAMPLE 2 > mail genmark@ford.gatech.edu Subject: genmark # This is an example of using all of the different options. address biologist@college.edu name John Doe order 5 psgraph step 6 threshold 0.50 title John Doe's New Protein Coding Region window 144 data TCAGTTCCAAGGTTTCCCAAAGGGTTTTCCCCAAAAGGGG... THINGS TO WATCH OUT FOR The sendmail program used for transferring messages across the network is limited to messages that are 64000 characters long. Therefore, it is good to remember to send any imformation you might have in chunks smaller than the 64000 character limit. The PostScript(tm) output might take up more space than is permissible in a mail message so, GenMark will send the graphic in parts that are smaller than 64K in length. If you shrink the step down to 3 and send a good sized sequence, the PostScript(tm) output will be huge, so don't be suprised. Try and reserve doing that for smaller sequences. For short sequences, you'll want to make the step smaller. We suggest a step of 6 for any sequence under about 1.5kb long, and a step of 3 for sequences less than about 800 bases long. Don't ask the program to make the step larger than the window. It won't crash the program, but then again you'll probably just get garbage back. The sequences you send are deleted as soon as they have been processed by the program. We cannot recover them for you. If you do not receive a response in a couple of hours, something's wrong. Verify the format of your submission and resend it. The graphic response may be effective for analyzing the intron/exon structure of eukaryotic sequences, but there are no guarrantees. In such a case, the list of open reading frames would almost certainly be useless, only the graphic would make any sense. In many cases, the graphic output can tell you much more information about the sequence in question than the open reading frame listing alone. Careful evaulation of the graphic could yield clues as to sequencing errors and frameshifts. REFERENCES Should you refer to the results of GENMARK analysis you should use the following reference: Borodovsky M. (1990) Recognition of coding regions in nucleotide sequences. In M.F.Frank-Kamenetskii ed. Computer analysis of Genetic Texts, Nauka, Moscow. Borodovsky M. McIninch J. Prediction of Gene Locations Using DNA Markov Chain Models (Submitted to CABIOS). QUESTIONS, PROBLEMS, SUGGESTIONS Please send any comments or questions that you might have about the software or the method of coding region recognition to: mb56@hydra.gatech.edu (Mark Borodovsky) or gt1619a@hydra.gatech.edu (James McIninch) .