From bronze!sol.ctr.columbia.edu!zaphod.mps.ohio-state.edu!wupost!uunet!sun-barr!ccut!s.u-tokyo!bionet!crc.ac.uk!gwilliam Tue Nov 26 16:40:39 EST 1991 Article: 1643 of bionet.software Path: bronze!sol.ctr.columbia.edu!zaphod.mps.ohio-state.edu!wupost!uunet!sun-barr!ccut!s.u-tokyo!bionet!crc.ac.uk!gwilliam From: gwilliam@crc.ac.uk (Gary@VTVM2.CC.VT.EDU Williams x3294) Newsgroups: bionet.software Subject: GRAIL Message-ID: <23937.9111261801@crc.ac.uk> Date: 26 Nov 91 18:01:07 GMT Sender: daemon@genbank.bio.net Distribution: bionet Lines: 161 Many thanks to those who sent me information on the GRAIL program. For those of you who requested that I summarise the information on this news-group, here it is. GRAIL is a program accessed via e-mail. To use it you must be registered. To register, follow the instructions below: =================================================================== Welcome to GRAIL (Gene Recognition and Analysis Internet Link) For book-keeping purposes we need to register all users to our system. To become a registered user please send the following e-mail message to grail@ornl.gov (please make sure that the key word "Register" is on the first line of your message) Register Your Name Your address Your phone number your E-mail address We will return to you a user ID and the Help file which explains how to submit sequences to grail and how to interpret the output. Direct your questions to: grailmail@ornl.gov. =================================================================== When you do the above, you will get a registration ID and a help-file mailed to you, as below: =================================================================== Your user ID is XXXXXXX [not my ID! - GWW] ------------------------- Help File ------------------------ Welcome to GRAIL (Gene Recognition and Analysis Internet Link) Grail is an interface to a system which will ultimately provide automated gene assembly from DNA sequence data. Currently the system provides analysis of protein coding potential of a DNA sequence. The coding recognition module (CRM) uses a multiple- sensor neural network approach to identify coding exons than are at least 100 bases long. In its current configuration the CRM identifies 90% of such regions with less than 1 false positive coding exon per 5 coding exons indicated. Your success rate will depend on a number of parameters including the G/C content of your sequence. In general, coding regions in sequences of low G/C content are not as well recognized as those in higher G/C. Investigation is underway to try and improve the performance for low G/C sequences. This part of the system is specifically designed to locate regions of DNA sequence with protein encoding potential. The system has been trained to recognize coding regions in Human DNA but seems to work well on DNA sequences from other mammals. Because the system has not been tested extensively on species other than human, no claims are made for the predictions of coding potential on DNA's from other species. To have sequences analyzed send e-mail to: grail@ornl.gov The message will start with the word "sequences" followed by the number of sequences you are sending followed by your user ID followed by the sequences you wish to have analyzed in the following format: Sequences number_of_sequences your_user_ID >seq1name AAAAAAAA........ >seq2name TTTTTTTT.......... etc. For the system to return any interpretation the sequence to be analyzed must be at least 100 bases long (and not more than 100kb). For each sequence the following information will be returned: 1. The score for the coding potential for each position analyzed on each strand (the f-(forward) strand represents the sequence as received, and the r-(reverse) strand is the reverse compliment). These scores range from 0.0 to 1.0 and a score greater than 0.5 identifies a region with protein encoding potential. Non-coding regions often have a score of 0.000. To reduce the output, only regions with scores of at least 0.01 are reported. 2. frame. In calculating the coding potential, the system calculates the reading-frame which is "preferred" in the window over which the calculation is done and this information is returned for regions with scores over 0.5. 3. orf. The limits between which the preferred frame is open is returned for windows with scores over 0.5. The second part of the output is the system's interpretation of the raw data. This output gives the limits (in general a minimum) of the extent of the coding exon, the most likely strand for the exon with a probability for the correctness of the strand assignment, the preferred reading frame for the exon and a quality assessment. An interesting phenomenon we have noted is that some exons seem to have coding character on both strands or even more coding character on the wrong strand. be aware that strand assignments are not always correct, and it is sometimes useful to consider both strands as possible. Any exon with a quality score of "excellent" is worth further consideration. Please remember that the system is designed to find coding exon of 100 or more bases, so small coding exons may well be missed. This implementation of the CRM has been tested on a set of human genes containing 102kb of sequence. This set contained 70 coding exons and the system identified 62 (89%) and assigned them all to the correct strand. (Though in a larger test set strand assignment was 90-95% correct). The preferred reading frame assignment was correct for 60 (96%) of these exons while the frame assignment for the other two had some ambiguity. Of the eight missed 6 were less than 100 bases long. Of 43 predicted exons with a quality score of "excellent" all were actual coding exons. Of predicted exons scoring "good" 11 of 16 (69%) were expected and of 49 predicted exons with a score of "marginal" only 8 (16%) were "real". Though this is a rather limited test set, the results of this analysis give some guidance for interpreting CRM output. N.B. This is an alpha+ version so we are open to feed-back. We have a new e-mail address called GRAILMAIL@ORNL.GOV for user feedback to the GRAIL staff. Or communication can be addressed specifically to us: Direct questions to: Richard J. Mural, e-mail: m9l@stc10.ctd.ornl.gov Phone: 615-576-2938 or Edward C. Uberbacher, e-mail: uber@msr.epm.ornl.gov Phone: 615-574-6134 or GRAIL staff, e-mail: grailmail@ornl.gov To receive a copy of this help file send the message "help" to grail@ornl.gov. =================================================================== Gary Williams Computing Services Section, Janet: G.Williams@UK.AC.CRC MRC-CRC & Human Genome Mapping Centre, Internet: G.Williams@CRC.AC.UK Watford Rd, HARROW, Middx, HA1 3UJ, UK EARN/Bitnet: G.Williams%CRC@UKACRL Tel 081-869 3294 Fax 081-423 1275 Usenet: ...!mcsun!ukc!mrccrc!G.Williams .