From usenet.ucs.indiana.edu!sol.ctr.columbia.edu!usc!rpi!ghost.dsi.unimi.it!univ-lyon1.fr!scsing.switch.ch!bernina!neptune!cbrg Mon Feb 8 11:46:15 EST 1993 Article: 1514 of bionet.software Newsgroups: bionet.software Path: usenet.ucs.indiana.edu!sol.ctr.columbia.edu!usc!rpi!ghost.dsi.unimi.it!univ-lyon1.fr!scsing.switch.ch!bernina!neptune!cbrg From: cbrg@inf.ethz.ch (CompBioResGrp) Subject: Searching proteins by their digested mass profile Message-ID: <1993Feb8.115510.4888@neptune.inf.ethz.ch> Sender: news@neptune.inf.ethz.ch (Mr News) Nntp-Posting-Host: rutishauser-gw.inf.ethz.ch Organization: Dept. Informatik, Swiss Federal Institute of Technology (ETH), Zurich, CH Date: Mon, 8 Feb 1993 11:55:10 GMT Lines: 155 A new server is available from the CBRG in Zurich. A description of the server follows. This description can be obtained by sending e-mail to cbrg@inf.ethz.ch with one line "help MassSearch". For a description of other servers at the CBRG, send an e-mail with the line: "help all" ------------------------------------------------------------------ In some cases, recognition of proteins can be done by fragmenting the protein according to certain pattern and using the molecular weights of the fragments as a trace. This method is not effective to find the composition of an unknown protein, but it is effective in locating an unknown sample if its sequence is recorded in a protein database. One of the ways of breaking a protein into smaller pieces according to a certain pattern is by using enzymes which digest the protein. For example, trypsin breaks a protein after every Arginine (R) or after every Lysine (K) not followed by a Proline (P). AspN breaks a protein before every Aspartic acid (D). A table of recognized enzymes and their cleavage rules is given below. The molecular weight of fragments can be found experimentally by mass spectrometry methods to a good level of accuracy. More importantly, these methods typically require very small samples in the order of fractions of pico-moles. The problem of identifying a sampled protein can be reduced to digesting the protein with an enzyme, finding the molecular weights of each of the pieces and then comparing this set of weights to what would be obtained from the digestion of each protein in the database. The process can be repeated with several different enzymes to increase its selectivity. The function MassSearch locates the best candidates in a protein database (SwissProt at this time) that would fit the given weights once digested by the given enzyme. This type of searching has been found particularly useful in the following circumstances: o To identify proteins when the amount available is very small, for example as can be separated by 2D gels. o To determine whether an unknown protein is already known in the database before spending a significant effort in sequencing. o To identify more than one protein which cannot be separated by other means (this method has been successfully used to identify two proteins which were digested together). The template of the body of the message to be sent to cbrg@inf.ethz.ch is (between but not including the dashed lines): --------------------------------------------------------------------- MassSearch Trypsin: 1524.0, 1509.7, 1387.5, 1169.4, 1014.4, 842.5, 836.4, 743.2, 717.2, 563.1, 511.3 --------------------------------------------------------------------- The token "MassSearch" indicates the operation to be run. The following lines contain the name of the digester enzyme followed by the weights. The weights can be separated by spaces, commas, tabs or newlines as convenient, but no other extraneous characters. Many different searches can be requested in a single command, each request must be identified by the name of the enzyme and followed by the weights. The output of the above request is: Searching on SwissProt version 23. For each set of weights, the matching sequences are printed in decreasing order of significance. Scores lower than 70 are generally not significant. Searching the weights 1524, 1509.7000, 1387.5000, 1169.4000, 1014.4000, 842.5000 , 836.4000, 743.2000, 717.2000, 563.1000, 511.3000 as digested by Trypsin Score n k AC DE OS 159.4 15 9 P80049; FATTY ACID-BINDING PROTEIN, LIVER (FABP). GINGLYMOSTOMA CIRRATUM (NURSE SHARK). 76.2 28 5 P22966; ANGIOTENSIN-CONVERTING ENZYME PRECURSOR, TESTIS-SPECIFIC (EC 3.4.15.1) (ACE) (DIPEPTIDYL CARBOXYPEPTIDASE I) (KININASE II). HOMO SAPIENS (HUMAN). 72.4 11 4 P16291; COAGULATION FACTOR IX (EC 3.4.21.22) (CHRISTMAS FACTOR) (FRAGMENT). OVIS ARIES (SHEEP). 72.3 25 2 P18416; TRANSPOSASE (TRANSPOSON TN552) (ORF 480). STAPHYLOCOCCUS AUREUS. 71.0 5 6 P08821; DNA-BINDING PROTEIN II (HB) (HU). BACILLUS SUBTILIS, AND BACILLUS GLOBIGII. 66.9 23 7 P13214; ANNEXIN IV (LIPOCORTIN IV) (ENDONEXIN I) (CHROMOBINDIN 4) (PROTEIN II) (P32.5) (PLACENTAL ANTICOAGULANT PROTEIN II) (PAP-II) (PP4-X) (35-BETA CALCIMEDIN). BOS TAURUS (BOVINE). . . . . . The first column measures the quality of the match between the given weights and a protein sequence in the database. The higher the score, the better the match. The hits are listed in decreasing scoring order. The second column, identified by n, indicates the number of fragments that will result from the digestion of the found protein. The third column, identified with k, indicates the number of given weights which were successfully matched against the theoretical digestion. The score is calculated from the total number of fragments, the number of given weights matched, and from how closely these weights could be matched. The fourth column indicates the accession number of the sequence in SwissProt. The rest of each line contains the description and species of the sequence which serve as a quick guide to identify the protein. A complete description of the algorithm and the probability foundations can be found in chapter 20 of "A tutorial introduction to computational biochemistry using the Darwin system" by G.H. Gonnet. The boundary between insignificant and significant matches is around 70. Scores less than 70 are not very significant, while scores greater than 70 are significant. The enzymes which are presently recognized, and the names to be used, are the following (courtesy of Amos Bairoch) Enzyme name cuts between except for ########### ############ ########## Armillaria Xaa-Cys,Xaa-Lys ArmillariaMellea Xaa-Lys BNPS_NCS Trp-Xaa Chymotrypsin Trp-Xaa,Phe-Xaa,Tyr-Xaa, Trp-Pro,Phe-Pro,Tyr-Pro, Met-Xaa,Leu-Xaa, Met-Pro,Leu-Pro Clostripain Arg-Xaa CNBr_Cys Met-Xaa,Xaa-Cys CNBr Met-Xaa AspN Xaa-Asp LysC Lys-Xaa Hydroxylamine Asn-Gly MildAcidHydrolysis Asp-Pro NBS_long Trp-Xaa,Tyr-Xaa,His-Xaa NBS_short Trp-Xaa,Tyr-Xaa NTCB Xaa-Cys PancreaticElastase Ala-Xaa,Gly-Xaa,Ser-Xaa,Val-Xaa PapayaProteinaseIV Gly-Xaa PostProline Pro-Xaa Pro-Pro Thermolysin Xaa-Leu,Xaa-Ile,Xaa-Met, Xaa-Phe,Xaa-Trp,Xaa-Val TrypsinArgBlocked Lys-Xaa Lys-Pro TrypsinCysModified Arg-Xaa,Lys-Xaa,Cys-Xaa Arg-Pro,Lys-Pro,Cys-Pro TrypsinLysBlocked Arg-Xaa Arg-Pro Trypsin Arg-Xaa,Lys-Xaa Lys-Pro V8AmmoniumAcetate Glu-Xaa Glu-Pro V8PhosphateBuffer Asp-Xaa,Glu-Xaa Asp-Pro,Glu-Pro Please report any problems with the server to knecht@inf.ethz.ch .