Methodology
- DIpro is a cysteine disulfide bond predictor based on 2D recurrent neural network, support vector machine, graph matching, and regression algorithms. It can predict if the sequence has disulfide bonds or not, estimate the number of disulfide bonds, and predict the bonding state of each cysteine and the bonded pairs. It yields the best accuracy on the benchmark dataset Sp39 [1]. It can handle any number of disulfide bonds where most methods available so far only can handle up to five disulfide bonds.
- Procedure: The sequence is processed in two steps. Step 1, use support vector machine to classify if the sequence has disulfide bonds or not. Step 2, use neural networks and graph algorithms to predict the number of bonds, bonding states, and bonding patterns.
Input Format
- Email address is where the prediction result is sent
- Query name: optional, used to identify the query
- Sequence: a raw text of sequence, white space are ignored
Output Format
- Query name
- Query sequence length
- Protein sequence with cysteine's position identified
- Classification results from Support Vector Machine
- Total number of cysteines in sequence
- The number of predicted bonds
- Positions of cysteines which are predicted to form disulfide bonds
- A table to list the predicted pairs of bonded cysteines ordered by probability in descending order. Column 1: bond index, Column 2: position of the first cysteine, Column 3: position of the second cysteine.
Performance
We developed two versions of disulfide bond predictor (DIpro 1.0 and DIpro 2.0). DIpro 1.0 has been online since Oct 23rd, 2003. DIpro 2.0 has been online since Aug 10th, 2004. In general, the performance of DIpro 2.0 should be better than DIpro 1.0 since it was trained on a larger dataset.
- The classification accuracy of support vector machines on SP51[1] is 83%.
- The prediction accuracy of Dipro 2.0 on SPX (or DIPRO2) dataset:
--------------------------------------------------------
Bond Num Pair_Recall(%) Pair_Precision(%)
(Sensitivity) (Specifity)
--------------------------------------------------------
1 71 48
2 63 63
3 62 67
4 50 55
5 37 41
6 29 33
7 31 36
8 30 32
9 61 71
10 37 40
12 50 55
14 57 62
16 22 23
17 35 40
19 42 73
25 24 40
---------------------------------------------------------
Overall 0.55 0.54
---------------------------------------------------------
Overall bond state reacall(sensitivity): 89.4%
Overall bond state precision(specifity): 87.8%
Bond number prediction accuracy: 71%
Average difference between true bond number and predicted bond number: 1.04
References
For a full paper including SVM classification, neural networks, statistical analysis, graph algorithm, and the SPX(or DIPRO2) dataset:
[1] Jianlin Cheng, Hiroto Saigo, Pierre Baldi, "Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching". Proteins: Structure, Function, Bioinformatics, vol 62, no. 3, pp. 617-629, 2006.[PDF]
For the results of neural networks on SP39 and SP41:
[2] Pierre Baldi, Jianlin Cheng, Alessandro Vullo, "Large-Scale Prediction of Disulphide Bond Connectivity", Advances in Neural Information Processing Systems(NIPS 2004) 17, L. Saul, Y. Weiss, and L. Bottou editors, pp.97-104, MIT press, Cambridge, MA, 2005. [PDF] or [PDF at NIPS website]
- Pierre Baldi, Professor and director of Institute of Genomics and Bioinformatics, School of Information & Computer Science, University of California Irvine
- Jianlin Cheng, Ph.D student, School of Information & Computer Science, UCI
- Hiroto Saigo, Ph.D. Student, Kyoto University, Japan. He is currently visitor at Pierre Baldi's lab.
- Alessandro Vullo, Ph.D student, University of Florence, Italy
Download DIpro Software (free for scientific use)
Download DIpro 2.0 (about 29M, Linux version, predict disulfide bond patterns). Click here or see readme.txt in the zip file for the installation instructions.
DIpro 2.0 depends on SSpro package. You can download SSpro 4.0 here.
Download Cysbond (SVM classifier to predict whether a protein chain has disulfide bond or not) See READEM in the zip file for installation instruction.
Dataset
The disulfide bond data set (new name: SPX, previous name: DIPRO2) used to train neural networks of DIpro2 was derived from PDB and augumented by solvent accessibilities and secondary structures generated by DSSP program. Each entry includes the sequence name (pdb code + chain id, line 1), seqeunce length, the number of bonded cysteines, and total number of cysteines(line 2), sequence(line 3), secondary structure (line 4), relative solvent accessibility (line 5: e: exposed, -: buried, determined at 25% threshold), and disulfide bond information (rest of lines, each line corresponding to one disulfide bond identified by the positions of cysteine pair). The redundancy in the data set was reduced using UniqProt. The similarity between any two sequences is less than about 30%.
Download the disulfide bond dataset used to train neural networks
The positive and negative datasets used to train Support Vector Machines (Cysbond) to discriminate proteins with disulfide bonds from proteins without disulfide bonds were derived from PDB too. The pairwise sequence similarity is <25%.
Download the negative dataset used to train SVM
Download the positive dataset used to train SVM