==================================================================================================== SCRATCH Suite of One-Dimensional Predictors (SCRATCH-1D) Package Description & Project Documentation ==================================================================================================== Author(s) : Christophe Magnan (cmagnan@ics.uci.edu) Gregor Urban (gurban@uci.edu) Pierre Baldi (pfbaldi@uci.edu) Copyright : Institute for Genomics and Bioinformatics University of California, Irvine Modified : 2021/03/20 ==================================================================================================== Package Description ==================================================================================================== SCRATCH-1D is a suite of one-dimensional deep-learning-based predictors included in the long-established and widely used SCRATCH suite of predictors developed by the Institute for Genomics and Bioinformatics (IGB) of the University of California, Irvine (UCI) : http://scratch.proteomics.ics.uci.edu SCRATCH-1D currently includes the following predictors and tools: - SSpro Release 6.0 Protein secondary structure prediction (3-class) - SSpro8 Release 6.0 Protein secondary structure prediction (8-class) - ACCpro Release 6.0 Protein relative solvent accessibility prediction (at the 25% threshold) - ACCpro20 Release 6.0 Protein relative solvent accessibility prediction (thresholds 0% to 95%) - PROFILpro Release 2.0 Protein evolutionary information / profiles for 1D & 2D predictors - HOMOLpro Release 2.0 Homology-based secondary structure & solvent accessibility prediction - EVALpro Release 1.0 Evaluation of sequence-based and profile-based 1D predictors Several 1D predictors in SCRATCH have similar methods, use identical reference protein databases, and share the same third-party tools. SCRATCH-1D unifies all these predictors in a single software allowing a more efficent processing of the queries, both in terms of computing resources and time. For instance, the sequence profiles needed in the input of the neural networks for each predictor are generated following the same protocol & databases for all the predictors. These profiles are no longer generated separately by each predictor but are now generated only once using a new software developed specifically for this task : PROFILpro. Also, all the predictors and corresponding tools are now compatible with multi-core machines, allowing efficient processing of large datasets. ==================================================================================================== Package Contents ==================================================================================================== A brief description of the package folders is provided below: - bin Script to run SCRATCH-1D predictors - dat Datasets & models for SCRATCH-1D - doc Documentation and test example - env Bash profile for running SCRATCH-1D - lib SCRATCH-1D & predictors source code - opt Third-party tools (HHSUITE & BLAST+) - tmp Temporary work folder for the software ==================================================================================================== Program Usage ==================================================================================================== SCRATCH-1D comes with only one script to run all the predictors : bin/run_scratch1d_predictors.sh USAGE: ./bin/run_scratch1d_predictors.sh See sections below for the list of available options. The program usage is available anytime by executing the main launcher script without any argument: ./bin/run_scratch1d_predictors.sh ============================== Input Fasta File & Protein Selection ============================== --input_fasta REQUIRED User-provided set of protein sequences in fasta file format. Number of proteins is not limited. Accepts gzip/bzip2 (.gz/.bz2) compressed files. --num_proteins OPTIONAL These options can be used to select a subset of --first_protein proteins in the provided fasta file rather than processing the entire set of proteins. This can be useful to submit large sets of proteins to multiple computers without having to split the fasta file. The first option specifies how many proteins to select and the second one the index of the first protein to select (starting by 0): First 10: --num_proteins 10 --first_protein 0 Next 10 : --num_proteins 10 --first_protein 10 --max_protein_len OPTIONAL For computational reasons, proteins longer than 5000 amino-acids are rejected from the provided fasta file by default. The option will override this limitation and allow longer proteins to be processed as well. We recommend in this case: (1) to process these proteins separately (2) a minimum of 32 GB of RAM available (3) to set the option --num_threads to 4 (4) to set the option --timeout_hrs to 24 See the next sections for more details. ================================ Default & Optional Output Files ================================= --output_prefix OPTIONAL The prefix that will be used to name any output file of the program. Default value 'SCRATCH-1D' Any existing file with this prefix will be lost and all folders in must already exist. --all_predictions OPTIONAL Report all predictions made by each predictor in the output files rather than reporting only the highest-accuracy consensus prediction. With this option, the output files will include: (1) the sequence-based predictions (2) the profile-based predictions (3) the template-based predictions (4) the final consensus predictions Format of the output files for each predictor will change in this case (see next sections). --keep_msa OPTIONAL Keep all multiple sequence alignments generated for each protein (2 total). Will take between a few MBs and 1 GB of disk space per protein. Will be reported in the file: output_prefix.msa --keep_S1D OPTIONAL Keep the 1D features extracted for each protein and used to make the sequence-based prediction. Will be reported in the file: output_prefix.S1D --keep_S2D OPTIONAL Keep the 2D features extracted for each protein and used to make the sequence-based prediction. Will be reported in the file: output_prefix.S2D --keep_P1D OPTIONAL Keep the 1D features extracted for each protein and used to make the profile-based prediction. Will be reported in the file: output_prefix.P1D --keep_P2D OPTIONAL Keep the 2D features extracted for each protein and used to make the profile-based prediction. Will be reported in the file: output_prefix.P2D ================================= Resource Usage & Other Options ================================= --num_threads OPTIONAL Maximum number of threads that the program will run simultaneously. Minimal & default value is 4. Use 'lscpu' to know how many threads can run in parallel on your computer (= number after CPU(s)) --timeout_hrs OPTIONAL Maximum number of hours available to blast+ or hhblits to generate a multiple sequence alignment before being killed & assume the corresponding task failed. Default value: 12 hours for each attempt to extract an MSA (Multiple Sequence Alignment). Value is in hours. --help OPTIONAL Display the help message and exit. ==================================================================================================== Input Files Format ==================================================================================================== Input files must be in the standard FASTA file format. There is no limit to the number of input sequences to be processed other than those imposed by the amount of RAM memory available on the machine running the program. ==================================================================================================== Output Files Description ==================================================================================================== SCRATCH-1D will generate 5 output files by default, named using the value of --output_prefix - output_prefix.ss3 : predicted secondary structure (3-class, SSpro) - output_prefix.ss8 : predicted secondary structure (8-class, SSpro8) - output_prefix.acc : predicted solvent accessibility (2-class, ACCpro) - output_prefix.rsa : predicted solvent accessibility (20-class, ACCpro20) - output_prefix.dat : prediction statistics extracted for each protein In case of a failure on one or more (but not all) proteins, 2 additional files are created: - output_prefix.faa : sequences of the failed proteins in FASTA file format - output_prefix.log : a short text file providing the reason for the failure SCRATCH-1D can optionally provide 5 additional output files (see options above): - output_prefix.msa : multiple sequence alignments generated for each protein - output_prefix.S1D : 1D features used to make the sequence-based prediction - output_prefix.S2D : 2D features used to make the sequence-based prediction - output_prefix.P1D : 1D features used to make the profile-based prediction - output_prefix.P2D : 2D features used to make the profile-based prediction ==================================================================================================== Output Files Format ==================================================================================================== This section provides a description of the various output file formats used by SCRATCH-1D. ===================================== Prediction File Format ===================================== The output files with extension ss3, ss8, acc, or rsa share the same file format, which itself depends on the absence or presence of option --all_predictions in the command line. Case 1: option --all_predictions is NOT provided in the command line (default). In this case, the output files are provided following a fasta-like file format enriched with the final consensus prediction of the corresponding predictor and some confidence scores associated with each prediction. Each protein is reported in output using 4 lines: Line 1 >protein_id The original protein header line as provided by the user Line 2 DTLDEAERQWKAEF... The original protein sequence as provided by the user Line 3 CCHHHHHEEEHHHH... The consensus prediction of the corresponding predictor Line 4 88998856788998... The confidence score associated to each prediction (0-9) Case 2: option --all_predictions is provided in the command line. In this case, the output files are written in a tab-separated file format providing all predictions (sequence-based, profile-based, template-based, and the final output consensus prediction reported in case 1) associated with each position in each protein, together with confidence scores. Each protein starts by the original header line found in the input fasta file and is followed by one line per position in the protein. Description of the 22 columns in the output files is provided below. column 1 Fixed value "pos" / start of the protein position fields column 2 sequential protein position starting with position 1 column 3 amino-acid found at the corresponding position column 4 Fixed value "seq" / start of the sequence-based prediction column 5 The network-predicted class using sequence-based features column 6 The corresponding prediction probability from the network column 7 Corrected prediction probability based on sequence similarity column 8 The corresponding confidence score ranging from 0 to 9 column 9 Fixed value "pro" / start of the profile-based prediction column 10 The network-predicted class using profile-based features column 11 The corresponding prediction probability from the network column 12 Corrected prediction probability based on profile similarity column 13 The corresponding confidence score ranging from 0 to 9 column 14 Fixed value "hom" / start of the homology-based prediction column 15 The class predicted by HOMOLpro or 'x' if no template was found column 16 The frequency of the predicted class among the selected templates column 17 Prediction probability based on template features and alignment column 18 The corresponding confidence score ranging from 0 to 9 column 19 Fixed value "out" / start of the output consensus prediction column 20 The final/recommended consensus prediction for the position column 21 The prediction probability (from the selected prediction) column 22 The corresponding confidence score ranging from 0 to 9 ===================================== Statistics File Format ===================================== The output file "output_prefix.dat" provides information that can be useful for downstream analyses in a tab-separated file format. Column description is provided below. fasta_entry The original protein identifier (truncated if longer than 25 chars) chain_length The length of the amino-acid sequence hhblits_database The database used to generate the MSA using hhblits hhblits_iterations The number of iterations performed on the database hhblits_total_hits Total number of hits in the MSA (including the query) psiblast_database The database used to generate the MSA using psiblast psiblast_iterations The number of iterations performed on the database psiblast_total_hits Total number of hits in the MSA (including the query) ss3_accuracy_estimate Estimated SSpro prediction accuracy on the protein ss8_accuracy_estimate Estimated SSpro8 prediction accuracy on the protein acc_accuracy_estimate Estimated ACCpro prediction accuracy on the protein rsa_accuracy_estimate Estimated ACCpro20 prediction accuracy on the protein ================================== Multiple Sequence Alignments ================================== Multiple sequence alignments generated for each protein (2 total) are reported as shown below: >protein_identifier >program=HHSUITE database=UNICLUST30_2018 output_hom=3211 selected_hom=3211 DHCPLGPGRCCRLHSVRASLEDLGWADWVLSPREVQVTMCIGAC DGCPLGEGRCCRLQ-PRASLQDLGWANWVVAPRELDVRMCVGAC DGCPLGEGRCCRLQSLRAYLQDLGWASWVVAPRELDVRMCVGAC DGCPLGEGRC-RLQSLRASL-DLGWANW-VAPRELDVRMCV--- The first line is the original protein header line from the input fasta file. The second line describes the MSA: software used to generate it (HHSUITE or BLAST+), the database used to generate it, the total number of hits returned by hhblits or psiblast, and the number of hits selected to generate the profiles and reported in the next lines. The next lines provide the multiple-sequence alignment starting by the query protein. ==================================== Protein 1D & 2D Features ==================================== The optional output files with extension S1D, S2D, P1D, and P2D provide the actual features extracted for each protein and used in the input of the neural networks. 1D features describe a specific aspect of a single protein position while 2D features describe a specific aspect of a pair of protein positions (like correlated mutations). S1D and S2D are features extracted from the query protein sequence alone while P1D and P2D are features extracted using both the query protein sequence and the two MSAs generated using hhblits and psiblast, so using evolutionary profiles. File format for these four files is the same: six header lines are provided to describe the dataset, then each protein is reported one after each other. Here is an example of the six header lines: Line 1 : amino_acid_BLOSUM62_pssm_A amino_acid_BLOSUM62_pssm_R ... Line 2 : num_chain 150 Line 3 : min_chain_len 37 Line 4 : avg_chain_len 220.73 Line 5 : max_chain_len 673 Line 6 : num_features 71 And the corresponding definitions: Line 1 : tab-separated list of feature/column names Line 2 : total number of chains in the dataset Line 3 : min chain length for the proteins in the dataset Line 4 : average chain length for the proteins in the dataset Line 5 : max chain length for the proteins in the dataset Line 6 : number of features per position (1D) or pair of positions (2D) Note that the S2D and P2D files have two extra columns "sequential_position_1" and "sequential_position_2" used to specify the position of each amino acid in the pair. While useful for parsing the files, these are not actual 2D features and are not counted in the number provided in Line 6. For the example provided above, it would mean that the data lines have 73 columns: 2 for the positions & 71 for the features. Proteins are then reported one after the other following this format for S1D and P1D: Line 1 : >protein_identifier as provided in the input fasta file Line 2 : corresponding protein length / number of positions (N) Line 3 : features for position 1 (tab-separated) Line 4 : features for position 2 (tab-separated) ... : ... Line N+2 : features for position N (tab-separated) and following this format for S2D and P2D: Line 1 : >protein_identifier as provided in the input fasta file Line 2 : corresponding protein length / number of positions (N) Line 3 : corresponding number of unique pairs of positions (P) Line 4 : features for pair of positions 1 (tab-separated) Line 5 : features for pair of positions 2 (tab-separated) ... : ... Line P+3 : features for pair of positions P (tab-separated) ==================================================================================================== Predictor Selection ==================================================================================================== A large fraction of the processing time is taken up by the generation of the sequence profiles and the homology analysis using the Protein Data Bank. Since these steps are shared by all the predictors, running one predictor only, or running all the predictors at once, on a set of protein sequences does not make a significant difference in terms of computational time. SCRATCH-1D will thus systematically run the four predictors on the input protein sequences and all the predictions will be reported in the output files listed above. If only one predictor is needed, please ignore the other predictions. ==================================================================================================== Prediction Probabilities & Confidence Scores ==================================================================================================== Since release 2.0, SCRATCH-1D provides various probabilities and confidence scores for each of the predictions made at each protein position. The corrected prediction probabilities and corresponding confidence scores (see section "Prediction File Format" above) are always calculated based on two different probabilities: the prediction probability obtained from the neural networks (or the class frequency in the selected templates for the homology-based prediction) and the expected accuracy of the prediction calculated by EVALpro, based on the sequence or profile similarity between the query protein and the training dataset (for HOMOLpro, the expected accuracy is computed based on the blast hit length, evalue, % identity, and % similarity between the query sequence and the template). Based on our observation, these corrected prediction probabilities / accuracy estimates are good indicators one can use for separating high and low confidence predictions (the actual prediction accuracy does go up as these probabilities or confidence scores go up and vice-versa), but their actual values tend to differ from the actual prediction accuracies. In short, the actual and the estimated accuracy follow the same trends but their values can differ by a significant margin. ==================================================================================================== Computational Considerations ==================================================================================================== With the rapid growth of the protein databases (especially since 2016), the computational cost to generate multiple sequence alignments and evolutionary profiles has significantly increased. In order to address this increase in computational time, SCRATCH-1D now requires a minimum of 4 threads to run (more is recommended), and a minimum of 16GB of RAM memory to accommodate the growing needs of psiblast and hhblits. The time and resources needed to process a single protein can drastically change from a protein to the next. We describe here what was observed during the preparation of SCRATCH-1D release 2.0 and some tips to optimize computation time. For ~97% of the proteins tested during our experiments, both computational time and resource usage were fairly low: between 10mn and 60mn per protein, and less than 4GB RAM used by each alignment process running with 4 threads. For the remaining ~3% proteins, a significant increase in the computational time and RAM usage was observed with the generation of a single MSA taking up to 24 hours and 16GB RAM. Depending on the computer used to run SCRATCH-1D and time constraints, the best way to process a large dataset can significantly change. Here are some tips that you may find useful: - If your dataset is very large (thousands of proteins), using a cluster with several nodes will most likely be necessary. Two options were added to the package (--num_proteins and --first_protein) to split your dataset without having to rewrite your fasta file. - An efficient way to rapidly get most of the predictions is to reduce the default value of option --timeout_hrs to ~2 hours, and use as many threads as possible on your machine thanks to the option --num_threads. Some failures can be expected in this case but a separate fasta file will be produced in the output, containing all the proteins for which the predictions could not be computed in the allotted time. You can then process these proteins in a second run, we recommend in that case to use --num_threads 4 (even if the machine has more threads available) and --timeout_hrs 24. Without this two-steps approach, all the results (including the ones that can be obtained rapidly) will be put on hold until all the problematic proteins are either processed or failed. - As a rule of thumb, in the worst case scenario we observed 16GB RAM usage for generating a single MSA on 4 threads. The number of threads you allow via the option --num_threads will therefore decide what is the max RAM usage of the run (8 threads ~32GB, etc). - Optimal performances are achieved whenever the number of threads is a multiple of 4. - For small configurations (less than 4 threads, less than 16GB RAM), only three solutions can be offered at this time: (1) you can use release 1.3 of SCRATCH-1D with an expected drop in accuracy of ~3%; (2) you can use release 2.0 to process a large part of the proteins on your computer and submit the failed ones to http://scratch.proteomics.ics.uci.edu; or (3) during the installation of release 2.0, you can choose smaller protein databases via the installation option --version (see installation instructions). This last option is provided as a last resort for users trying to run SCRATCH-1D 2.0 on a small computer-- note that a drop in accuracy should be expected in this case (however the exact performances of the predictors using small protein databases have not been measured systematically). ==================================================================================================== Terminal Colors ==================================================================================================== If you want to remove the custom coloring of the text displayed on your terminal: 1) Edit the bin/run_scratch1d_predictors.sh program and remove the comments on the following lines: #HD="\033[0m" => HD="\033[0m" #PR="\033[0m" => PR="\033[0m" #MS="\033[0m" => MS="\033[0m" #WN="\033[0m" => WN="\033[0m" #ER="\033[0m" => ER="\033[0m" #NM="\033[0m" => NM="\033[0m" 2) Edit the lib/SCRATCH1D_Utilities.pm program and remove the comments on the following lines: #my $HD="reset"; => my $HD="reset"; #my $PR="reset"; => my $PR="reset"; #my $MS="reset"; => my $MS="reset"; #my $WN="reset"; => my $WN="reset"; #my $ER="reset"; => my $ER="reset"; #my $NM="reset"; => my $NM="reset"; Your default background/text colors will be used afterward. ==================================================================================================== Release Notes ==================================================================================================== Version 2.0 (2021) Author(s) : Christophe Magnan, Gregor Urban, Pierre Baldi Description : Second major release of the software Comments : - Recent datasets extracted to train new models - Larger feature sets extracted by PROFILpro - PROFILpro now generates 1D and 2D features - hhsuite added to get evolutionary profiles - Sequence-based prediction is now available - Update deep learning methods. Training done in Pytorch. Software uses custom, optimized, Python code to implement the trained models. - Introduction of EVALpro 1.0 in the package - Use of sequence and profile similarity with the training dataset to extract a consensus prediction from the sequence-based, profile based, and template-based predictions - HOMOLpro optimization (~2% gain observed) - Package rewritten to improve maintainability Version 1.3 (2020) Author(s) : Christophe Magnan, Pierre Baldi Description : update of the package databases Comments : - PROFILpro protein database UNIREF50 updated - HOMOLpro template database pdb_full updated Version 1.2 (2018) Author(s) : Christophe Magnan, Pierre Baldi Description : update of the package databases Comments : - PROFILpro protein database UNIREF50 updated - HOMOLpro template database pdb_full updated Version 1.1 (2015) Author(s) : Christophe Magnan, Pierre Baldi Description : Update + Bug fixes for version 1.0 Comments : - Databases for profiles and homology updated - Non-standard amino acids replaced by X - Sequences of length greater than 10,000 ignored Version 1.0 (2013) Author(s) : Christophe Magnan, Pierre Baldi Description : First release of the software Comments : - Wrapper tool for SSpro, SSpro8, ACCpro, ACCpro20, PROFILpro, HOMOLpro, 1D-BRNN. Earlier versions of the predictors (SSpro, SSpro8, ACCpro, ACCpro20) developed by Gianluca Pollastri, Jonathan Chen, Arlo Randall, and Pierre Baldi. ====================================================================================================