====================================================================================================

                      SCRATCH Suite of One-Dimensional Predictors (SCRATCH-1D)

                             Method Description & Project Documentation

====================================================================================================

Author(s) :  Christophe Magnan (cmagnan@ics.uci.edu)
Copyright :  Institute for Genomics and Bioinformatics
             University of California, Irvine
Modified  :  2020/12/30

====================================================================================================
                                         Method Description
====================================================================================================

SCRATCH-1D is a suite of one-dimensional predictors included in the long-established and widely used
SCRATCH suite of predictors developed by the Institute for Genomics and Bioinformatics (IGB) of the
University of California, Irvine (UCI) : http://scratch.proteomics.ics.uci.edu

SCRATCH-1D currently includes the following predictors and tools:

 - SSpro      Release 5.2  Protein secondary structure prediction (3-class)
 - SSpro8     Release 5.2  Protein secondary structure prediction (8-class)
 - ACCpro     Release 5.2  Protein relative solvent accessibility prediction (at the 25% threshold)
 - ACCpro20   Release 5.2  Protein relative solvent accessibility prediction (thresholds 0% to 95%)

 - PROFILpro  Release 1.2  Protein evolutionary information / sequence profiles for 1D predictors
 - HOMOLpro   Release 1.2  Homology-based secondary structure & solvent accessibility prediction
 - 1D-BRNN    Release 3.3  One-dimensional bidirectional recurrent neural networks

Several 1D predictors in SCRATCH have similar methods, use identical reference protein databases,
and share the same third-party tools. SCRATCH-1D unifies all these predictors in a single software
allowing a more efficent processing of the queries, both in terms of computing resources and time.
For instance, the sequence profiles needed in input of the neural networks for each predictor are
generated following the same protocol and databases for all the predictors. These profiles are no
longer generated separately by each predictor but are now generated only once using a new software
developed specifically for this task : PROFILpro. Also, all the predictors and corresponding tools
are now compatible with multi-core machines, allowing to process large datasets rapidly and with
no hands-on necessary (up to 3,000 protein sequences a day on a 16-core machine for instance).

All the predictors and tools included in SCRATCH-1D are located in the 'pkg' sub-folder and
are delivered with their own documentation and method description. Please refer directly to
these documentations for more information about SCRATCH-1D predictors and tools.

====================================================================================================
                                        Project Documentation
====================================================================================================

This section provides a description of the project folder and how to use SCRATCH-1D.

=========================================  Project Folder  =========================================

A brief description of the project folders is given below.

- bin             Scripts to run SCRATCH-1D predictors
- doc             Documentation of the software
- env             Bash profile for running SCRATCH-1D
- lib             SCRATCH-1D library scripts to run the predictors
- pkg             SCRATCH-1D predictors and tools
- tmp             Temporary work folder for the software

=========================================  Software Usage  =========================================

SCRATCH-1D comes with only one script to run all the predictors : bin/run_SCRATCH-1D_predictors.sh

    Usage :  ./run_SCRATCH-1D_predictors.sh  input_fasta  output_prefix  [num_threads]

With:

- input_fasta     Input protein sequences in FASTA file format

- output_prefix   Prefix for the output file names, 4 output files will be created:

                  - output_prefix.ss    : predicted secondary structure (3-class, SSpro)
                  - output_prefix.ss8   : predicted secondary structure (8-class, SSpro8)
                  - output_prefix.acc   : predicted solvent accessibility (2-class, ACCpro)
                  - output_prefix.acc20 : predicted solvent accessibility (20-class, ACCpro20)

- num_threads     Number of cores to use to process the dataset (default=1)


A large part of the processing time is taken by generating the sequence profiles and by running the
homology analysis on the Protein Data Bank. Since these two steps are shared by all the predictors,
running one predictor only or running all the predictors at once on a set of protein sequences does
not make a significant difference in terms of computation time. SCRATCH-1D will thus systematically
run the four predictors on the input protein sequences and all the predictions will be reported in
the output files listed above. If only one predictor is needed, please ignore the other predictions.

An additional script 'get_abinitio_predictions.sh' is provided in the 'bin' folder of SCRATCH-1D in
order to get the ab-initio predictions only. The homology analysis will not be performed with this
script and the predictions will not be improved by this second stage prediction. This script is only
provided for evaluation purposes. Usage is identical to 'run_SCRATCH-1D_predictors.sh'.

=======================================  Input Files Format  =======================================

Input files must be in the standard FASTA file format. There is no limit for the number of input
sequences to process beside the amount of RAM memory available on the machine running the program.

====================================  Output Files Description  ====================================

Output files are in the same file format than the input files where the protein amino-acid sequence
is replaced by the predicted secondary structure or relative solvent accessibility. Headers are
reported as provided in input and predictions are given in the same order than the input sequences.

====================================================================================================
                                           Release Notes
====================================================================================================

Version 1.3 (2020)

Author      :  Christophe Magnan
Description :  update of the package databases
Comments    :  PROFILpro protein database UNIREF50 updated
               HOMOLpro template database pdb_full updated

Version 1.2 (2018)

Author      :  Christophe Magnan
Description :  update of the package databases
Comments    :  PROFILpro protein database UNIREF50 updated
               HOMOLpro template database pdb_full updated

Version 1.1 (2015)

Author      :  Christophe Magnan
Description :  Update + Bug fixes for version 1.0
Comments    :  Databases for profiles and homology updated
               Non-standard amino acids replaced by X
               Sequences of length greater than 10,000 ignored

Version 1.0 (2013)

Author(s)   :  Christophe Magnan
Description :  First release of the software
Comments    :  Wrapper tool for SSpro, SSpro8, ACCpro, ACCpro20, PROFILpro, HOMOLpro, and 1D-BRNN.

====================================================================================================