########################################################################################## # # # Software : EVALpro # # Release : 1.0 (Aug 2019) # # # # Authors(s) : Christophe Magnan cmagnan@ics.uci.edu # # Gregor Urban gurban@uci.edu # # Mirko Torrisi torrisimirko@yahoo.com # # # # Copyright : Institute for Genomics and Bioinformatics # # University of California, Irvine # # # ########################################################################################## EVALpro is a software package to evaluate the accuracy of a profile-based predictor as a function of the max cosine similarity between training and test profiles. Operating systems compatibility =============================== EVALpro should be compatible with any Linux or Mac OS operating system. The installation procedure provided in the next section however assumes the availability of: - Python 2.7 or higher - PIP package manager - GCC compiler under the operating system used to install and run the software. Please check the availability of these three dependencies on your system prior to proceeding with the installation of EVALpro. Package installation ==================== EVALpro can be installed from the downloaded package using the following instructions: tar -xzf EVALpro_1.0.tar.gz cd EVALpro_1.0/lib gcc -Wall multicore_analysis.c -o multicore_analysis -lm pip install -r python_libraries.txt cd .. chmod -R 755 * For Mac OS, add the command line below to the instructions above: xattr -d com.apple.quarantine lib/multicore_analysis If a problem occurs during the installation, please check the availability of the dependencies listed in the previous section on your operating system. To report any issues with the installation of EVALpro or for any questions related to this software, please contact Pierre Baldi (pfbaldi@ics.uci.edu). Package validation ================== To validate the installation of EVALpro on your operating system, you can run the analysis on the datasets provided in the 'test' folder: python EVALpro.py --sets test/PSSM_train.dat test/PSSM_test.dat --out test --cpu 4 The number of threads running the analysis (--cpu 4) should be modified to match with the number of cores/threads available on your system (e.g. --cpu 16). The output files 'test.csv' and 'test.png' should be identical to the provided files 'test/PSSM_output.csv' and 'test/PSSM_output.png'. Using EVALpro (python EVALpro.py --help) ============= EVALpro.py [--sets training_set test_set] [--out output_prefix] [--step window_size] [--cpu num_thread] [--keep] [--warn] --sets training_set test_set training and test datasets, in that order REQUIRED --out output_prefix prefix for the name of the output files REQUIRED --step window_size fixed length of the profile windows DEFAULT=30 --cpu num_thread number of threads to use for the analysis DEFAULT=1 --keep keep full results in separate output file OPTIONAL --warn display all warning messages on screen OPTIONAL Required Arguments ------------------ Only the path of the two input files and the prefix to be used for naming the output files are mandatory arguments of the program. For instance: python EVALpro.py --sets test/PSSM_train.dat test/PSSM_test.dat --out PSSM_test python EVALpro.py --sets test/FREQ_train.dat test/FREQ_test.dat --out FREQ_test Input/output file formats are described in the next sections. Optional Arguments ------------------ --cpu num_thread Using multiple threads to run the analysis is highly recommended to reduce significantly the computation time. The amount of RAM memory needed to run the analysis will increase with the number of threads as follows: PSSM Profiles - integer type : 5 * num_thread * SIZEOF(test_set_file) Frequency Profiles - float type : num_thread * SIZEOF(test_set_file) --step window_size While not recommended, the default length of the profile windows (30) used during the analysis can be changed using the --step argument. --keep The exact results (max cosine similarity value, accuracy) for each profile window in the test dataset are not provided by default in the output files. To keep these results in a separate output file, use the --keep flag. --warn The warning messages frequently issued by the Python library 'sklearn' are not displayed by default on screen, they can be restored using the --warn flag. Input File Formats ================== EVALpro requires a training dataset and a test dataset to run the analysis. Examples of such files are provided in the 'test' folder of the package for both PSSM and frequency profiles. Detailed specifications are provided below. Training Dataset ---------------- The profile of each training protein of length N must be reported on N+1 lines using the following format: LINE 1 >protein_id LINE 2 space-separated profile values for position 1 LINE 3 space-separated profile values for position 2 ... LINE N+1 space-separated profile values for position N For instance: >pdb1a12A -8 3 -8 -8 -11 -6 -6 -9 -8 -10 -10 9 -9 -11 -9 -8 -8 -11 -9 -10 -1 0 -6 -5 -6 5 -3 -9 -6 5 -6 5 -5 -4 -7 -5 -5 -9 -6 2 -3 -13 -14 -14 1 -7 -13 -14 -14 3 -6 -13 1 -6 -7 -7 -6 -14 -12 8 -2 -8 -8 -8 -6 -4 -9 8 -5 -13 -13 -3 -7 -8 -9 -2 -9 -13 -8 -9 Test Dataset ------------ The test dataset must also include the actual and predicted classes for each position in the protein using the following format: LINE 1 >protein_id LINE 2 actual_class predicted_class space-separated profile values for position 1 LINE 3 actual_class predicted_class space-separated profile values for position 2 ... LINE N+1 actual_class predicted_class space-separated profile values for position N For instance: >pdb2ndiA C C 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 E E -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 H C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 C H 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 Additional Information ---------------------- protein_id The protein identifier is not required to be unique for each protein. Note however that it is used in the optional output file to indicate for each profile window the protein it comes from, so we recommend to use unique and short identifiers for all proteins in the test dataset. actual_class predicted_class Classes can be any chain of characters NOT containing a space. EVALpro will consider a prediction correct if the two provided characters or strings are identical, and will assume an incorrect prediction otherwise. profile EVALpro was primarily designed to evaluate the accuracy of profile-based predictors as a function of the max cosine similarity between the training and test profiles. As such, 20 numerical features per position in a protein should be provided in each input file, i.e. the PSSM or frequency profile values for the corresponding position on the protein. Note however that this is not a requirement of EVALpro, i.e. any set of features can be added or can replace the intended 20 profile values as long as (1) the number of features is identical for each position of each protein in both datasets; (2) the features are numerical (int,float,double); and (3) space is used to separate the values in both input files. Output File Formats =================== EVALpro generates 2 different output files by default and 1 optional output file: output_prefix.csv Tab-separated text file providing, for each cosine similarity value between the lowest one observed in the provided datasets and 1.00, using 0.01 increments, the accuracy of the predictor as predicted by a Gaussian Process Regression (GPR) model trained using the results calculated for each profile window in the test dataset. An example is provided below. Cosine Similarity Accuracy 0.25 0.2449 0.26 0.3877 0.27 0.5062 0.28 0.5978 ... ... 0.99 0.8047 1.00 0.8038 output_prefix.png Figure automatically generated to visualize the results provided by the first output file "output_prefix.csv". The figure also shows the distribution of the profile windows by cosine similarity level, rescaled for visibility purposes. Two examples of such figures are provided in the 'test' folder of the package. output_prefix.raw (optional) This optional output file can be obtained using the --keep flag when running EVALpro. It provides for each profile window in the test set: 1) the identifier of the corresponding protein; 2) the 1-based start position of the profile window on the protein sequence; 3) the max cosine similarity value observed between the profile window and any profile window in the training dataset; and 4) the actual accuracy of the predictor on that protein fragment. Values are tab-separated. An example is provided below. ProteinID WindowStartPos CosineSimilarity Accuracy 2ndiA 1 0.432508984569 0.700000 2ndiA 2 0.440549612307 0.700000 2ndiA 3 0.445777008287 0.700000 2ndiA 4 0.405544893105 0.700000 2ndiA 5 0.410466247096 0.700000 2ndiA 6 0.415950622868 0.700000 2ndiA 7 0.402127856604 0.666667 2ndiA 8 0.407117389552 0.633333 2ndiA 9 0.403424689676 0.600000 2ndiA 10 0.402063128061 0.600000 Enjoy!