######## README file for NNalign version 1.4 ########

Andreatta M., Schafer-Nielsen C., Lund O., Buus S. and Nielsen M. (2011) PLoS ONE 6(11): e26781. doi:10.1371/journal.pone.0026781

http://www.cbs.dtu.dk/services/NNAlign

## SUPPORTED PLATFORMS

The code has been compiled and tested on the following architectures;
- Linux x86_64
- Linux ia64
- Darwin x86_64
- Darwin i386

## BASIC INSTALLATION

Unpack the compressed tar file:
      tar -xvf nnalign.tar.gz

As a bare-bone program, you should already be able to run NNalign without any further action. Test this by running the perl wrapper:
     ./nnalign-1.4.SA.pl -h

which will display all the possible options for the program.

## GRAPHICS - WebLogo (optional)

The output of NNalign is greatly enhanced by allowing the generation of sequence logos. You will need to install WebLogo (Crooks et al. 2004) for this. The code can be found at http://weblogo.berkeley.edu including instructions for the very straightforward installation. 
After you have successfully installed Weblogo, you can turn on logo generation in NNalign by specifying the path of WebLogo with the -W option (e.g. /usr/bin/weblogo). Alternatively, you may hardcode its location at the beginning of the nnalign-1.4.SA.pl script, by setting the $weblogo variable.

## GRAPHICS - R (optional)

NNAlign can also produce graphs (data distribution, target vs. prediction scatterplots) using the statistics software R. If you have R installed, specify the path using the -R option (e.g. -R /usr/bin/R). If you are unsure about the location of R, you may try:
	which R

For a given installation, the R path may also be hardcoded in the nnalign-1.4.SA.pl script, by setting the $R variable in the SET CUSTOM PATHS HERE section.

The NNAlign scripts use the non-standard R package 'hexbin', which can be installed simply by typing, within the R interactive window:
	install.packages("hexbin")

########### INSTRUCTIONS ##############################
 
Run the program with the -h option to display the possible command line arguments:
   ./nnalign-1.4.SA.pl -h

The training data must be on two columns, with the peptides in the first column and the relative target values in the second column. You may inspect the sample data provided in the 'nnalign/test/' folder for examples.

The program creates a dedicated folder for each run, containing all the result files, logos etc. On STDOUT, a log showing the progress of the computation, performance values, and location of the results, is displayed.

An example:
	./nnalign-1.4.SA.pl -f test/HLA-DRB1.0101.train -m 9 -j 3 -F -L 

Another example with more options, an evaluation set, and graphics:
	./nnalign-1.4.SA.pl -f test/HLA-DRB1.0101.train -x test/HLA-DRB1.0101.test -P HLA_example -m 9 -j 3 -F -L -H 1 -n 3,8 -W /usr/bin/weblogo -R /usr/bin/R

The programs has several options, described in the following section.

#######################################################################
########### COMMAND LINE OPTIONS ######################################
#######################################################################

Command line options come in two kinds: options that require a value, and switches (on/off options). Switches require no argument.

### BASIC OPTIONS: ###################################################

-f [file]		Upload training set (peptide TAB signal). The training data must be in two columns, with the peptides in the first column and the relative target values in the second column. You may inspect the sample data provided in the 'nnalign/test/' folder for examples.

-M [file]		Upload a prediction model. If you have previously created a neural network model with NNAlign, you may upload it and apply it for prediction on new data.

-P [string]		Prefix for results file. The prefix will be preprended to the file names produced by the program.

-m [number or range]	Motif length. It can be given as a single number (e.g. -m 9), as a range of values (e.g. -m 7-11), or as a range with steps (e.g. -m 7-11/2 will run predictions for a window of 7, 9 and 11). Default is 9.


### DATA PROCESSING OPTIONS: ############################################

-l [0,1,2]		Data rescaling. For optimal usage of neural networks, data should be rescaled between 0 and 1. Specifying -l 0 applies a linear rescaling, -l 1 applies a log-transform, -l 2 does no rescaling of the data.

-Y [0,1]		Directionality of the data. Set to 0 if positives have high values or 1 if positive have low values. Default is 0.

-A				Preserve repeated flanks in raw data. If all sequences have repeated flanking amino acids, by default these flanks are removed from the data. Turn on to reverse this behaviour (switch).

-a [int]		Number of folds for cross-validation (default 5). Partitions the training data in n subsets to estimate predictive performance. You may specificy the partition method with the -H option.

-H [0,1,2]		Method to create subsets: random [0], homology [1], common motif [2]. Random splits the data into 'n' partitions, where 'n' is specified with the -a options. Homology clustering partitions the data based on percent similarity between sequences (specify the identity threshold using the -t option). Common motif clusters the data based on contiguous stretches of identical amino acids (specify the maximum overlap with the -I option). 

-t [float]		Threshold for homology clustering (default 0.8). Sequences with identity higher than -t are clustered in the same partition for cross-validation

-I [int]		Maximum overlap length for common-motif clutering (default 5). Sequences with more than -I identical contiguous amino acids are clustered in the same partition for cross-validation

-E				Remove homologous sequences from dataset (switch). With -H 1 or -H 2, only one representative sequence per cluster is preserved.

-e [0,1]		By default (0) simple cross-validation is used, where n-1 subsets are used for training and 1 subset for evaluation, in all possible n permutations of the evaluation set. Does exhaustive cross-validation if set to 1. It uses nested cross-validation to train on n-2 subsets, stop on 1 subset, and evaluate on 1 subset. More conservative but also more time-demanding. 


### NEURAL NETWORKS TRAINING AND ARCHITECTURE: ##########################################

-C [int]		Number of training cycles (default 500). The number of times each datapoint is presented to the neural networks in the ensemble.

-n [int or list]	Number of hidden neurons (default 3). The size of the NN hidden layer. It can be specified also as a list of values separated by commas (e.g. -n 3,8,15), in which case several networks with alternative architectures are created. 

-y				Stop training on best test set performance (switch). It uses early stopping to prevent overfitting on the training data.

-B [0,1,2]		Amino acid encoding (default 1). Use Sparse encoding [0] or Blosum encoding [1] or both Sparse and Blosum [2].

-j [int]		Size of the PFR in NN encoding (default 0). It is possible to encode the amino acid composition of the regions flanking the binding core (PFR). This has proven useful for MHC class II binding where the PFR composition appears to influence peptide binding. The size of the PFR regions is specified using option -j.

-L				Encode peptide length (switch). Use sequence length as input to the ANNs.

-F				Encode peptide length (switch). Use PFR length as input to the ANNs.

-s	[int]		Number of seeds for each network architecture (default 10). Start ANNs from -s different initial weight configurations and build a network ensemble.

-b	[int]		Number of networks in the final ensemble (default 20). Select only the -b networks with highest performance for the final ensemble.

-p				Preference for hydrophobic amino acids at P1 (switch). Toggle this option if you wish to favor hydrophobic amino acids in the first position of the binding core.

### OFFSET CORRECTION SETTINGS ######################################

-O				DO NOT re-align networks with offset correction (switch). By default, offset correction is used to optimize the combined information content of the ANN ensemble. If you wish to alter this behaviour, toggle the option -O.

-u	[float]		Start temperature for PSSM alignment (default 0.10). PSSMs derived from each network are re-aligned using Gibbs sampling in offset correction; the start temperature sets the initial condition of the system in the Gibbs sampling descent.

-Z [int]		Number of iterations per temperature step (default 2000). This option sets the depth of the Gibbs sampling to optimize the optimal combined information content of the ensemble. 


### SEQUENCE LOGOS ##################################################

-U [int]		Number of bits for sequence logos (default 4). Set the maximum value on the y-axis of sequence logos.

-Q				Make logos of all networks in the final ensemble (switch). Make individual logos of the motifs found by all networks in the ensemble.


### EVALUATION DATA and FILTERING OF RESULTS ##########################################

-x [file]		Upload evaluation file. It can be used both in conjunction with the -f and -M options (training data or pre-trained model). The file can be a list of peptides (optionally annotated with numerical values) or a protein sequence in FASTA format.

-T [float]		Threshold on evaluation file predictions (default 0). If you submit a large evaluation file, for example a complete proteome, you may want to filter the results and only show the most relevant predictions. Only peptides with predicted values higher than -T will be displayed.

-S				Sort results by prediction value (switch).


### PATHS and SYSTEM SETTINGS ###########################################

-c [int]		Number of parallel processes for ANN training (default 10). The program can split the workload over several processors, with high gain in execution speed. Adjust according to your system. 

-R [path]		Path to R. Generation of graphics depends on the R software (www.r-project.org). If you have R installed, specify the path here on in the preamble of the nnalign-1.4.SA.pl code.

-W [path]		Path to Weblogo. See the GRAPHICS section above for instructions.

-d [path]		Directory with the executables (Default ./bin). If you change the location of the NNAlign executables you may have to update the path here.

-r [path]		Results directory. This is where all of your result files will be saved.


####################################################################################

For more details, visit also: http://www.cbs.dtu.dk/services/NNAlign

Updated on January 2014, M. Andreatta


