IEDB Next-Generation Tools PepX - version 0.1 beta
==================================================

Introduction
------------
The Peptide eXpression annotator (pepX) takes a peptide as input,
identifies from which proteins the peptide can be derived, and returns 
an estimate of the expression level of those source proteins from selected 
public databases.

This package contains a mixture of Python scripts and AMD64 Linux-specific 
binaries and Docker containers.  

This standalone tool handles most of the data processing for the PepX
tool at https://nextgen-tools.iedb.org/pipeline?tool=pepx.


Release Notes
-------------
v0.1 beta - Initial public beta release


Prerequisites
-------------

The following prerequisites must be met before installing the tools:

+ Linux 64-bit environment
  * http://www.ubuntu.com/
    - This distribution has been tested on Linux/Ubuntu 64 bit system.

+ Python 3.9 or higher
  * http://www.python.org/

+ wget
  * Installable via apt/yum/dpkg under most Linux distributions

+ 120GB free space
  * In order to download the pepX database, sufficient space is needed.


Optional:

+ Docker
  Docker Engine is required for running PepX tool inside of a container.
  * https://docs.docker.com/engine/install/


Installation
------------

Below, we will use the example of installing to /opt/iedb_tools.

1. Extract the code and change directory: 
  $ mkdir /opt/iedb_tools
  $ tar -xvzf IEDB_NG_PEPX-VERSION.tar.gz -C /opt/iedb_tools
  $ cd /opt/iedb_tools/ng_pepx-VERSION

2. Optionally, create and activate a Python 3.9+ virtual environment using your favorite virtual environment manager.  
Here, we will assume the virtualenv is at ~/virtualenvs/pepx:
  $ python3 -m venv ~/venvs/pepx
  $ source ~/venvs/pepx/bin/activate

3. Install python requirements:
  $ pip install -r requirements.txt

4. Download the PepX Database.  This is a large file (~120GB), so this step may take more than an hour to complete:
  $ ./download_pepx_db.sh


Help
----
> python3 src/run_pepx.py -h


Available Datasets
------------------
> python3 src/run_pepx.py -q gene -s CCLE -p PATH_TO_DATABASE


Usage
-----
* Basic example
> python3 src/run_pepx.py -q gene -s CCLE -d 329 -p <PATH_TO_DB>  FVQMMTAK MRYVASYL EDIISFIK

* Submit input sequences as TSV
> python3 src/run_pepx.py -i examples/example.tsv -q gene -s CCLE -d 329 -p <PATH_TO_DB>

* Using json file containing parameters
> python3 src/run_pepx.py -j examples/example.json

* Output tabular result to terminal
> python3 src/run_pepx.py -q gene -s CCLE -d 329 -f tsv -p <PATH_TO_DB> FVQMMTAK MRYVASYL EDIISFIK

* Output result in JSON format to terminal
> python3 src/run_pepx.py -q gene -s CCLE -d 329 -f json -p <PATH_TO_DB> FVQMMTAK MRYVASYL EDIISFIK

* Save the result into a TSV file
> python src/run_pepx.py -q gene -s CCLE -d 329 -o pepx_result -f tsv -p <PATH_TO_DB> FVQMMTAK MRYVASYL EDIISFIK


Input formats
-------------
Inputs may be specified in JSON format.  See the JSON files in the 'examples'
directory.  When multiple methods are selected, jobs will be run serially and
the output will be concatenated.  This can be avoided with the '--split' and 
'--aggregate' workflow which is described below, but currently only supported
for internal IEDB usage.

Here is an example JSON that illustrates the basic format:

{
    "input_sequence_text": "FVQMMTAK\r\nMRYVASYL\r\nEDIISFIK\r\nYEVSQLKD\r\nYISEHEHF\r\n",
    "qlevel": ["gene"],
    "dataset_id": ["1723"],
    "database": "/mnt/c/Users/USER/Downloads/pepX-prod-20231018.indexed.sqlite"
}

* input_sequence_text: A list of peptides for which to predict
* qlevel: Specify which TPM values (gene-level/transcript-level) should be used.
* dataset_id: Available public expression datasets.
* database: Path to the database.


Job splitting and aggregation
-----------------------------
> python3 src/run_pepx.py [-j] <input_json_file> [-p] <database_path> [-o] <output_prefix> [-f] <output_format>

*NOTE that this is an experimental workflow and this package does not contain
the code to automate job submission and aggregation.  Currently this is for
IEDB internal usage only*

Using the '--split' option, a job will be decomposed into units that are efficient 
for processing.

> python3 src/run_pepx.py -j examples/example.json -p {DB_PATH} --split --split-dir=examples/job/parameter_units

A 'job_description.json' file will be created that will describe the commands
needed to run each individual job, its dependencies, and the expected outputs. 
Each job can be executed as job dependencies are satisfied.  The job description
file will also contain an aggregation job, that will combine all of the
individual outputs into one JSON file.


Caveats
-------
All IEDB next-generation standalones have been developed with the primary
focus of supporting the website. Some user-facing features may be lacking,
but will be improved as these tools mature.


Contact
-------
Please contact us with any issues encountered or questions about the software
through any of the channels listed below.

IEDB Help Desk: https://help.iedb.org/
Email: help@iedb.org