# Allele Validator

This library provides many functionalities that will easily retrieve data from datasources. This would include validating alleles, converting between labels/synonyms/MRO IDs, and information retrieval. In order to make this library work, Allele Validator relies on several files which would need preprocessing to be done. Please follow the steps below in order to prepare the necessary file that Allele Validator would need.
> **NOTE** : Allele Validator will also prepare datasource file that will be used for **[Allele Autocomplete](https://gitlab.lji.org/iedb/tools/tools-redesign/api-dependencies/allele-autocomplete)**.

<br>

## Required Initial Files
- **Tools_MRO_mapping.xlsx**: Static file that shouldn't be modified (default).
- **mro_molecules.tsv**: File that needs to get updated every once in a while from **[MRO Github](https://github.com/IEDB/MRO)**.
- **method-table.xlsx**: Table containing method names, versions, and default version. This file needs to be updated when new method or version is added.
- **allele-lengths.xlsx**: Table containing method and available allele lengths.
- **Additional_netMHCpanRV.xlsx**: File containing 94 netmhcpan 4.1 alleles that were newly mapped and were not seen in <i>Tools_MRO_mapping.xlsx</i>.

<br>

> The <i>**unmapped_alleles.txt**</i> and <i>**unmapped_98_alleles.txt**</i> can be disregarded. It was mainly used for analysis.<br>More information can be found from [issue #346](https://gitlab.lji.org/iedb/tools/tools-redesign/ar-redesign-prototype/-/issues/346#note_26488).

<br>

## Data Prep

Run the following command:
```
python map_missing_alleles_from_allelenames.py
```
The script ultimately outputs a new file called <i>**Tools_MRO_Mapping_VFYD.xlsx**</i>.<br>

It contains the following 5 steps:
1. Walks through <i>**netMHCpan-4.1 allelenames**</i>, and add any unmapped alleles to Tools_MRO_mapping dataframe (Adds 14 alleles to the dataframe).
    ```
    BoLA-NC1:00101
    BoLA-NC1:00201
    BoLA-NC1:00301
    BoLA-NC1:00401
    BoLA-NC2:00101
    BoLA-NC2:00102
    BoLA-NC3:00101
    BoLA-NC4:00101
    BoLA-NC4:00201
    H-2-Dq
    H-2-Kq
    H-2-Lq
    HLA-A30:14L
    SLA-3:0402
    ```
2. Replace <i>term</i> column from Tools_Mapping dataframe with labels from allelenames file.
3. Remove alleles that are going to be deprecated for the new site.
    ```
    deprecated_tools_list = [
        'ann-3.4',
        'netmhcpan',
        'arb',
        'comblib',
        'netmhccons',
        'netmhcstabpan',
        'recommended',
        'nn_align'
    ]
    ```
4. Rename columns: Tool, Tool Label, MRO Name, MRO ID.
5. Write the final dataframe to excel file called <i>**Tools_MRO_Mapping_VFYD.xlsx**</i>.
<br><br>

## Initializing Data (1-time process.)
Assuming we start from the scratch, this stage includes necessary step to create the first version of data files that will be necessary to create in order to make Allele Validator work.<br><br>

We would first need to retrieve the latest version of the `mro_molecules.tsv` from **[MRO Github](https://github.com/IEDB/MRO)**. <br>
Run the following script, which automaically pulls the latest file from MRO Github repository.
```
sh get_latest_MRO.sh
```

We then will run series of data preprocessing steps in order to create files that will be used as the basis for Allele Validator.
```
python initiate_tools_mapping.py
```
At the end of the script, it should create two additional files :
- **tools-mapping.tsv**
- **mhc_alleles.tsv**

That's it! This whole stage should be a one-time process to kick-off the initial data processing.

## Regular Build
The **mro_molecules.tsv** could be updated in Github, thus we would need to pull down, process, and re-pickle the data.
Run the following script to do so :
```
sh weekly_build.sh
```
Weekly build script consists of three stages :
1. Retrieving the latest MRO data from Github using the `get_latest_MRO.sh` script.
2. Running the regular build that addds extra/newly added tools label as synonym, populating predictor availability column and tool_group column, and rebuilding autocomplete datasource for the Autocomplete.
3. Generating pickle files of the data files.




