# Reference Databases

A reference database is required to run DME+ in DRAGEN. The databases are stored remotely and must be downloaded prior to running an analysis. A shell script is provided to facilitate the download.

## Directory Setup

Prior to downloading the databases, create a directory that will be dedicated to storing them. It is recommended that the directory be on a disk with at least 150 GB of free space. The path to this directory will be used for the `-d` parameter when the download script is run in subsequent steps: "databases/" is used in the examples below.

## Obtaining the Download Script

Download and management of the reference databases is handled by a shell script. The script can be downloaded with the following command:

```shell
wget -O explify-dbs.sh https://illumina-databases.s3.us-east-1.amazonaws.com/explify-dbs.sh
chmod +x explify-dbs.sh
```

## Seeing What Databases are Available for Download

The `search` subcommand can be used to list what databases can be downloaded:

```shell
$ ./explify-dbs.sh search -d databases/
4 database(s) found meeting those criteria:
- Custom-1.0.0
- RPIP-6.7.0
- UPIP-8.8.0
- VSPv2-2.9.0
```

* The `-d` argument is the base directory used for storage of the databases
* Optionally, when a test panel name is specified with the `-p` argument, the results will be limited to that panel
* Optionally, setting the `-n` argument will filter the search to databases that have not already been downloaded

## Downloading a Database

The `download` subcommand is used to download the database files for a test panel:

```shell
./explify-dbs.sh download -d databases/ -p UPIP -v 8.8.0 -n 20
```

* The `-d` argument is the base directory used for storage of the databases
* The `-p` argument is the test panel name
* The `-v` argument is the test panel version
* The `-n` argument is the number of CPUs that can be used to download the files (defaults to 1)

Additional notes:

* In this example, after the UPIP-8.8.0 are downloaded, additional required files will be downloaded to a subdirectory named "common"
* After the files are downloaded, their checksums will be automatically checked
* Due to the size of some of the files, this command will take some time. It is best to run it via `screen` or `nohup`

## Listing Downloaded Databases

The `list` subcommand is used to view the databases that have already been downloaded:

```shell
$ ./explify-dbs.sh list -d databases/
```

* The `-d` argument is the base directory used for storage of the databases
* Optionally, when a test panel name is specified with the `-p` argument, the results will be limited to that panel

## Checking Database Integrity

The `download` subcommand will automatically check the file checksums after download. The `check` subcommand can also be used on its own to check the files:

```shell
$ ./explify-dbs.sh check -d databases/ -p UPIP -v 8.8.0 -n 20
```

* The `-d` argument is the base directory used for storage of the databases
* The `-p` argument is the test panel name
* The `-v` argument is the test panel version
* The `-n` argument is the number of CPUs that can be used to download the files (defaults to 1)

## Using the Databases with the DME+ Pipeline

The database files should be organized under a root directory first by test panel type, then by test panel version. Assuming the root directory is `databases/`, its organization should look like this:

```
databases/
    Custom/
        1.0.0/
    RPIP/
        6.7.0/
    UPIP/
        8.8.0/
    VSPv2/
        2.9.0/
```

To run an analysis with RPIP 6.7.0, for example, the following inputs would be needed:

```shell
--explify-ref-db-dir databases/
--explify-test-panel-name RPIP
--explify-test-panel-version 6.7.0
```

The DME+ pipeline will use these inputs to navigate to the specified database location, namely `databases/RPIP/6.7.0`.

If the databases are stored on a normal file system, it is recommended that you set `--explify-load-db-ram=true`. This will tell the pipeline to load the databases into memory for faster analysis. It is also allowable to store the databases on a RAM disk, which reduces load time over many pipeline runs. In this case, it is recommended to set `--explify-load-db-ram=false`.
