NAME

pheno-ranker: A script that performs semantic similarity in PXF/BFF data structures and beyond (JSON|YAML)

SYNOPSIS

pheno-ranker -r <individuals.json> -t <patient.json> [-options]

  Arguments:
    * Cohort mode:
      -r, --reference <file>         JSON/YAML BFF/PXF file(s) (array/object), supports .gz

    * Patient mode:
      -t, --target <file>            JSON/YAML BFF/PXF file (object or single-object array), supports .gz 

  Options:
    -age                             Include age-related variables; excludes agent-like terms (BFF/PXF-only) [>no-age|age]
    -a, --align [path/basename]      Write alignment file(s). If not specified, default filenames are used [default: alignment.*]
    -append-prefixes <prefixes>      Prefixes for primary_key when #cohorts >= 2 [default: C]
    -config <file>                   YAML config file to modify default parameters [default: share/conf/config.yaml]
    -cytoscape-json [file]           Writes an undirected graph in Cytoscape-compatible JSON [default: graph.json]
    -e, --export [path/basename]     Export miscellaneous JSON files. If not specified, default filenames are used [default: export.*]
    -exclude-terms <terms>           Exclude BFF/PXF terms (e.g., --exclude-terms sex id) or column names in JSON-derived from CSV 
    -graph-stats [file]              Generates a text file with key graph metrics, for use with <-cytoscape-json> [default: graph_stats.txt]
    -graph-min-weight <number>        Keep graph edges with weight greater than or equal to this value
    -graph-max-weight <number>        Keep graph edges with weight less than or equal to this value
    -include-hpo-ascendants          Include ascendant terms from the Human Phenotype Ontology (HPO)
    -include-terms <terms>           Include BFF/PXF terms (e.g., --include-terms diseases) or column names in JSON-derived from CSV
    -max-matrix-records-in-ram <number> In cohort mode, set max records before switching to RAM-efficient mode (default: 5000)
    -matrix-format <format>          Matrix output format in cohort mode [>dense|mtx]
    -max-number-vars <number>        Maximum number of variables for binary string [default: 10000]
    -max-out <number>                Print only N comparisons [default: 50]
    -o, --out-file <file>            Output file path [default: -r matrix.txt | -t rank.txt]
    -poi, --patients-of-interest <id_list>   Export JSON files for the selected individual IDs during a dry-run
    -poi-out-dir <directory>         Directory for JSON files (used with --poi)
    -prp, --precomputed-ref-prefix [path/basename]   Use precomputed data for the reference cohort(s). No need to use --r
    -retain-excluded-phenotypicFeatures     Retains features set to "excluded": true by appending '_excluded' to their IDs
    -similarity-metric-cohort <metric>  Similarity metric for cohort mode [>hamming|jaccard]
    -sort-by <metric>                Sort by Hamming distance or Jaccard index [>hamming|jaccard]
    -w, --weights <file>             YAML file with weights

  Generic Options:
    -debug <level>                   Print debugging (from 1 to 5, being 5 max)
    -h, --help                       Brief help message
    -log                             Save log file [default: pheno-ranker-log.json]
    -man                             Full documentation
    -no-color                        Toggle color output [>color|no-color]
    -v, --verbose                    Verbosity on
    -V, --version                    Print version

SUMMARY

Pheno-Ranker is a lightweight, easy-to-install tool for performing semantic similarity analysis on phenotypic data in JSON/YAML formats, including Beacon v2 Models and Phenopackets v2. It also supports pre-processed CSV files prepared using the included csv2pheno-ranker utility.

INSTALLATION

If you plan to only use pheno-ranker CLI, we recommend installing it via CPAN. See details below.

Non containerized

The Perl command-line interface is tested on Linux, macOS, and Windows via GitHub Actions. The commands below focus on Debian-based Linux systems, where Perl 5 is typically available by default and extra CPAN modules can be installed with cpanminus. On Windows, use Docker, WSL, or a Perl environment such as Strawberry Perl.

Method 1: From CPAN

First install system level dependencies:

sudo apt-get install cpanminus libperl-dev gcc make

We will install Pheno-Ranker and the dependencies at ~/perl5

cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
cpanm --notest Pheno::Ranker
pheno-ranker --help

To ensure Perl recognizes your local modules every time you start a new terminal, you should type:

echo 'eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)' >> ~/.bashrc

To update to the newest version:

cpanm Pheno::Ranker

Method 2: From CPAN in a CONDA environment

Please follow these instructions.

Method 3: From GitHub

To clone the repository for the first time:

git clone https://github.com/cnag-biomedical-informatics/pheno-ranker.git
cd pheno-ranker

To update an existing clone, navigate to the repository folder and run:

git pull

Install system level dependencies:

sudo apt-get install cpanminus libperl-dev

Now you have to choose between one of the 2 options below:

Option 1: Install dependencies (they're harmless to your system) as sudo:

cpanm --notest --sudo --installdeps .
bin/pheno-ranker --help            

Option 2: Install the dependencies at ~/perl5:

cpanm --local-lib=~/perl5 local::lib && eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
cpanm --notest --installdeps .
bin/pheno-ranker --help

To ensure Perl recognizes your local modules every time you start a new terminal, you should type:

echo 'eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)' >> ~/.bashrc

Optional: If you want to use utils/barcode or utils/bff_pxf_plot:

sudo apt-get install python3-pip libzbar0
pip3 install -r requirements.txt

Containerized

Method 4: From Docker Hub

(Estimated Time: Approximately 10 seconds)

Download the latest version of the Docker image (supports both amd64 and arm64 architectures) from Docker Hub by executing:

docker pull manuelrueda/pheno-ranker:latest
docker image tag manuelrueda/pheno-ranker:latest cnag/pheno-ranker:latest

See additional instructions below.

Method 5: With Dockerfile

(Estimated Time: Approximately 1 minute)

Please download the Dockerfile from the repo:

wget https://raw.githubusercontent.com/cnag-biomedical-informatics/pheno-ranker/main/Dockerfile

And then run:

# Docker Version 19.03 and Above (Supports buildx)
docker buildx build -t cnag/pheno-ranker:latest .

# Docker Version Older than 19.03 (Does Not Support buildx)
docker build -t cnag/pheno-ranker:latest .

Additional instructions for Methods 4 and 5

To run the container (detached) execute:

docker run -tid -e USERNAME=root --name pheno-ranker cnag/pheno-ranker:latest

To enter:

docker exec -ti pheno-ranker bash

The command-line executable can be found at:

/usr/share/pheno-ranker/bin/pheno-ranker

The default container user is root but you can also run the container as $UID=1000 (dockeruser).

 docker run --user 1000 -tid --name pheno-ranker cnag/pheno-ranker:latest

Mounting volumes

Docker containers are fully isolated. If you need the mount a volume to the container please use the following syntax (-v host:container). Find an example below (note that you need to change the paths to match yours):

docker run -tid --volume /media/mrueda/4TBT/data:/data --name pheno-ranker-mount cnag/pheno-ranker:latest

Then I will do something like this:

# First I create an alias to simplify invocation (from the host)
alias pheno-ranker='docker exec -ti pheno-ranker-mount /usr/share/pheno-ranker/bin/pheno-ranker'

# Now I use the alias to run the command (note that I use the flag --o to specify the filepath)
pheno-ranker -r /data/individuals.json -o /data/matrix.txt

System requirements

- OS/ARCH supported: B<linux/amd64> and B<linux/arm64>.
- Ideally a Debian-based distribution (Ubuntu or Mint), but any other (e.g., CentOS, OpenSUSE) should do as well (untested).
  The Perl CLI is also tested on macOS and Windows; container images are Linux-based.
* Perl 5 (>= 5.26 core; installed by default in most Linux distributions). Check the version with "perl -v".
* >= 4GB of RAM
* 1 core
* At least 16GB HDD

HOW TO RUN PHENO-RANKER

For executing pheno-ranker you will need a PXF/BFF file(s) in JSON|YAML format. The reference cohort must be a JSON array, where each individual data are consolidated in one object.

You can download examples from this location.

There are two modes of operation:

Cohort mode:

Intra-cohort: With --r argument and 1 cohort.

Inter-cohort: With --r and multiple cohort files. It can be used in combination with --append-prefixes to add prefixes to each individual id.

Patient Mode:

With -r reference cohort(s) and --t patient data.

Examples:

$ bin/pheno-ranker -r phenopackets.json  # intra-cohort

$ bin/pheno-ranker -r phenopackets.yaml -o my_matrix.txt # intra-cohort

$ bin/pheno-ranker -r phenopackets.json -w weights.yaml --exclude-terms sex ethnicity exposures # intra-cohort with weights

$ $path/pheno-ranker -r individuals.json others.yaml --append-prefixes CANCER CONTROL  # inter-cohort

$ $path/pheno-ranker -r individuals.json -t patient.yaml -max-out 100 # mode patient

COMMON ERRORS AND SOLUTIONS

* Error message: R plotting
    Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
    line 1 did not have X elements
    Calls: as.matrix -> read.table -> scan
    Execution halted
  Solution: Make sure that the values of your primary key (e.g., "id") do not contain spaces (e.g., "my fav id" must be "my_fav_id")

* Error message: Foo
  Solution: Bar

CITATION

The author requests that any published work that utilizes Pheno-Ranker includes a cite to the following reference:

Leist, I.C. et al., (2024). Pheno-Ranker: a toolkit for comparison of phenotypic data stored in GA4GH standards and beyond. BMC Bioinformatics. DOI: 10.1186/s12859-024-05993-2

AUTHOR

Written by Manuel Rueda, PhD. Info about CNAG can be found at https://www.cnag.eu.

COPYRIGHT AND LICENSE

This PERL file is copyrighted. See the LICENSE file included in this distribution.