The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::WGS2NCBI - module to assist in submitting whole genome sequencing projects to NCBI

DESCRIPTION

This module documents the four actions (prepare, process, convert and compress) that are available to users of the wgs2ncbi script. Each of these steps is configured by one or more configuration files. In the documentation below, the relevant fields from these configuration files are listed. To understand how the configuration system itself works, consult the documentation at Bio::WGS2NCBI::Config.

prepare

The prepare action takes the annotation file (in GFF3 format) and extracts the relevant information out of it, writing it to a potentially large set of files. This is done because GFF3 annotation files can become quite large, so that finding the annotations for any particular scaffold might take a long time if this is done by scanning through the whole file. Instead, the set of annotations is reduced, by taking the following steps:

Remove all sequence data from the file - Embedded FASTA data is permissible according to the GFF3 standard, but this makes files needlessly bulky if the FASTA data are available separately as well.
Remove all annotations from unrecognized sources - GFF3 files can contain annotations from sources you don't particularly trust and want to ignore in your submission. This is configurable.
Remove all irrelevant features - any sequence features in GFF3 that are not recognized in NCBI feature tables are discarded. This is configurable.

Subsequently, the annotations are split such that there is a separate file for each scaffold (or chromosome, if that is how your annotations are organized). This way, the relevant information for any scaffold can be found much more quickly through the file system rather than by scanning through a large file.

In order for this action to succeed, the following configuration values need to be provided:

gff3file

The location of the input annotation file in GFF3 format.

gff3dir

The location of the output directory for the split annotation files.

source

Which annotation source to trust.

feature

Which features to retain.

process

The process action takes the sequence file (in FASTA format, with a record for each scaffold or chromosome) and the pre-processed annotations and converts these into feature table files and (masked) sequence files.

In order for this action to succeed, the following configuration values need to be provided:

datafile

The location of the input FASTA file, with a record for each scaffold or chromosome.

info

An INI-style configuration file that contains key/value pairs that will be embedded in the FASTA sequence headers of the produced output files. This is typically used for metadata about the sampled organism, such as its sex, collection locality, collected cell type, etc.

masks

An INI-style configuration file that contains the coordinates of sequence segments to mask. This may be needed because NCBI will do a strict screen to check for unclipped adaptor sequences or contaminants. In the report that is returned by NCBI it will state the sequence coordinates of segments that NCBI will not accept in a submission. By putting these coordinate in this file the offending segments will be replaced with NNNs.

products

An INI-style configuration file that contains the corrected names for protein products. The rationale is that your genome annotation process may introduce protein names that NCBI would like to deprecate, such as names that include molecular weights, database identifiers, references to 'homology', and so on. The discrepancy report that is produced in the "convert" in Bio::WGS2NCBI step will be a first guide in composing corrected names, but the validation that NCBI will perform will likely point out additional errors.

gff3dir

The location of the pre-processed GFF3 files as produced by "prepare" in Bio::WGS2NCBI.

datadir

The location of the output dir where the (potentially 'chunked', see below) sequence files and feature tables will be written.

prefix

A short character sequence that is prefixed to every sequence record identifier that is generated. NCBI will provide submitters with this prefix when the submission is initialized.

authority

This is a naming authority that will be applied to all sequence record identifiers. A reasonable value for this could be the name of the lab or institution that leads the project resulting in the submission. NCBI intends this authority, in combination with the prefix as a way to ensure that sequences are globally uniquely identifiable.

minlength

The minimum length of a scaffold to be retained in a submission. This should be 200 or above.

minintron

The minimum length of an intron to be retained in a submission. Introns shorter than this are interpreted (by NCBI) to be spurious and should therefore by discarded. As a consequence, the gene that contains such an intron will be annotated as a pseudogene. This value must be 10 or above.

chunksize

The output that is produced can be combined into chunks of more than one scaffold per file. To keep the number of files manageable it is convenient to set this to a large value, but less than or equal to 10,000.

limit

This parameter allows you to run the process on only a limited set of scaffolds. This is provided for testing, "dry run" purposes. For real usage this value must be set to 0.

convert

The convert action runs the tbl2asn program provided by NCBI with the right settings. This requires the following configuration settings:

datadir

The location of the dir where the (potentially 'chunked', see below) sequence files and feature tables were written by "process" in Bio::WGS2NCBI.

template

The location of the template file produced with the form at: http://www.ncbi.nlm.nih.gov/WebSub/template.cgi

outdir

The location where to write the resulting ASN.1 files.

discrep

The location where to write the discrepancy report.

tbl2asn

The location where the tbl2asn executable is located.

trim

The trim action trims stretches of leading or trailing NNNs from sequence records, and updates the coordinates in the associated feature tables accordingly. In cases where a feature falls within a trimmed region, the feature is removed entirely.

datadir

The location of the dir where the (potentially 'chunked', see below) sequence files and feature tables were written by "process" in Bio::WGS2NCBI.

prune

The prune action reads a discrepancy file as supplied by NCBI, parses out errors that have locations in them, which are then pruned from the table files in $config->datadir.

This requires the following configuration settings:

datadir

The location of the dir where the (potentially 'chunked', see below) sequence files and feature tables were written by "process" in Bio::WGS2NCBI.

validation

The location where to read the validation report from NCBI.

prefix

The ID prefix that was assigned to you by NCBI when you created your submission, something like 'CR513_'

authority

The naming authority prefix that you chose for your identifiers, something like 'gnl|aceprd|'

The

compress

The compress action bundles the ASN.1 files produced by Bio::WGS2NCBI/convert into a .tar.gz archive that can be uploaded to NCBI. This requires the following configuration settings:

outdir

The location where the ASN.1 files were written.

archive

The name and location of the archive to produce.

help

Displays module documentation (which you are reading now).