Reference Indexes

From UFRC
Jump to navigation Jump to search

Many of the bioinformatics tools available on our system whether they are used on the command-line via the batch system or in the Galaxy require reference indexes built for the respective tools using the appropriate genomic sequence data. This page provides a short overview of our reference index building practices and requirements.

Source data

We need to know what genome build should be used for reference index building if your project requires a specific version. When you request a reference build please point us either to the fasta file located either in your scratch space or in one of your Galaxy histories or to the exact URI where we can download the fasta file from. If you are not sure what build you need please list the full name of the genome and its preferred provenance as there are some drastic or subtle differences between nomenclature and structure of data from UCSC, JGI, NCBI, or Ensembl. If we are to choose the build we will likely download the hard-masked sequence i.e. fasta data where repeats and low-complexity regions have been masked with RepeatMasker or a similar program and have been replaced with 'N' symbols. Occasionally, if the genome is in the early stage of assembly the non-masked sequences are too short for most reference indexes to be built. In that case we'll use the non-masked sequence.

Reference Indexes

The list of reference indexes we build for Galaxy usually includes the following:

  • Fasta
  • BLAST
  • Bowtie
  • Bowtie2
  • BWA
  • 2bit
  • Samtools
  • Picard/GATK

The less commonly built indexes done on request are

  • Bfast
  • Lastz
  • Mosaik

We also stage a chromosome build for the Galaxy's visualization tools when possible.

Database Keys

To identify the reference indexes for a particular genome build Galaxy usually relies on two identifiers, which are written into configuration files:

dbkey - generic identifier of a particular genome

and

unique build id - specific identifier of a particular build. This ID is used to distinguish between UCSC and Ensemble builds of the same genome and successive versions of the index builds in case we need to do any additional processing or use both masked and unmasked data.