Sailfish¶
Sailfish is a tool for transcript quantification from RNA-seq data. It requires a set of target transcripts (either from a reference or de-novo assembly) to quantify. All you need to run sailfish is a fasta file containing your reference transcripts and a (set of) fasta/fastq file(s) containing your reads. Sailfish runs in two phases; indexing and quantification. The indexing step is independent of the reads, and only needs to be run once for a particular set of reference transcripts and choice of k (the k-mer size). The quantification step, obviously, is specific to the set of RNA-seq reads and is thus run more frequently.
Indexing¶
To generate the sailfish index for your reference set of transcripts, you should run the following command:
> sailfish index -t <ref_transcripts> -o <out_dir> -k <kmer_len>
This will build a sailfish index using k-mers of length <kmer_len>
for the
reference transcripts provided in the file <ref_transcripts>
and place the
index under the directory <out_dir>
. There are additional options that can
be passed to the sailfish indexer (e.g. the number of threads to use). These
can be seen by executing the command sailfish index -h
.
Note that, as of v0.7.0, the meaning of the -k
parameter has changed slightly.
Rather than the k-mer size on which Sailfish will quantify abundances, it becomes
the minimum match size that will be considered in the quasi-mapping
procedure during quantification. For sufficiently long (e.g. 75bp or greater)
reads, the default should be acceptable. If you have substantially shorter
reads, you may want to consider a smaller -k
.
Note
values of k
The k
value used to build the Sailfish index must be an odd number. Using an
even value for k
will raise an error and the full index will not be built.
Quantification¶
Now that you have generated the sailfish index (say that it’s the directory
<index_dir>
— this corresponds to the <out_dir> argument provided in the
previous step), you can quantify the transcript expression for a given set of
reads. To perform the quantification, you run a command like the following:
> sailfish quant -i <index_dir> -l "<libtype>" {-r <unmated> | -1 <mates1> -2 <mates2>} -o <quant_dir>
Where <index_dir>
is, as described above, the location of the sailfish
index, <libtype>
is a string describing the format of the fragment (read)
library (see Fragment Library Types), <unmated>
is a list of files
containing unmated reads, <mates{1,2}>
are lists of files containg,
respectively, the first and second mates of paired-end reads. Finally,
<quant_dir>
is the directory where the output should be written. Just like the
indexing step, additional options are available, and can be viewed by running
sailfish quant -h
.
When the quantification step is finished, the directory <quant_dir>
will
contain a file named “quant.sf” (and, if bias correction is enabled, an
additional file names “quant_bias_corrected.sf”). This file contains the
result of the Sailfish quantification step. This file contains a number of
columns (which are listed in the last of the header lines beginning with ‘#’).
Specifically, the columns are (1) Transcript ID, (2) Transcript Length, (3)
Transcripts per Million (TPM) and (6) Estimated number of reads (an estimate
of the number of reads drawn from this transcript given the transcript’s
relative abundance and length). The first two columns are self-explanatory,
the next four are measures of transcript abundance and the final is a commonly
used input for differential expression tools. The Transcripts per Million
quantification number is computed as described in [1], and is meant as an
estimate of the number of transcripts, per million observed transcripts,
originating from each isoform. Its benefit over the F/RPKM measure is that it
is independent of the mean expressed transcript length (i.e. if the mean
expressed transcript length varies between samples, for example, this alone can
affect differential analysis based on the K/RPKM.).
Description of important options¶
Sailfish exposes a number of useful optional command-line parameters to the user.
The particularly important ones are explained here, but you can always run
sailfish quant -h
to see them all.
-p
/ --numThreads
¶
The number of threads that will be used for quasi-mapping, quantification, and bootstrapping / posterior sampling (if enabled). Sailfish is designed to work well with many threads, so, if you have a sufficient number of processors, larger values here can speed up the run substantially.
--useVBOpt
¶
Use the variational Bayesian EM algorithm rather than the “standard” EM algorithm to optimize abundance estimates. The details of the VBEM algorithm can be found in [2], and the details of the variant over fragment equivalence classes that we use can be found in [3]. While both the standard EM and the VBEM produce accurate abundance estimates, those produced by the VBEM seem, generally, to be a bit more accurate. Further, the VBEM tends to converge after fewer iterations, so it may result in a shorter runtime; especially if you are computing many bootstrap samples.
--numBootstraps
¶
Sailfish has the ability to optionally compute bootstrapped abundance estimates. This is done by resampling (with replacement) from the counts assigned to the fragment equivalence classes, and then re-running the optimization procedure, either the EM or VBEM, for each such sample. The values of these different bootstraps allows us to assess technical variance in the main abundance estimates we produce. Such estimates can be useful for downstream (e.g. differential expression) tools that can make use of such uncertainty estimates. This option takes a positive integer that dictates the number of bootstrap samples to compute. The more samples computed, the better the estimates of varaiance, but the more computation (and time) required.
--numGibbsSamples
¶
Just as with the bootstrap procedure above, this option produces samples that allow
us to estimate the variance in abundance estimates. However, in this case the
samples are generated using posterior Gibbs sampling over the fragment equivalence
classes rather than bootstrapping. We are currently analyzing these different approaches
to assess the potential trade-offs in time / accuracy. The --numBootstraps
and
--numGibbsSamples
options are mutually exclusive (i.e. in a given run, you must
set at most one of these options to a positive integer.)
References¶
[1] | Li, Bo, et al. “RNA-Seq gene expression estimation with read mapping uncertainty.” Bioinformatics 26.4 (2010): 493-500. |
[2] | Nariai, Naoki, et al. “TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference.” Bioinformatics (2013): btt381. |
[3] | Rob Patro, Geet Duggal & Carl Kingsford “Accurate, fast, and model-aware transcript expression quantification with Salmon” bioRxiv doi: http://dx.doi.org/10.1101/021592 |