Snakemake and the Scientific Filesystem

Jan 30, 2018. | By: @vsoch and @fbartusch

Here is a story of snakemake, the workflow manager that could, and his partner in crime, the Scientific Filesystem!

What do you get when you combine a Scientific Filesystem, an organized, programmatically accessible, and discoverable specification for scientific applications, and Snakemake, a workflow management system is a tool to create scalable data analyses? In a container? Reproducibility, of course! Here we will go through a small tutorial to generate:

  • a scientific filesystem
  • in a container (Docker and Singularity!)
  • with Snakemake as a workflow manager.

[Read More]

CarrierSeq: sequence analysis workflow with SCIF

Jan 29, 2018. | By: Vanessasaurus

Some time ago we featured the CarrierSeq workflow run with Singularity. At the time, SCIF was available only for Singularity, and with great feedback from reviewers and the community, SCIF (the Scientific Filesystem) was re-fashioned to be install-able in any container technology (Docker or Singularity) or on your host!

Just to hammer it into your brain, I’ll remind you what the Scientific Filesystem, referred to as SCIF, is. SCIF is a specification for a filesystem organizational, a set of environment variables, and functions that control the two in order to optimize the usability and discoverability of scientific applications. Thus, in this second tutorial we are again going to install a Scientific Filesystem into a Docker container and the same one into a Singularity container using the same scif recipe file.

[Read More]

An Introduction to SCI-F Apps

Jan 29, 2018. | By: Vanessasaurus

In this tutorial, we are going to get started with the Scientific Filesystem (SCIF), which is a specification for a filesystem organizational, a set of environment variables, and functions that control the two in order to optimize the usability and discoverability of scientific applications. Specifically, we will build a “foobar” container, and a “cowsay” container that prints colored fortunes.</a>. We don’t even have to make containers (we can install on the host) but we will for this tutorial to keep the applications isolated from the host.

[Read More]

CarrierSeq: sequence analysis workflow with Singularity + SCIF

Sep 26, 2017. | By: Vanessasaurus

Singularity is a container, similar to Docker, that is secure to run in HPC environments. For this pipeline, we take a previously implemented pipeline (in Docker and for host) and provide two example Singularity containers, each of which provides modular access to the different software inside. By way of using the Scientific Filesystem (SCI-F) with Singularity, we have a lot of freedom in deciding on what level of functions we want to expose to the user. A developer will want easy access to the core tools (e.g., bwa, seqtk) while a user will likely want one level up, on the level of a collection of steps associated with some task (e.g., mapping). These two use cases enhance our traditional use of a container as a single executable. We will walk through the steps of building and using each one.

Setup

You first need to install Singularity. For this tutorial, we are using the development branch with the upcoming version 2.4. You can install it doing the following.

git clone -b development https://www.github.com/singularityware/singularity.git
cd singularity
./autogen.sh
./configure --prefix=/usr/local
make
sudo make install

Now you are ready to go!

Get Data

If you aren’t familar with genomic analysis (as I’m not) you will be utterly confused about how to download data. The reference is provided, but the input data isn’t. The container was originally designed to download it’s own data, but due to a change in the sri-toolkit the original data is provided as a manual download link from Dropbox. The good news is, if/when the data download works, we can provide another app in the continer to get it, without needing to understand how the sri-toolkit works.

Build the image

Building looks like this:

sudo singularity build --writable carrierseq.img Singularity

If you prefer to have a “sandbox” directory, you can do this:

sudo singularity build --sandbox carrierseq.img Singularity

I don’t understand why we have both. I’m ok with this. If you don’t need a writable image (for testing) just build a final one:

sudo singularity build carrierseq.img Singularity

This will build an essentially frozen image, so your pipeline is forever preserved.

Listing Pipeline Apps

If you didn’t know anything about the image, you would want to explore it. SCI-F provides easy commands to do that.

Global Help

singularity help carrierseq.img

    CarrierSeq is a sequence analysis workflow for low-input nanopore 
    sequencing which employs a genomic carrier.

    Github Contributors: Angel Mojarro (@amojarro), 
                         Srinivasa Aditya Bhattaru (@sbhattaru),   
                         Christopher E. Carr (@CarrCE),
                         Vanessa Sochat (@vsoch).

    fastq-filter from: https://github.com/nanoporetech/fastq-filter
    see:
           singularity run --app readme carrierseq.img | less for more detail

    To run a typical pipeline, you might do:

    # Download data
    singularity run --app mapping --bind data:/scif/data carrierseq.img
    singularity run --app poisson --bind data:/scif/data carrierseq.img
    singularity run --app sorting --bind data:/scif/data carrierseq.img
   

SCI-F Apps

You can see apps in the image, as instructed:

singularity apps carrierseq.img
download
mapping
poisson
readme
reference
sorting

The command mentioned at the body is an internal provided only to add, preserve, and make the README accessible. Cool!

singularity run --app readme carrierseq.img | less

# CarrierSeq

## About

bioRxiv doi: https://doi.org/10.1101/175281

CarrierSeq is a sequence analysis workflow for low-input nanopore sequencing which employs a genomic carrier.

Github Contributors: Angel Mojarro (@amojarro), Srinivasa Aditya Bhattaru (@sbhattaru), Christopher E. Carr (@CarrCE), and Vanessa Sochat (@vsoch).</br> 
fastq-filter from: https://github.com/nanoporetech/fastq-filter

.......etc

We can see all apps in the image:

singularity apps carrierseq.img
mapping
poisson
readme
sorting
download

And then we can ask for help for any of the pipeline steps: https://github.com/amojarro/carrierseq/issues/1

singularity help --app mapping carrierseq.img
singularity help --app poisson carrierseq.img
singularity help --app sorting carrierseq.img

We can also look at metadata for the entire image, or for an app. The inspect command can expose environment variables, labels, the definition file, tests, and runscripts.

See singularity inspect --help for more examples:

singularity inspect carrierseq.img

{
    "org.label-schema.usage.singularity.deffile.bootstrap": "docker",
    "org.label-schema.usage.singularity.deffile": "Singularity",
    "org.label-schema.usage": "/.singularity.d/runscript.help",
    "org.label-schema.schema-version": "1.0",
    "org.label-schema.usage.singularity.deffile.from": "ubuntu:14.04",
    "org.label-schema.build-date": "2017-09-20T18:16:50-07:00",
    "BIORXIV_DOI": "https://doi.org/10.1101/175281",
    "org.label-schema.usage.singularity.runscript.help": "/.singularity.d/runscript.help",
    "org.label-schema.usage.singularity.version": "2.3.9-development.gaaab272",
    "org.label-schema.build-size": "1419MB"
}


singularity inspect --app mapping carrierseq.img
{
    "FQTRIM_VERSION": "v0.9.5",
    "SEQTK_VERSION": "v1.2",
    "BWA_VERSION": "v0.7.15",
    "SINGULARITY_APP_NAME": "mapping",
    "SINGULARITY_APP_SIZE": "9MB"
}

Overall Strategy

Since the data is rather large, we are going to map a folder to our present working directory ($PWD) where the analysis is to run. This directory, just like the modular applications, has a known and predictable location. So our steps are going to look like this:

# 0. Make an empty random folder to bind to for data.
mkdir data

# 1. Download data, map the data base to an empty folder on our local machine
#     (we actually will do this from the browser as the sri toolkit is changed.
# 2. Perform mapping step of pipeline, mapping the same folder.
# 3. perform poisson regression on filtered reads
# 4. Finally, sort results

# Download data from https://www.dropbox.com/sh/vyor82ulzh7n9ke/AAC4W8rMe4z5hdb7j4QhF_IYa?dl=0
# See issue --> https://github.com/amojarro/carrierseq/issues/1
singularity run --app mapping --bind data:/scif/data carrierseq.img
singularity run --app poisson --bind data:/scif/data carrierseq.img
singularity run --app sorting --bind data:/scif/data carrierseq.img

For any of the above steps, see the singularity help --app <appname> carrierseq.img for how to customize settings and environment variables. This demo is intended to use the defaults.

CarrierSeq Pipeline

The common user might want access to the larger scoped pipeline that the software provides. In the case of CarrierSeq, this means (optionally, download) mapping, poisson, and then sorting. If the image isn’t provided (e.g., a Singularity Registry or Singularity Hub) the user can build from the build recipe file, Singularity:

sudo singularity build carrierseq.img Singularity

The author is still working on updating the automated download, for now download from here and then move into some data folder you intend to mount:

mv $HOME/Downloads/all_reads.fastq data/

And then the various steps of the pipeline are run as was specified above:

singularity run --app mapping --bind data:/scif/data carrierseq.img
singularity run --app poisson --bind data:/scif/data carrierseq.img
singularity run --app sorting --bind data:/scif/data carrierseq.img

1. Mapping

To run mapping, bind the data folder to the image, and specify the app to be mapping:

singularity run --app mapping --bind data:/scif/data carrierseq.img

2. Poisson

singularity run --app poisson --bind data:/scif/data carrierseq.img

3. Sorting

singularity run --app sorting --bind data:/scif/data carrierseq.img

How Can I Change It?

Given two containers that share the same input and output schema, I could swap out of the steps:

...
singularity run --app sorting --bind data:/scif/data carrierseq.img
singularity run --app sorting --bind data:/scif/data another.img

or even provide a single container with multiple options for the same step

...
singularity run --app sorting1 --bind data:/scif/data sorting.img
singularity run --app sorting2 --bind data:/scif/data sorting.img

As a user, you want a container that exposes enough metadata to run different steps of the pipeline, but you don’t want to need to know the specifics of each command call or path within the container. In the above, I can direct the container to my mapped input directory and specify a step in the pipeline, and I dont need to understand how to use bwa or grep or seqtk, or any of the other software that makes up each.

CarrierSeq Development

The developer has a different use case - to have easy command line access to the lowest level of executables installed in the container. Given a global install of all software, without SCI-F I would need to look at $PATH to see what has been added to the path, and then list executables in path locations to find new software installed to, for example, /usr/bin. There is no way to easily and programatically “sniff” a container to understand the changes, and the container is a black development box, perhaps only understood by the creator or with careful inspection of the build recipe.

In this use case, we want to build the development container.

sudo singularity build carrierseq.dev.img Singularity.devel

Now when we look at apps, we see all the core software that can be combined in specific ways to produce a pipeline step like “mapping”.

singularity apps carrierseq.dev.img
bwa
fqtrim
python
seqtk
sra-toolkit

each of which might be run, exec to activate the app environment, or shell to shell into the container under the context of a specific app:

# Open interactive python
singularity run --app python carrierseq.dev.img

# Load container with bwa on path
$ singularity shell --app bwa carrierseq.dev.img
$ which bwa
$ /scif/apps/bwa/bin/bwa

These two images that serve equivalent software is a powerful example of the flexibility of SCI-F. The container creator can choose the level of detail to expose to a user that doesn’t know how it was created, or perhaps has varying levels of expertise. A lab that is using core tools for working with sequence data might have preference for the development container, while a finalized pipeline distributed with a publication would have preference for the first. In both cases, the creator doesn’t need to write custom scripts for a container to run a particular app, or to expose environment variables, tests, or labels. By way of using SCI-F, this happens automatically.

See the original repository CarrierSeq here or the Singularity CarrierSeq here.

[Read More]

Singularity Scientific Workflow with SCI-F Example

Sep 25, 2017. | By: Vanessasaurus

Here we are going to do a comparison between a scientific analysis provided in standard Singularity container, versus the same analysis implemented with the Scientific Filesystem (SCI-F). This analysis was originally used to compare Singularity vs. Docker on cloud providers via this container with interesting results pertaining to resources usage between the different cases. If you are interested in a Docker vs. Singularity implementation, see that project. For this small example, we want to give rationale for taking a SCI-F apps approach over a traditional Singularity image. We compare the following equivalent (but different!) implementations:



The original image is served on Singularity Hub:

singularity pull shub://vsoch/singularity-scientific-example

And the image with SCI-F app supports must be built with (not yet released) Singularity 2.4:

sudo singularity build scif.img Singularity

If you want to build locally, you can install the development branch:

git clone -b development https://www.github.com/singularityware/singularity.git
cd singularity
./autogen.sh
./configure --prefix=/usr/local
make
sudo make install

Evaluation

The containers use the same software to perform the same functions, but notably, the software and executables are organized differently, and called differently. Singularity standard (the first without SCI-F) relies on external scripts, and the container is a bit of a black box. Singularity with SCI-F has no external dependencies beyond data, and is neatly organized. So how do we evaluate this?

The aim here is to qualitatively evaluate SCI-F on its ability to expose container metadata, and information about the pipeline and executables inside. As each evaluation is scoped to the goal’s of the container, for this example we focus on the purpose of deploying a set of steps that encompass a pipeline to perform variant calling.

First, the goals of SCI-F:

  • Containers are consistent to allow for comparison. I am able to easily discover relevant software and data for one or more applications defined by the container creator.
  • Containers are transparent. If i discover a container and do not have any prior knowledge or metadata, the important executables and metadata are revealed to me.
  • Container contents are easily available for introspection because they are programmatically parseable. I can run a function over a container, and know exactly the functions available to me, ask for help, and know where to interact with inputs and outputs.
  • Container internal infrastructure is modular. Given a set of SCI-F apps from different sources, I can import different install routines and have assurance that environment variables defined for each are sourced correctly for each, and that associated content does not overwrite previous content. Each software and data module must carry, minimally, a unique name and install location in the system.

For each of the above, let’s define some basic tests.

1. Development Evaluation

For this use case, we are a container developer, and we are writing a singularity build recipe.

Can I easily define multiple entrypoints?

Standard: Singularity standard defaults to a single runscript per container. If I need to expose multiple functions, I either need to write a more complicated single entrypoint to direct the call, or I need to write detailed docs on different exec commands (and executables inside) for the user to run. For this real world use case, at the time when this runscript was written, SCI-F was not yet conceptualized. Given the sheer number of tools in the container, the runscript served to list a long list of executables, written to a text file, added to the path:

%runscript

    if [ $# -eq 0 ]; then
        echo "\nThe following software is installed in this image:"
        column -t /Software/.info | sort -u --ignore-case        
        echo "\Note that some Anaconda in the list are modules and note executables."
        echo "Example usage: analysis.img [command] [args] [options]"  
    else
        exec "$@"
    fi

Given that this container was intended to run a scientific workflow, this list doesn’t help to make its usage transparent. It would be useful for an individual familiar with the core software, perhaps developing a specific workflow with it. Arguably, this complete listing should be provided, perhaps not as a main entrypoint, but a separate app to print metadata, or the software names and versions added as labels to the app(s) where they are relevant. It’s important to note that this first container did not include any logic for performing the analysis, this was provided by the included scripts. If the scripts are separated from the container, reproducing the analysis is no longer possible.

SCI-F: SCI-F has the advantage here because I can give names to different apps, and write a different executable runscript for each. I can still provide a main runscript, and it could either give the user a listing of possible options (below) or run the entire workflow the container provides. Here is a main runscript that instructs the user on how to use the apps, dynamically generating the list:

%runscript

    if [ $# -eq 0 ]; then
        echo "\nThe following software is installed in this image:"
        ls /scif/apps | sort -u --ignore-case        
        echo "Example usage: singularity --app <name> <container> [command] [args] [options]"  
    else
        exec "$@"
    fi

And here is an example app, specifically to downlad a component of the data:

%apprun download-reference
    mkdir -p $REF_DIR
    wget -P $REF_DIR ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.transcripts.fa.gz
    gzip -d $REF_DIR/gencode.v25.transcripts.fa.gz
    wget -P $REF_DIR ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    gzip -d $REF_DIR/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Importantly, the scripts that were previously in scripts are now cleanly organized into sections in the build recipe. This modular organization and easy accessibility would have been very challening given the first container organization. The runscript would have needed to be long and complicated to infer what the user wanted to do, or the same functionality achieved by executing different scripts (inside the container), which is a non-trivial (or minimally more annoying to write) than simply writing lines into sections.


Can I easily install content known to modules?

Given that I have two functions for my container to perform, foo and bar, can I generate an install routine that will allow for both shared dependencies (i.e. global) and distinct dependencies?

Standard: Singularity standard has one mode to install global dependencies, everything goes into the %post script and any organization of required data, files, and libraries is scattered around the image. Other than coming up with a manual organization and documenting it, there is no way to cleanly define boundaries that will be easily discovered by the user. If you take a look at the Standard Singularity recipe, you will see this reflected in one huge, single install procedure. As an example, for this container the tools bwa and samtools were generally used for an alignment step, and there is no way of knowing this. They are installed globally:

cd /Software
su -c 'git clone https://github.com/Linuxbrew/brew.git' singularity
su -c '/Software/brew/bin/brew install bsdmainutils parallel util-linux' singularity
su -c '/Software/brew/bin/brew tap homebrew/science' singularity
su -c '/Software/brew/bin/brew install art bwa samtools' singularity
su -c 'rm -r $(/Software/brew/bin/brew --cache)' singularity

In fact, the art tools are installed with the same manager (brew), but they belong to an entirely different step. If a research scientist (or user) were parsing this build recipe, or using an NLP algorithm that took distance into account, there would be no strong signal about how these software were used or related to one another.

SCI-F: With SCI-F, by simply defining an environment, labels, install, or runscript to be in the context of an app, the modularity is automatically generated. When I add a list of files to an app foo, I know they are added to the container’s predictable location for foo. If I add a file to bin I know it goes into foo’s bin, and is added to the path when foo is run. If I add a library to lib, I know it is added to LD_LIBRARY_PATH when foo is run. I don’t need to worry about equivalently named files under different apps getting mixed up, or being called incorrectly because both are on the path. For example, in writing these sections, a developer can make it clear that bwa and samtools are used together for alignment:

# =======================
# bwa index and align
# =======================

%appinstall bwa-index-align
    git clone https://github.com/lh3/bwa.git build
    cd build && git checkout v0.7.15 && make
    mv -t ../bin bwa bwakit

    apt-get install -y liblzma-dev
    cd .. && wget https://github.com/samtools/samtools/releases/download/1.5/samtools-1.5.tar.bz2
    tar -xvjf samtools-1.5.tar.bz2
    cd samtools-1.5 && ./configure --prefix=${SINGULARITY_APPROOT}
    make && make install

%apprun bwa-index-align
    mkdir -p $DATADIR/Bam
    bwa index -a bwtsw $DATADIR/Reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa
    bwa mem -t $NUMCORES $DATADIR/Reference/Homo_sapiens.GRCh38.dna.primary_assembly.fa $DATADIR/Fastq/dna_1.fq.gz $DATADIR/Fastq/dna_2.fq.gz | samtools view -bhS - > $DATADIR/Bam/container.bam  

%applabels bwa-index-align
    bwa-version v0.7.15
    samtools-version v1.5

and that art is used to simulate reads:

# =======================
# simulate reads
# =======================

%apphelp simulate-reads
    Optionally set any of the following environment variables (defaults shown)
    READS (100000000)
    READ_LEN (150)
    GENOME_SIZE (3400000000)

%appenv simulate-reads
    READS=${READS:-100000000}
    READ_LEN=${READ_LEN:-150}
    GENOME_SIZE=${GENOME_SIZE:-3400000000}
    export GENOME_SIZE READ_LEN READS

%appinstall simulate-reads   
    wget https://www.niehs.nih.gov/research/resources/assets/docs/artbinmountrainier20160605linux64tgz.tgz
    tar -xzvf artbinmountrainier20160605linux64tgz.tgz 
    mv art_bin_MountRainier/* bin/
    chmod u+x bin/art_*

%apprun simulate-reads
    GENOME="$REF_DIR/Homo_sapiens.GRCh38.dna.primary_assembly.fa"
    FOLD_COVERAGE=$(python -c "print($READS*$READ_LEN/$GENOME_SIZE)")
    art_illumina --rndSeed 1 --in $GENOME --paired --len 75 --fcov $FOLD_COVERAGE --seqSys HS25 --mflen 500 --sdev 20 --noALN --out $FASTQ_DIR/dna_ && gzip $FASTQ_DIR/dna_1.fq && gzip $FASTQ_DIR/dna_2.fq

Whether I glanced at the recipe, or did some kind of text processing, I could easily see the relationships and purpose of the software in the container.


Can I associate environment and metadata with modules?

Given two different functions for my container to perform, foo and bar, can I define environment variables and labels (metadata) that I know will be sourced (environment) or exposed (inspect labels) in the context of the app?

Standard: Singularity standard also has one global shared %environment, and %labels section. If two functions in the container share the same environment variable and the value is different, this must be resolved manually. For this example, the first container didn’t have any labels or environment, however in practice these global sections are usually used for high level, global variables like author and version (of the container). When I run the container, regardless of if different contexts or variables are needed for executables inside, I get the same environment.

SCI-F: With SCI-F, I simply write the different variables to their sections, and have confidence that they will be sourced (environment) or inspected (labels) with clear association to the app.


%appenv run-rtg
    MEM=${MEM:-4g}
    THREADS=${THREADS:2}
    export MEM THREADS

%applabel run-rtg
    rtg-version 3.6.2

Do I need to know standard locations in advance?

Given that a container has conformance to SCI-F, do I need to know how it works to use it?

Standard: With a standard location, we would be relying on Linux File System conventions (e.g., installation of content to /usr/local or intuitively infer that a folder called /Software (as with this scientific example) or /code is likely where the creator put important content.

SCI-F: Instead of requiring the user to know that an app’s base might be at /scif/apps/foo, we instead expose environment variables (e.g., SINGULARITY_APPBASE) that can be referenced at runtime to refer to different apps. This is especially important if, for example, I need to reference the output of one app as input for another, or simply know it’s install location. Regardless of which app is running, the container will also expose the top level folder for all apps installations, and data, SINGULARITY_DATA and SINGULARITY_APPS at /scif/data and /scif/apps, respectively.

2. Production Evaluation

For this use case, we operate under the scenario that we are familiar with Singularity and the commands to use SCi-F, but we know nothing about the containers. We are a user.

Do I know what the container does?

The most natural thing to do with a Singularity container, knowing that it is possible to execute, is to do exactly that. For this evaluation, we want to assess how well executing the container reveals the intentions of the creator.

Standard: From the runscript we evaluated earlier, we are presented with a list of software and versions installed in the image, without detail to where or for what purpose. While this listing is comprehensive, it’s most appropriate for a developer than a scientific workflow. In this listing, it isn’t clear how the software is used or combine in the analysis. We are reliant on some external script or controller that drives the container. We don’t have any hints about possible analysis steps the container can serve.

The following software is installed in this image:
alabaster           0.7.9     Anaconda
anaconda            4.3.0     Anaconda
anaconda-client     1.6.0     Anaconda
anaconda-navigator  1.4.3     Anaconda
argcomplete         1.0.0     Anaconda
art                 20160605  Homebrew
astroid             1.4.9     Anaconda
astropy             1.3       Anaconda
babel               2.3.4     Anaconda
backports           1.0       Anaconda
beautifulsoup4      4.5.3     Anaconda
bitarray            0.8.1     Anaconda
blaze               0.10.1    Anaconda
bokeh               0.12.4    Anaconda
boto                2.45.0    Anaconda
bottleneck          1.2.0     Anaconda
bsdmainutils        9.0.10    Homebrew
bwa                 0.7.15    Homebrew
cairo               1.14.8    Anaconda
cffi                1.9.1     Anaconda
chardet             2.3.0     Anaconda
chest               0.2.3     Anaconda
click               6.7       Anaconda
cloudpickle         0.2.2     Anaconda
clyent              1.2.2     Anaconda
colorama            0.3.7     Anaconda

...

sqlalchemy          1.1.5     Anaconda
sqlite              3.13.0    Anaconda
statsmodels         0.6.1     Anaconda
sympy               1.0       Anaconda
terminado           0.6       Anaconda
tk                  8.5.18    Anaconda
toolz               0.8.2     Anaconda
tornado             4.4.2     Anaconda
traitlets           4.3.1     Anaconda
unicodecsv          0.14.1    Anaconda
util-linux          2.29      Homebrew
wcwidth             0.1.7     Anaconda
werkzeug            0.11.15   Anaconda
wheel               0.29.0    Anaconda
widgetsnbextension  1.2.6     Anaconda
wrapt               1.10.8    Anaconda
xlrd                1.0.0     Anaconda
xlsxwriter          0.9.6     Anaconda
xlwt                1.2.0     Anaconda
xz                  5.2.2     Anaconda
xz                  5.2.3     Homebrew
yaml                0.1.6     Anaconda
zeromq              4.1.5     Anaconda
zlib                1.2.8     Anaconda
\Note that some Anaconda in the list are modules and note executables.
Example usage: analysis.img [command] [args] [options]

SCI-F: When we run the SCi-F container, a different kind of information is presented. By listing the container contents at /scif/apps, the user knows what pipeline steps are included in the analysis.

singularity run scif.img

and in fact, this basic listing is generally useful for apps, so it’s provided as it’s own command, if the user doesn’t write a runscript at all:

singularity apps scif.img

The runscript also hints that I can direct the %help command to better understand an app. While both SCiF and standard Singularity allows for specification of a global %help section, providing documentation on the level of the application is more focused and offers specificity to the user. Both also allow for global %labels that might point to documentation, code repositories, or other informative information.


Does moduarity come naturally?

For this metric, we want to know if using different functions of the container (modules) is intuitive. As we have seen above, the definition of a “module” could be anything from a series of script calls to perform a pipeline step (alignment using bwa and samtools), a single script call (such as a module just for bwa), or even the same function applied to a specific set of content (e.g., download “A” v.s. download “B”).

Standard: Singularity proper represents a module on the level of the container. The entire container is intended for one entrypoint, and any deviation from that requires customization of that entrypoint, or special knowledge about other executables in the container to call with exec. The container is modular only in context of being a single step in a pipeline.

SCI-F: SCI-F, in that the software inside is defined and installed in a modular fashion, makes it easy to find three different modules:

  • download-fastq
  • download-rtg
  • download-reference

and without looking further, infer that likely these three downloads can be run in parallel. While this doesn’t represent any kind of statement or assurance of this, it allows for the container have a natural modularity. Consider steps that are named according to an order:

  • 1-download
  • 2-preprocess
  • 3-analysis

The creator of the container, in choosing a careful naming, can deliver a huge amount of information about different modules.


Do I know what executables are important in the container?

Without much effort, I should have a high level understanding of the different functions that the container performs, as defined by the creator. For example, a container intended for development of variant calling will expose low level tools (e.g, bwa, seqtk) while a container that is intended will expose a pipeline (e.g., mapping).

Standard: For Singularity standard, if the container performs one function (and one only), then a single runscript / entrypoint is sufficient. Having multiple functions is completely reliant on the programming ability of the creator. If the creator is able to write a more complex runscript that explains different use cases, the container can satisfy this goal. If not, the container is a black box.

SCI-F For SCI-F, without any significant programming ability, the different apps are easily exposed to the user.


Can I easily get help for an executable?

I should be able to run a command to get help or a summary of what the container does (introspection).

Standard: For Singularity standard, you can ask a container for help, but it’s a single global help.

SCI-F: For SCI-F apps, you can define help sections for all of the functions that your container serves, along with a global help.

3. Research Evaluation

An important attribute of having modular software apps is that it allows for separation of files and executables for research, and those that belong to the base system. From a machine learning standpoint, it provides labels for some subset of content in the container that might be used to better understand how different software relates to a pipeline. Minimally, it separates important content from the base, allowing, for example, a recursive tree generated at /scif to capture a large majority of additions to the container. Or simple parsing of the build recipe to see what software (possibly outside of this location) was intended for each app. Equally important, having container software installed at a global at %post also says important things about it - that it perhaps is important for more than one software module, or is more of a system library.

Conclusion

SCI-F clearly has the advantage when it comes to a container creator embedding his or her work with implied metadata about software and container contents. SCI-F also makes it easier to package different run scripts with the container, and expose them easily to the user. However, this does not mean that the standard approach of using a container as a general toolbox and distributing it with a series of external callers is bad or wrong. The choice to use (or not use) SCI-F apps is largely dependent on the goals of the creator, and the intended users.

[Read More]

Introducing SCI-F Apps

Sep 15, 2017. | By: Vanessasaurus

SCI-F refers to the Scientific Filesystem specifically designed to allow for internal modularity and reproducibility of scientific containers (or anywhere else a filesystem can be installed).

[Read More]

Container Metrics: Evaluating a container from Different Angles via SCIF Apps

Sep 13, 2017. | By: Vanessasaurus

For this next use case, a scientist is interested in running a series of metrics over an analysis of interest (the primary entry point for a rperoducible container. He has been given a container with a runscript, and several installed supporting metrics (SCIF apps also in the container), and knows nothing beyond that. You can find the dependency files at https://github.com/sci-f/metrics.scif.

[Read More]

Hello World, Using SCI-F for Modular Software Evaluation

Sep 13, 2017. | By: Vanessasaurus

A common question pertains to evaluation of different solutions toward a common goal. An individual might ask “How does implementation “A” compare to implementation “B” as evaluated by one or more metrics?” For a systems admin, the metric might pertain to running times, resource usage, or efficiency. For a researcher, he or she might be interested in looking at variability (or consistency) of outputs. Importantly, it should be possible to give a container serving such a purpose to a third party that does not know locations of executables, or environment variables to load, and the container runs equivalently. SCI-F allows for this by way of providing modular software applications, each corresponding to a custom environment, libraries, and potentially files.

[Read More]

About

Singularity Apps are modular SCI-F apps that can be added as helpers or wrappers for Singularity containers