Summary and Setup

Edit this page

Data Carpentry’s aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain. This workshop teaches data management and analysis for genomics research including: best practices for organization of bioinformatics projects and data, use of command-line utilities, use of command-line tools to analyze sequence quality and perform variant calling, and connecting to and using cloud computing. This workshop is designed to be taught over two full days of instruction.

Please note that workshop materials for working with Genomics data in R are in “alpha” development. These lessons are available for review and for informal teaching experiences, but are not yet part of The Carpentries’ official lesson offerings.

Interested in teaching these materials? We have an onboarding video and accompanying slides available to prepare Instructors to teach these lessons. After watching this video, please contact team@carpentries.org so that we can record your status as an onboarded Instructor. Instructors who have completed onboarding will be given priority status for teaching at centrally-organized Data Carpentry Genomics workshops.

Frequently Asked Questions

Read our FAQ to learn more about Data Carpentry’s Genomics workshop, as an Instructor or a workshop host.

Getting Started

This lesson assumes that learners have no prior experience with the tools covered in the workshop. However, learners are expected to have some familiarity with biological concepts, including the concept of genomic variation within a population. Participants should bring their own laptops and plan to participate actively.

To get started, follow the directions in the Setup tab to get access to the required software and data for this workshop.

Data

This workshop uses data from a long term evolution experiment published in 2016: Tempo and mode of genome evolution in a 50,000-generation experiment by Tenaillon O, Barrick JE, Ribeck N, Deatherage DE, Blanchard JL, Dasgupta A, Wu GC, Wielgoss S, Cruveiller S, Médigue C, Schneider D, and Lenski RE. (doi: 10.1038/nature18959)

All of the data used in this workshop can be downloaded from Figshare. More information about this data is available on the Data page.

Workshop Overview

Lesson	Overview
Project organization and management	Learn how to structure your metadata, organize and document your genomics data and bioinformatics workflow, and access data on the NCBI sequence read archive (SRA) database.
Introduction to the command line	Learn to navigate your file system, create, copy, move, and remove files and directories, and automate repetitive tasks using scripts and wildcards.
Data wrangling and processing	Use command-line tools to perform quality control, align reads to a reference genome, and identify and visualize between-sample variation.
Introduction to cloud computing for genomics	Learn how to work with Amazon AWS cloud computing and how to transfer data between your local computer and cloud resources.

Optional Additional Lessons

Lesson	Overview
Intro to R and RStudio for Genomics	Use R to analyze and visualize between-sample variation.

Teaching Platform

This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. All the software and data used in the workshop are hosted on an Amazon Machine Image (AMI). If you want to run your own instance of the server used for this workshop, follow the directions in the Setup tab.

Common Schedules

Overview

This workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. We will be providing you with an AWS instance. To access your AWS instance, some additional software, detailed below, may need to be installed on your computer.

Required additional software

This lesson requires a working spreadsheet program. If you have a working spreadsheet program installed on your computer, such as Microsoft Excel or LibreOffice (a free, open source spreadsheet program), you can use that. Otherwise, you can use Google Sheets. Either option will work well for this workshop.
For Windows, you will also need to install either Git Bash, PuTTY, or the Ubuntu Subsystem. Instructions are below.

Windows users only: Setting up software you can use to connect to your cloud computer

Open your Command Prompt app by searching for “cmd”. At the command prompt, type ssh. Confirm that this prints out the usage information for the ssh command. If the result is “Command not found” then you have a few options:

Set up the Ubuntu Subsystem for Windows. This option is only available for Windows 10 - detailedinstructions are available at https://docs.microsoft.com/en-us/windows/wsl/install.
Download the Git for Windows installer. Run the installer and follow the steps below:
- Click on “Next” four times (two times if you’ve previously installed Git). You don’t need to change anything in the Information, location, components, and start menu screens.
- From the dropdown menu select “Use the Nano editor by default” (NOTE: you will need to scroll up to find it) and click on “Next”.
- On the page that says “Adjusting the name of the initial branch in new repositories”, ensure that “Let Git decide” is selected. This will ensure the highest level of compatibility for our lessons.
- Ensure that “Git from the command line and also from 3rd-party software” is selected and click on “Next”. (If you don’t do this Git Bash will not work properly, requiring you to remove the Git Bash installation, re-run the installer and to select the “Git from the command line and also from 3rd-party software” option.)
- Ensure that “Use the native Windows Secure Channel Library” is selected and click on “Next”.
- Ensure that “Checkout Windows-style, commit Unix-style line endings” is selected and click on “Next”.
- Ensure that “Use Windows’ default console window” is selected and click on “Next”.
- Ensure that “Default (fast-forward or merge) is selected and click”Next”
- Ensure that “Git Credential Manager Core” is selected and click on “Next”.
- Ensure that “Enable file system caching” is selected and click on “Next”.
- Click on “Install”.
- Click on “Finish”.
- Check the settings for you your “HOME” environment variable.
- If your “HOME” environment variable is not set (or you don’t know what this is):
- Open command prompt (Open Start Menu then type cmd and press [Enter])
- Type the following line into the command prompt window exactly as shown: setx HOME "%USERPROFILE%"
- Press [Enter], you should see SUCCESS: Specified value was saved.
- Quit command prompt by typing exit then pressing [Enter]
Another option is to install the MobaXterm desktop app. Please follow the download instructions at mobaxterm.mobatek.net{:target=“_blank”} to download the free edition.

Data

The data used in this workshop is available on FigShare. Because this workshop works with real data, be aware that file sizes for the data are large. Please read the FigShare page for information about the data and access to the data files.

More information about these data will be presented in the first lesson of the workshop.

Software

Software	Version	Manual	Available for	Description
FastQC	0.11.9	Link	Linux, MacOS, Windows	Quality control tool for high throughput sequence data.
Trimmomatic	0.39	Link	Linux, MacOS, Windows	A flexible read trimming tool for Illumina NGS data.
BWA	0.7.17	Link	Linux, MacOS	Mapping DNA sequences against reference genome.
SAMtools	1.9	Link	Linux, MacOS	Utilities for manipulating alignments in the SAM format.
BCFtools	1.9	Link	Linux, MacOS	Utilities for variant calling and manipulating VCFs and BCFs.
IGV	Link	Link	Linux, MacOS, Windows	Visualization and interactive exploration of large genomics datasets.

QuickStart Software Installation Instructions

These are the QuickStart installation instructions. They assume familiarity with the command line and with installation in general. As there are different operating systems and many different versions of operating systems and environments, these may not work on your computer. If an installation doesn’t work for you, please refer to the user guide for the tool, listed in the table above.

We have installed software using Conda. Conda is a package manager that simplifies the installation process. Please first install Conda through the Miniconda installer (see below) before proceeding to the installation of individual tools. For more information on Miniconda, please refer to the Conda documentation.

Conda

Linux

To install Conda, type:

BASH

$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Then, follow the instructions that you are prompted with on the screen to install Conda.

MacOS

To install Conda, type:

BASH

$ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$ bash Miniconda3-latest-MacOSX-x86_64.sh

Then, follow the instructions that you are prompted with on the screen to install Conda.

FastQC

MacOS

To install FastQC, type:

BASH

$ conda install -c bioconda fastqc=0.11.9

FastQC Source Code Installation

If you prefer to install from source, follow the directions below:

BASH

$ cd ~/src
$ curl -O http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
$ unzip fastqc_v0.11.9.zip

Link the fastqc executable to the ~/bin folder that you have already added to the path.

BASH

$ ln -sf ~/src/FastQC/fastqc ~/bin/fastqc

Due to what seems a packaging error the executable flag on the fastqc program is not set. We need to set it ourselves.

BASH

$ chmod +x ~/bin/fastqc

Test your installation by running:

BASH

$ fastqc -h

Trimmomatic

MacOS

BASH

conda install -c bioconda trimmomatic=0.39

Trimmomatic Source Code Installation

If you prefer to install from source, follow the directions below:

BASH

$ cd ~/src
$ curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
$ unzip Trimmomatic-0.39.zip

The program can be invoked via:

$ java -jar ~/src/Trimmomatic-0.39/trimmomatic-0.39.jar

The ~/src/Trimmomatic-0.39/adapters/ directory contains Illumina specific adapter sequences.

BASH

$ ls ~/src/Trimmomatic-0.39/adapters/

Test your installation by running: (assuming things are installed in ~/src)

BASH

$ java -jar ~/src/Trimmomatic-0.39/trimmomatic-0.39.jar

Simplify the Invocation, or to Test your installation if you installed with miniconda3:

To simplify the invocation you could also create a script in the ~/bin folder:

BASH

$ echo '#!/bin/bash' > ~/bin/trimmomatic
$ echo 'java -jar ~/src/Trimmomatic-0.39/trimmomatic-0.39.jar $@' >> ~/bin/trimmomatic
$ chmod +x ~/bin/trimmomatic

Test your script by running:

BASH

$ trimmomatic

BWA

MacOS

BASH

conda install -c bioconda bwa=0.7.17=ha92aebf_3

BWA Source Code Installation

If you prefer to install from source, follow the instructions below:

BASH

$ cd ~/src
$ curl -OL http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2
$ tar jxvf bwa-0.7.17.tar.bz2
$ cd bwa-0.7.17
$ make
$ export PATH=~/src/bwa-0.7.17:$PATH

Test your installation by running:

BASH

$ bwa

SAMtools

MacOS

BASH

$ conda install -c bioconda samtools=1.9=h8ee4bcc_1

SAMtools Versions

SAMtools has changed the command line invocation (for the better). But this means that most of the tutorials on the web indicate an older and obsolete usage.

Using SAMtools version 1.9 is important to work with the commands we present in these lessons.

SAMtools Source Code Installation

If you prefer to install from source, follow the instructions below:

BASH

$ cd ~/src
$ curl -OkL https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
$ tar jxvf samtools-1.9.tar.bz2
$ cd samtools-1.9
$ make

Add directory to the path if necessary:

BASH

$ echo export `PATH=~/src/samtools-1.9:$PATH` >> ~/.bashrc
$ source ~/.bashrc

Test your installation by running:

BASH

$ samtools

BCFtools

MacOS

BASH

$ conda install -c bioconda bcftools=1.9

BCF tools Source Code Installation

If you prefer to install from source, follow the instructions below:

BASH

$ cd ~/src
$ curl -OkL https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
$ tar jxvf bcftools-1.9.tar.bz2
$ cd bcftools-1.9
$ make

Add directory to the path if necessary:

BASH

$ echo export `PATH=~/src/bcftools-1.9:$PATH` >> ~/.bashrc
$ source ~/.bashrc

Test your installation by running:

BASH

$ bcftools

Summary and Setup

Frequently Asked Questions

Getting Started

Data

Workshop Overview

Optional Additional Lessons

Teaching Platform

Common Schedules

Schedule A (2 days OR 4 half days)

Schedule B (2 days OR 4 half days)

Schedule C (3 days OR 6 half days)

Overview

Required additional software

Windows users only: Setting up software you can use to connect to your cloud computer

Data

Software

QuickStart Software Installation Instructions

Conda

Linux

BASH

MacOS

BASH

FastQC

MacOS

BASH

FastQC Source Code Installation

BASH

BASH

BASH

BASH

Trimmomatic

MacOS

BASH

Trimmomatic Source Code Installation

BASH

BASH

BASH

Simplify the Invocation, or to Test your installation if you installed with miniconda3:

BASH

BASH

BWA

MacOS

BASH

BWA Source Code Installation

BASH

BASH

SAMtools

MacOS

BASH

SAMtools Versions

SAMtools Source Code Installation

BASH

BASH

BASH

BCFtools

MacOS

BASH

BCF tools Source Code Installation

BASH

BASH

BASH

IGV