1.1 Lab Exercises

Overview

In this lab, we will get familiar with our new Virtual Machine (VM) terminal. If you’ve got unix and command line experience, great! If you don’t, or it’s been a while, don’t stress. Think of this VM like a personal sandbox. It’s yours, it never goes away, and you can’t break anything (too badly).

We will do four major things in this lab:

  • Get familiar with your new VM

  • Learn some basic unix commands: ls, cd, mkdir, pwd

  • Download a genome from TAIR

  • Explore that genome with grep

Give it a go, be patient, and ask questions.

“Roads were made for journeys, not destinations” - Confucius

Task A: Get comfortable on the command line

Step 1. Where am I?

Help! I’m lost! Where am I? The UNIX command pwd is your lighthouse.

pwd

If this tree-like directory structure doesn’t make sense, stop and review Module 1 of LinuxSurvival

Use ls to list the files in your current directory:

ls

Add a “flag” to ls to see more information about every file. -l stands for “long format”.

ls -l

What are all the flags you can use for a given command? Read the manual for every UNIX command by using the man command:

man ls

Step 2. Make a new directory for this lab

We use mkdir to make a new directory (folder). The usage is two parts:

mkdir <dirname>

Replace <dirname> with what you want to call this directory. Make sure it is is one word, no spaces. I’ll use “Lab1” but organize your life however you’d like to.

mkdir Lab1

There’s a nice trick we can use to speed up our command line life, called tab completion. The tab key is your best friend in UNIX; it is similar to how Google will try and autocomplete text for you while you’re typing into the search bar. If you start typing a filename in UNIX, and press the tab key, UNIX will try to complete the filename or path for you as long as it is unique.

We want to change directories into Lab1 now using the cd command, but we also want to be lazy. We could type out the full command:

cd Lab1

Or, we could just type:

cd La

and then press the tab key to complete the word. Try it, and press enter to execute the cd command.

Did it work? Use pwd to see where you are.

This trick works with just about anything you’re typing, like programs, filenames, scripts, and commands.

Task B: Download the Arabidopsis thaliana genome from TAIR

Arabidopsis is a powerful model for plant biology. It is not perfect, and is not useful in every situation. After all, there are >300,000 species of land plants on the planet, so how could one species possibly be useful to understanding another?

Arabidopsis thaliana plant

Image source: Plantlet.org, Credit: Eric Belfield

Step 1. Download the genome for Arabidopsis thaliana

The unix command wget allows us to fetch data from servers. Not every UNIX command means something, but wget’s name is derived from World Wide Web + get = wget. Here’s how we use it:

wget https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas

That’s it, just two parts: wget [path-to-what-we-want-to-fetch]

Step 2. Let’s see what the genome looks like

Use the command less to open up the FASTA file:

less TAIR10_chr_all.fas

This is what FASTA format looks like. FASTA format contains two major parts:

  1. A header that starts with “>” and includes information about

  2. The sequence on the next line(s). Sometimes the header can have information about the chromosome number (as you see here). Other genomes are not so perfect, and might be in hundreds or thousands of pieces.

Just like in Microsoft Word, you can use another UNIX program to find words or characters. This is really helpful if we just want to look at every line that has a FASTA header with the “>” character.

grep ">" TAIR10_chr_all.fas

The Arabidopsis genome is incredibly high quality, since people have been improving it for nearly 20 years. You should see FASTA headers for 5 nuclear chromosomes, one chloroplast genome, and one mitochondrial genome.

Step 3. View gene annotation sequences in a FASTA file

Use your new set of UNIX vocabulary to download the peptide sequences for Arabidopsis. Here’s the link:

https://www.arabidopsis.org/download_files/Sequences/Araport11_blastsets/Araport11_genes.202106.pep.fasta.gz

This file ends in “.gz”. This means that it is compressed using a program called gzip. This is a very common and nifty compression tool, just like .zip files on Windows and MacOS. To decompress this file, all we need to do is:

gzip -d filename

The -d flag means “decompress”. What if we want to compress something?

gzip filename

Mastering Content

Step 1

Count the number of genes in the Arabidopsis peptide fasta file.

Hint: You know how to use grep now. Is there a flag you can add to grep that will count things for you? Use man and/or Google. If you get stuck, rely on your colleagues, friends, and classmates in the discussion forum — this is real life, after all.

Step 2

Plants have canonical repeat motifs at their telomeres, usually “CCCTAAA” for most monocots and eudicots (side note: monocots in the Asparagales order often have “CCCTAA” telomere repeats, like humans).

Count the number of times that the string “CCCTAAA” occurs in the genome fasta file. Is this a robust way to measure of the length of telomeres in Arabidopsis?