An Introduction to Linux & Commandline

Molb 4485/5485 -- Computers in Biology

Nicolas Blouin and Vikram Chhatre

Wyoming INBRE Bioinformatics Core
Dept. of Molecular Biology
University of Wyoming
nblouin@uwyo.edu
vchhatre@uwyo.edu
http://www.uwyo.edu/ibc

Table of Contents

  1. The Terminal
  2. Connecting to Remote Server
  3. Where Am I?
  4. Linux Directory Structure
  5. Changing Directories
  6. Making Directories
  7. Making & Editing Files
  8. Moving Directories & Files
  9. Copying Directories
  10. Viewing Directory Contents
  11. Removing Files and Directories
  12. Display File Contents
  13. File & Folder Permissions
  14. Your First Shell Script
  15. Unix Power Commands
  16. Saving Your Work

1. The Terminal

This is the common name for the application that gives you text-based access to the operating system of the computer. Basically, it allows you to type input into the computer so that you can receive output from the programs you call. On Unix based machines there is always a terminal program.

Some brief tips before we move further:

2. Connecting To Remote Server

Often with bioinformatics analysis, you will be performing tasks on a remote network server, one much more powerful than your workstation. Let's use the terminal program you just learned about to connect with this server. We will use Secure SHell (SSH) protocol to talk to the remote server. In the following example, replace username with the user name provided to you.

Two-Factor Authentication

To improve computer security, University of Wyoming has started requiring two-factor authentication to grant access to network servers. Two factor implies password plus a second type of authentication. You all have Yubikeys (a usb flash drive like device) for the token needed for this.

    $ ssh username@mtmoran.uwyo.edu
                         TWO-FACTOR AUTHENTICATION
=============================================================================
This system requires two-factor authentication.

The password requirement is your UWYO domain password.

The token can be generated by your registered YubiKey or manually input with
the Duo mobile app. If you have questions about using this implementation of
two-factor authentication, contact the ARCC team at arcc-info@uwyo.edu

Please enter the two-factor password the in the form:

                            <password>,<token>

=============================================================================
    $ wyoinbre,<Press YubiKey Gold Button>
    [username@mmmlog1 ~]$

3. Where Am I (on the filesystem)?

As you will learn everything in Linux is relative to where you are in the file system. Therefore, knowing where you are before launching a command is valuable information. Luckily, there are built in commands for this type of information. Understanding the location of files will be a key part of success.

4. Linux Directory Structure

Linux files are arranged in a hierarchical structure, or a directory tree. From the root directory (/) there are many subdirectories. Each subdirectory can contain files or other subdirectories, etc., etc. Whenever you're using the terminal you will always be 'in' a directory. The default behavior of opening a terminal window, or logging into a remote computer, will place you in your ‘home’ directory. This is true when we logged into MtMoran. The home directory contains files and directories that only you can modify, we will get to those permissions later.

To see what files, or directories, you have in your home directory we will use the ls command.

5. Changing Directories

To move between directories (folders) we use the cd (change directory) command. We are currently in our home directory. Lets move to /project/inbre-train/username/LearnLinux. The cd command uses following syntax:

    $ cd DIRECTORY
    $ cd /project/inbre-train/username/LearnLinux

6. Making Directories

Creating directories in Unix is done with the mkdir (make directory) command.

    $ mkdir DirectoryName

Using spaces when naming directories, like on your desktop, is not advised in the Unix file system. This is why you see the use of _ in place of spaces. You can escape a space in Unix but it creates unnecessary typing and can create issues executing certain programs. Generally, using spaces in file and directory names is something to avoid.

7. Making and Editing Files

In this section you will learn the basics of making files and putting things into those files. There are a variety of ways we can accomplish this as Unix has built in multiple editors for these tasks. We will review a few here.

    $ touch FILENAME

This will create a new, empty file.

    $ nano FILENAME 

This is a built in text editor that will allow us to put information into a file.

8. Moving Directories and Files

To move a file or directory the mv (move) command is used. This is the first command we have used that requires two arguments. You need to specify the source and the destination for the moving.

    $ mv SOURCE DESTINATION

9. Copying Directories

To copy a file or directory cp (copy) command is used. Just like mv you will need a source and a destination to copy something.

    $ cp SOURCE DESTINATION

10. Viewing Directory Contents

To view the contents of directories we use the ls (list segments) command.

    $ ls DIRECTORY

If no directory is provided ls will list the contents of the current directory.

11. Removing Files and Directories

Caution: This is a dangerous command. File/Folder deletion in Unix is permanent and nonreversible.

If you run ls on your LearnLinux/Work/ directory, it is likely full of lots of empty files and directories by this point. Wouldn't it be nice if there were a way to clean that up? Of course there is a way, however it can be dangerous. To delete directors and files from the system we have two options the rm (remove) and rm –r commands.

    $ rm FILE

One more time just to be clear. It is possible to delete EVERY file you have ever created with the rm command. Thankfully there is a way to make rm a bit safer, and on DT2, this is the default setting. Using the –i flag rm will ask for confirmation before deleting anything.

12. Display File Contents

There are various commands available to display/print the contents of a file. The default of all these commands is to display the contents of the file on the terminal. These commands are less, cat, head, and tail.

    $ less FILENAME 

Displays file contents on the screen with line scrolling (to scroll you can use 'arrow' keys, 'PgUp/PgDn' keys, 'space bar' or 'Enter' key). Press 'q' to exit.

    $ cat FILENAME

Simplest form of displaying contents. It catalogs the entire contents of the file on the screen. In case of large files, entire file will scroll on the screen without pausing.

    $ head FILENAME

Displays only the 10 starting lines of a file by default. Any number of lines can be displayed with the -n flag followed by the number of lines.

    $ tail FILENAME 

As the name implies the opposite of head this displays the last 10 lines. Again -n option can be used to change this.

13. File & Folder Permissions

All files in any operating system have a set of permissions associated with the file that define what can be done with the file and by whom. What = read, write (modify), and/or execute a file. Whom = user, group, or public. These permissions are denoted with the following syntax:

Permissions
Read: r
Write: w
Execute: x

Relations
User: u
Group: g
Others: o
All users: a

Changing permissions is done via chmod (change mode) command

    $ chmod [Options] RELATIONS [+ or -] PERMISSIONS FILE

14. Your First Shell Script

Just like Perl, Python, R, C++ etc. BASH (Bourne Again Shell) is a programming language that works on Unix and Unix-like computers (Linux, Macintosh, BSD etc.). All the commands that you have been passing to the terminal, are in fact being executed by bash, the command shell. A shell script is simply a collection of various bash commands that are executed sequentially. To make a script we simply write shell commands into a file and then treat that file like any other program or command.

  # This is my first shell script.
  echo "Hello World!"

15. Unix Power Commands

The commands that you have learned so far are essential for doing any work in Unix, but they don’t really let you do anything that is very useful. The following section will introduce new commands that will start to show you the power of Unix.

15.1 Pipes & Redirects

Everything we have done so far has sent the result of the command to the screen. This is feasible when the data being displayed is small enough to fit the screen or if it is the endpoint of your analysis. But for large data outputs, or if you need a new file, printing to the screen isn't very useful. Unix has built in methods to hand output from commands using > (greater than) or < (lesser than) or >> signs.

    # Creates a new file (file2) with same contents as old file (file1) 
    $ cat FILE1 > FILE2 

    
    # Appends the contents for file1 to file2, equivalent to opening file1, 
    # copying all the contents, pasting the copied contents to the end of 
    # the file2 and saving it! 
    $ cat FILE1 >> FILE2 

    $ cat FILE1 | less 

Here, cat command displays the contents of the file1, but instead of sending it to standard output (screen) it sends it through the pipe to the next command less so that contents of the file are now displayed on the screen with line scrolling.

From the LearnLinux/Data/ directory

    $ cat seq.fasta

    $ head seq.fasta > new.txt

    $ cat new.txt

    $ tail seq.fasta > new.txt

    $ cat new.txt

Now lets try that with the append option.

    $ head –n 1 seq.fasta > new.txt

    $ tail –n 1 seq.fasta >> new.txt

The grep (globally search a regular expression and print) is one of the most useful commands in Unix and it is commonly used to filter a file/input, line by line, against a pattern.

    $ grep [OPTIONS] PATTERN FILENAME 

Like any other command there are various options available man grep for this command. Most useful options include:

Some typical scenarios to use grep

You might already know that fasta files header must start with a > character, followed by a DNA or protein sequence on subsequent lines. To find only those header lines in a fasta file, we can use grep.

15.3 Regular Expressions

grep + regular expressions (also called regex) = power! Before we get into this let's start with a task.

TASK

The '.' and '*' characters are also special characters that form part of the regular expression. Try to understand how the following patterns all differ. Try using each of these patterns with grep -c against any one of the sequence files. Can you predict which of the five patterns will generate the most matches?

    ACGT 
    AC.GT 
    AC*GT 
    AC.*GT 

The asterisk in a regular expression is similar to, but NOT the same, as the other asterisks that we have seen so far. An asterisk in a regular expression means: match zero or more of the preceding character or pattern Try searching for the following patterns to ensure you understand what '.' and '*' are doing:

    A...T 
    AG*T 
    A*C*G*T*

When working with the sequences (protein or DNA) we are often interested to see if a particular feature is present or not. This could be various things like a start codon, restriction site, or even a motif. In Unix all strings of text that follow some pattern can be searched using some formula called regular expressions. e.g. As you learned above regular expressions consist of normal and meta characters. Commonly used characters include:


Expression Function
. matches any single character
$ matches the end of a line
^ matches the beginning of a line
* matches one or more character
\ quoting character, treat the next character followed by this as an ordinary character.
[] matches one or more characters between the brackets
[range] match any character in the range
[^range] match any character except those in the range
\{N\} match N occurrences of the character preceding (sometimes simply +N) where N is a number.
\{N1,N2\} match at least N1 occurrences of the character preceding but not more than N1
? match 1 occurrence of the character preceding
| match 2 conditions together, (this|that) matches both this or that in the text



Here are some common regex patterns for Nucleotide/Protein searches:


Patterns Matches
^ATG Find a pattern starting with ATG
TAG$ Find a pattern ending with TAG
^A[TGC]G Find patterns matching either ATG, AGG or ACG
TA[GA]$ Find patterns matching either TAG or TAA
^A[TGC]G*TGTGAACT*TA[GA]$ Find gene containing a specific motif
[YXN][MPR]\_[0-9]\{4,9\} Find patterns matching NCBI RefSeq (e.g. XM_012345)
\(NP\|XP\)\_[0-9]\{4,9\} Find patterns matching NCBI RefSeq proteins

Let's use grep to find a zinc finger motif. For simplicity let’s assume zinc finger motif to be CXXCXXXXXXXXXXXXHXXXH. Either you can use dots to represent any amino acids or use complex regular expressions to come up with a more representative pattern.

    $ grep --color "C..C............H...H" At_proteins.fasta 
    $ grep --color "C.\{2\}C.\{12\}H.\{3\}H" At_proteins.fasta
    $ grep --color "C[A-Z][A-Z]C[A-Z]\{12\}H[A-Z][A-Z][A-Z]H" At_proteins.fasta

These all do the exact same thing. As you can see, regular expressions can be very useful for finding patters of all kinds.

UNIX Tip: You can use regular expressions in grep, sed, awk, less, perl, python, certain text editors almost any programing language or tool can utilize the power of regex.



15.4 SED Stream Editor

sed is a stream editor that reads one or more text files and makes changes or edits then writes the results to standard output. The simple syntax for sed is:

    $ sed 'OPERATION/REGEXP/REPLACEMENT/FLAGS' FILENAME

Above, / is the delimiter but you can use _ | or : as well.

OPERATION = the action to be performed, the most common being s which is for substitution.

REGEXP and REPLACEMENT = the search term and the substitution for the operation be executed.

FLAGS = additional parameters that control the operation.

Common FLAGS include:

g   replace all the instances of REGEXP with REPLACEMENT (globally)
n   (n=any number) replace nth instance of the REGEXP with REPLACEMENT
p   If substitution was made, then prints the new pattern space
i   ignores case for matching REGEXP
w   If substitution was made, write out the result to the given file
d   when specified without REPLACEMENT, deletes the found REGEXP

As you can see these two commands do the same thing with different delimiters. We Changed "Chr1" to "Chromosome_1" in the file. However this was not done permanently. To do that we would have to write to a new file or use a flag within sed.

    $ touch greetings.txt
    $ echo "Hello there" >> greetings.txt
    $ head greetings.txt

Now we have our file to manipulate with sed. We have three options for altering and saving the file:

Option 1: Make a new file

    $ sed 's/Hello/Hi/g' greetings.txt > greetings_short.txt
    $ head greetings*

Option 2: Edit in place but make a backup of the original with the given extension

    $ sed –i.bak 's/Hello/Hi/g' greetings.txt
    $ head greetings*

Option 3: Edit in place but without a backup. NOTE if you run out of system memory or have an error this will rewrite the original file. You will not get that file back.

    $ sed –i 's/Hello/Hi/g' greetings.txt.bak
    $ head greetings*


15.5 Word Count

wc (word count) is a useful command in bioinformatics because it can quickly identify how many lines or words are in a file.

    $ wc FILENAME


15.6 Sorting Files

sort command can be used to arrange things in a file. Simplest way to use this command is:

    $ sort FILE1 > SORTED_FILE1

sort has these commonly used flags:

TASK

The LearnLinux/Data/Sequences directory consists of numerically labeled files. Unix can sort either alphabetically or numerically (not both) and hence they are arranged in Seq1.fa, Seq10.fa, Seq11.fa etc. In order to sort them in an easy to read way, try using:

    $ ls |sort –t 'q' -k 2n

This command lists all the files in Sequences/ directory and then passes it to sort command. Sort command then sorts it numerically but only using 3rd and 4th letters of the first field (file name)

Try using sort on Data/Arabidopsis/At_genes.gff

    $ sort -r -k 1 At_genes.gff 
    $ sort -r -k 4 At_genes.gff


15.7 Uniq

uniq (unique) command removes duplicate lines from a sorted file, retaining only one instance of the running matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. uniq requires sorted input since it compares only consecutive lines.

    $ uniq [OPTIONS] INFILE OUTFILE

Useful options include:

TASK

From Data/


### 15.8 Dividing files by Columns ###

cut extracts entire columns of data from files. By default, it assumes that columns are tab delimited, but this is not always the case. If your data file contains columns (called 'fields' here) that are separated by other delimiters e.g. space or comma, then you will need to tell cut about it.

The following example assumes that the fields are separated by tabs. This will print the first column from the input file to the screen.

    $ cut –f1 FILE      

Here is an example of a .csv file (comma separated values). The following command will display columns 2 through 4 from this file.

    $ cut –d ',' –f2-4 FILE  

Another example where the delimiter is a pipe (|) and the command will display 1st and 9th column.

    $ cut –d '|' –f1,9 FILE

TASK



16. Saving Your Work

i. On Mt.Moran Remote Server

    $ history > netid_week1_history.sh

ii. Homework