An Introduction to Linux & Commandline

Molb 4485/5485 -- Computers in Biology

Nicolas Blouin and Vikram Chhatre

Wyoming INBRE Bioinformatics Core
Dept. of Molecular Biology
University of Wyoming
nblouin@uwyo.edu
vchhatre@uwyo.edu
http://www.uwyo.edu/ibc

The Terminal
Connecting to Remote Server
Where Am I?
Linux Directory Structure
Changing Directories
Making Directories
Making & Editing Files
Moving Directories & Files
Copying Directories
Viewing Directory Contents
Removing Files and Directories
Display File Contents
File & Folder Permissions
Your First Shell Script
Unix Power Commands
Saving Your Work

1. The Terminal

This is the common name for the application that gives you text-based access to the operating system of the computer. Basically, it allows you to type input into the computer so that you can receive output from the programs you call. On Unix based machines there is always a terminal program.

Some brief tips before we move further:

You will need to frequently resize this window as we do the hands-on activities. This is done as with any other window.
You can change the size of the text by holding command + shift and hitting + or command plus the - key.
You can have multiple terminal windows open at the same time.
You can also have multiple tabs open in each terminal window.
Everything you type in the terminal is case sensitive. grep -F is not the same as grep -f. This will be a very important thing to remember when using the terminal and creating files or folders.

2. Connecting To Remote Server

Often with bioinformatics analysis, you will be performing tasks on a remote network server, one much more powerful than your workstation. Let's use the terminal program you just learned about to connect with this server. We will use Secure SHell (SSH) protocol to talk to the remote server. In the following example, replace username with the user name provided to you.

Two-Factor Authentication

To improve computer security, University of Wyoming has started requiring two-factor authentication to grant access to network servers. Two factor implies password plus a second type of authentication. You all have Yubikeys (a usb flash drive like device) for the token needed for this.

Type the following in your terminal and hit ENTER key. This will bring up the password prompt.

    $ ssh username@mtmoran.uwyo.edu

                         TWO-FACTOR AUTHENTICATION
=============================================================================
This system requires two-factor authentication.

The password requirement is your UWYO domain password.

The token can be generated by your registered YubiKey or manually input with
the Duo mobile app. If you have questions about using this implementation of
two-factor authentication, contact the ARCC team at arcc-info@uwyo.edu

Please enter the two-factor password the in the form:

                            <password>,<token>

=============================================================================

As you type the password, you will not see the cursor move. This is normal. During the two-factor authentication, we will use the password in combination with a alpha-numeric 'push' using the Yubikey. Enter the two as follows, don't forget the comma

    $ wyoinbre,<Press YubiKey Gold Button>

If the authentication was successful, the remote server will bring you to a linux prompt as follows. This prompt has several parts: username, servername and your current location on the filesystem. The location ~ is a shortcut for your home directory (/home/username) on the remote server.

    [username@mmmlog1 ~]$

This application is going to be your friend when working with large data sets. Get used to making new windows (control + shift + n), new tabs (control + shift + t), and resizing this window.

3. Where Am I (on the filesystem)?

As you will learn everything in Linux is relative to where you are in the file system. Therefore, knowing where you are before launching a command is valuable information. Luckily, there are built in commands for this type of information. Understanding the location of files will be a key part of success.

To find out where you are in the file system type pwd in the terminal window. This will return your current working directory (print working directory)
As you can see above the working directory is: /project/inbre-train/username
The present directory is represented as . (dot) and the parent directory is represented as .. (dot dot)
Directories are separated by forward slashes /. Together they make up path. /project/inbre-train/username/ is the path to your home directory.

4. Linux Directory Structure

Linux files are arranged in a hierarchical structure, or a directory tree. From the root directory (/) there are many subdirectories. Each subdirectory can contain files or other subdirectories, etc., etc. Whenever you're using the terminal you will always be 'in' a directory. The default behavior of opening a terminal window, or logging into a remote computer, will place you in your ‘home’ directory. This is true when we logged into MtMoran. The home directory contains files and directories that only you can modify, we will get to those permissions later.

To see what files, or directories, you have in your home directory we will use the ls command.

Type ls and hit enter
You should see the list of files and directories in your current folder.
After the ls command finishes it produces a new command prompt that is ready for your next command.
The ls command can be used to list the contents of any directory not necessarily just the one you are currently in.
Type ls LearnLinux

5. Changing Directories

To move between directories (folders) we use the cd (change directory) command. We are currently in our home directory. Lets move to /project/inbre-train/username/LearnLinux. The cd command uses following syntax:

    $ cd DIRECTORY
    $ cd /project/inbre-train/username/LearnLinux

Type ls
Type pwd
You can see that using the cd command moved us to a different directory.
By typing ls we can see that there is different stuff in this directory.
Finally, using pwd shows us what directory we are not located in.
We could have done the previous example in separate steps
Type cd /project
Type cd inbre-train/username
Type cd LearnLinux
Note that we needed to type /project but not /username. When using a /Directory you are specifying a directory that is directory below the root directory. Without the leading / the system looks below the current directory.
Type cd project
Type cd /project
What happened with the first command?
You will frequently need to move up a level to a parent directory. Remember that two dots .. are used to represent the parent directory. Every directory has a parent except the root level.
Type cd ..
Type pwd
You can move multiple levels at the same time
Type cd /project/inbre-train/username/LearnLinux
Type cd ../..
Type pwd
When using cd everything is relative to your current location. However you can always use the absolute location to change directories. Lets move into the Code directory and look at two ways to switch to the Data directory.
Type cd /project/inbre-train/username/LearnLinux/Code
Option 1: Type cd ../Data
Type pwd
Type cd /project/inbre-train/username/LearnLinux/Code
Option 2: Type cd /project/inbre-train/username/LearnLinux/Data
Type pwd
As you can see both options get us to the same place but Option 1 will only work from within a directory below Data. Option 2 will work from any location on the machine.

6. Making Directories

Creating directories in Unix is done with the mkdir (make directory) command.

    $ mkdir DirectoryName

Using spaces when naming directories, like on your desktop, is not advised in the Unix file system. This is why you see the use of _ in place of spaces. You can escape a space in Unix but it creates unnecessary typing and can create issues executing certain programs. Generally, using spaces in file and directory names is something to avoid.

Type cd /project/inbre-train/username/LearnLinux
Type mkdir Work
Type ls
Type mkdir Temp1
Type cd Temp1
Type mkdir Temp2
Type cd Temp2
Type pwd
In the previous example we created two temporary directories but it took two steps. We could have done this in one step with by adding an option/flag to the mkdir command.
Type cd ..
Type mkdir –p Temp1.1/Temp2.1

7. Making and Editing Files

In this section you will learn the basics of making files and putting things into those files. There are a variety of ways we can accomplish this as Unix has built in multiple editors for these tasks. We will review a few here.

    $ touch FILENAME

This will create a new, empty file.

    $ nano FILENAME

This is a built in text editor that will allow us to put information into a file.

Create two files in LearnLinux/
- $ touch earth.txt
- $ touch heaven.txt
Type ls
We have now created two empty files called earth.txt and heaven.txt
Type cd Work
Type touch basic_info.txt
Type nano basic_info.txt
We are now using an internal text editor that we can use to alter the contents of this file.
Add your name, email address, and favorite food to this file on separate lines.
Press control + x (^x) to exit and then y to save the file
We can also create and edit files with nano
- $ nano onestep.txt
- Add a line of text and save the file.

8. Moving Directories and Files

To move a file or directory the mv (move) command is used. This is the first command we have used that requires two arguments. You need to specify the source and the destination for the moving.

    $ mv SOURCE DESTINATION

Lets move heaven.txt and earth.txt
$ cd /project/inbre-train/username/LearnLinux
$ mv heaven.txt Work/
$ mv earth.txt Work/
$ ls
$ ls Work/
We could have moved these files all at once using wildcards. An asterisk * means match anything.
$ mv *.txt Work/ #This will move any file that ends with .txt
$ mv *t Work/ # This moves any file or directory that ends with a t
$ mv *ea* Work/ #This works because only heaven and earth contain ‘ea’
$ mv can also be used to rename files
$ touch rags
$ ls
$ mv rags Work/riches
$ ls Work/
Here we move and renamed the file rags to Work/riches
We can rename it without moving the file
$ mv Work/riches Work/rags
$ ls Work/
The mv command is also used to rename files or directories.

9. Copying Directories

To copy a file or directory cp (copy) command is used. Just like mv you will need a source and a destination to copy something.

    $ cp SOURCE DESTINATION

Copying files is similar to moving them
$ cd /project/inbre-train/username/LearnLinux/Work
$ mkdir Copy
$ cd Copy
$ touch file1
$ cp file1 file2
$ ls
Remember we do not have to be in a directory to make, move, or copy files.
$ touch /project/inbre-train/username/file3
$ ls
$ cp /project/inbre-train/username/file3 . # here we represent the current directory with a . (dot)
$ ls
The cp command can also move directories using a flag
$ mkdir Example
$ mv file* Example/
$ ls
$ cp –r Example/ Example2
$ ls Example Example2/
What happens if you do not use the -r flag?
$ cp Example2/ Example3
The error occurs because the –r flag means copy recursively. Since Example2 is not empty cp (without –r) does not descend into Example2 and copy those files it simply tries to move a directory without moving the things in the directory.

10. Viewing Directory Contents

To view the contents of directories we use the ls (list segments) command.

    $ ls DIRECTORY

If no directory is provided ls will list the contents of the current directory.

We have been using ls frequently to check directory contents. However, there are many options for using ls. As the previous example noted we can use ls on multiple directories at the same time.
$ ls –l /project/inbre-train/username/LearnLinux
$ ls –p /project/inbre-train/username/LearnLinux
$ ls –othr /project/inbre-train/username/LearnLinux #Nic’s favorite!!
As you can see these flags/options change the way ls displays the contents of the directory, giving us more or less information.
Notice the changes that the –p or the combination of –o, -t, -h, and –r flags makes to the output.

11. Removing Files and Directories

Caution: This is a dangerous command. File/Folder deletion in Unix is permanent and nonreversible.

If you run ls on your LearnLinux/Work/ directory, it is likely full of lots of empty files and directories by this point. Wouldn't it be nice if there were a way to clean that up? Of course there is a way, however it can be dangerous. To delete directors and files from the system we have two options the rm (remove) and rm –r commands.

    $ rm FILE

One more time just to be clear. It is possible to delete EVERY file you have ever created with the rm command. Thankfully there is a way to make rm a bit safer, and on DT2, this is the default setting. Using the –i flag rm will ask for confirmation before deleting anything.

$ cd /project/inbre-train/username/LearnLinux/Temp1
$ ls
If you remember the Temp2 directory is empty therefore we can use rm -r to delete it.
$ rm -r Temp2/
$ ls
We can now move up a level and remove Temp1
$ cd ..
$ rm -r Temp1/
$ ls
Now try $ rm Temp1.1
See the error message letting us know that we are trying to remove a directory not a file.
See? Linux is warning us. So back to the recursive flag:
$ rm –r Temp1.1
Note the default behavior of rm is to simple delete with out confirmation of what you typed (except for directories of course). This is why it is so dangerous.
You can have rm ask for conformation before deleting anything using the –i flag (no one does this in practice though).
$ mkdir –p Temp1/Temp2/Temp3/Temp4
$ cp –r Temp1 Temp1.1
$ ls
$ rm –r Temp1
$ rm –ir Temp1.1
$ ls

12. Display File Contents

There are various commands available to display/print the contents of a file. The default of all these commands is to display the contents of the file on the terminal. These commands are less, cat, head, and tail.

    $ less FILENAME

Displays file contents on the screen with line scrolling (to scroll you can use 'arrow' keys, 'PgUp/PgDn' keys, 'space bar' or 'Enter' key). Press 'q' to exit.

    $ cat FILENAME

Simplest form of displaying contents. It catalogs the entire contents of the file on the screen. In case of large files, entire file will scroll on the screen without pausing.

    $ head FILENAME

Displays only the 10 starting lines of a file by default. Any number of lines can be displayed with the -n flag followed by the number of lines.

    $ tail FILENAME

As the name implies the opposite of head this displays the last 10 lines. Again -n option can be used to change this.

Lets work though these commands.
- $ cd /project/inbre-train/username/LearnLinux/
- $ less Data/Arabidopsis/At_proteins.fasta
Try this: type =. This is a large file. You can see at the bottom, less displays that we are looking at lines 1-32 of 269,463 and we are 0% through the file.
We can use h to get help commands for less.
Page forward using 'space', move a line at a time with j (forward) or k (backward) or N lines.
Hit q to exit the help
Navigate around using the various commands
Try hitting j ENTER 100 ENTER
Press q when ready to exit less.
Navigate the file using the more command, press q to exit.
$ cat is the simplest form of viewing and file. Command cat prints all of the file to the screen from start to finish.
$ cat Data/Arabidopsis/At_genes.gff.short
Did you get all of that?
$ cat is most useful with combined with other commands using | (pipes). We will cover this later.
The last two commands head and tail are fantastic when you need to look at a file and make sure things are in order.
$ head Data/GenBank/E.coli.genbank
$ head Data/GenBank/Y.pestis.genbank
We can change how many lines we see using the -n flag
$ head -n 1 Data/GenBank/E.coli.genbank Data/GenBank/Y.pestis.genbank
$ tail Data/GenBank/E.coli.genbank Data/GenBank/Y.pestis.genbank
This shows us the end of a file. This can be important when transferring files or data and needing to make sure everything transferred completely.

13. File & Folder Permissions

All files in any operating system have a set of permissions associated with the file that define what can be done with the file and by whom. What = read, write (modify), and/or execute a file. Whom = user, group, or public. These permissions are denoted with the following syntax:

Permissions
Read: r
Write: w
Execute: x

Relations
User: u
Group: g
Others: o
All users: a

Changing permissions is done via chmod (change mode) command

    $ chmod [Options] RELATIONS [+ or -] PERMISSIONS FILE

Lets make a new directory and add some files.
From the LearnLinux/ directory
$ mkdir Allow
$ cd Allow/
$ touch read.txt write.txt execute.go all.txt
$ ls
$ ls –l
We have created some files but we need to change the permission for these files in order to share these or execute them as programs.
Since you created these files you’re the owner and have the ability to change their permissions with chmod.
From this you can see the default is for the user to have rw access and the group and others to have r access.
Lets add execute permissions for everyone on execute.go
- $ chmod a+x execute.go
- $ ls –l
We have now added the x option to all three levels of permission for this file
If we want other members of the group to have write permission for write.txt we can do that as well.
- $ chmod a+w write.txt
- $ ls –l
Others still cannot modify this file but now members of the group will be able to modify the contents.
If we want a file to be completely public we need all of the flags active.
- $ chmod a+rwx all.txt
- $ ls –l
Now the file all.txt can be read, written, or executed by anyone on this system.
We can also remove permissions using this same command.
- $ chmod a-rwx all.txt
- $ ls –l
Now we have removed all access to the all.txt file even the owner's access.
Finally we can change the permissions of all the files in a directory with the -R flag.
- $ cd ..
- $ chmod –R a+rwx Allow/
- $ ls –l Allow/
This made all of these files public in one step.

14. Your First Shell Script

Just like Perl, Python, R, C++ etc. BASH (Bourne Again Shell) is a programming language that works on Unix and Unix-like computers (Linux, Macintosh, BSD etc.). All the commands that you have been passing to the terminal, are in fact being executed by bash, the command shell. A shell script is simply a collection of various bash commands that are executed sequentially. To make a script we simply write shell commands into a file and then treat that file like any other program or command.

Open a new shell script
- $ cd LearnLinux/Code/
- $ nano hello.sh
Type the following two lines in this file

  # This is my first shell script.
  echo "Hello World!"

Save the file and exit nano
At the command prompt, make the file executable
- $ chmod u+x hello.sh
- $ ./hello.sh

15. Unix Power Commands

The commands that you have learned so far are essential for doing any work in Unix, but they don’t really let you do anything that is very useful. The following section will introduce new commands that will start to show you the power of Unix.

15.1 Pipes & Redirects

Everything we have done so far has sent the result of the command to the screen. This is feasible when the data being displayed is small enough to fit the screen or if it is the endpoint of your analysis. But for large data outputs, or if you need a new file, printing to the screen isn't very useful. Unix has built in methods to hand output from commands using > (greater than) or < (lesser than) or >> signs.

< redirects the data to the command for processing
> redirects the data from the command’s output to a file. The file will be created if it is non-existing and if present it will overwrite the contents with the new output data (you will lose the original file).
>> unlike > this redirection lets user append the data to an already existing file or a new file
Another special operator | (called pipe) is used to pass the output from a command to another command (as input) before sending it to an output file or display.
Some examples:

    # Creates a new file (file2) with same contents as old file (file1) 
    $ cat FILE1 > FILE2 

    
    # Appends the contents for file1 to file2, equivalent to opening file1, 
    # copying all the contents, pasting the copied contents to the end of 
    # the file2 and saving it! 
    $ cat FILE1 >> FILE2 

    $ cat FILE1 | less

Here, cat command displays the contents of the file1, but instead of sending it to standard output (screen) it sends it through the pipe to the next command less so that contents of the file are now displayed on the screen with line scrolling.

From the LearnLinux/Data/ directory

    $ cat seq.fasta

    $ head seq.fasta > new.txt

    $ cat new.txt

    $ tail seq.fasta > new.txt

    $ cat new.txt

Now lets try that with the append option.

    $ head –n 1 seq.fasta > new.txt

    $ tail –n 1 seq.fasta >> new.txt

15.2 Grep Search

The grep (globally search a regular expression and print) is one of the most useful commands in Unix and it is commonly used to filter a file/input, line by line, against a pattern.

    $ grep [OPTIONS] PATTERN FILENAME

Like any other command there are various options available man grep for this command. Most useful options include:

-v inverts the match or finds lines NOT containing the pattern.
--color colors the matched text for easy visualization
-i ignore case for the pattern matching.
-l lists the file names containing the pattern
-n prints the line number containing the pattern
-c counts the number of matches for a pattern

Some typical scenarios to use grep

Counting number of sequences in a multi-fasta sequence file
Get the header lines of fasta sequence file
Find a matching motif in a sequence file
Find restriction sites in sequence(s)
Get all the Gene IDs from a multi-fasta sequence files and many more.

You might already know that fasta files header must start with a > character, followed by a DNA or protein sequence on subsequent lines. To find only those header lines in a fasta file, we can use grep.

Move to LearnLinux/Data/Arabidopsis
- $ grep ">" intron_IME_data.fasta
Did you get that?
Remember the default for a program is to output to the screen.
We can fix this with a redirect or a pipe.
- $ grep ">" intron_IME_data.fasta | less
This takes the output from grep and sends it as input to less
What if we want to know how many sequences are in a file?
- $ grep –c ">" intron_IME_data.fasta
We can also get lines that don't match our string.
- $ grep –v ">" intron_IME_data.fasta | less
Given a the fasta file structure we can use grep to separate this information
- $ grep ">" intron_IME_data.fasta > intron_headers.txt
- $ grep –v ">" intron_IME_data.fasta > intron_sequences.txt
We can even get some biological information from grep
- $ grep –-color "GAATTC" chr1.fasta
GAATTC is the EcoR1 cut site. The --color option highlights the matches in this sequence.

15.3 Regular Expressions

grep + regular expressions (also called regex) = power! Before we get into this let's start with a task.

TASK

The '.' and '*' characters are also special characters that form part of the regular expression. Try to understand how the following patterns all differ. Try using each of these patterns with grep -c against any one of the sequence files. Can you predict which of the five patterns will generate the most matches?

    ACGT 
    AC.GT 
    AC*GT 
    AC.*GT

The asterisk in a regular expression is similar to, but NOT the same, as the other asterisks that we have seen so far. An asterisk in a regular expression means: match zero or more of the preceding character or pattern Try searching for the following patterns to ensure you understand what '.' and '*' are doing:

    A...T 
    AG*T 
    A*C*G*T*

When working with the sequences (protein or DNA) we are often interested to see if a particular feature is present or not. This could be various things like a start codon, restriction site, or even a motif. In Unix all strings of text that follow some pattern can be searched using some formula called regular expressions. e.g. As you learned above regular expressions consist of normal and meta characters. Commonly used characters include:

Expression	Function

`.`	matches any single character

`$`	matches the end of a line

`^`	matches the beginning of a line

`*`	matches one or more character

`\`	quoting character, treat the next character followed by this as an ordinary character.

`[]`	matches one or more characters between the brackets

`[range]`	match any character in the range

`[^range]`	match any character except those in the range

`\{N\}`	match N occurrences of the character preceding (sometimes simply +N) where N is a number.

`\{N1,N2\}`	match at least N1 occurrences of the character preceding but not more than N1

`?`	match 1 occurrence of the character preceding

`\|`	match 2 conditions together, (this\|that) matches both this or that in the text

Here are some common regex patterns for Nucleotide/Protein searches:

Patterns	Matches
`^ATG`	Find a pattern starting with ATG
`TAG$`	Find a pattern ending with TAG
`^A[TGC]G`	Find patterns matching either ATG, AGG or ACG
`TA[GA]$`	Find patterns matching either TAG or TAA
`^A[TGC]GTGTGAACTTA[GA]$`	Find gene containing a specific motif
`[YXN][MPR]\_[0-9]\{4,9\}`	Find patterns matching NCBI RefSeq (e.g. XM_012345)
`$NP\\|XP$\_[0-9]\{4,9\}`	Find patterns matching NCBI RefSeq proteins

Let's use grep to find a zinc finger motif. For simplicity let’s assume zinc finger motif to be CXXCXXXXXXXXXXXXHXXXH. Either you can use dots to represent any amino acids or use complex regular expressions to come up with a more representative pattern.

    $ grep --color "C..C............H...H" At_proteins.fasta

    $ grep --color "C.\{2\}C.\{12\}H.\{3\}H" At_proteins.fasta

    $ grep --color "C[A-Z][A-Z]C[A-Z]\{12\}H[A-Z][A-Z][A-Z]H" At_proteins.fasta

These all do the exact same thing. As you can see, regular expressions can be very useful for finding patters of all kinds.

UNIX Tip: You can use regular expressions in grep, sed, awk, less, perl, python, certain text editors almost any programing language or tool can utilize the power of regex.

15.4 SED Stream Editor

sed is a stream editor that reads one or more text files and makes changes or edits then writes the results to standard output. The simple syntax for sed is:

    $ sed 'OPERATION/REGEXP/REPLACEMENT/FLAGS' FILENAME

Above, / is the delimiter but you can use _ | or : as well.

OPERATION = the action to be performed, the most common being s which is for substitution.

REGEXP and REPLACEMENT = the search term and the substitution for the operation be executed.

FLAGS = additional parameters that control the operation.

Common FLAGS include:

g   replace all the instances of REGEXP with REPLACEMENT (globally)
n   (n=any number) replace nth instance of the REGEXP with REPLACEMENT
p   If substitution was made, then prints the new pattern space
i   ignores case for matching REGEXP
w   If substitution was made, write out the result to the given file
d   when specified without REPLACEMENT, deletes the found REGEXP

From LearnLinux/Data/Arabidopsis

$ head –n 1 chr1.fasta
$ sed "s/Chr1/Chromosome_1/g" chr1.fasta | head –n 1
$ sed "s:Chr1:Chromosome_1:g" chr1.fasta | head –n 1

As you can see these two commands do the same thing with different delimiters. We Changed "Chr1" to "Chromosome_1" in the file. However this was not done permanently. To do that we would have to write to a new file or use a flag within sed.

    $ touch greetings.txt
    $ echo "Hello there" >> greetings.txt
    $ head greetings.txt

Now we have our file to manipulate with sed. We have three options for altering and saving the file:

Option 1: Make a new file

    $ sed 's/Hello/Hi/g' greetings.txt > greetings_short.txt
    $ head greetings*

Option 2: Edit in place but make a backup of the original with the given extension

    $ sed –i.bak 's/Hello/Hi/g' greetings.txt
    $ head greetings*

Option 3: Edit in place but without a backup. NOTE if you run out of system memory or have an error this will rewrite the original file. You will not get that file back.

    $ sed –i 's/Hello/Hi/g' greetings.txt.bak
    $ head greetings*

15.5 Word Count

wc (word count) is a useful command in bioinformatics because it can quickly identify how many lines or words are in a file.

    $ wc FILENAME

From Data/Arabidopsis
$ wc At_genes.gff
Here we have the total number of lines, words, and bytes in this file
$ wc -l At_genes.gff
This prints out just the line count for the input file.
$ wc is best used in with pipes but it can be useful to count things as well
$ ls /project/inbre-train/UserName/LearnLinux/Data/Sequences | grep ".fa" | wc –l
This command tells you how many .fa files there are in the Sequences directory.

15.6 Sorting Files

sort command can be used to arrange things in a file. Simplest way to use this command is:

    $ sort FILE1 > SORTED_FILE1

sort has these commonly used flags:

-n numerical sort
-r reverse sort
-k N,N sort the Nth field (column), where N is a number. Sorting can also be done on the exact character on a particular field e.g. –k 4.3,4.4 sorts based on 3rd and 4th character of the 4th field. Additionally you can supply additional –k for resolving ties.
-t specify the delimiters to be used to identify fields (default is TAB) -t ‘:‘ to use ‘:’ as delimiter

TASK

The LearnLinux/Data/Sequences directory consists of numerically labeled files. Unix can sort either alphabetically or numerically (not both) and hence they are arranged in Seq1.fa, Seq10.fa, Seq11.fa etc. In order to sort them in an easy to read way, try using:

    $ ls |sort –t 'q' -k 2n

This command lists all the files in Sequences/ directory and then passes it to sort command. Sort command then sorts it numerically but only using 3rd and 4th letters of the first field (file name)

Try using sort on Data/Arabidopsis/At_genes.gff

    $ sort -r -k 1 At_genes.gff 
    $ sort -r -k 4 At_genes.gff

15.7 Uniq

uniq (unique) command removes duplicate lines from a sorted file, retaining only one instance of the running matching lines. Optionally, it can show only lines that appear exactly once, or lines that appear more than once. uniq requires sorted input since it compares only consecutive lines.

    $ uniq [OPTIONS] INFILE OUTFILE

Useful options include:

-c count; prints lines by the number of occurrences
-d only print duplicate lines
-u only print unique lines
-i ignore differences in case when comparing
-s N skip comparing the first N characters (N=number)

TASK

From Data/

$ cat uniq.txt
Using the above options do the following:
- Count the occurrence of each unique line
- Print only duplicated lines.
- Print only unique lines.

### 15.8 Dividing files by Columns ###

cut extracts entire columns of data from files. By default, it assumes that columns are tab delimited, but this is not always the case. If your data file contains columns (called 'fields' here) that are separated by other delimiters e.g. space or comma, then you will need to tell cut about it.

The following example assumes that the fields are separated by tabs. This will print the first column from the input file to the screen.

    $ cut –f1 FILE

Here is an example of a .csv file (comma separated values). The following command will display columns 2 through 4 from this file.

    $ cut –d ',' –f2-4 FILE

Another example where the delimiter is a pipe (|) and the command will display 1st and 9th column.

    $ cut –d '|' –f1,9 FILE

TASK

Display only first column of the At_genes.gff file using cut
Now can you display that in a way so you can actually see what the output looks like?
What if you want column 1, 4, and 5?

16. Saving Your Work

i. On Mt.Moran Remote Server

Make sure you are in /project/inbre-train/username/week1
After you are finished with all exercises, use the history command to direct all your activity to a text file as follows:

    $ history > netid_week1_history.sh

ii. Homework

(a) On a Mac
- Simply direct the history as above to a text file which will sit on your workstation.
- $ mkdir -p /Users/username/molb4485/LastName_week1
- $ history > /project/inbre-train/username/week1/netid_week1_history.sh
(b) On Virtualbox
- This assumes you did your homework using virtualbox terminal on your personal computer.
- Save history as follows:
- $ mkdir -p /home/username/molb4485/week1
- $ history > ~/molb4485/week1/netid_week1_history.sh
(a & b) Email the file
- Send an email to wyinbre@gmail.com and attach your history file.

An Introduction to Linux & Commandline

Molb 4485/5485 -- Computers in Biology

Nicolas Blouin and Vikram Chhatre

Table of Contents

1. The Terminal

2. Connecting To Remote Server

3. Where Am I (on the filesystem)?

4. Linux Directory Structure

5. Changing Directories

6. Making Directories

7. Making and Editing Files

8. Moving Directories and Files

9. Copying Directories

10. Viewing Directory Contents

11. Removing Files and Directories

12. Display File Contents

13. File & Folder Permissions

14. Your First Shell Script

15. Unix Power Commands

15.1 Pipes & Redirects

15.2 Grep Search

15.3 Regular Expressions

15.4 SED Stream Editor

15.5 Word Count

15.6 Sorting Files

15.7 Uniq

16. Saving Your Work