Skip to main content

Perl One-Liners: Bridging the Gap Between Large Data Sets and Analysis Tools

  • Protocol
Celiac Disease

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1326))

Abstract

Computational analyses of biological data are becoming increasingly powerful, and researchers intending on carrying out their own analyses can often choose from a wide array of tools and resources. However, their application might be obstructed by the wide variety of different data formats that are in use, from standard, commonly used formats to output files from high-throughput analysis platforms. The latter are often too large to be opened, viewed, or edited by standard programs, potentially leading to a bottleneck in the analysis. Perl one-liners provide a simple solution to quickly reformat, filter, and merge data sets in preparation for downstream analyses. This chapter presents example code that can be easily adjusted to meet individual requirements. An online version is available at http://bioinf.gen.tcd.ie/pol.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altschul SF, Gish W, Miller W et al (1990) Basic local alignment search tool. J Mol Biol 215:403–410

    Article  CAS  PubMed  Google Scholar 

  2. Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Karsten Hokamp .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 Perl Basics

Knowledge of a few basic concepts in Perl will lead to a better comprehension of the instructions contained in the one-liners below. It will also allow the user to modify the code and make adjustments according to individual needs. This appendix explains some basic concepts of Perl that are used in Subheading 3.

1.2 Perl Variables

A Perl command can contain several elements, such as variables, operators, built-in functions, and key words. Variables provide storage containers for data and come in three types: scalars (e.g., numbers, letters, or strings of characters), arrays (lists of scalars), and hashes (lists of scalars organized into key-value pairs). Very complex constructs are possible but to keep it simple only the most basic aspects are presented here. Each variable is given a name that starts with a symbol ($ for scalars, @ for arrays, and % for hashes) and is followed by alphanumerical characters (a–Z, 0–9), including the underscore. Below are some simple examples of assigning and accessing variables:

$attempt = 1;

$date = '11/12/2013';

print "attempt $attempt on $date\n";

@elements = ('CDS', 'mRNA', 'tRNA');

print "First element: $elements[0]\n";

%roman = (1, 'I', 2, 'II', 3, 'III');

print "Roman for 3: $roman{3}\n";

An easy way to try out Perl code is the debugger. It can be started by typing “perl -d -e 42” at the command line. This will give a new prompt (“DB<1>”) after which Perl statements can be typed for testing. The debugger provides extra functionality, for example examining the content of variables, which can be particularly useful for beginners.

A couple of rules are worth noting from the lines above:

  1. 1.

    Perl statements end with a semicolon.

  2. 2.

    Value assignments happen from right to left; that is, the value to be assigned is on the right-hand side of the equal sign.

  3. 3.

    Text needs to be enclosed in double or single quotes.

  4. 4.

    Variables and special characters (e.g., “\n”) or evaluated within double quotes but not single quotes.

  5. 5.

    Lists are enclosed in round brackets with elements separated by comma.

  6. 6.

    List indices start at position zero.

  7. 7.

    To access a single element of an array, the symbol at the start of the variable changes to “$” and the index is specified in square brackets, “[]”.

  8. 8.

    To access a specific value in a hash, the symbol at the start of the variable changes to “$” and the lookup key is specified in curly brackets, “{}”.

1.3 Perl Operators

The next lines of code demonstrate some example use of operators in Perl (some comments are added, starting with “#”):

# some standard mathematical operations

# print 3 * (5 + 10) - 2**4;

# processing the content of variables

$total_error = $fp = $fn;

# increase value in $minutes by 30

$minutes += 30;

# increase value in variable $hour by one

$hour++;

# decrease value in variable $remaining by one

$remaining--;

# repeat 'CG' 12 times

$motif = 'CG' x 12;

# the dot concatenates strings and content of# variables

$chr = 'chr' . $roman{$chr_number};

# two dots create lists by expanding from# lower to higher border

@hex = (1..9, a..f);

1.4 Perl Functions

Perl provides many functions that can be applied to the different variable types. A few are listed below and shown with examples:

# functions for scalars

$seq_len = length($seq);

$rev_seq = reverse($seq);

$upper_case = uc($seq);

$lower_case = lc($seq);

$codon = substr $seq, 0, 3;

# remove white-space from end of line

chomp $input_line;

# functions for arrays

@array = split //, $string;

$first_element = shift @array;

$last_element = pop @array;

unshift @array, $first_element;

push @array, $last_element;

@alphabetically_sorted = sort @names;

@numerically_sorted = sort { $a <=> $b } @values;

# functions for hashes

if (defined $description{$gene}) { print $description{$gene} } else { print 'not available'; }

foreach (keys %headers) { print ">$_\n$headers{$_}\n"; }

1.5 Loops and Branches

The last two examples introduced the concept of loops and branches. These operate on lists and Boolean expressions, respectively.

A loop is carried out for each element in a list and an if-statement is executed if a test condition is true. Any Perl statement that evaluates to something different to 0 or an empty string is considered true. For tests comparators are available, such as “>,” “<,” “==,” “>=,” and “<=” for numbers and “gt,” “lt,” and “eq” for characters. A common mistake is to use just a single equal sign to check if two variables are equal. In such cases a double equal sign needs to be used to distinguish the comparison from an assignment. See below for examples:

# a progress meter for reading in long files:

if ($line % 1000 == 0) { print STDERR " $line "; }

# collect lines of sequence into one long# lower-case string:

while (<>) { chomp; $seq .= lc $_; }

# exact motif search

if (substr($seq, $pos, 10) eq $motif) { print "Motif found at position $pos!\n"; }

# pad number with zeros at the front

$num = '0'.$num until (length($num) >= $max_len);

The line “while (<>) {}” is a special Perl construct that reads line by line from standard input and stores each line in the special variable “$_”. A file name specified on the command line would be automatically opened by the shell and fed into the Perl program.

1.6 Regular Expressions

One of the most powerful features of Perl is its implementation of regular expressions, which allow matching not only exact text strings but also variable classes of text. Whole books have been written about this topic and a full explanation would go beyond the scope of this chapter. Therefore, only a few basic concepts are explained and demonstrated in the form of examples.

Regular expressions are specified within delimiters (“/” by default) and applied to the content of a variable with the “=~” operator. If a second expression is provided, then the first pattern will be replaced with the second. In addition, modifiers can be used, such as “i” for case-insensitive matches and “g” for global matches, instead of just the first one. Special characters are available to match groups of characters, such as “\w” for any alphanumerical character, “\d” for numbers, and “\s” for white space. The negated class, e.g., not a digit, can be accessed through capital letters, such as “\D,” “\W,” and “\S.” Occurrences can be specified through numbers in curly brackets, e.g., {3} for exactly 3, or {4,10} for 4–10, or {2,} for two or more occurrences of a pattern. Special cases are “+” for one or more matches, “*” for zero or more matches, and “?” for zero or one match. To refer to the matched patterns afterwards, round brackets are used and the special variables $1, $2, …, depending on how many patterns are specified. The examples below illustrate their usage:

# search $_ for the word "regulator" (ignoring# case) and print if found

if (/regulator/i) { print;}

# check for non-numerical input

if ($input =~ /\D/) { warn "Non-numerical input in '$input'\n"; }

# remove all white space

$input =~ s/\s//g;

# find a pattern that is repeated at least 3# times and print

if ($input =~ /(CG{3,})/) { print "Found motif $1!\n"; }

# split a string at tabulators and collect# the elements in an array

@list = split /\t/, $input;

There is plenty of literature available for more information on learning Perl. A good starting point is the online library at perl.org: http://www.perl.org/books/library.html.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this protocol

Cite this protocol

Hokamp, K. (2015). Perl One-Liners: Bridging the Gap Between Large Data Sets and Analysis Tools. In: Ryan, A. (eds) Celiac Disease. Methods in Molecular Biology, vol 1326. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-2839-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-2839-2_15

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-2838-5

  • Online ISBN: 978-1-4939-2839-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics