The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. It turns out that with tabular formatting, one should use the argument '-maxtargetseqs' instead of the previous one (see latest BLAST user manual). As a result, I get an interestingly different output: the number of 'hits' announced in the output header dramatically increases, in this case to 50000.
doi: 10.1101/gr.2079204
PMID: 15140831
This article has been cited by other articles in PMC.
Abstract
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Genome projects () are generating sequence data at a much faster rate than can be effectively analyzed. The goal of functional genomics is to determine the function of proteins predicted by these sequencing projects (; ; ). Because experimental evidence about individual proteins is difficult to obtain, a common strategy is to classify proteins into families on the basis of the presence of shared features or by clustering using some similarity measure. The underlying assumption is that members of the same family may possess similar or identical biochemical functions () and that one can assign the functions of well-characterized members of a family to other members whose functions are not known or not well understood ().
The simplest methods for clustering proteins into families rely on sequence-similarity measures, such as those obtained by BLAST (). More sophisticated approaches detect domains using domain databases (; ; ), optionally use the order of domains as a fingerprint for the protein, and classify proteins into families on the basis of the presence of shared domains or similar domain architecture (). Classification of proteins into families using structural similarities () is, at present, limited by the relatively small number of structures available in PDB ()—only 22,874 as of Oct 16th, 2003.
Similarity-based clustering is a two-step process—one first needs to determine pairwise similarities between all pairs of proteins and then apply a clustering method that uses the similarity matrix to group proteins into clusters. However, methods that quantify similarity by using some attribute of the best BLAST hit and use single-linkage clustering are not always successful. One problem such methods face is the detection of the multidomain structure of many protein families. Ideally, proteins should be classified into a single family only if they exhibit highly similar domain architecture. Best hit-based approaches may group together different multidomain proteins that share a common domain () and are prone to mistakes in the presence of promiscuous domains (; ). Several graph-based clustering methods have been proposed to overcome some of the limitations of single-linkage clustering (Matsuda et al. 1999; ; ). We show that some of the shortcomings of single-linkage clustering can be overcome by post-processing (and, if possible, grouping) BLAST hits into matches.
In this study, we test our methods on the protein families of Arabidopsis thaliana. The Arabidopsis thaliana genome was fully sequenced in 2000 (), and the predicted proteome contains 28,995 annotated proteins. However, as of Jan 7th, 2004, only 5473 proteins have been classified into 741 families. The gene family information page maintained at The Arabidopsis Information Resource (TAIR) () lists the different research groups involved in Arabidopsis thaliana gene-family identification, and provides references to publications describing the properties and construction of the gene families. In several cases, the construction of the family is fairly complicated and is based on an in-depth understanding of the properties of similar well-characterized families in other sequenced genomes. The computational methods utilized include scanning the protein sequences for known domains or motifs, identifying transmembrane regions, analyzing hydropathy plots, detecting homologs of characterized proteins from other species, etc. Phylogenetic analysis or clustering based on domain architecture is usually used to further divide large clusters into smaller families.
In this work, we study whether Arabidopsis thaliana families constructed by such diverse methods can be characterized by a small set of biologically meaningful parameters. In other words, we do not attempt to discover families ab initio; rather, we show that most discovered families can be described by one or two parameters. We consider two different parameter schemes. In the first scheme, similarity between two proteins is measured in terms of the fraction of the proteins participating in a gapped alignment (cover) and the percentage identity of such an alignment. We also analyze a second scheme in which similarity is measured in terms of relative score, that is, the ratio of the score of the alignment to the self-similarity score (score of a protein with itself).
In either scheme, we say that a family is clusterable if carrying out single-linkage clustering with some particular threshold value for the parameter(s) groups members of that family into a single cluster. Carrying out the clustering operation with a lower threshold usually results in the cluster becoming corrupted by members of other families, whereas raising the threshold may split the family across multiple clusters. We describe a novel method for visualizing the variation in clusterability with choice of parameters. Our method identifies the parameter values that best characterize a family, and thereby provides ready answers to questions of the form “How similar are members of family X?”
One result of our work is the discovery that, despite the wide variety of methods used in the construction of protein families, 76% of all analyzed Arabidopsis thaliana families are fully clusterable by the proposed simple parameter schemes. Our results, available online at http://warta.bio.psu.edu/htt_doc/ArabCluster, also show relationships between families that share members, and help identify potentially incorrect family assignments. We also show how our results could be used to identify novel families and assign unclassified proteins to known families.
METHODS
Constructing Matches From Hits
Let A be the set of all protein sequences. We compare the proteins of A against each other by running BLASTp with e-value 0.0001. The result is a set of hits, in which each hit is a local alignment that aligns a region of one protein sequence with a region from another protein sequence with a particular score. By parsing the BLAST output, we can define, for each hit, location attributes that specify which regions of the proteins are participating in the local alignment and quality attributes that indicate how good the hit is. More formally, a hit h that aligns region [x1, x2] of protein x with region [y1, y2] of protein y has the following location attributes:
- start(h, x) = x1, end(h, x) = x2, location(h, x) = [x1, x2]
- start(h, y) = y1, end(h, y) = y2, location(h, y) = [y1, y2]
and the following quality attributes:
- identity(h)—the percentage identity of the hit
- aln_len(h)—the length of the alignment
- cover(h, x)—the % of protein x participating in the hit
- cover(h, y)—the % of protein y participating in the hit
- score(h)—the bit score of the hit as reported by BLAST
We term the hit that aligns the entire length of a protein sequence p against itself as a self-hit and use the notation self-score(p) to refer to the score of such a hit. On the basis of these self scores, we can define two relative score (quality) attributes for any hit h involving distinct proteins x, y:
If there are multiple hits between a pair of proteins, the best hit alone may not represent the full extent of similarity between the proteins. At the same time, it may not be possible to take all of the hits into consideration, as a single domain in one protein can match multiple occurences of a repetitive motif in the other protein. A common strategy is to summarize the similarity using a compatible set of hits. We say that a set of hits between a pair of proteins is compatible if the regions participating in the alignments are nonoverlapping, and if the lines representing the hits do not intersect in a pictorial representation of the hits (see Fig. 1). More formally, hits h1, h2 between a pair of proteins x, y, are compatible if:
Three hits between proteins p1, p2 are shown at left. Hits h1, h2 are incompatible, as the participating regions are in the opposite order. Thus, if score(h1) > score(h2), the best match will be constructed from h1, h3, otherwise, it will be constructed from h2, h3.
- location(h1, x) ∩ location(h2, x) = φ and location(h1, y) ∩ location(h2, y) = φ
- (end(h1, x) < start(h2, x)) and (end(h1, y) < start(h2, y)) or
- (end(h2, x) < start (h1, x)) and (end(h2, y) < start(h1, y))
A set of hits H between a pair of proteins x, y is compatible if all pairs of hits in H are compatible by the above definition. Such a compatible set of hits can be grouped into a match, m. A match has the same quality attributes as a hit. Percentage identity is computed by taking the weighted percentage identity across the hits in H, that is,
whereas all other quality attribute values can be obtained by adding up the corresponding values across the hits in H. Thus, a match can be thought of as a type of global alignment constructed from several local alignments. We define the best match, m(x, y), between distinct proteins x, y as the match with the highest score. A more formal treatment of compatible hits, matches, and simple methods for calculating the best match are available in Veeramachaneni (2002) and Zhang (). In the remainder of this work, we use the term “match” to refer to the best match between a pair of proteins.
Clustering
In this study, we consider two different similarity measures; the first measure, based on percentage identity (i) and percentage cover (c) is called the (i, c)-similarity measure, and the second measure, based on relative score (r) is termed the r-similarity measure. We describe in detail clustering based on the (i, c)-similarity measure only, as the actual clustering algorithm used is independent of the similarity measure.
We represent the similarity relationships in our protein data set by an undirected weighted graph, G. The nodes of G correspond to the set of all proteins A, and edges connect proteins x, y if, and only if, there is some hit with x as the query and y as the subject (or vice-versa). The weight of an edge represents the extent of similarity between the proteins connected by the edge. In the case of (i, c) clustering, the weight of the edge is given by a pair—the first element of the pair is the percentage identity of the best match between the proteins and the second element is the percentage of the proteins participating in the match (cover). More formally,
where m is the best match of proteins x, y. In a similar manner, the weight used in the case of r-clustering is given by
The graph representation of similarity data is amenable to several graph-based clustering algorithms including single-linkage clustering, k-means (Michalski et al. 1998) and MCL (). We used single-linkage clustering, which is equivalent to finding connected components in the similarity graph, as it is the simplest of all clustering methods, and more importantly, because it has no hidden parameters.
To observe the effect of using different percentage identity and cover thresholds on the formation of clusters, we carried out (i, c)-clustering 100 times by varying percentage identity i and percentage cover c independently in increments of 10, from 0 to 90. For a particular choice of (i, c), we first construct a restricted graph Gi,c from G by retaining only those edges with weight at least (i, c). We then identify clusters by computing connected components of Gi,c (see Fig. 2). It is easy to see that G0,0, which is identical to G, will be a dense graph that yields a few large clusters, and that G90,90 will be a relatively sparse graph that yields several small clusters.
A similarity graph G of eight proteins is shown at left. The weights on the edges show the percentage identity and cover of the best match between the pairs of proteins. When clustering with threshold (30, 20), G30, 20 is created from G by removing edges c--e, c--f, and d--g. G30,20 contains, three connected components that form the clusters C1, C2C3 shown at right.
Relative score-based clustering is carried out in a similar manner by varying the threshold r from 0 to 90, in increments of 10.
Measuring Cluster Quality
Let P ⊆ A be the set of proteins that have been classified into a set of families F (some proteins may belong to more than one family). We are interested in checking whether the clusters produced by our method for a particular choice of (i, c) (or r) correspond to the protein families, F, defined by experts. In this respect, we are only interested in how well our method clusters know family members, not whether it accurately identifies unclassified proteins with similar properties. Therefore, we remove from our clusters all proteins that are unclassified (A–P). We are now left with a partition of P into clusters that we shall denote by Ci,c.
Ideally, each family of F will correspond to a single cluster of Ci,c. However, the more likely scenario is that some families will be spread across several clusters, or that several families will be grouped into a single cluster. Intuitively, we would consider the clustering parameters (i, c) to be “good” with respect to a family F if
- the majority of the members of F are in a single cluster (concentration)
- in each cluster that contains members of F, the majority of proteins belong to family F (purity)
Note that these two measures are orthogonal—if all of the classified proteins P are placed in a single cluster, then concentration is high, but purity is low. On the other hand, if each protein of P is placed in an individual cluster of size 1, then purity is high, but concentration is low. Concentration and purity reflect the sensitivity and specificity, respectively, of the clustering with respect to the family under consideration. Another method for measuring clustering quality that attempts to combine concentration, purity is matching rate ().
We measure the concentration, purity, matching rate of family F in a particular cluster C ∈ Ci,c as follows:
In other words, concentration measures the fraction of the family present in the cluster, whereas purity corresponds to the fraction of the cluster that belongs to the family. It is easy to see that the matching rate measure, which combines these two measures, satisfies the condition match_rate(F,C) ≤ min(concentration(F,C), purity(F,C)) and, therefore, cannot distinguish clusters with high concentration, low purity from clusters with low concentration, high purity.
We now extend these definitions to a set of clusters as:
- concentration(F, Ci,c) = 100 × maxC∈Ci,c concentration(F,C)
- purity(F,Ci,c) = 100 × ΣC∈Ci,c[purity(F,C) × concentration(F,C)]
- match_rate(F,Ci,c) = 100 × maxC∈Ci,c match_rate(F,C)
When measuring quality in terms of concentration and purity, we say that a family F is (x, y) clusterable by parameters (i, c) if concentration(F,Ci,c) ≥ x and purity(F,Ci,c) ≥ y. Similarly, if matching rate is the measure of clustering quality, we say that a family F is x clusterable if match_rate(F,Ci,c) ≥ x.
In the example shown in Figure 2, the proteins belong to two families—the B family with five members is shown in black, and the W family with three members is shown in white. The computation of concentration, purity, and matching rate for the two families is summarized in the table below:below:
Table 1
C1 | C2 | C3 | overall | ||
---|---|---|---|---|---|
family B | concentration | 2/5 | 1/5 | 2/5 | 40 |
purity | 2/4 | 1/2 | 2/2 | 70 | |
match rate | 2/7 | 1/6 | 2/5 | 40 | |
family W | concentration | 2/3 | 1/3 | 0/3 | 66 |
purity | 2/4 | 1/2 | 0/2 | 50 | |
match rate | 2/5 | 1/4 | 0/2 | 40 |
Although (30, 20) may not be the right clustering parameters for families B, W, this does not mean that the families are not clusterable. In fact, family B is (100, 100) clusterable by parameters (0, 50) and family W is (100, 100) clusterable by parameters (60, 0).
Displaying Clustering Quality
For a particular family, we display the variation in clustering quality as a function of the clustering parameters (i, c) in the form of a 10 × 10 grid (see Fig. 3). If the quality is measured in terms of concentration and purity, each grid element is shown in a rgb color triple, where the extent of red corresponds to the purity, and the extent of green corresponds to the concentration (blue is always set to 0.0). When matching rate is used as the quality measure, the grid element is shown in shades of gray, with white representing match rate 100, and black representing match rate 0. In the interest of conciseness, these Variation in Clustering Quality pictures will be referred to as VCQ pictures in the rest of this work.
Clustering quality of the B family from Figure 2 is shown at left. The quality picture for the MDR family of the ABC superfamily of Arabidopsis thaliana is shown at right.
The clustering quality of family B, which consists of the black nodes from Figure 2, is shown on the left hand side of Figure 3. In the top left corner, where i = 0, c = 0, all members of the B family are in the same cluster (high concentration or green), but the cluster also contains all members of the W family (low purity or red). This leads to a strong green color. At the opposite end of the picture, each member of the B family is in its own trivial cluster of size 1 (high purity, low concentration), leading to the red color. As indicated by the calculations shown in the table, the grid element corresponding to i = 30, c = 20 is filled with a color that is 40% green and 70% red, resulting in a slightly reddish color. Also note that because the B family is fully clusterable by parameters (0, 50), the grid element at that location is 100% red, 100% green, that is, yellow. A small blue dot is used to indicate such perfect concentration, purity.
The results for the MDR family of proteins (), are also shown in Figure 3. This family clusters perfectly when percentage identity is chosen between 30 and 40 and percentage cover at least 60. The perfect clusterability at high cover indicates that members of the family are of approximately the same length, and that a low-percentage identity extends across almost the entire length of the proteins.
Notes on Clusterability
Because every protein matches itself with 100% identity and cover, it is easy to see that any family of size 1 is (100, 100) clusterable. We call such families trivial families.
We classify nontrivial families into several categories on the basis of the extent of shared family members. The categories can be described without ambiguity in set theoretic terms; however, we choose to illustrate them with the help of Figure 4 due to space constraints.
Possible relationships between families on the basis of shared members.
- atomic family: no members are shared (A)
- subset family: all members are shared with some family (B)
- superset family: contains a subset family (C)
- intersected family: some members are shared (D, E)
In reality, the picture can be more complicated, as a family can fall into more than one category, for example, a superset family can itself be a subset or intersected family, etc. However, even with this simple picture, one can see that our expectations regarding the clusterability of a family vary with the category in which the family falls. For instance, we would expect family A to be more clusterable than the other families.
RESULTS
The complete set of 28,581 Arabidopsis thaliana protein sequences from TIGR formed the set A. Gene family information downloaded from http://www.arabidopsis.org on July 28, 2003 helped us classify 4241 of these proteins into 571 families. A total of 119 families are trivial and 345 are atomic. The classification of the remaining 107 families is shown in Figure 5.
Venn diagram showing the classification of the nonatomic families as of July 28, 2003. A total of 22 families can be classified as subset and intersected, whereas one family falls into all of the three shown categories.
The entire set of proteins A was compared against itself using BLASTp with a e-value threshold of 0.0001. The distribution of the resulting 2,254,453 hits is shown in Figure 6. A total of 8.6% of proteins participate in no hits at all, whereas 1.3% participate in more than 1000 hits. A total of 19 nontrivial families defined by experts contain proteins that have no hits to any other proteins—clearly these families will not be (100, 100) clusterable for any choice of clustering parameters.
Distribution of the number of BLAST hits per protein.
In 76% of the cases, there is exactly one hit between a pair of proteins, so the best match is identical to this hit. In the other cases, where there are multiple hits—due to repeated motifs or conserved domains separated by a distance—we compute the compatible set of hits with the maximum score and create the best match.
Clusters were determined using single linkage clustering. Graph G0,0, in which no edges are discarded, contains 238 connected components (clusters), whereas G90, 90, in which all edges with percentage identity and cover less than 90 are removed, yields 3961 clusters.
Finally, unclassified proteins were removed from the computed clusters, and the clustering quality for each family was computed for all choices of clustering parameters. Overall, 86% of atomic families are at least (90, 90) clusterable for some choice of clustering parameters, whereas only 64% of nonatomic families are similarly clusterable. The variation of clusterability, with family size and classification is shown in Figure 7. The results for r clustering are almost as good (within 2%).
The graph at top shows the variation of clusterability with family size for atomic families. A similar graph for nonatomic families is shown at bottom. Please note that the scales used are different.
VCQ pictures similar to Figure 3 for each family and superfamily are available at http://warta.bio.psu.edu/htt_doc/ArabCluster. All of the pictures and the Web pages are constructed on-demand by perl scripts querying a MySQL database that stores the necessary information.
DISCUSSION
Match as Unit of Similarity
In this study, we use single-linkage clustering as the mechanism for grouping similar proteins. The potential drawbacks of using single-linkage clustering have been documented in several papers that propose more sophisticated clustering methods. However, our goal in this study was not to discover families, but rather to characterize existing families by meaningful attributes such as identity, cover, and relative score. We avoided the use of biologically unmeaningful parameters such as inflation value (), connectivity ratio (Matsuda et al. 1999), z-score cutoff value (), which are used in the automated detection of families by other similarity graph-based clustering methods. Another reason for using single-linkage clustering is that it was the most common clustering method used by researchers involved in the creation of Arabidopsis families listed at http://www.arabidopsis.org.
In an effort to overcome some of the problems associated with using single-linkage clustering for grouping members of multidomain families, we use the notion of a match that can be thought of as a form of gapped alignment composed of possibly multiple BLAST hits. Note that the concept of a match is not novel—it has been used implicitly by programs such as Sim4 (), est_genome () and Spidey () to align mRNA sequences to genomic sequences. In fact, even the construction of a gapped BLAST hit from ungapped hsps embodies this concept (although, of course, there are additional parameters like gap penalties at work in this case). It has also been used as a measure of similarity by programs such as XDOM (), and in the creation of HOVERGEN (), HOBACGEN () databases.
Figure 8 shows an instance where our usage of match as the basic unit of similarity helps distinguish members of two different families in the ABC superfamily (). In this particular case, all hits have very similar identities (≈30%), cover (≈40%) and score. Thus, single-linkage clustering based on the best hit alone would have grouped all three proteins together. However, when we compute the best match, the cover (and relative score) between the two MDR family proteins doubles, and this helps separate them from the ATH family. A similar process helps distinguish the MDR proteins from those of the PMP, ATM, and TAP families of the ABC superfamily (see http://warta.bio.psu.edu/htt_doc/ArabCluster/sfams/sf2.html).
MDR family proteins contain two transmembrane domains, whereas ATH family proteins contain only one. All of the hits between the MDR proteins and the ATH protein are shown at left as lines connecting the transmembrane regions. The hits that form the best matches are shown at right.
Overall, only 2% of the matches computed are composed of multiple hits. One reason for this unexpectedly small number could be that our criteria for hits to be compatible is too stringent—we require hits not to overlap at all. It is possible that allowing for small overlaps between hits—as is done in XDOM ()—will permit more nontrivial matches. A second reason for the small number of matches with multiple hits is that in many cases, multidomain proteins are connected by a single hit. For instance, proteins PHYB_ARATH, PHYD_ARATH of the Histidine Kinase family () have identical domain architecture comprising of five full-length, nonoverlapping Pfam () domains. However, the BLAST comparison results in a single hit between the proteins that encompasses all the five domains. Overall, the matches formed by a single hit are always likely to be a significant majority, as the number of multidomain proteins is exponentially smaller than the number of single domain proteins ().
Usefulness of VCQ Pictures
At present, the usual manner of describing the sequence level similarity of a family is by statements of the form “amino acid identity of family F ranges from 20%–80%”. However, such statements are not very helpful in understanding what distinguishes family F from other families at the sequence level, that is, it is possible for a protein to match a member of F with identity 30% and still not be a member F. Our VCQ pictures provide this information, as the underlying method takes into consideration all known protein families. Thus, if family F clusters perfectly for all (i, c) parameter combinations from, say, (30, 30) to (50, 80), then one can be confident that no (classified) protein not belonging to F matches any member of F with similarity (30, 30) or higher; (30, 30) is the parameter that distinguishes F from other families, whereas the overall yellow region in the picture gives an idea of the similarity within the family.
VCQ pictures (like Fig. 3), can give a rough idea of the nature and extent of conserved domains in a family. Families with small, unique domains are clusterable by a high identity, low-cover threshold that is visible as a yellow region in the top right-hand side of the (i, c)-clustering VCQ picture, whereas multidomain families are likely to be clusterable by low-identity, high-cover thresholds.
One can also use the pictures to identify families that have been defined too broadly (concentration is unusually low, even at low thresholds), or too narrowly (purity is unusually low, even at high thresholds).
Note that the VCQ picture of a family may change as more proteins are classified and novel families are created. However, updating the pictures is fairly simple, as the time-consuming steps of measuring similarities and carrying out the clustering with different thresholds are independent of the classification of proteins into families. When family definitions are added or modified, we simply have to filter the precomputed clusters to discard unclassified proteins and remeasure the quality.
Comparison of Clustering Schemes
Our first clustering scheme uses percentage identity and cover as the similarity measure. We analyzed our (i, c)-clustering results to measure how effective these parameters were individually. The results summarized in Table 1 show that using these parameters in combination improves the clusterability results significantly. Figure 9 shows the number of families that are clusterable for different (i, c) parameter combinations. Because low values of threshold can decrease purity and high values can decrease concentration, it comes as no surprise that intermediate values of parameters i, c are most effective at clustering families – in particular, the parameter combination (i = 30, c = 50) alone is capable of clustering 252 (56%) of the nontrivial families.
Contour plot showing, for each choice of identity and cover, the number of nontrivial families that are (90, 90) clusterable.
Table 1.
Clustering Quality Results for the 452 Nontrivial Families
(i, c) | i | c | r | |
---|---|---|---|---|
(100,100) | 340 (75%) | 274 (61%) | 229 (51%) | 332 (73%) |
(90,90) | 369 (82%) | 290 (64%) | 256 (57%) | 362 (80%) |
The columns represent different clustering schemes — column labeled i refers to clustering using percentage identity alone, column labeled c refers to clustering using percentage cover alone, etc. The first row lists families that are (100,100) clusterable, whereas the second includes families that are at least (90,90) clusterable.
The second clustering scheme uses relative score as a measure of similarity. Relative score-based clustering is computationally simpler, as it needs to be carried out only 10 times as opposed to 100 times for (i, c)-clustering. The results shown in Table 1 indicate that it is almost as effective as (i, c)-clustering. However, as high-identity, low-cover matches and low-identity, high-cover matches can have the same relative score, it is harder to gain an understanding regarding the nature of similarity within a family by viewing the relative score-based clustering quality picture. Analogous to Figure 9, we show in Figure 10 the number of families that are clusterable at different relative score levels.
Number of families that are (90, 90) clusterable at different levels of relative score thresholds.
Factors Affecting Clusterability
As can be inferred from the results presented in the previous section, small families have a higher chance of being clusterable. However, equally important is the type of the family—atomic families are much more likely to be clusterable than subset, superset, or intersected families. One should also keep in mind that the same family is sometimes independently listed by several groups. For instance, the PDR family appears three times—as a member of the ABC superfamily (), as a member of the ABC Transporters superfamily, and yet again, independently as the ABC transporter PDR subfamily (). Only the final version, which is a superset of the other two is fully clusterable. Due to such inconsistencies, it is natural that some nonatomic families will not be clusterable. Our Web site displays for each family all other related families (families with which members are shared), and thus makes it easier to spot such inconsistencies.
We now list some of the reasons why an atomic family may not be clusterable in our analysis:
- Idiosyncracies in the family: One example is the structure of the two members of the PMP family (ABC superfamily) shown in Figure 11. The PMP proteins are supposed to be half-molecule ABC transporters (), however, Q94FB9 is a full-molecule transporter with each half being PMP like. This causes the cover of the match between the two proteins to reduce by 50%. Attempts to cluster them together by lowering the threshold for cover will only gather other ABC proteins with two transmembrane domains.The domain structure of two PMP proteins is shown in the figure. The transmembrane domains are colored black, and the nucleotide-binding factors are shown in gray. The two hits between the proteins are shown by black lines.
- Very similar families: Two of the Eukaryotic Initiation Factors Gene superfamily are eIF4A eIF4A, and eIF4A-like (). The former family is fully clusterable, but the latter consists of five members, that by all quantitative measures of similarity, are as similar to each other as they are to members of the eIF4A family. The main reason for the proteins to be in different families seems to be historical; the members of the eIF4A family were the first ones of the superfamily to be characterized and studied, whereas the members of the eIF4A-like family have not been studied completely. Note that the two families taken together are clusterable, so it is still possible that experimental validation will result in the families being merged at some later point in time. In that case, the resulting family will be fully clusterable.
- Level of grouping: Proteins can be classified into groups that are variously labeled as classes, subfamilies, families, superfamilies, etc. In general, it is expected that members of the same family share significant sequence similarity, whereas members of a superfamily may share structural similarity. However, these criteria are not rigid and can be interpreted differently by different groups. For instance, the plant U-box proteins are classified into a single family with five different classes on the basis of their domain architecture (). However, concentration(F, C0,0) is <100, that is, all proteins of the U-box family do not come into one cluster, even when none of the edges in the similarity graph are discarded! This indicates that the overall level of similarity is not very high.
- Incorrect data at TAIR: We mined the tabular data at TAIR for information about protein families. Occasionally, the data is inconsistent with literature. For instance, the 67 members of the Core Cell Cycle gene superfamily that fall into seven families () are listed in a single family. Again, due to the overall low level of similarity, the members fail to cluster together, even when no threshold is applied. We indicate such cases by drawing an X in the grid element corresponding to (i = 0, c = 0). Overall, there are 22 such atomic families.
The one nonbiological parameter that affects our results slightly is the e-value that was chosen for the initial BLAST run. All of the results described in this study were the result of running BLAST with an e-value threshold 0.0001. This somewhat stringent e-value is responsible for some low-similarity families not being clusterable. When we repeated our analysis with e-value set to 1, the number of no-trival families that had proteins with 0 hits reduced from 19 to 7. This resulted in a small increase in the overall number of families that were clusterable.
Identifying New Families
As indicated by Figure 9, the parameters at which a family forms a distinct cluster can vary widely. At one extreme, we have the MLO (), MRS2 () families, which are so distinctive that they cluster perfectly at the (0, 0) level, and at the other extreme, we have the families of the ABC superfamily, that, because of the presence of common domains, form distinct clusters only when the threshold is raised to (50, 30). Clearly, there is no magic parameter combination at which the clusters are guaranteed to form a complete family.
The only fact we can be sure of is that clusters that form at higher thresholds are purer than those that form at lower thresholds. For instance, consider Figure 12, which shows the distributions of the number of clusters (of size at least 5) with respect to relative score threshold. For the purpose of this figure, each cluster was classified into one of four categories:
Distribution of the number of clusters of size at least five at different relative score thresholds. The clusters are further classified on the basis of their purity, etc.
- T1: pure, fully classified (all members of the cluster belong to the same family)
- T2: pure, partially classified (all of the classified members of the cluster belong to the same family)
- T3: impure
- T4: none of the members of the cluster have family annotations
The negligible number of clusters of type T3, when relative score threshold 50 (or greater) is used, indicates that, at this level, almost all clusters are likely to be pure. Thus, one can choose a cluster of type T4, align its member sequences, detect conserved blocks in the multiple alignment, and construct a new family by identifying all unclassified proteins that contain the blocks. Whereas T4 clusters formed with relative score threshold 90 are also going to be pure, they are not appropriate seeds for the discovery of new families, as the sequences in those clusters are likely to be almost identical, making it impossible to extract functionally relevant blocks from the alignment. In many cases, one can also predict the family of unclassified members of clusters of type T2 on the basis of the classified members.
However, any such predictions or new family definitions need to be followed with more comprehensive work to identify the functional role of the conserved regions. One should also note that the relative score threshold of 50 may not be appropriate in the case of other genomes—only after a significant number of protein families are defined, can we calibrate a suitable threshold that can aid in the detection of the remaining families.
Applicability to Other Species Data
The genomes of complex eukaryotes like human, mouse, and rat have recently been completed. The proteomes of these organisms differ in domain complexity from that of Arabidopsis thaliana. A preliminary analysis of InterPro () domain matches to each of these proteomes indicates that, on an average, each Arabidopsis protein matches 4.5 InterPro domains, whereas the corresponding number for human proteins is 9. Given that protein families usually consist of proteins with similar domain architectures, we believe that the larger number of domains per protein actually improves the clusterability of the protein families. For instance, consider two families F1 defined by domain architecture Dx.Dy and F2 with domain architecture Dy.Dz. Under the simplistic assumption that the domains are distinct, but of equal length, one can see that F1, F2 will separate into different clusters only when the cover (or relative score) threshold is >50. On the other hand, if the domain architecture consisted of 10 distinct domains, and the two families shared only one of them, this separation of the families can be accomplished with any cover (or relative score) threshold >10. Note that because clusters may become pure at lower thresholds, the best choice of clustering parameters is likely to be different for these proteomes.
Conclusion
In this study, we describe a similarity measure that is more comprehensive than simply choosing an attribute of the best BLAST hit. We show that this similarity measure can help overcome some of the limitations of single-linkage clustering with regard to multidomain protein families. We present a novel method for visualizing the sequence similarity within protein families. This is accomplished by showing, in a color plot, how the clusterability of a family varies with choice of clustering parameters. Families that cluster with highly specific small domains display a different pattern in their clusterability plot from families with large, but variable domains. We applied our method to visualize the protein families of Arabidopsis thaliana and make the results available through a Web interface. Our display method provides answers to questions of the form—“What is the similarity of members of family X?”—thus helps reveal some of the parameters that might have been used in the creation of the family. We show how our method can be used to detect possibly incorrect family assignments. Finally, we describe how our method can be used to assign families to some unclassified proteins and how novel families can be discovered.
Acknowledgments
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Notes
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2079204. Article published online before print in May 2004.
References
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol.215: 403–410. [PubMed] [Google Scholar]
- Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature408: 796–815. [PubMed] [Google Scholar]
- Azevedo, C., Santos-Rosa, M.J., and Shirasu, K. 2001. The U-box protein family in plants. Trends Plant Sci.6: 354–358. [PubMed] [Google Scholar]
- Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., and Sonnhammer, E.L. 2002. The Pfam protein families database. Nucleic Acids Res.30: 276–280. [PMC free article] [PubMed] [Google Scholar]
- Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res.28: 235–242. [PMC free article] [PubMed] [Google Scholar]
- Bernal, A., Ear, U., and Kyrpides, N. 2001. Genomes OnLine Database (GOLD): A monitor of genome projects world-wide. Nucleic Acids Res.29: 126–127. [PMC free article] [PubMed] [Google Scholar]
- Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y. 1998. Predicting function: From genes to genomes and back. J. Mol. Biol.283: 707–725. [PubMed] [Google Scholar]
- Devoto, A., Hartmann, H.A., Piffanelli, P., Elliott, C., Simmons, C., Taramino, G., Goh, C.S., Cohen, F.E., Emerson, B.C., Schulze-Lefert, P., et al. 2003. Molecular phylogeny and evolution of the plant-specific seven-transmembrane MLO family. J. Mol. Evol.56: 77–88. [PubMed] [Google Scholar]
- Doolittle, R.F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem.64: 287–314. [PubMed] [Google Scholar]
- Duret, L., Mouchiroud, D., and Gouy, M. 1994. HOVERGEN, database of homologous vertebrate genes. Nucleic Acids. Res.22: 2360–2365. [PMC free article] [PubMed] [Google Scholar]
- Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O. 2000. Protein function in the post-genomic era. Nature405: 823–826. [PubMed] [Google Scholar]
- Enright, A.J. and Ouzounis, C.A. 2000. Generage: A robust algorithm for sequence clustering and domain detection. Bioinformatics16: 451–457. [PubMed] [Google Scholar]
- Enright, A.J., Van Dongen, S., and Ouzounis, C.A. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res.30: 1575–1584. [PMC free article] [PubMed] [Google Scholar]
- Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a CDNA sequence with a genomic DNA sequence. Genome Res.8: 967–974. [PMC free article] [PubMed] [Google Scholar]
- Geer, L.Y., Domrachev, M., Lipman, D.J., and Bryant, S.H. 2002. CDART: Protein homology by domain architecture. Genome Res.12: 1619–1623. [PMC free article] [PubMed] [Google Scholar]
- Gouzy, J., Eugene, P., Greene, E.A., Kahn, D., and Corpet, F. 1997. XDOM, a graphical tool to analyze domain arrangements in any set of protein sequences. Comput. Appl. Biosci.13: 601–608. [PubMed] [Google Scholar]
- Heger, A. and Holm, L. 2000. Towards a covering set of protein family profiles. Prog. Biophys. Mol. Biol.73: 321–337. [PubMed] [Google Scholar]
- Hegyi, H. and Gerstein, M. 1999. The relationship between protein structure and function: A comprehensive survey with application to the yeast genome. J. Mol. Biol.288: 147–164. [PubMed] [Google Scholar]
- Holm, L. and Sander, S. 1996. Mapping the protein universe. Science273: 595–602. [PubMed] [Google Scholar]
- Hwang, I., Chen, H.C., and Sheen, J. 2002. Two-component signal transduction pathways in Arabidopsis. Plant Physiol.129: 500–515. [PMC free article] [PubMed] [Google Scholar]
- Kawaji, H., Yamaguchi, Y., Matsuda, H., and Hashimoto, A. 2001. A graph-based clustering method for a large set of sequences using a graph partitioning algorithm. Genome Inform. Ser. Workshop Genome Inform.12: 93–102. [PubMed] [Google Scholar]
- Li, L., Tutone, A.F., Drummond, R.S., Gardner, R.C., and Luan, S. 2001. A novel family of magnesium transport genes in Arabidopsis. Plant Cell13: 2761–2775. [PMC free article] [PubMed] [Google Scholar]
- Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999. Detecting protein function and protein–protein interactions from genome sequences. Science285: 751–753. [PubMed] [Google Scholar]
- Matsuda, H., Ishihara, T., and Hashimoto, A. 1999. Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci.210: 305–325. [Google Scholar]
- Metz, A.M., Timmer, R.T., and Browning, K.S. 1992. Sequences for two cDNAs encoding Arabidopsis thaliana eukaryotic protein synthesis initiation factor 4A. Gene120: 313–314. [PubMed] [Google Scholar]
- Michalski, R.S., Bratko, I., and Kubat, M. 1998. Machine learning and data mining. Wiley, New York.
- Mott, R. 1997. Estgenome: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci.13: 477–478. [PubMed] [Google Scholar]
- Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., et al. 2003. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res.31: 315–318. [PMC free article] [PubMed] [Google Scholar]
- Perriere, G., Duret, L., and Gouy, M. 2000. HOBACGEN: Database system for comparative genomic in bacteria. Genome Res.10: 379–385. [PMC free article] [PubMed] [Google Scholar]
- Rhee, S.Y., Beavis, W., Berardini, T.Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., et al. 2003. The Arabidopsis Information Resource (TAIR): A model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res.31: 224–228. [PMC free article] [PubMed] [Google Scholar]
- Sanchez-Fernandez, R., Davies, T.G., Coleman, J.O., and Rea, P.A. 2001. The Arabidopsis thaliana ABC protein superfamily, a complete inventory. J. Biol. Chem.276: 30231–30244. [PubMed] [Google Scholar]
- Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D., and Kahn, D. 2002. Prodom: Automated clustering of homologous domains. Brief Bioinform.3: 246–251. [PubMed] [Google Scholar]
- Smith, T.F. and Zhang, X. 1997. The challenges of genome sequence annotation or “the devil is in the details”. Nat. Biotechnol.15: 1222–1223. [PubMed] [Google Scholar]
- Tsoka, S. and Ouzounis, C.A. 2000. Recent developments and future directions in computational genomics. FEBS Lett.480: 42–48. [PubMed] [Google Scholar]
- van den Brule, S. and Smart, C.C. 2002. The plant PDR family of ABC transporters. Planta216: 95–106. [PubMed] [Google Scholar]
- Vandepoele, K., Raes, J., De Veylder, L., Rouze, P., Rombauts, S., and Inze, D. 2002. Genome-wide analysis of core cell cycle genes in Arabidopsis. Plant Cell14: 903–916. [PMC free article] [PubMed] [Google Scholar]
- Veeramachaneni, V. 2002. “Aligning fragmented sequences.” Ph.D. thesis, The Pennsylvania State University, University Park, PA.
- Wheelan, S.J., Church, D.M., and Ostell, J.M. 2001. Spidey: A tool for mRNAto-genomic alignments. Genome Res.11: 1952–1957. [PMC free article] [PubMed] [Google Scholar]
- Wolf, Y.I., Brenner, S.E., Bash, P.A., and Koonin, E.V. 1999. Distribution of protein folds in the three superkingdoms of life. Genome Res.9: 17–26. [PubMed] [Google Scholar]
- Zhang, H. 2003. Alignment of BLAST high-scoring segment pairs based on the longest increasing subsequence algorithm. Bioinformatics19: 1391–1396. [PubMed] [Google Scholar]
WEB SITE REFERENCES
- http://www.arabidopsis.org/; The Arabidopsis Information Resource (TAIR).
- http://warta.bio.psu.edu/htt_doc/ArabCluster; Arabidopsis families similarity pictures.
Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
Abstract
Advancements in high-throughput nucleotide sequencing techniques have brought with them state-of-the-art bioinformatics programs and software packages. Given the importance of molecular sequence data in contemporary life science research, these software suites are becoming an essential component of many labs and classrooms, and as such are frequently designed for non-computer specialists and marketed as one-stop bioinformatics toolkits. Although beautifully designed and powerful, user-friendly bioinformatics packages can be expensive and, as more arrive on the market each year, it can be difficult for researchers, teachers and students to choose the right software for their needs, especially if they do not have a bioinformatics background. This review highlights some of the currently available and most popular commercial bioinformatics packages, discussing their prices, usability, features and suitability for teaching. Although several commercial bioinformatics programs are arguably overpriced and overhyped, many are well designed, sophisticated and, in my opinion, worth the investment. If you are just beginning your foray into molecular sequence analysis or an experienced genomicist, I encourage you to explore proprietary software bundles. They have the potential to streamline your research, increase your productivity, energize your classroom and, if anything, add a bit of zest to the often dry detached world of bioinformatics.
bioinformatics software, CLC bio, Geneious, genome assembly, nucleotide alignment, phylogenetics software
INTRODUCTION
Most mornings I wake up to a slew of spam email from biotech companies offering unbeatable bargains on next-generation sequencing (NGS). Yesterday, for example, Beckman Coulter kindly offered to ‘take the stress out of sequencing’ for only a few thousand dollars. Illumina recently provided me with ‘a glimpse into the future of genomics’, just by clicking on their buyer’s guide. And Macrogen, a South Korean sequencing conglomerate, dared me to race the HiSeq ‘Xpressway to the $1000 genome’. These irritating emails underscore an important point: massively parallel sequencing has arrived to the masses. NGS is now standard fare in almost all facets of life science research [1]. It is also big business and intimately tied to another burgeoning industry—bioinformatics [2].
Anyone who has ever had something sequenced, such as a genome, transcriptome, gene or PCR product, or used nucleotide or protein sequence data in their research has probably dabbled in bioinformatics. Not long after scientists started generating molecular sequence information, computer-savvy biologists and biology-savvy computer scientists began developing programs to analyse those data [3]. Given the breadth and depth of questions that can be addressed with primary biological sequence information, many of these programs have become immensely popular. For example, the journal article describing the basic local alignment search tool (BLAST), which allows a query nucleotide or amino acid sequence to be compared against a database of sequences, has been cited >50 000 times [4].
Today’s omics-obsessed scientific marketplace is overflowing with bioinformatics programs. Whatever your sequence analysis problem (assembling, aligning, annotating, folding, etc.), there is probably a program or online application to solve it—skim through the community-maintained list of bioinformatics software at SEQanswers.com to see what I mean: http://seqanswers.com/wiki/Software/list. The majority of these tools are open source, but they can be difficult to learn, install and run; some require an in-depth knowledge of computers [5]. There are, however, various commercial alternatives, which bring together multiple bioinformatics programs into user-friendly stand-alone packages. Although beautifully designed, these software suites can come with a hefty price tag, meaning that most researchers, teachers and students are lucky if they can afford just one. Like buying a car, choosing between different suites can be challenging, and there is surprisingly little information appraising the different programs. Here, I describe my own experiences with using commercial bioinformatics packages, focusing on their cost, functions and educational utility.
I have no affiliation, past or present, with any of the programs, software or companies described in this manuscript, but being a longtime genomics enthusiast, I use many of these applications daily, and I am a strong proponent of ease of use and accessibility in bioinformatics [5]. Although the focus of this article is commercial software, there are a number of free browser-based bioinformatics toolkits worth considering, e.g. [6, 7]. Two toolkits that I use regularly and recommend are MEGA [8] and Unipro UGENE [9]. See Vincent and Charette [10] for a succinct but compelling summary of the drawbacks of commercial tools and arguments for freedom in bioinformatics.
A bioinformatics magic bullet
During my PhD I spent hours a day at the computer assembling and analysing organelle genomes. Friends and colleagues would poke their heads into my cubbyhole of an office and recoil at the sight of my bloodshot eyes and the genomic chaos playing out on the high-definition dual monitors that surrounded me. ‘Dave, how many analyses are you running?’ they would ask, gaping at the mosaic disarray of program windows scattered across the screens. Like any decent genomics junkie, I usually had half a dozen different bioinformatics applications running concurrently. I would be desperately editing and assembling Sanger sequences with Phred, Phrap and Consed [11, 12] while blasting the resulting contigs locally against custom databases and annotating the output on an ongoing GenBank entry. A chug of coffee and I would switch to gene and genome alignments with ClustalW [13] and MAUVE [14], which were pumped directly into PAML [15] to measure genetic diversity and substitution rates. A quick blink of the eyes and I was plotting sliding-window GC contents across entire organelle chromosomes, all while mainlining a medley of phylogenetic and tree-building programs, from MrBayes [16] to PhyML [17] to PAUP [18] to MacClade [19]. And because some of these applications worked only on a PC and others on a Mac, I had Windows and Apple operating systems running at the same time, with command-line terminals piled on top of graphical user interfaces (GUIs).
‘There’s got to be an easier, more efficient way of doing this’, I would say to myself, as I tossed another empty coffee cup into the trash. Sensing my angst, a colleague recommended that I invest in a commercial, cross-platform, GUI-based bioinformatics package, arguing that it would streamline and simplify my work. I was reluctant to take his advice. I felt that paying for such programs went against the spirit of academic research and that using GUI software would weaken my computational skills. However, after failing for the fourth time to correctly install and run an open-source genome assembly algorithm, I gave in and bought a user-friendly bioinformatics bundle, and have not regretted it.
Show me the money
In 2007, with the grant support of my former PhD supervisor, I purchased my first bioinformatics software package. At the time, there was a small but strong cohort of commercial options available, most of which offered a free 30-day trial—a practice that, with few exceptions, continues today, although in some cases the trial period has been reduced to ≤2 weeks. After testing an assortment of programs, I decided on Geneious (Biomatters Ltd., Auckland, New Zealand), which was first released in 2005 and is now among the more widely used cross-platform commercial bioinformatics packages (Table 1). I chose Geneious not because it was necessarily better than other software, but because the company offered, and continues to offer, student discounts. Seven years ago, I paid approximately $200 (all prices in US dollars) for a student license of Geneious, which allowed me to install the software on a single computer. As Geneious increases in popularity, so does its price tag. As of May 2014, a student license costs $395 (a standard academic one is $795), which still makes it among the least expensive all-in-one commercial suites on the market. In comparison, stand-alone academic licenses of the Lasergene Genomics Suite (DNASTAR, Madison, USA) and Sequencher (Gene Codes, Ann Arbor, USA) are around $6000 and $2500, respectively (Table 1).
Examples, features and comparisons of some commonly used commercial bioinformatics software suites
Software | Company | Cost (USD)a | Free trial (days) | Platformb | NGS analysesc | Evolutionary analysesd | Database searchinge | Plug-ins | Workflows | Teaching suitability |
---|---|---|---|---|---|---|---|---|---|---|
Avadis NGS | Strand Scientific Intelligence | $4500 | 20 | M, W, L | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
CLC Genomics Workbench | ClC bio, Qiagen | $5500 | 30 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
CodonCode Aligner | CodonCode | $720 | 30 | M, W | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
Genamics Expression | Genamics | $295 | 30 | W | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ |
Geneious | Biomatters | $795 | 14 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Full Lasergene Suite | DNASTAR | $5950 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
MacVector & Assembler | MacVector | $300 | 21 | M | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
NextGENe | Softgenetics | $4049 | 35 | W | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
Sequencher | Gene Codes | $2500 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
VectorNTI Advance | Life Technologies | $600 | 30 | W | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ |
Software | Company | Cost (USD)a | Free trial (days) | Platformb | NGS analysesc | Evolutionary analysesd | Database searchinge | Plug-ins | Workflows | Teaching suitability |
---|---|---|---|---|---|---|---|---|---|---|
Avadis NGS | Strand Scientific Intelligence | $4500 | 20 | M, W, L | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
CLC Genomics Workbench | ClC bio, Qiagen | $5500 | 30 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
CodonCode Aligner | CodonCode | $720 | 30 | M, W | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
Genamics Expression | Genamics | $295 | 30 | W | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ |
Geneious | Biomatters | $795 | 14 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Full Lasergene Suite | DNASTAR | $5950 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
MacVector & Assembler | MacVector | $300 | 21 | M | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
NextGENe | Softgenetics | $4049 | 35 | W | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
Sequencher | Gene Codes | $2500 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
VectorNTI Advance | Life Technologies | $600 | 30 | W | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ |
aApproximate price of a single-user academic license. Prices were taken directly from company websites (as of 1 June 2014) or were obtained by sales representatives sometime between January and June 2014. Many companies offer a range of pricing and licensing options, and frequently have promo deals. bRuns on the following platforms: Mac (M), Windows (W) and Linux (L). cCan store, organize and analyse (e.g. assemble or map to a reference sequence) next-generation sequencing data. In some cases, de novo assembly features are missing. dContains some tools for studying molecular evolution, such as those for performing multiple sequence alignments, phylogenetic analyses and/or repeat identification. eIs able to connect and interact with online sequence databases, such as GenBank. ✓ = yes, ✗ = no
Examples, features and comparisons of some commonly used commercial bioinformatics software suites
Software | Company | Cost (USD)a | Free trial (days) | Platformb | NGS analysesc | Evolutionary analysesd | Database searchinge | Plug-ins | Workflows | Teaching suitability |
---|---|---|---|---|---|---|---|---|---|---|
Avadis NGS | Strand Scientific Intelligence | $4500 | 20 | M, W, L | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
CLC Genomics Workbench | ClC bio, Qiagen | $5500 | 30 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
CodonCode Aligner | CodonCode | $720 | 30 | M, W | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
Genamics Expression | Genamics | $295 | 30 | W | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ |
Geneious | Biomatters | $795 | 14 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Full Lasergene Suite | DNASTAR | $5950 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
MacVector & Assembler | MacVector | $300 | 21 | M | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
NextGENe | Softgenetics | $4049 | 35 | W | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
Sequencher | Gene Codes | $2500 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
VectorNTI Advance | Life Technologies | $600 | 30 | W | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ |
Software | Company | Cost (USD)a | Free trial (days) | Platformb | NGS analysesc | Evolutionary analysesd | Database searchinge | Plug-ins | Workflows | Teaching suitability |
---|---|---|---|---|---|---|---|---|---|---|
Avadis NGS | Strand Scientific Intelligence | $4500 | 20 | M, W, L | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ |
CLC Genomics Workbench | ClC bio, Qiagen | $5500 | 30 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
CodonCode Aligner | CodonCode | $720 | 30 | M, W | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ |
Genamics Expression | Genamics | $295 | 30 | W | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ |
Geneious | Biomatters | $795 | 14 | M, W, L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Full Lasergene Suite | DNASTAR | $5950 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
MacVector & Assembler | MacVector | $300 | 21 | M | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
NextGENe | Softgenetics | $4049 | 35 | W | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
Sequencher | Gene Codes | $2500 | 30 | M, W | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
VectorNTI Advance | Life Technologies | $600 | 30 | W | ✗ | ✓ | ✓ | ✗ | ✓ | ✓ |
aApproximate price of a single-user academic license. Prices were taken directly from company websites (as of 1 June 2014) or were obtained by sales representatives sometime between January and June 2014. Many companies offer a range of pricing and licensing options, and frequently have promo deals. bRuns on the following platforms: Mac (M), Windows (W) and Linux (L). cCan store, organize and analyse (e.g. assemble or map to a reference sequence) next-generation sequencing data. In some cases, de novo assembly features are missing. dContains some tools for studying molecular evolution, such as those for performing multiple sequence alignments, phylogenetic analyses and/or repeat identification. eIs able to connect and interact with online sequence databases, such as GenBank. ✓ = yes, ✗ = no
I have since gone on to test, and in some instances purchase, a multitude of other commercial bioinformatics platforms, which have varied widely in price, usability and quality. In several cases, the costs of these software suites were not listed on the company websites or anywhere else online. To get pricing details, I had to request quotes from sales representatives. For example, after a successful 30-day trial of CLC Genomics Workbench (CLC Bio, Qiagen, Aarhus, Denmark), I filled out an online pricing request form and was contacted 2 days later by a sales agent who provided me with a formal quote (an estimated $5500 for a standard academic license). I went through similar processes to get pricing on Lasergene, Sequencher and various other bioinformatics programs. Since starting this article, CLC bio now posts some of their prices online (www.clcbio.com; accessible through the ‘buy online’ icon), but many companies still require potential customers to contact sales reps, making it difficult and time-consuming to compare prices of different software packages. On a number of occasions, after requesting quotes or free trial access, I was bombarded with emails and phone calls by sales agents asking whether I had come to any decisions about purchasing the software or whether I needed more information; one time a representative even called a laboratory where I used to work, asking for my current contact details—so if you request a quote, be prepared to be pestered.
What do you get for your money and for how long?
Purchasing a commercial sequence analysis suite is not as simple as a one-time payment followed by a lifetime of bioinformatics bliss. There can be hidden unexpected costs and clauses associated with running the software and continuing to use it in the future. Most commercial packages include 12 months of free maintenance, upgrades and support. Shortly after I bought my student license for Geneious, the firm released a new version of the software. Because this occurred within 1 year of my purchasing the program, I was able to upgrade to the newest version for free. Geneious and other bioinformatics manufacturers have recently switched to ‘version-based licensing’, meaning that users receive free updates for their version of the software (e.g. switching from v1.1 to v1.2), no matter when they are released, but access to newer versions (e.g. switching from v1 to v2) requires an upgrade, which typically costs anywhere from 25 to 75% of the software list price.
Last year, for approximately $6000, I purchased as part of a package deal a single academic license of CLC Genomics Workbench and a genome finishing plug-in (more on plug-ins later). Enrollment in the maintenance, upgrade and support program for the first 12 months, which was mandatory, was an additional $1500, making the initial cost of the software $7500. Renewal of the maintenance program was 25% of the purchase price per year, and, most importantly, was automatic, ‘unless terminated in writing by one of the involved parties (CLC bio or the customer) not later than 3 months before the beginning of the next calendar year’. In other words, 9 months after buying the software, I was sent an invoice for $1500, with 2% interest per month.
Although costly, subscribing to the maintenance agreement can be wise. Commercial bioinformatics programs (Table 1), such as Geneious, CLC Genomics Workbench and Lasergene, frequently undergo major changes, which can significantly improve the software. In the past, I have regretted not renewing certain software, and more than once I have bought programs anew at full price because I let the maintenance period expire.
Before investing in a bioinformatics package, there are other important details to consider. I suggest asking about the rules on moving the software to another computer, in case, for example, you buy a new laptop or your old one breaks down. I have found that most companies allow users to transfer their software license to a different computer. But doing so normally requires contacting user support for a new software activation key, and if you have let your maintenance agreement expire, then you might have to renew it before being able to migrate the software. Similarly, if you update your computer operating system—from Apple OS X 10.8 to 10.9, for instance—your bioinformatics package might have to be upgraded as well. Most bioinformatics companies offer their software for both Windows and Apple platforms, and some, including Geneious and CLC bio, have Linux versions too, so in most cases, it is possible to switch operating systems completely and continue running the program.
Things get even more complicated when purchasing network (or ‘floating’) licenses of bioinformatics programs. Unlike a single computer license, which works only on one computer, a network/floating license allows multiple people to use a bioinformatics package simultaneously by logging on to a network computer (e.g. a powerful computer housed in the lab) and running the program from it. The number of people that can log on depends on the number of floating licenses that were purchased. Network/floating licenses are more expensive (typically twice the price) than their single-computer counterparts, but they can be more economical for big labs or classroom settings, where purchasing multiple single-user licenses makes less sense. Floating licenses can also be convenient for groups that have a high turnover—such as those with a lot of summer students and undergraduate volunteers—as they allow software key codes to be issued to individual lab members and then taken back once the member leaves. Sequencher (Table 1) offers a ‘hardkey’ option, whereby the user is sent a USB dongle after purchasing the software. Sequencher can then be loaded onto as many computers as the owner wants—all that is required to activate the software is plugging in the USB key. But, as I can attest, USB dongles are easy to misplace (and, if issued from Sequencher, expensive and inconvenient to replace).
Cloud computing has also arrived to bioinformatics [20]. Companies like DNAnexus, InterpretOmics, and others are selling bioinformatics as a service, whereby consumers buy online access to powerful computers and their associated software tools, analysis pipelines and data storage and sharing capabilities. The sequencing giant Illumina sells online access to their genomics cloud-computing infrastructure BaseSpace—10 terabytes of storage will run you $12 000 per year. Alternatively, the popular web-based platform Galaxy is a free, open-source, cloud-based bioinformatics tool. It is safe to assume that bioinformatics clouds will only grow larger and more popular over the next few years and are where the most innovative new software will be based.
But what does the software actually do?
You have paid your money and decided on the best maintenance and licensing options for your needs, now what? Well, it is time to start examining molecular sequence data and making some big discoveries, of course. Commercial bioinformatics packages bring together, into a single browser-based platform, a diversity of nucleotide and protein analysis tools (Figure 1). These tools do everything from simple pairwise alignments to restriction site and gene predictions to whole genome and transcriptome assemblies. Given the prevalence of high-throughput sequencing in life science research, many of the tools are designed for analysing, visualizing and arranging NGS information.
The tools and features commonly found in commercial bioinformatics software packages, and what to keep in mind when purchasing one.
The tools and features commonly found in commercial bioinformatics software packages, and what to keep in mind when purchasing one.
One of the most sought after and marketed features of commercial bioinformatics software is their ability to perform fast, efficient and high-quality de novo assemblies of NGS data—taking millions, even billions, of single or paired-end sequencing reads and assembling them into contigs. Go to any of the big bioinformatics software websites and you will find statements like ‘Dominating the high-throughput sequencing data analysis challenge’, ‘Quick and accurate de novo assembly on a desktop computer’ and ‘Next-gen sequence assembly with a clear graphical interphase’. These kinds of claims are often associated with a white paper describing the software’s de novo assembler, including its algorithm, speed and accuracy, how well it performs on standard datasets, such as the human genome, and how it stacks up against other brand-name and open-source assemblers. White papers, however, do tend to present commercial software in an overly positive light and—unlike open-source programs—only a few of the widely used proprietary tools have undergone peer review.
Commercial browser-based assemblers once had a reputation for being slow, memory-expensive and inferior to the free open-source alternatives. Early on, I admittedly struggled to generate quality assemblies, even of small genomes, using commercial programs. In recent years, however, proprietary assembly algorithms have improved immensely and are now used by some of the top academic and industrial research laboratories in the world. With software like CLC Genomics Workbench v7, I have been able to assemble draft genome and transcriptome sequences of microalgae from my laptop computer, which has 16 GB of memory and an Intel Core i7 processor. Many teams are using proprietary tools to assemble complex eukaryotic nuclear genomes, including those of land plants. But these kinds of assemblies require large amounts of time, resources and computing power.
Commercial assemblers, unlike certain open-source ones, are also great at handling data from different sequencing platforms, such as assembling a mixture of Illumina, 454, PacBio and Sanger reads (Table 1); in fact, for many researchers, this is a key selling point. In March 2014, for example, Northwestern University purchased an organization-wide license of Lasergene, providing all faculty, staff and students with access to the software [21]. Similarly, the J. Craig Venter Institute has been using ‘CLC bio’s enterprise platform since 2009 and currently uses it on more than 30 research grants, including their work as part of the Human Microbiome Project’ [22].
Read mapping, which is when sequencing reads are aligned to a reference, such as an entire chromosome or genome, is another core feature of commercial bioinformatics packages. Like with the de novo assemblers, bioinformatics companies regularly boast about their highly tuned, ultra-fast mapping algorithms for reference-guided alignments. CLC bio maintains that their ‘read mapper not only maps more than 1.3 billion Illumina reads (100 nt, paired-end) in less than 5 hours, but [that it] also achieves consistently high mapping accuracy even for complex read data, such [as those] originating from the PacBioRS system’ [23]. They go on to argue that the CLC ‘mapper consistently outperforms the market in all major disciplines’, including the open-source peer-reviewed mapping algorithms Bowtie 2 and BWA [23]. Geneious makes similar claims about their proprietary mapper: ‘Six read mapping algorithms were evaluated on Illumina HiSeq and Ion Torrent sequence data from an Escherichia coli—BWA (0.6.2-r126), Bowtie 1 (0.12.8), Bowtie 2 (2.0.0-beta7), SMALT (0.6.4), SOAP2 (2.20) and Geneious (6.0.3). The results demonstrate that the Geneious Read Mapper produces superior results to the other mapping algorithms on these data sets’ [24]. The claims can be overstated, but in my experience commercial read mappers are as good as or outperform many of the open-source alternatives.
The ultimate test for any assembler or read mapper is whether it is cited in peer-reviewed journals. There is no question that open-source programs are cited more than proprietary ones. The paper presenting the mapper Bowtie 2, for instance, has received 700 citations in just 2 years [25]. But citations for commercial software suites, especially their assembly and mapping algorithms, are on the rise and catching up to their open-source counterparts. A keyword search of ‘CLC Genomics’ in Google Scholar returns >2000 hits. Visit the Geneious blog (http://blog.geneious.com) and you will find a section called ‘Citation Sunday’, highlighting peer-reviewed research that used Geneious. Click the ‘publications’ link on the DNASTAR homepage (www.dnastar.com) and you will see a long list of papers and the following bold statement: ‘Every year for the last 28 years, more researchers have cited DNASTAR's software in scientific journals than any other sequence analysis software’ (italics their own). Skimming through these publications, it is obvious that most papers citing proprietary programs reference a range of open-source ones as well, and that contemporary genomics research often involves a hodgepodge of commercial and free bioinformatics software. Lizzy Sollars, a PhD student at CLC bio, put it best when describing her work on the Ash Tree Genome Project: ‘Using CLC bio's de novo assembler, along with the open-source scaffolding tool SSPACE, we produced our best de novo assembly so far’ [26]. Visit the Broad Institute Software Archive (www.broadinstitute.org/scientific-community/software) for a list of widely used open-source tools for analysing large genome-related datasets.
More than just browser-based assemblers and mappers
Commercial sequence analysis suites, in addition to assembling and mapping NGS data, are designed to carry out the day-to-day bioinformatics tasks involved in molecular, evolutionary and genome biology (Figure 1). Although it might sound trivial, one of the more useful features of commercial packages is visualizing, organizing and storing molecular sequence information. The intuitive graphical interfaces of commercial software allow users to easily build folder hierarchies and drop-down lists of sequence data, move or export these data to different folders and change file formats for use in other applications. In most cases, the software can connect to online resources, such as the National Centre for Biotechnology Information (NCBI) and UniProt, providing quick direct access to vast amounts of nucleotide and protein sequence information, which can then be downloaded, interpreted and analysed through interactive sequence viewers. Many commercial programs also give users the ability to BLAST [4] their data directly against NCBI and UniProt databases, or custom databases, and view and analyse the results through GUIs. My research on organelle DNA has benefited greatly from these types of search tools—in minutes, using commercial software, I can download all of the completely sequenced mitochondrial and chloroplast genomes from GenBank, extract their annotations, sort and search them based on a range of features and transfer them to subfolders for downstream analyses.
The applications within commercial bioinformatics suites that I tend to use most often are for evolutionary analyses and comparative genomics. Most packages come with software for aligning nucleotide and amino acid sequences (and entire chromosomes) as well as tools for inferring evolutionary relationships among sequences and constructing phylogenetic trees and distance matrices. Other useful tools include protein structure prediction, nucleotide repeat and motif finders and primer prediction software. An advantage to performing these kinds of analyses within commercial software is that the results—be they genome maps, alignments, nucleotide sequence dot plots or phylogenetic trees—are depicted in colourful and editable graphics, which can be exported and used for figures in lectures and publications. I regularly build genome maps with Geneious and then export them to a graphics-editing program for further polishing. All of the genome maps in Smith et al. [27, 28], for example, were constructed with Geneious. The interactive graphical visualization tools of commercial suites are excellent for exploring large genomic data sets (often depicted in stacked views) and allow for quick navigation to regions or contigs of interest. Many of these features parallel those of popular freely available NGS viewers, like the Interactive Genomics Viewer [29] and Tablet [30].
If you purchase a bioinformatics package and discover that a particular function is missing, do not panic because there is probably a ‘plug-in’ that can do the job. Plug-ins are downloadable applications that provide additional features to software packages—similar to apps for smartphones and tablets. For bioinformatics software, plug-ins add an array of new sequence analysis tools (ones that complement existing tools or that add novel functions), greatly improving the package. Companies are constantly designing new plug-ins for their software, which means that the repertoire of tools within bioinformatics packages is continually expanding. Plug-ins work in two ways: they allow users to add more features to the software, but they also allow developers to design their own apps for the software. Bioinformatics plug-ins can bring some of the most commonly used open-source software to proprietary programs, giving users the benefits of a user-friendly GUI and the power of peer-reviewed algorithms. A cursory scan through the plug-in list for Geneious reveals programs for phylogenetics (e.g. GARLI [31], MrBayes [16] and RAxML [32]), NGS assembly and mapping (e.g. Velvet [33], TopHat [34] and Bowtie [25]), sequence alignment (e.g. ClustalW [13], MAUVE [14] and Muscle [35]) and other molecular analysis procedures (e.g. Glimmer Gene Prediction [36], Phobos Tandem Repeat Finder (e.g. [37]) and DualBrothers Recombination Detection [38]). More plug-ins means more functions and sometimes more money. CLC bio provides a wide range of plug-ins for their Genomics Workbench package (www.clcbio.com/clc-plugin), many of which are free, but some can cost hundreds even thousands of dollars—the Shannon Human Splicing Pipeline plug-in is around $4000.
Once you have found the tools and plug-ins to suit your needs, you can start linking them together into ‘workflows’ and pipelines. As CLC bio puts it: ‘A workflow consists of a series of tools where the output of one tool is connected as the input to another tool. This way you can set up a workflow to go through (for example) read mapping, using the mapped reads as input for variant detection, and perform filtering of the variant track’. Workflows can save researchers huge amounts of time and are becoming more widespread among commercial bioinformatics packages. If you do not want to fork out the big bucks, check out The Galaxy Project (http://galaxyproject.org)—a free, web-based and user-friendly bioinformatics workflow management system, which provides access to a large number of data integration and analysis programs.
Bringing bioinformatics into the classroom
Students today are reared on a digital diet of smartphones, tablets and ultra-sleek retina-display laptops filled with intuitive software apps, which integrate seamlessly across platforms and devices. Thus, when these students are introduced to bioinformatics and molecular evolution, one would expect them to engage more easily and enthusiastically with easy-to-use GUI software than with barebones command-line-driven tools.
Commercial bioinformatics suites, given their browser-based point-and-click interface, lend themselves to teaching and learning. From a lecturer’s perspective, the high-end graphics, visual aids and tutorials built into proprietary software are great for communicating bioinformatics topics, themes and procedures, from sequence alignments to contig assemblies to blasting proteins against GenBank. I regularly incorporate bioinformatics software suites into my undergraduate lectures and conference presentations. With my notebook computer connected to a projector, I can use a program like Geneious to effectively communicate to a large audience the procedures and output of various bioinformatics analyses. For example, using a bioinformatics package, it takes me ∼10 min to import a set of Illumina sequencing reads, download a reference genome from GenBank, map the reads to the reference and then zoom in to the resulting alignment, showing the class where the reads mapped onto the genome, the polymorphic sites, paired-end distances and an assortment of other statistics. With the same software, I can design, distribute and evaluate bioinformatics assignments to be completed inside or outside of the classroom. These assignments typically involve a range of sequence analysis tools where the results of one tool are used as input for another. I almost always receive positive feedback from students when using user-friendly bioinformatics—some students have even said that it has inspired them to pursue a career in bioinformatics.
Obviously, the biggest barrier to bringing commercial software into the classroom is the high financial cost of the programs. It is unreasonable to ask students to pay hundreds of dollars for proprietary software, and most undergraduate departments are unable or unwilling to invest thousands of dollars into bioinformatics teaching resources—although with institutes like Northwestern buying campus-wide access to proprietary programs, this might be changing.
One strategy for using commercial bioinformatics in a course is to get all of the students to apply for a free trial version of the software. Their access to the software will be limited to ≤30 days, but this should be long enough for them to complete a few assignments or workshops. Alternatively, some commercial bioinformatics packages can be downloaded and used for free on a ‘basic’ or ‘test’ mode, which means that certain operations are turned off (e.g. assemblies cannot be exported or saved). However, even with limited functions, the software can still provide enough processes for teaching and developing assignments [39]. Again, there is nothing preventing instructors from investing in a personal copy of the software and using it for lectures.
Give it try and give us your feedback
Going forward, innovations in molecular sequencing techniques will result in ever more sophisticated bioinformatics programs, and it is crucial that these programs are accessible to a broad range of users. We might soon be at a point where walk-in medical clinics have genome sequencing and bioinformatics desks, where patients can play an active role in interpreting their gene sequences and contributing to genetic treatments, and where high-school students assemble and analyse genomes for homework. The increasingly integral role of bioinformatics in research, medicine and society also means that it will become an increasingly larger, more lucrative industry and one where users will have to pay for the best products.
My own experiences with proprietary bioinformatics software have been positive. The tools I have purchased have made my laboratory group and me more productive, and I certainly enjoy using stand-alone GUI-based programs more than command-line driven ones. This productivity and ease of use, however, has come at a cost, both intellectually and financially. Although I use sequence analysis tools almost every day, my bioinformatics skills, in certain respects, have plateaued. Moreover, the licensing and upgrading costs of using commercial software represent a significant proportion of my laboratory’s operating budget. Another downside to commercial bioinformatics is that the user can lose touch with what the programs/algorithms are actually doing (they can be a ‘black box’), whereas it is simple to look ‘under the hood’ of open-source tools, which makes them easy to modify and develop. But as bioinformatics software and algorithms become increasingly complex, it might be unrealistic to expect students to have a strong grasp of the math, theory and computer science that underpin those processes.
If you are considering commercial programs, I recommend taking advantage of the free trials that most of the bioinformatics companies offer. You may find that these programs streamline your research and invigorate your classroom, or that they are a waste of time and resources and you are better off using open-source and/or freeware alternatives. Wherever you stand on the topic, I urge you to share your opinions and experiences with others—and best of luck with all of your bioinformatics endeavours.
- Innovations in molecular sequencing techniques, and the popular use of these technologies, have given rise to a range of user-friendly commercial bioinformatics software suites.
- Often marketed as one-stop bioinformatics toolkits, these software packages can be expensive, and it can be difficult for consumers to choose between the different programs.
- This review explores some of the currently available proprietary bioinformatics packages, comparing their prices, usability, functions and suitability for teaching.
- Some commercial bioinformatics programs are arguably overpriced and overhyped, but many are well designed, sophisticated and, in my opinion, worth the investment.
- I encourage readers to explore commercial bioinformatics packages; they have the potential to streamline your research, increase your productivity and energize your classroom.
Acknowledgment
The author thanks four anonymous reviewers whose feedback greatly improved the manuscript.
FUNDING
This work was supported by a Discovery Grant to DRS from the Natural Sciences and Engineering Research Council (NSERC) of Canada.
References
ML
. , Nat Rev Genet
, , vol. 11
(pg. -46
)S
, J
. Bioinformatics software for biologists in the genomics era
, , 2007
, vol. (pg. 1713
-)G
. , Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business
, Hoboken
SF
, W
, W
, et al. , J Mol Biol
, , vol. 215
(pg. -10
)DR
. , Front Genet
, , vol. 4
pg. T
, SR
, M
, et al. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data
, , 2012
, vol. (pg. 464
-)S
, A
, M
, et al. A survey of tools for variant analysis of next-generation genome sequencing data
, , 2013
, vol. (pg. 256
-)K
, G
, D
, et al. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0
, , 2013
, vol. (pg. 2725
-)K
, O
, M
. , Bioinformatics
, , vol. 28
(pg. -7
)AT
, SJ
. , Front Genet
, , vol. 5
pg. B
, P
. Basecalling of automated sequencer traces using phred. II. Error probabilities
, , 1998
, vol. (pg. 186
-)D
, C
, P
. , Genome Res
, , vol. 8
(pg. -202
)MA
, G
, NP
, et al. , Bioinformatics
, , vol. 23
(pg. -8
)AC
, B
, FR
, et al. Mauve: multiple alignment of conserved genomic sequence with rearrangements
, , 2004
, vol. (pg. 1394
-)Z
. PAML 4: Phylogenetic analysis by maximum likelihood
, , 2007
, vol. (pg. 1586
-)F
, JP
. MrBayes 3: Bayesian phylogenetic inference under mixed models
, , 2003
, vol. (pg. 1572
-)S
, O
. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood
, , 2003
, vol. (pg. 696
-)DL
. , PAUP* Phylogenetic Analysis Using Parsimony (*and other methods) Version 4.0b10a
, Sunderland, MA
WP
, DR
. , , 1992
Sinauer Associates
LD
. The case for cloud computing in genome informatics
, , 2010
, vol. pg. 207
DNASTAR press release, 31 March 2014: Northwestern University adopts DNASTAR Lasergene software. http://www.dnastar.com/t-NorthwesternPress.aspx (1 June 2014, date last accessed)
CLC bio press release, 8 Jan 2013: J. Craig Venter Institute extends CLC bio site license through 2017. http://www.clcbio.com/news/jcvi-extends-site-license/(1 June 2014, date last accessed)
CLC bio White Paper, Read Mapping. 2012. http://www.clcbio.com/files/whitepapers/whitepaper-on-CLC-read-mapper.pdf (1 June 2014, date last accessed)
M
,
S
, P
. The Geneious 6.0.3 read mapper. http://assets.geneious.com/documentation/geneious/GeneiousReadMapper.pdf (1 June 2014, date last accessed)
B
, SL
. , Nat Methods
, , vol. 9
(pg. -9
)CLC bio press release, 26 Sep 2013: CLC bio and UK scientists assemble ash tree genome. http://www.clcbio.com/news/clc-bio-and-uk-scientists-assemble-ash-tree-genome/(1 June 2014, date last accessed)
DR
, E
, AA
, et al. First complete mitochondrial genome sequence from a box jellyfish reveals a highly fragmented linear architecture and insights into telomere evolution
, , 2012
, vol. (pg. 52
-)DR
, J
, , et al. Palindromic genes in the linear mitochondrial genome of the nonphotosynthetic green alga Polytomella magna
, , 2013
, vol. (pg. 1661
-)H
, JT
, JP
. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
, , 2012
, vol. (pg. 178
-)I
, M
, L
, et al. Tablet—next generation sequence assembly visualization
, , 2010
, vol. (pg. 401
-)DJ
. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD diss., The University of Texas at Austin, 2006
A
. RAxML Version 8: a tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies
, , 2014
, vol. (pg. 1312
-)DR
, E
. Velvet: algorithms for de novo short read assembly using de Bruijn graphs
, , 2008
, vol. (pg. 821
-)C
, L
, SL
. , Bioinformatics
, , vol. 25
(pg. -11
)RC
. MUSCLE: multiple sequence alignment with high accuracy and high throughput
, , 2004
, vol. (pg. 1792
-)AL
, KA
, EC
, et al. Identifying bacterial genes and endosymbiont DNA with Glimmer
, , 2007
, vol. (pg. 673
-)C
, F
, R
. Genome-wide analysis of tandem repeats in Daphnia pulex-a comparative approach
, , 2010
, vol. pg. 277
VN
, KS
, F
, et al. Dual multiple change-point model leads to more accurate recombination detection
, , 2005
, vol. (pg. 3034
-)M
, R
, A
, et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data
, , 2012
, vol. (pg. 1647
-)© The Author 2014. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]