You are on page 1of 1

Automation of Web Applications and

Iterative Searching for Post-Translational Modifications


Simon Chiang & Kirk C. Hansen
Biomolecular Structure Program, University of Colorado Denver
Abstract Implementation
Data intensive fields like proteomics require researchers to interact with a wide
Tap-Mechanize Tap-Mechanize is written in the programming language Ruby and utilizes two
variety of software that is increasingly web-based. Web applications pose special distinct libraries, Tap and Mechanize. Tap (Task Application[3]) is a software
challenges to programmers seeking to automate their execution. Although web framework that we developed to standardize our interaction with diverse software
applications provide a relatively standard interface to users, the programmatic tools, and to facilitate the construction of automated workflows. Mechanize[4] is a
interfaces span numerous protocols and frequently do not exist at all. library that emulates human interactions with web forms and, although we use the

We present Tap-Mechanize, a system to easily capture the output of web forms for
Automated Analytical Workflow Ruby version, originates from the Perl open-source community.

resubmission using a standard programatic interface. By capturing web forms into Tap-Mechanize captures configurations by redirecting the HTTP output of a web
a standard format, Tap-Mechanize enables many web applications to be used in form to a local server that parses the request into a configuration file. The
automated workflows. Such workflows drastically reduce the time required to redirection occurs via javascript that re-writes the action of the form upon
analyze large datasets, facilitate reproducibility, and enable more complicated 0 1 2 5 6 7 submission. Multiple page requests, requests across https, and page requests using
techniques to be used during analysis. links may all be captured using this method.

We have used Tap-Mechanize to implement iterative searching of MS/MS


3 4 Gene Ontology, GO Slims : Biological Process - Weighted (Dataset 1 name)

The redirection script is injected into the DOM using the Firefox plugin Ubiquity[5].
Biological process (go:0008150) (16.35%)

proteomics data. Iterative searching uses a quick, general search to filter spectra of Redirection from other browsers is currently not supported.
Cellular process (go:0009987) (16.18%)
Macromolecule metabolic process (go:0043170) (15.98%
Metabolic process (go:0008152) (15.70%)
Nucleobase, nucleoside, nucleotide and nucleic aci..
Cell communication (go:0007154) (8.73%)
Regulation of biological process (go:0050789) (6.46%
Transport (go:0006810) (4.14%)
Response to stimulus (go:0050896) (2.48%)

Redirect
Multicellular organismal development (go:0007275) (2

unmodified peptides, and then performs more time-consuming searches on the


Biosynthetic process (go:0009058) (0.67%)
Cell differentiation (go:0030154) (0.56%)
Cell death (go:0008219) (0.48%)
Electron transport (go:0006118) (0.48%)
Secretion (go:0046903) (0.33%)
Membrane fusion (go:0006944) (0.33%)

remaining spectra. Using iterative searching we are able identify peptides with
0. Input data
post-translational modifications (PTMs) that normally are missed. These peptides
1,2. Search using Mascot, export results.
are of particular interest because PTMs frequently regulate the function of proteins, 3,4. Search using GPM, export results.
and are implicated in many disease states. 5.
6.
Generate intersection of results
Map results accession numbers using PIR
Discussion
7. Generate graphic using GoGetter
Our experiments using Tap-Mechanize to iteratively search a collagen sample for
hydroxylation of proline illustrates that the partitioning of search results can inflate
false discovery rates (FDRs). The effect is purely mathematical in nature. During the
Introduction non-PTM search, the modified peptides are unidentified and therefore absent from

The past several years have seen a proliferation of analytical tools for proteomics
+ the denominator of the FDR equation; as a result the decoy hits have a
disproportionately high effect and the FDR increases. During the subsequent PTM
data. Several major search engines exist and proteomics toolkits are now available search, the unmodified peptides are now absent from the denominator and again,
in many programming languages. The next challenge of the proteomics FDR increases.
community will be finding ways to utilize these tools more effectively.
In this example, the lowest FDRs were observed when searching for the modified
Studies have show, for instance, that searching MS/MS proteomics data using and unmodified peptides together, without iterative searching. However, the total
multiple search engines increases the number and quality of peptide/protein
identifications[1]. Variations of this approach, including iterative searching[2], also Iterative Search Workflow number of identifications between the non-PTM and PTM searches was greater
than the total without iterative searching.
hold promise for improving search results but, as a practical matter, these
techniques require automation. A significant barrier to automation is simply Collagen has an unusually high rate of modification and the observed effect should
strong be less severe for most proteins. Moreover, this experiment does not prove or
interfacing with the various analytical tools, the vast majority of which are online.
disprove the utility of iterative searching. Mostly it illustrates that the partitioning
Web applications provide a fairly standard interface to humans, the web form, but 0 1 2 3 6 step used to select spectra for secondary searching must be executed carefully, and
that there is great utility in exploring search results under many conditions.
typically they do not provide a programatic interface. Moreover, analytical
applications can be quite complex; most require a large number of configurations
where the allowable values are hard to predict. As a result, web applications can be Without automation it is difficult to pursue studies such as this. Tap-Mechanize
difficult to automate. Even with a program that emulates a web form, the work helps researchers to automate web applications by capturing configurations from
required to generate configurations is prohibitive.
4 5 web forms and resubmitting them within workflows. This technique preserves the
functionality built into the web interface. Moreover, it allows web applications to
Tap-Mechanize facilities the automation of web applications by providing a simple be utilized as-is, without requiring developers to provide a separate programatic
and robust way to capture configurations directly from web forms. Once captured, CO1A2_RAT CO1A2_RAT interface.
0. Input Data weak
the configurations may be resubmitted programatically, or used as a template to
1,2. Search without PTMs, export results At the most basic level, automation allows researchers to be more productive.
run the application in a batched fashion. Using Tap-Mechanize most web
3. Partition spectra by identification
applications are easy to incorporate into automated workflows. More significantly, automation gives researchers an opportunity to examine how
4,5. Search weak/unidentifed spectra for PTMs
6. Collate results their tools work. Analytical software is complex; each configuration is meaningful,
We have used Tap-Mechanize to automate several workflows related to data even though the exact consequences of a configuration are often unclear. The
preparation and processing, and to experiment with iterative searching. Iterative same can be said of the many numeric results produced during analysis. It is, as
searching can take many forms. The most basic type of iterative searching simply always, through trial and error that we enrich our appreciation of what our results
partitions spectra by the strength of their identifications, then re-searches the weak Web Applications are in Green mean.
or unidentified spectra using additional techniques. This type of searching is Partition Threshold: exp > 0.05
thought to be useful when analyzing proteins with numerous post-translational
modifications (PTMs).

2% seq cov after primary 52% seq cov after secondary


References
One such protein is collagen. Collagen consists primarily of a GXY repeat where X
and Y can be any amino acid. Normally X is proline and Y is hydroxyproline, search without PTMs search for PTMs 1. Searle, B.C., Turner, M. & Nesvizhskii, A.I. Improving sensitivity by probabilisti-
meaning collagen is modified at approximately every third residue. As a cally combining results from multiple MS/MS search methodologies. J Proteome
consequence, hydroxylation of proline must be specified as a PTM to identify the Res 7, 245-53(2008).
majority of collagen peptides. These peptides are physiologically very relevant; FDR=Decoy Hits/
N Spectra Peptide Hits Decoy Hits FDR (%) 2. Nesvizhskii, A.I. et al. Dynamic spectrum quality assessment and iterative com-
hydroxyproline allows collagen molecules to wrap into tight alpha-helix spirals and Peptide Hits
putational analysis of shotgun proteomic data: toward more efficient identification
ultimately stabilizes collagen fibrils. In the absence of hydroxylation, collagen Primary (non-PTM) 1293 49 1 1/49 2.04 of post-translational modifications, sequence polymorphisms, and novel peptides.
degrades easily and the disease scurvy results. Mol Cell Proteomics 5, 652-70(2006).
Secondary (PTM) 1244 326 2 2/326 0.61 3. Tap Website <http://tap.rubyforge.org/>
Using rat tail collagen as a sample, we explored the consequences of using iterative 4. Mechanize <http://mechanize.rubyforge.org/mechanize/>
searching to identify PTMs, in particular the effects of partitioning spectra on the Non-Iterative Search 1293 373 2 2/373 0.54
5. Ubiquity <http://labs.mozilla.com/projects/ubiquity/>
false discovery rate (FDR).

You might also like