George Ruban¹ and Joachim Reidl²
Based on the weight-matrices of von Heijne¹, we have developed a computer program to screen whole chromosomal DNA fragments for putative signal sequences of exported or secreted proteins. The algorithm is a C language computer program. It performs an automated translation of large DNA fragments (small chromosomes) in all six reading frames, using the universal codon usage. Subsequent automated analysis of the N-terminal signal peptide will result in a compiled sub-library for further "fasta" program analysis. Using the complete Haemophilus influenzaeRd chromosome², we demonstrated that numerous experimentally characterized secreted or exported proteins can be detected by this pure computational approach. Thus indicates that this valuable tool could be used for the detection of exported gene products (e.g. virulence factors), based on DNA sequence information only.
To initiate protein export across the inner membrane (in prokaryotes) or the endoplasmic reticulum (in eukaryotes) a N-terminal signal sequence is found on most secretory proteins. The specific features of a typical signal peptide, well investigated for prokaryotes, are reflected by its secondary structure:
Noting the increasing amount of information based on DNA sequencing of complete chromosomes, especially for pathogenic prokaryotes, we intended to develop a computational approach, based on the well established signal sequence evaluation matrix of von Heijne¹. The developed computer program provides a method to screen for signal sequence dependent and putative exported proteins. Based on the completed DNA sequence of the H. influenzae chromosome², we have verified the program and will show that it facilitates the detection of naturally secreted proteins.
Basically this program converts a DNA file to one of likely proteins. As illustrated in figure 1, the program first converts a given DNA file into all six possible reading frames. Second, a sliding window of N residues is fitted to each possible protein in a loop (optional) to calculate the best value (significance value) from the given von Heijne weight-matrix¹. Third, the recognized N-terminal signal sequence is converted along with the corresponding reading frame, location information, protein length, and weight factor into a formatted data base (Fasta format).
In figure 2, the basic function is shown with an example of the blaM encoding Beta-lactamase protein of E. coli. In step 1. translation into the open reading frame is sensed by the initiator codon ATG. Step 2. a sliding window of N=15 amino acids is compared against the weight-matrix¹, and in a loop procedure the best fit is screened until the highest score above the minimum significance value parameter is reached. In step 3., the best candidates are saved as described above.
Figure 3 shows the parameter input information. The program accepts ASCII code DNA file format. The user can then choose several options:
The source code for the program was written in ANSI C, only using the "//" comment style from C++. Therefore, it should compile under most C/C++ compilers and operating systems without modification. The specific executable was compiled with Borland Turbo C++ for Windows 3.1, and has been run under Windows 3.1 and on a Power Macintosh running SoftWindows.
As it was reported recently², the complete chromosome of H. influenzae has been sequenced and is provided by the TIGR organization on the World Wide Web (http://www.tigr.org). In order to evaluate the properties of the program, we compiled the complete H. influenzae chromosome with different input parameter. The 1.830.137 bp chromosome encodes 1743 predicted open reading frames². With the minimum weight parameter set to zero, and the minimum deduced protein length set to twenty amino acids, the chromosome has the theoretical capacity to encode 12421 open reading frames (orfs), regardless of numerous possible start sites located within existing genes or in transcriptional control regions. As it can be seen in Figure 4, a protein sorting in sub-libraries occurs depending on weight-factor (0-160), and sliding window (4-15). As it can be demonstrated, using weight factor zero, 12421 hypothetical orfs are generated and saved into a sub-library. As the weight factor increases (80, 90, 100, 120, 160), the number of possible signal sequences containing orfs decrease significantly in the respective sub-libraries. Remarkably, if the minimum length of the deduced orfs is set to at least 100 amino acids, the output library with weight factor 0 contains about 1708 possible reading frames, reflecting very closely the actual deduced number of 1743 orfs. It can also be observed that the content of the respective sub-libraries of minimal lengths of 20 or 100 amino acids do not differ significantly in the numbers of saved orfs.
In order to verify the obtained sub-libraries, we investigated whether a defined subset of experimentally and predicted secreted proteins can actually be identified. For this reason we included the characterized secreted proteins of H. influenzae (HI0693 e(P4), HI0401 P1, HI0139 P2, HI1164 P5, HI0381 P6, HI0689 Hpd, HI0990 IgA-protease, HI0994 transferrin bdg. Tbp1, HI0995 transferrin bdg. Tbp2, HI0251 TonB, HI0113 HxuC, HI0263 HxuB), as well as predicted precursor (HI1111 XylBP, HI0504 RibBP, HI1579 Lpp, HI0620 HlpA, HI0302 Cute, HI1567 IroA, HI0703 LppB, HI0256 Lipo-34) into a test-set, and asked for the frequency in which we can obtain such proteins in a variety of different compiled sub-libraries, generated by the program. As shown in figure 5, a 60-70 * fold increase of the relative accumulation of the test-set can be produced by the program by using a sliding window (11 to 15) and weight factors (100 to 120). The results are calculated as:
To obtain specific information about the generated sub-libraries, we investigated the content of a generated sub-library, containing 48 orfs, with weight factor 120, sliding window 15 and a minimum protein length of 20 amino acids. This sub-library contained 25% of the test-set proteins. To further specify the content of this sub-library, we have sent each peptide sequence via Blast-Search to the NCBI network server. As a result we obtained that 35 proteins actually do encode for putative or experimentally characterized precursor proteins, 4 proteins are homologues to membrane associated transport proteins, and 9 proteins were found to correspond to proteins with no defined export characteristic or no data base hit. The results are summarized in table 1.
|Table 1: Identification of Sub-Library Content|
|3||HI0131||AfuA,iron uptake outer membrane|
|4*||HI0139||P2,outer membrane protein|
|11*||HI0401||P1,outer membrane protein|
|12*||HI0504||RbsB, ribose bdg.-protein,periplasm|
|13||HI0507||hypothetical signal sequence|
|14||potE,putative putrescin antiporter|
|15||macB,sigma E homologue|
|17||HI0661||HhuA,hemoglobine bdg.,outer membrane|
|20||MglB, methyl-galactoside bdg.precursor|
|22||HI0852||possible drug translocase|
|23||DsbD,C-type cytochrome biog.,precursor|
|24||KefC,potassium efflux system,precursor|
|27||MerP,mecury scavanger prot,precursor|
|28||HI1090||hemin export protein|
|32||OppA,oligo peptide bdg.-prot.,precursor|
|35*||HI1161||P5,outer membrane protein,precursor|
|43||HI1586||hypothetical protein,no precursor|
|44||HI1591||outer membrane lipoprotein carrier|
Identification was determined by homology analysis, using the Blast Search engine of the NCBI network server. HI identifiers are included as available. Underlined results indicate potentially not secreted proteins (see text). Asterisks (*) mark test-sets containing orfs.
In summary, by compiling the H. influenzae genome, we have demonstrated that this program represents a simple tool for a first step analysis to verify signal sequence dependent secreted proteins based on the DNA information only. This tool allows omitting the time consuming step to precisely dissect the deduced coding regions of large bacterial chromosomes before they become accessible for further characterization, for example to seek for secreted proteins. Furthermore, the program can be used to generate user friendly data-bases of individual composed sub-libraries, which subsequently can be used as fasta formatted libraries to allow a fast homology search of suspected or homologue forms of already characterized proteins or secreted proteins (e.g. virulence factors).