Nuclear transport of proteins is a basic cellular mechanism preceding a lot of biological processes. The classical transport mechanism for nuclear proteins involves karyopherins importing and exporting the proteins. The karyopherins recognize typically nuclear transport signals in the protein sequence. Three main types of nuclear localization signals (NLS) are focused in the scientific field of nuclear protein transport: monopartite, bipartite and PY-NLS. In studies on nuclear export signals (NES) the specific type of leucine-rich signals is often investigated.
The first goal of this thesis was to update NLSdb, a database containing 114 experimental and 194 potential NLS, to the current state of available data. Towards this end, a set of 2452 novel signals with published experimental evidence was extracted from the literature and used as development set. An in silico mutagenesis approach was applied to this set to detect 4301 novel potential NLS in nuclear proteins. We matched these potential NLS in protein sequences of unannotated subcellular localization to identify nuclear proteins. We were able to confirm the predicted localization using our potential NLS in literature.
Additional to the collection of data, an extensive analysis on protein sequences containing NLS and NES was performed to provide insights into subcellular localization of proteins and their occurrence in various organisms. A clustering of sequences of NLS led to the separation of signals into distinct sub-groups with a clear definition of a consensus sequence for each sub-group. Aligning potential NLS against the sub-groups resulted in a refinement of the consensus sequences.
The results from this study reflect the scientific progress, lead to further knowledge in the field of nuclear transport and highlight the usability of bioinformatics methods for the discovery of new insights in biology. Nuclear transport is related to many interesting researches, for example allergic reactions, cancer and other diseases. The outcome of this work provides a good fundament for other studies with nuclear transport signals.
Table of Contents
1. Introduction
1.1. Cellular compartmentalization
1.2. Nuclear localization signal (NLS)
1.2.1. Monopartite NLS
1.2.2. Bipartite NLS
1.2.3. PY-NLS
1.3. Nuclear export signal (NES)
1.4. NLSdb - Database of nuclear localization signals 1.0
1.5. Motivation
2. Materials and Methods
2.1. Collection of experimentally verified nuclear transport signals
2.1.1. NLSs
2.1.1.1. The database NLSdb1.0
2.1.1.2. Publication of Lange et al.
2.1.1.3. Prediction tool SeqNLS
2.1.1.4. The Swiss-Prot database
2.1.1.5. PY-NLS sources
2.1.1.6. Others
2.1.2. NESs
2.1.2.1. The database ValidNESs
2.1.2.2. The NESdb
2.1.2.3. NESbase database
2.1.2.4. The Swiss-Prot database
2.1.2.5. The prediction tool NESMapper
2.1.2.6. Others
2.1.3. Test set - unannotated Swiss-Prot proteins
2.2. In silico mutagenesis
2.2.1. Sets of nuclear and non-nuclear proteins
2.2.2. Mutagenesis approach
2.3. Data analysis
2.3.1. Data pre-processing tools
2.3.2. Protein function and NLS prediction tools
3. Results and Discussion
3.1. Experimental development dataset
3.2. Sequence properties of nuclear localization signals and their proteins
3.2.1. Signal length
3.2.2. Organism of origin
3.2.3. Sequence similarity
3.2.4. Subcellular localization
3.2.5. Clustering of signals
3.3. 4301 novel potential NLSs through mutagenesis
3.3.1. Characterization of potential NLSs
3.3.2. Increasing coverage from 9% to 43%
3.4. Benchmark - NLSdb1.0 vs. NLSdb2.0
3.4.1. 38% of proteins with novel potential NLSs in NLSdb1.0
3.4.2. 100% overlap between NLSdb1.0 and NLSdb2.0
4. Conclusion
5. Outlook
6. Appendix
7. References
Kurzzusammenfassung
Analyse nuklearer Transportsignale
Der nukleare Proteintransport ist ein grundlegender Mechanismus der Zelle, welcher einer Reihe biologischer Prozesse vorhergeht. Der klassische Transport für nukleare Proteine erfolgt durch sogenannte Karyopherine, welche die Proteine ein und aus dem Nukleus schleusen. Im Feld des nuklearen Proteintransports sind drei Haupttypen von Signalen für den nuklearen Import (NLS) bekannt: monopartite, bipartite und PY-NLS. Studien zum nuklearen-export Signalen (NES) fokussieren den speziellen Typ Leucin reicher Signale.
Das erste Ziel dieser Studie war es NLSdb auf den aktuellen Stand der Forschung zu bringen. NLSdb ist eine Datenbank für nukleare-lokalisations-Signale und enthält eine Sammlung von 114 experimentellen und 196 potentiellen Import-Signalen. Ein Set von 2452 neuen Signalen mit publiziertem experimentellem Nachweis aus der Literatur wurde als Entwicklungsset genutzt. Ein in silico Mutagenese-Ansatz wurde auf diesem Entwicklungsset angewandt. Dies führte zu 4301 neuen potentiellen NLSs, welche in nuklearen Proteinen gefunden wurden.
Zusätzlich zur Datensammlung wurden durch Protein Sequenz Analyse die Eigenschaften (Funktion, Ort und Vorkommen) der Signale und der durch sie transportierten Proteine untersucht. Die genaue Betrachtung der Signaleigenschaften erlaubt eine mögliche Untergruppierung und genauere Angabe von Sequenzmustern der verschiedenen Signaltypen. Verschiedene Tests wurden gemacht um die genutzten Daten zu bewerten. Sie zeigen, dass das Update die alte Version von NLSdb verbessert und heben die Nützlichkeit der Vorhersagemöglichkeit von neuen Signalen hervor.
Die Ergebnisse dieser Studie reflektieren den Fortschritt in der Wissenschaft, bringen weiteres Wissen im Bereich des nuklearen Transports und betonen den Nutzen der Bioinformatik für die Entdeckung neuer Erkenntnisse in der Biologie. Nuklearer Proteintransport tritt in vielen interessanten Forschungsbereichen auf, zum Beispiel bei Allergien, Krebs oder anderen Erkrankungen. Die Ergebnisse dieser Arbeit bieten eine gute Ausgangslage für weitere Forschungsarbeiten.
Abstract
Analysis of Nuclear Transport Signal
Nuclear transport of proteins is a basic cellular mechanism preceding a lot of biological processes. The classical transport mechanism for nuclear proteins involves karyopherins importing and exporting the proteins. The karyopherins recognize typically nuclear transport signals in the protein sequence. Three main types of nuclear localization signals (NLS) are focused in the scientific field of nuclear protein transport: monopartite, bipartite and PY- NLS. In studies on nuclear export signals (NES) the specific type of leucine-rich signals is often investigated.
The first goal of this thesis was to update NLSdb, a database containing 114 experimental and 194 potential NLS, to the current state of available data. Towards this end, a set of 2452 novel signals with published experimental evidence was extracted from the literature and used as development set. An in silico mutagenesis approach was applied to this set to detect 4301 novel potential NLS in nuclear proteins. We matched these potential NLS in protein sequences of unannotated subcellular localization to identify nuclear proteins. We were able to confirm the predicted localization using our potential NLS in literature.
Additional to the collection of data, an extensive analysis on protein sequences containing NLS and NES was performed to provide insights into subcellular localization of proteins and their occurrence in various organisms. A clustering of sequences of NLS led to the separation of signals into distinct sub-groups with a clear definition of a consensus sequence for each sub-group. Aligning potential NLS against the sub-groups resulted in a refinement of the consensus sequences.
The results from this study reflect the scientific progress, lead to further knowledge in the field of nuclear transport and highlight the usability of bioinformatics methods for the discovery of new insights in biology. Nuclear transport is related to many interesting researches, for example allergic reactions, cancer and other diseases. The outcome of this work provides a good fundament for other studies with nuclear transport signals.
1. Introduction
1.1. Cellular compartmentalization
The analysis of nuclear transport signals opens up a biological field that is based on fundamental processes within the cells functionality. Every eukaryotic cell is divided into compartments. The compartmentalization through the cellular organelles forms physical barriers building up special areas with best circumstances for biological processes and molecular functions [Verkman2002]. The nucleus is one of the organelles. It is, like many of them, enclosed by an own membrane and different layers. The nucleus itself is divided into sub-compartments. Similar to the membranes of cellular compartments, the outer nuclear sub-compartment called nuclear envelope constitutes a barrier for all molecules excessing the diffusion size. It is build up by two nuclear membranes. The outer membrane is connected with the endoplasmic reticulum and the inner membrane interacts with the nuclear lamina. Nuclear pore complexes (NPCs) are the gates for molecules, bigger than diffusion size (<= 40-65kDa), to pass the nuclear envelope [Cautain2014].
In this study, the molecules of interest, traveling in and out of the nucleus, were proteins. Once a protein is transcribed in the cell’s nucleus, exported and translated in the cell´s cytoplasm and then folded into its final shape, it can be moved to its functional destination. For some proteins this destination is the nucleus and the protein´s sequence carries special nuclear localization signals for the transport into it.
1.2. Nuclear localization signal (NLS)
There are many ways of transporting a protein into the nucleus [Wagstaff2009]. One of the nuclear import mechanisms, which is called the classical nuclear import mechanism, depends on specific nuclear localization signals that are recognized by the so-called karyopherins importing the proteins through the NPCs. The transporter proteins are called importin-α and importin-β. In the classical pathway the signal is bound by the binding pockets of the importin-α͘ Importin-β binds to that complex and functions as target for the NPCs. After passing the nuclear pore complex, a protein called Ran GTPase, binds to the importin-β and the hydrolysis to GDP brings the complex to release the NLS containing protein [Curmi2010]. In this study the focus lies on three main types of signals: monopartite, bipartite and PY-NLS. The first two signals follow mainly the import mechanism of the classical pathway. The PY-NLSs are transported into the nucleus by the karyopherin-β2 pathway [Lange2008].
1.2.1. Monopartite NLS
The first and most investigated NLS is the monopartite signal. It was first described as a nuclear import signal in the simian virus 40 by Colledge et al. [Colledge1986]. From 1986 on, the signal still keeps attention in the field of nuclear localization signals. Monopartite NLSs are short signals, containing 4-10 amino acids, which are mostly basic. The consensus sequence for the most conserved part of this pattern is K-(K/R)-X-(K/R) [Leung2003].
1.2.2. Bipartite NLS
The bipartite signals can be described as two stretches of monopartite signals (two basic stretches), combined by a 10-12 amino acids long linker sequence in the middle. It is still not completely resolved how this signal sequence looks and what the functional areas are. In literature putative consensus sequences for the bipartite signal are described as the following: (K/R)(K/R)X10-12K(K/R)(K/R) and (K/R)(K/R)X10-12K(K/R)X(K/R) [Kosugi2008]. In addition, examples showing the activity and influence of the linker part have been discovered [Kosugi2008].
1.2.3. PY-NLS
A more recently observed signal is the PY-NLS. As mentioned above the proteins with these signals are imported through the karyopherin-β2 pathway. Other than classical signals the PY-NLS can be directly bound by the karyopherin-β2, which was first described in 2006 [Lee2006]. The signals have both a strongly conserved C-terminal region and a long and variable region at the signal´s N-terminus. The C-terminal end of this pattern most of the time is indicated as (R/H/K)X2-5PY, with the characteristic of a proline followed by a tyrosine. A division of types of PY-signals can be made, depending on the N-terminal amino acids. Regarding to this part of the PY-signals, they can be classified as hydrophobic or basic [Süel2008]. Two consensus sequences are defined for the two types of the PY-NLS: (LIMHFYVPQ)-(GAS)-(LIMHFYVPRQK)-(LIMHFYVPRQK)-X7-12-(RKH)-X2-5PY for hydrophobic PY-NLSs and (KR)-X0-2-(KR)-(KR)-X3-10-(RKH)-X1-5PY for basic PY-NLSs [Lange2008]. According to this two classifications of PY-NLSs the expected length of the signals varies between 16- 24 amino acids in hydrophobic PY-NLSs and 11-23 residues in basic PY-NLSs.
1.3. Nuclear export signal (NES)
Some proteins also need to be exported out of the nucleus for further functions in the cell's interior. For this purpose, their sequences carry nuclear export signals. The scientific research on NESs started as early as the research on NLSs. First described was the nuclear export signal in HIV-1 [Fischer1995]. Later, a focus on NESs came along with the publications about investigation of PY-NLS. The best described type of nuclear export signals is the leucine rich NES, but the definition of a consensus sequence leads to the conflict of matching random regions in the sequence [Dong2009]. Similar to the nuclear import sequences, the export sequences can be classified into more conserved groups of NESs [Kosugi3.2008]. The common pathway for nuclear export is the CRM1-depend pathway [Fornerod1997]. CRM1 belongs to the protein family of importin-β and acts as exportin protein. Similar to the classical import pathway, the export protein binds Ran GTPase and then binds at the NES of the cargo protein. The complex is then able to exit the nucleus [Kosugi2.2008].
1.4. NLSdb - Database of nuclear localization signals 1.0
The starting point for this thesis was NLSdb1.0, the database of nuclear localization signals, containing 114 literature curated experimentally verified NLSs and 194 potential NLSs found through iterated ‘in silico mutagenesis’ [Nair2003]. It was published in 2003 and is still a highly referenced and used database. NLSdb1.0 is a reliable source for finding NLSs, but the earlier mentioned developments and discoveries of signals types, indicate that an update of the database is necessary.
1.5. Motivation
To understand the impact of this investigated cellular mechanism, a short look on the function of proteins entering the nucleus can be taken. Nuclear proteins can be affecting the regulation of the gene expression. Regarding to this, nuclear import of proteins is basically involved in everything related to DNA regulation. For example, allergic reactions, coming from an overexpression of inflammation, involve cytokine proteins. Cytokines can be traced back to an imported protein changing specifically the DNA replication rate [Aggarwal2014].
Following this importance for basic knowledge of nuclear transport signals the main goals of this Bachelor thesis were twofold:
First, it was aimed to update the first version of NLSdb. NLSdb1.0 is a reliable source for finding NLSs and their proteins. The database provided 114 experimentally verified and 195 computationally predicted potential NLSs. However, the last update was performed in 2003. Much new research was done in the area of NLSs/NESs and the database did not distinguish between different types of NLSs/NESs. Thus we performed the necessary update of NLSdb1.0 to the current state of available signal data.
Secondly, various protein sequence analyses were performed on the dataset of 2452 novel experimental and 4301 potential NLSs and their proteins to provide insights into their biology. Therefore we inquired several questions:
- Do homologous protein sequences have similar signals?
- From which organism are the proteins with known nuclear localization signals?
- What are the subcellular location annotations of proteins containing NLSs?
- Can we predict subcellular location using the potential signals?
- Can we define sub-groups within the signal types and refine consensus sequences for the different groups of signals?
This all together with the new potential signals can lead to a more comprehensive understanding of one of the cell's basic mechanisms that affects cells activity in many ways. It can also be used as foundation for many further studies.
2. Materials and Methods
2.1. Data collection of experimentally verified nuclear transport signals
NLSdb1.0 was released in 2003 [Nair2003]. Experimental NLSs and NESs were collected from literature and databases published after 2003 to update NLSdb1.0.
Different keywords were used for the search of publications listing experimental NLSs: “importin binging signals”, “In vitro NLS”, “nuclear localization”, “nuclear localization signal datasets”, “nuclear localization signals review”, “bipartite NLS”, “PY-NLS”, “signal peptides nuclear import” and “signals for nuclear transport”͘ For finding NESs the keyword “nuclear export signals” was used. Additionally to provide lists of experimentally verified NLSs and NESs, the listed sources gave a good overview of the biological background of nuclear transport signals, the most recent and usable NLSs containing databases and computational tools for signal prediction. Especially, the publication of Marfori et al. provided this [Marfori2011].
To accept publications as reliable source, the signals needed to be experimentally verified. Our definition of a reliable evidence was conform to the definition of the experimentally verified signals curated for NLSdb1.0 : “Our main criteria for ‘accepting’ NLSs were that the signal was proven sufficient to mediate the nuclear transport of a non-nuclear protein to the nucleus and that deleting the NLS prevented the nuclear import.” [Nair2003].
Additional to the literature curation, proteins and signals were extracted from Swiss-Prot entries [Swiss-Prot2004] containing an experimentally annotated NLS. Compared to the other sources for NLSs, Swiss-Prot was incomplete in the annotation of nuclear proteins and their NLSs. Some of the experimentally verified proteins from publications were sorted into Trembl instead of Swiss-Prot and some nuclear localization annotation and NLSs proved in literature were missing in Swiss-Prot. For this reason, Uniprot [UniProt2015] was the database used for extracting annotations and accession numbers (ACs) for proteins coming from all sources.
2.1.1. NLSs
The sources that were used for the collection of nuclear localization signals are given below. From each source the signals and the Uniprot-ACs for the proteins carrying the signals, as well as information (about organism, location annotation and GO terms), were extracted. Table 1 shows the sources and the number of experimentally verified signals that could be found in them.
illustration not visible in this excerpt
Table 1: Number of experimental NLSs extracted from the different sources. The signal extraction procedure is described in sections 2.1.1.1.-2.1.1.6
2.1.1.1. The database NLSdb1.0
As mentioned in 1.4., NLSdb1.0 is database containing 114 experimentally validated and 196 potential NLSs [Nair2003]. The 114 experimentally verified signals were collected by searching the literature. The 194 potential NLSs were created by an iterated in silico mutagenesis approach. In the first algorithm step, at every position in the sequence of the experimental signals the amino acids were mutated to another amino acid or were removed. The mutated signals were then tested to match exclusively a dataset containing nuclear proteins. Since it´s publication in 2003, NLSdb1.0 was cited each year on average 10 times (Figure 1). Monthly, its webserver was being accessed by about 100 unique IP addresses. NLSdb1.0 listed for each signal the Uniprot [UniProt2015] ID of the protein that carries the signal.
illustration not visible in this excerpt
Figure 1: Yearly citation of NLSdb [Nair2003] paper since it´s publication in 2003 from Goolge.analytics.dom.
2.1.1.2. Publication of Lange et al.
The publication of Lange et al. [Lange2008] provided monopartite and bipartite NLSs potentially imported via the classical pathway with importin-α [Lange2007]. In their study, the prediction tool PSORT 2 [Nakai1999] was used to find classical monopartite and bipartite signals in the yeast proteome found in GenBank [Benson2013]. The predicted proteins were tested in vivo to get an experimental evidence for them. To prove the nuclear localization of proteins containing such a NLS, a GFP-fusion screening was made. All proteins using the classical pathway of nuclear import interact with importin-α [Curmi2010]. To find this interactions, the proteins of interest were looked up on the interaction database BioGRID [Aryamontri2015]. Only those signals having both a nuclear localization and an interaction with importin-α were used for this study. After filtering the data following these restrictions, 68 yeast proteins with multiple signals were left. In these proteins, 70 unique monopartite and 35 unique bipartite NLSs were listed.
2.1.1.3. Prediction tool SeqNLS
SeqNLS is a tool for prediction of NLS using pattern matching and a scoring scheme [Lin2013]. The tool was trained on the experimental signals of NLSdb1.0. Their test dataset was compiled from two subsets:
First, the yeast dataset of 43 proteins with 51 experimentally verified signals collected from literature by NLStradamus (another tool for NLS prediction based on Hidden Markov Models) [Ba2009].
Secondly, a hybrid dataset, containing 57 proteins from different organism with 72 annotated NLSs curated from literature, published after 2010.
All together 122 unique signals and 93 unique Uniprot-ACs were collected from SeqNLS.
2.1.1.4. The Swiss-Prot database
The source with the highest count of proteins used for this project was the Swiss-Prot database [Bairoch2004]. The information on the presence of a nuclear localization signal (location in the sequence, type of signal and evidence) in a protein were provided in the Motif” or “Region” section of the “Family and Domain” annotation of it´s Swiss-Prot entry. The “Motif” and “Region” sections were screened for the following keywords for a signal to be included into our dataset:
- Nuclear localization signal
- Bipartite nuclear localization signal
- Nuclear import signal
Note, that during later stages of the work we discovered the following additional keywords for nuclear localization signals that were not included into our dataset:
- Unconventional nuclear localization signal
- Required for nuclear localization
- Required (and sufficient) for nuclear import
- Sufficient for nuclear import
- Required for nucleolar localization
Note, the annotations of NLSs in the “Motif” and “Region” section refer to the same͘ Personal communication with the Uniprot consortium revealed the NLSs within the “Region” section as annotation errors intended to be included in the “Motif” section͘
Additional criteria for a nuclear localization signals to be included was the evidence of its annotation. The evidences were given by “Evidence Codes Ontology” [Chibucos2014], short ECOs.
These annotations were classified into manual or automatic assertions. Manual assertions were given by the four following ECOs:
- “ECO:0000269”, manually curated information with published experimental evidence.
- “ECO:0000250”, manually curated information propagated from a related experimentally characterized protein.
- “ECO:0000255” manual assertions for information generated by the UniProtKB automatic annotation system (e.g. with Prosite-Rule [Sigrist2010] as source). This was also used for information generated by various sequence analysis programs used during the manual curation process verified by a curator. “ECO:0000305”, manually curated information inferred by a curator based on his/her scientific knowledge or on the scientific content of an article.
In total 3874 unique protein sequences with 2243 unique signals in either “Motif” or “Region” section were extracted from Swiss-Prot.
2.1.1.5. PY-NLS sources
Two publications were used as sources for PY-NLSs. The first source was published in 2006 by Lee et al. [Lee2006]. They focused on signals transported by the karyopherin-β2 pathway and defined the sequence of PY-NLSs as having an “overall basic character, and possess a central hydrophobic or basic motif followed by a C-terminal R/H/KX2-5PY consensus sequence” [Lee2006]. They provided 7 already known experimental NLSs and proved some of their 81 predicted PY-NLSs to be functional NLSs [Lee2006]. We extracted 9 signals with experimental evidence from this publication.
The other source was the publication of Süel et al. from 2008 [Süel2008] that collect experimentally validated and computationally predicted PY-NLSs. They used these signals to prove their functionality as nuclear import signals in different in vivo and in vitro methods. Only the signals with a very reliable evidence were taken as data for this study [Süel2008]. We extracted 10 PY-NLSs with in vivo and in vitro evidence from this list.
Most of the signal sequences were listed in form of consensus sequences provided together with the signals´ position in the protein sequence. To have comparable data the signals were extracted according to their position in the protein sequence. This resulted in 19 unique signals in 17 unique proteins.
2.1.1.6. Others
Some other sources were discovered in the process of literature search for signals. They were not directly used, but were informative and should be mentioned in order to provide information for further studies.
The nuclear protein database (NPD) [Bickmore2002] is a well-structured database of known nuclear proteins from vertebrates. It is possible to search the database by sequence motifs, including monopartite NLSs. The results showed the start and stop position of the signals, but they did not tell what the evidence of this signals is. NPD listed 1443 proteins with a monopartite NLS.
Another source for knowledge of differences in the sequences of NLSs, was the work of Kosugi et al. [Kosugi2008]. By an in vitro NLSs screening in random peptide libraries they provided about 500 sequences containing monopartite NLSs. Additionally, a mutational sequence analysis resulted in artificial NLSs experimentally verified by an in vivo approach. A consensus sequence tend to be either too general matching random sequences or too specific missing putative NLSs [Kosugi2008]. Doing their study, Kosugi et al. characterized six different groups of nuclear localizations signals imported through the classical pathway and provided consensus sequences for each of them (Table 2). The classification resolved the problem arising by a single consensus sequences. Additionally, the binding position at importin-α for each of the signal type was discovered. Another observation from their study was the regulative activity of the linker region in bipartite signal [Kosugi2008].
illustration not visible in this excerpt
Table 2: Consensus sequences of six classes of importin-α -dependent NLSs taken from Kosugi et al. [Kosugi2008]
2.1.2. NESs
Five sources for collecting nuclear export signals were used. Table 3 lists the number of protein sequences with a NES found in each of the sources. Only the column for NESMapper shows the number of signals.
illustration not visible in this excerpt
Table 3: Number of NESs extracted from the different sources
These sources focus on NESs exported via the classical pathway mediated by karyopherinβ2 and mostly contained leucine-rich NESs.
2.1.2.1. The database ValidNESs
The database ValidNESs contained 262 leucine-rich nuclear export signals in 221 protein sequences curated from literature [Fu2012]. They also provided NESsential, a tool for predicting nuclear export signals based on sequence structure, disordered and solvent accessibility [Fu2011].
2.1.2.2. The NESdb
Another database for NESs was NESdb [Xu2012]. NESdb listed 175 proteins containing a NES with experimental evidence and published reference. Besides, NESdb provided 196 proteins with putative NESs, which were not yet proven to be functional sequences. This work was published in 2012 and was another recent source for finding nuclear proteins and NLSs online. Altogether we extracted the signals of the 175 unique proteins.
2.1.2.3. The NESbase database
The third database specialized on nuclear export signals was NESbase [Cour2003]. Published in 2003 this was the first work on NESs and they were pioneers of NES research. The signals listed in this database, were also manually curated from literature and show a leucine-rich motif. The database contained 75 proteins with one or more NESs, as well as information about necessity and sufficiency for nuclear protein transport mediated by the signals [Cour2003].
2.1.2.4. The Swiss-Prot database
Similar to the description in 2.1.1.4., NESs were searches in Swiss-Prot database [Bairoch2004]. The key words for nuclear export signals in the “Motif” and “Region” section were:
- Nuclear export signal
- nuclear export sequence.
Two additional keywords for nuclear export signals were discovered in a later state of work:
- required (and sufficient) for nuclear export
- Sufficient for nuclear export.
The evidence codes were the same as in 2.1.1.4. In total 971 proteins containing 433 NESs were extracted from Swiss-Prot “Motif” and “Region” section.
2.1.2.5. The prediction tool NESMapper
The last source in Table 3 for NESs was NESMapper. NESMapper is another prediction tool for finding leucine-rich NESs [Kosugi2014]. The prediction algorithm was based on profiles created by a scoring of activity-affecting residues in the signal sequence. They used datasets from three different sources for NESMapper development:
- First, 205 NESs from ValidNESs
- Second, 32 signals from DUB NES (signals of the human deubiquitinases protein family[Santisteban2012]
- Third, 311 artificial NESs from their own study.
Similar to their study, described in 2.1.1.6., they used signals screening in random peptide libraries and applied a mutational approach to create 311 artificial signals. The functionality of some of the artificial NLSs was proven in vivo. All together this source listed 343 experimentally verified nuclear export signals.
2.1.2.6. Other
Similar to 2.1.1.6., the nuclear protein database (NPD) [Bickmore2002] can be listed as an additional source, which was not included in the current datasets, because of time limits. The NPD listed 297 proteins containing a NES.
2.1.3. Test set - unannotated Swiss-Prot proteins
All eukaryotic proteins without a subcellular location annotation were collected from Swiss- Prot. This set was redundancy reduced by Uniqueprot [Mika2003] and then tested once with the signals of NLSdb1.0 and once with NLSdb2.0 for benchmarking these two versions.
NLSdb1.0 is based on eukaryotic organism. They used human, mouse, fly, worm, yeast and cress. NLSdb2.0 used all organism found to have a NLS out of Uniprot, so a comparison based on eukaryotic proteins was suitable for both versions.
2.2. In silico mutagenesis
The previous described signals and sources were all for collecting an experimentally verified set of NLSs. For an update of NLSdb1.0 potential signals were also needed. These potential signals were created by a mutagenesis approach similar to the approach of NLSdb1.0 [Nair2003].
2.2.1. Sets of nuclear and non-nuclear proteins
Before the mutated signals could be used, sets for matching them, were needed. A nuclear and a non-nuclear protein set were created out of proteins annotated in Swiss-Prot. The decision whether a protein was included in one of the sets was based on the subcellular location annotation in Swissprot files. In section 2.1.1.4. about Swiss-Prot, the different ECOs and their meaning were specified. Besides these ECOs, a location annotation without such an evidence code also stand for a reliable source. On the Uniprot web page for evidences [Evidence2014] it was explained, that not all annotations were yet ordered into the ECO system, but annotations without any code still mean there is evidence for them. A protein was sorted into the nuclear set if it had a nuclear or chromosome related subcellular annotation with either “ECO:0000269” (manually curated information with published experimental evidence) or just the text without an evidence code. For the non-nuclear dataset the same restrictions in terms of ECOs were set. Every protein with an annotation that differs from “nucleus” or “chromosome” was sorted into the non-nuclear set.
2.2.2. Mutagenesis approach
The development set of 2452 experimentally verified NLSs was used as training set for the iterative in silico mutagenesis approach. The algorithm was divided into three main steps:
Firstly, the size of the development set was decreased for keeping only experimental NLSs that can be found in proteins with annotated nuclear location in Swiss-Prot. Only the signals that did not occur in protein sequences of the non-nuclear set were taken. These signals were then tested to occur in the protein sequences of the nuclear dataset.
Secondly, we performed a mutational step, using the signals of the reduced development set as input. Figure 2 visualizes the in silico mutation with an example. Every signal was mutated at each position into all 20 amino acids. All possible mutations of every signal were tested again for their occurrence in the protein sequences of the non-nuclear and the nuclear dataset.
The last step was an iteration on the mutated signals. Only mutated signals matching in the nuclear proteins, but not in the non-nuclear proteins, were sorted into the result set and shortened by one position at the end of the signals. The shorter signals still matching exclusively in the sequences of the nuclear protein set were further shortened. This was repeated until the created sequence matches either in none or both of the two protein sets. All resulting signals formed the set of potential NLSs.
illustration not visible in this excerpt
Figure 2: In silico mutagenesis approach. First, a mutation at every position in the initial NLS into all 20 amino acids was done. Secondly, the mutated signals matching exclusively in the nuclear protein sequences, were shortened at their last position. The shortened signals were tested again to match only in the nuclear protein set. The matching signals are iteratively shortened until they did not match in the nuclear set, or matched in the nuclear and in the non-nuclear protein set. All mutated and shortened signals found to match the nuclear set only, were used included in a set of potential NLSs.
The fact that the signals were first mutated and then iteratively shortened led to the appearance of supersets within the list of potential signals. A superset was defined as every set of potential signals containing another potential signal (see Table 4). Not all potential signals had supersets. Anyway, to bypass these supersets only the shortest signals of each superset were kept.
illustration not visible in this excerpt
Table 4: Example superset of potential signals. The experimental signal from the development dataset was mutated. All potential signals (the ones in the superset and the final signal) were matched only in protein sequences of nuclear proteins. We kept only the shortest potential signal for the final dataset of potential signals.
2.3. Data analyses
Due to time limits, all work described here was based on NLSs only.
2.3.1. Data pre-processing tools Blast
Blast is the standard program for sequence comparison based on local alignments. We compared the similarity of the protein sequences with blast to infer the homology of the proteins from our datasets. The algorithm finds sequences in a database either identical or similar to a sequence searched [Altschul1990]. For our purpose the version PSI-Blast was used. In PSI-Blast the search is iteratively based on position-specific scoring matrices (PSSM). We chose an e-value lower than 0.001 and 2 iterations as appropriate parameters for significant alignments.
Cd-hit
Cd-hit is a tool we used for redundancy reduction of sequences [Li2001]. Cd-hit sorts the sequences by their length. The longest sequence is the representative for the cluster and other sequences are compared by sequence similarity to the representative sequence and sorted into its cluster if the similarity is above a given threshold. If the similarity is lower, the compared sequence is being set to be the representative for a new cluster. In this way, a very fast grouping of all sequences is achieved. Short words from 2 to 5 letters are compared between two sequences, representing high identity segments. The number of these short words corresponds to the sequence identity. Therefore, only proteins with a number of identical short words higher than a threshold are aligned to speed up the algorithm.
We used Cd-hit on the sequences of proteins containing NLSs to identify sets of sequence similarity of 100%, 80%, 60% and 40% ( c = 1.0, 0.8, 0.6, 0.4, respectively). We chose a word length of 5 (n=5) for 100% and 80% and of 4 (n=4) for 60% and 40%. We matched the experimental NLSs to the sequences of the redundancy reduced sets to see if proteins with homolog sequences have similar proteins.
Uniqueprot
Cd-hit is limited to a minimum of 40% sequence identity. Therefore, we used Uniqueprot, (another tool for redundancy reduction of protein sequences [Mika2003]) to compute protein sets with 20% sequence identify and to find the signals occurring in this set. The algorithms of Uniqueprot and Cd-hit differ. Uniqueprot uses Blast for aligning the sequences. A similarity score between each pair of sequences, called HSSP value or HVAL (see Formula below) is calculated based on the blast alignment [Sander1991] [Rost1999]. Proteins sharing the same HVAL higher than a threshold (t), in our case t = 0, are grouped. For all groups a greedy algorithm keeps one protein and removes the other. After this, the HVAL between all remaining proteins is lower than the threshold and the sequence similarity reduced. If the threshold for the HVAL is 0, it corresponds to a sequence similarity of maximum 20 % for the sequences having more than 450 residues aligned.
Formula: HSSP value formula, where PID = the number of identical residues in the alignment, calculated by blast*100 - L and L = length of the alignment (without gaps).
We applied the formula for the HSSP value on the alignment results from Blast. We investigated the distribution of similar protein sequences containing NLSs using the calculated HVALs to infer the homology of proteins carrying a NLS.
Clustering approach from Mikael Bodén´s lab at Queensland University
To identify unique groups within the sets of each type of NLSs, the signal sequences were clustered using an approach developed by Mikael Bodén’s lab at the University of Queensland. The advantage of this method over other clustering approaches is the ability to process a large amount of sequences (> 4000 sequences). Working with sequence motifs always asks for a consensus rule, for this reason the clustering and grouping of the types of NLSs was used for refining consensus sequences.
We performed pairwise alignment of all the NLSs sequences for each signal type to calculate the most likely evolutionary distances - in the form of a distance matrix - between them. We used these distances to construct a phylogenetic tree for each type of NLSs by the hierarchical clustering method UPGMA (unweighted pair-group method using arithmetic averages) [Sneath1973]. We could identify distinct sub-groups of NLSs from the phylogenetic trees inspired by a large scale clustering approach developed by Krause et al [Krause2005]. Consensus patterns in the form of position-weight-matrices (PWMs) were derived for the sub-groups. We visualized these consensus sequences as sequence logos. We scored alignments of the set of potential NLSs against the PWMs of the monopartite sub-groups. To find the most similar sub-groups for each potential NLS, we ranked the PWM match scores for potential sequences against those of a background composed of 10000 completely random sequences. The monopartite signals in high ranking groups were aligned with all the best matching potential sequences with Clustal Omega [Sievers2011]. We visualized these alignments through the creation of additional sequence logos using
Weblogo [Crooks2004]. This allowed us to extend and refine the consensus sequences for each sub-group of monopartite NLSs.
A detailed explanation of the formulas and methods used in the clustering approach can be found in the supplementary material (on the DVD) - provided by Julian Zaugg, a PhD student in the group of Mikael Bodén.
2.3.2. Protein function and NLS prediction tools
PredictProtein
PredictProtein [Yachdav2014] is, as the name infers, a collection of prediction tool for structural and functional protein properties. We were interested in the subcellular location of proteins, where Swiss-Prot lacks an annotation. Loctree3 [Goldberg2014], a prediction tool for subcellular locations [Nair2005], was used to predict the location annotations of proteins.
PredictNLS
PredictNLS [Cokol2000] is a tool that searches the signals of NLSdb1.0 in a query sequence. The output file contains a list of the signals found in the sequence together with the signals´ position. Additionally, it provides the information on whether the protein binds the DNA or not. PredictNLS was used to compare the signals from this study to NLSdb1.0.
3. Results and Discussion
The data from all sources were extracted and separated in three set for: monopartite, bipartite and PY-NLSs, respectively. The according UniprotKB-IDs and gene names of proteins with NLSs were mapped to Uniprot-ACs for further analyses.
3.1. Experimental development dataset
illustration not visible in this excerpt
Table 5: Comparison of experimental NLSs from different sources. Combining the signals of all sources we got 2603 NLSs. Between the sources some of the signals overlapped. From the signals of NLSdb1.0 48 NLSs were found in the data collected from Swiss-Prot. The NLSs from Lange et al. and SeqNLS overlapped in 6 NLSs and had both an overlap with the signals extracted from Swiss-Prot, of 16 and 19 signals, respectively. The field “ dditional to NLSdb1͘0” lists 2452 unique signals collected in this study. This number is the number of all signals without the signals of NLSdb1.0 and without the overlapping signals between all sources (in total 37 unique signals overlapped).
Table 5 lists the number of NLSs extracted from each source and the overlap of signals between the sources. To find the overlap a simple sequence comparison was done. On the right side the column “ ll unique signal” shows the number of 2603 signals extracted from all sources͘ The row “ dditional to NLSdb1.0” shows that 2452 new experimental signals were collected in this study. The first comparison row shows, that the signals in NLSdb1.0 had no overlap with the signals from Lange et al. or Lin et al. Furthermore only 48 signals of NLSdb1.0 were in the dataset collected from Swiss-Prot. A closer check for the 114 signals and their proteins validated that the proteins in NLSdb1.0 had a published evidence for their NLSs,s but yet no signal annotation in Swiss-Prot. The other two published sources also had a low overlap of 6 NLSs between their experimentally verified signals. Both overlapped with the signals extracted from the Swiss-Prot motif annotations: 16 signals from Lange et al. and 19 signals from SeqNLS were included in the Swiss-Prot data. Some of the 6 signals overlapping in the data from SeqNLS and Lange et al. also overlapped with the signals from Swiss-Prot. In total the number of unique overlapping signals is 37 (without considering the signal of NLSdb1͘0)͘ The “ dditional to NLSdb1͘0” lists the unique signals found additional to the signals of NLSdb1.0. On the first gaze the collection of new signals was an increase of data over more than 20 fold compared to the data of NLSdb1.0. The further results of analyses steps shed light on the quality of the new signals. It should be mentioned that consensus patterns were within the 114 signals of NLSdb1.0. Those did not match in a simple string comparison, therefore they were manually compared to the other signals. None of the manually compared signals was found in the datasets from other sources.
[...]
- Quote paper
- Silvana Wolf (Author), 2015, Analysis of Nuclear Transport Signals, Munich, GRIN Verlag, https://www.grin.com/document/365482
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.