This thesis is made to research and develop a tool for visualising gene expression data. The report will describe and illustrate how correlations between genes will change in a time course. The tool that is developed can be widely used in biology (e.g. toxicology, studying the cell cycle, studying the progression of diseases) and the 3D environment gives better insights in the gene expression data. For this research the given data will be preprocessed first before analysing. After this step the unexpressed genes will be filtered out. The remaining genes will be used to make a correlation matrix. The correlation matrix will be clustered in smaller matrices to make it easier for algorithms to analyse the data. An advantage of clustering is that it can visualise the inter- and intra-cluster correlations. After the clustering step two filtering algorithms will be used on each cluster to retrieve the final correlation networks. The clustering and filtering is done for every time point. Between this the data is interpolated for smooth simulation. All clusters will be visualised in 3D environment and will be made interactive by using virtual reality. This thesis has as a goal to develop a tool that gives insight on how correlation networks change in a time course and give answers to the following questions: Which simulation techniques work best? What layout algorithm will give the best visual outcome? Will a 3D environment improve readability of the data? Which filters are applicable to the data?
Visualising how correlation networks change over time BSc. Thesis1
H.G.E. Bongers
Department of Data Science and Knowledge Engineering, Faculty of Humanities and Sciences, Maastricht University
June
Abstract
This thesis is made to research and develop a tool for visualising gene expression data. The report will describe and illustrate how corre- lations between genes will change in a time course. The tool that is developed can be widely used in biology (e.g. toxicology, study- ing the cell cycle, studying the progression of diseases) and the 3D environment gives better insights in the gene expression data.
For this research the given data will be prepro- cessed first before analysing. After this step the unexpressed genes will be filtered out. The remaining genes will be used to make a corre- lation matrix. The correlation matrix will be clustered in smaller matrices to make it easier for algorithms to analyse the data. An advan- tage of clustering is that it can visualise the inter- and intra-cluster correlations. After the clustering step two filtering algorithms will be used on each cluster to retrieve the final cor- relation networks. The clustering and filtering is done for every time point. Between this the data is interpolated for smooth simulation. All clusters will be visualised in 3D environment and will be made interactive by using virtual reality.
This thesis has as a goal to develop a tool that gives insight on how correlation networks change in a time course and give answers to the following questions: Which simulation tech- niques work best? What layout algorithm will give the best visual outcome? Will a 3D kinds of diseases in the future. This thesis focusses on gene expression data but also other kinds of data (even non biological data) work in this application.
In this thesis the methods used will be discussed first. After that the results will be shown. Next the results will be discussed in the conclusion. In the final step it will be discussed what could be interesting for future research.
1.1 Research questions
- Will a 3d environment improve readability of the data compared to a 2d environment?
- What simulation technique works best?
- Which layout algorithm is easy to read?
- What can we do to filter the big data for visualisa- tion
- How many nodes can a network have before it becomes unmanageable?
2 Methods
The section methods will describe and explain what steps were taken to reach the results in section 3. The data set used in the methods described below will be a gene expression data set. All the methods described below will be done by the application. The application uses the methods discussed in the following methods.
2.1 Preprocessing and filtering
Clustering
The first step of preprocessing the data is clustering the data6. Clustering means that the input data is divided in the amount of clusters chosen by the user (There will be a range of number of clusters from which the user can choose from to make it easier for the user). Clustering has as an advantage that the data is divided in several smaller datasets which is useful when processing a vast dataset. The cluster algorithm used in this application is the k-means algorithm.
K-means clustering is a clustering method where the algorithm divides the data in k smaller data sets (clus- ters). The clusters are obtained by following the next steps:
- Step 1: k means are instantiated at random in the data’s domain.
- Step 2: k clusters are made by taking one data point and look which mean is closest. This data point belongs to the described cluster now. Do this for every data point in the data set.
- Step 3: Recalculate the mean of each cluster based on the newly obtained clusters from step 2.
- Step 4: Repeat step 2 and 3 until the clusters are converged.
illustration not visible in this excerpt
Figure 1: An example of k-means clustering on an arbi- trary 3D data set
In other words the k-means algorithm tries to find the minimum distance from an unclassified data point to the mean of a cluster which can be mathematicaly described as
illustration not visible in this excerpt
The k-means algorithm is a simple to implement and fast algorithm which can handle large data sets easily. K-means clustering can handle both 2D and 3D. The resulting clusters are significantly smaller and because of that the calculations in the following methods will be faster. Since the input data can be very arbitrary, the k-means algorithm is ideal to divide our data.
Expressed and unexpressed genes
When the input data set is a gene expression data set it might be useful to filter out all the unexpressed genes. The unexpressed genes are most likely not that interest- ing in the experiment. This method will filter out the unexpressed genes by taking the following steps:
- Step 1: log 2(x) , ∀ x ∈ dataset
- Step 2: Make a histogram of the logarithmic data set.
- Step 3: Decide the cutoff value based on the his- togram. The histogram wil show 2 convexities. The cutoff value is the value where the two peaks cross.
illustration not visible in this excerpt
Figure 2: This histogram is an ideal example of showing how expressed/unexpressed genes work. It can clearly be observed that we have a left distribution with unexpressed genes and a right distribution with the expressed genes. The red line represents the chosen cutoff value
- Step 4: Filter out every value under the cutoff value since those are the unexpressed genes. The remain- ing genes will be the expressed genes.
The histogram will show a bimodal peak (see Fig- ure 2) from which we can read the cutoff value. The right peak (which is normally distributed) will include the true positive genes. A difficulty of finding the cut- off value is choosing a value where there are as many true positive values and as few as possible of the left peak. Also will these two peaks overlap for a big part and the ratio of unexpressed/expressed genes may differ in most situations (see Figure 3). There is no optimal value when choosing a cutoff value. To cut down the number of genes drastically it is recommended to be as strict as possible.
Variance and correlation filtering
The next two filters which are used to bring the size of the data set down are a similar kind of filter. The variance and correlation filter both filter out the genes based on the size of the data set which the filter gets as an input. This is because the resulting matrix needs to be under a certain size to have a reasonable running time of the matrix.
For the variance filter we want the genes which have a high variance in data set. These genes are more inter- esting since when the variance is high the genes will react different to other kinds of genes. The formula to find the threshold value for the variance filter is as follows:
illustration not visible in this excerpt
where n = size of data set, k = tuning variable and x i is the data value in the data set.
illustration not visible in this excerpt
Figure 3: This second example is gene expression data from Saccharomyces Cerevisiae oxidative stress response. In this example it is much harder to see a difference be- tween the distributions. The red line represents the cho- sen cutoff value.
Correlation filtering works in the same way as vari- ance filtering. The correlation filter chooses a thresh- old of what the maximum correlation to the other genes must be. Genes with a low average correlation might be less interesting to look at then the highy correlated genes. This is especially important when looking at what genes correlate with each other when looking at a dis- ease. The formula to find the threshold value for the correlation filter is as follows:
illustration not visible in this excerpt
where n = size of data set, k = tuning variable and x i is the data value in the data set.
The correlation and variance filter are used to reduce the size of the data set to make further calculations eas- ier. It is possible in the application to turn of these filters. If these filters are turned of the data sets will be significantly larger and computation time will rise. This is discussed in the experiments section.
Discussion
Preprocessing and filtering the data is an important step in getting to a good robust solution. It is important that the data set which is used as an input is in a standard format. qThe format should include that there are multi- ple timepoints and multiple replicates of the experiment. The title of the data should be in the format: ” [name experiment] + t = [timepoint in hours] ”. The preprocessing and filtering is done dependant on how exact the resulting data set needs to be. The tuning variables al- low the user to let the user decide how exact the result- ing data must be. The more precise the data, how more computationally intensive the program is and it will take a longer time to compute.
[...]
- Quote paper
- Hendricus Bongers (Author), 2016, Visualising how correlation networks change over time, Munich, GRIN Verlag, https://www.grin.com/document/370173
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.