The Spearman Correlation is a well known approach to assess the rank correlation of two data sets. One of the advantages choosing Spearman over other correlation coefficients such as the Pearson is that the difference in original value series is less important while the relative rank of the value is what matters most in this coefficient. The Spearman correlation coefficient is often used to assess and validate the performance of models that require less accuracy in absolute value estimate, e.g. the loss prediction models or exposure models. Although other measures such as the Kendall’s τ and Somer’s D are used to measure rank ordering with tied observations, the Spearman’s ρ is often calculated as an initial step of correlation analysis. In this short note we look into the tied observations in the target data set and investigate the impact on the Spearman Correlation coefficient in three different scenarios: single value ties, random multi-value ties, and bounded random ties.
Abstract
The Spearman Correlation is a well known approach to assess the rank correlation of two data sets. One of the advantages choosing Spearman over other correlation coefficients such as the Pearson is that the difference in original value series is less important while the relative rank of the value is what matters most in this coefficient. The Spearman correlation coefficient is often used to assess and validate the performance of models that require less accuracy in absolute value estimate, e.g. the loss prediction models or exposure models. Although other measures such as the Kendall’s τ and Somer’s D are used to measure rank ordering with tied observations, the Spearman’s ρ is often calculated as an initial step of correlation analysis. In this short note we look into the tied observations in the target data set and investigate the impact on the Spearman Correlation coefficient in three different scenarios: single value ties, random multi-value ties, and bounded random ties.
Keywords: Spearman Correlation, Tied Observations, Sensitivity Analysis.
Introduction
The Spearman Correlation, or Spearman’s rank correlation coefficient is considered the nonparametric version of the Pearson correlation and an appropriate measure for both strength and direction of association between ranked data sets.
It is worth mentioning that although the Spearman correlation describes the strength and direction of the monotonic relationship, but more than often the data sets in comparison does not have a significant monotonic relationship between the value series of interest, e.g. the loss and exposure estimates in Finance. In this case, performing a Spearman correlation analysis will also help to find out if there is a monotonic relationship between the data series.
Because the Spearman Correlation is a well known and widely accepted concept, we omit the technical details and the mathematical formulation of the Spearman Correlation and the accompanied statistical significance test in the rest of this paper. For the details of the Spearman Correlation, see1 and2 for more information.
To find out how the correlation coefficient and the level of significance change as both the value and percentage number of ties changes, we look into the following three scenarios in the next few sections: a) Single value ties. b) Random multi-value ties. c) Bounded random ties2.
For each analysis we set the value range to (0 , 1), randomly generate 2000 observations, and randomly select the number of tied observations in the sample. The Spearman Correlation is calculated using the original values and the adjusted series with tied observations. We repeat this calculation for a large number of times, 5000 for all results in this paper, the final result under each scenario is summarized in following sections.
Single value ties
We start with the simple case where all the tied observations tie into the same value. Here we run the tests for 0 , 5% , 10% , . . . , 100% with fixed incremental differences of 5%. In Figure 1 below, we show some examples of the random samples in different trials.
Figure 1: Example of single value ties
illustration not visible in this excerpt
For any fixed number of ties (50% in the figure shown below) within the sample, as the location of ties change by the 5% in estimated value as mentioned, one can observe how the Spearman correlations coefficient change as the location of ties change. It is also clear that the same trend of movements can be found in all percentages of ties, as shown in Figure 2.
Figure 2: Sensitivity of Spearman Correlation as location of ties change.
illustration not visible in this excerpt
While we see that ties at around 50% of the value range will always achieve the maximum correlation at the fixed number of ties. Here we also plot the results by percentage number of tied samples and group by mean, max and min correlations obtained in Figure 3.
Figure 3: Spearman Correlation: percentage number of ties vs. location of ties
illustration not visible in this excerpt
The following view in Figure 4 provide full details of the results. It is clear that the Spearman Correlation coefficients obtained when the ties are located in the centre of the value range is the most consistent and high in numeric value, in the meantime, the coefficients obtained when ties locate at the far end of the value range are most volatile and lower in numeric value.
Figure 4: Significant only: Correlation by percentage number of ties and location of ties
illustration not visible in this excerpt
Random multi-value ties
The random multiple tied cases where the tied observations tie into the a number of different values is more intuitive from a model estimate point of view and closer to real observations. Here we assume that the ties could happen at up to 10 different values within the value range. In Figure 5 below, we show some examples of the random samples in different trials.
Figure 5: Example of random multi-value ties
illustration not visible in this excerpt
In this scenario, the observations at each percentage number of ties are mixed together as we allow up to 10 different locations of ties within the value range. The maximum, minimum and mean values are calculated through all trials as we only want to highlight the effect of the percentage number of ties here.
Also, for each correlation coefficient calculated, we consider only the significant results that has a p-value less than 5%. The result is shown in Figure 6.
Figure 6: Significant only: Random multi-value ties
In order to demonstrate the impact of in-significant correlation coefficients, the following Figure 7 illustrate that the significance test indeed help to identify the minimum value, especially at the lower end.
Figure 7: All results: Random multi-value ties
One may also notice that, as the random locations of ties across trials are symmetrical around the centre of the value range at 50%, we have obtained a linear relationship between the Spearman coefficient and the percentage number of ties in the sample. In case the location of ties is unknown to the observer, Figure 6 illustrate the expected range of the coefficient value given the number of ties within the underlying data set. However, we’d also highlight that this analysis does not consider model estimation errors, as mentioned in the beginning of this paper, the non-tied idiosyncratic values in the data set are kept the same in both data series involved in the calculation of this coefficient.
Bounded random ties
The bounded random ties is a special case of the random multi-value ties where large amount of observations are found at the extreme values. Here we assume only up to 2 other tied values exist between the boundaries. In Figure 8 below, we show some examples of the random samples in different trials.
Figure 8: Example of bounded random ties
Similar to the analysis in previous sections, we look into both the ‘significant only’ results, in Figure 9, and ‘all’ results in Figure 10.
Figure 9: Significant only: Bounded random ties
Cross comparing the bounded results and the random multi-value ties, we see that with the boundary effect we have lost the linearity in the mean value of coefficient. This result is in fact in-line with what was shown in Figure 4 for cases where the tied values are located at 0% or 100% of the value range.
Figure 10: All results: Bounded random ties
As for the impact of the significance test, in Figure 9 and Figure 10 we see that the minimum coefficient can be found at around 70% of the percentage number of tied samples, however in the Figures 6 and 7 for the random multi-value scenario we find that the lower limit is found at around 85%.
The lower limit of coefficient we found in both case (bounded and random multi-values ties) is 3 . 68% with small differences at the fifth decimal place, while the numerical value might be a result of value range assumptions and sampling methods for this analysis, one can conclude that the boundaries at upper and lower limits of the range leads to the convexity relationship between the percentage of ties and the correlation coefficient. In short, the expected strength and direction of the monotonic relationship is relatively lower when a large amount of ties are observed in the upper and lower boundaries.
Conclusions
In this paper, we revisited the calculation of the well known Spearman Correlation coefficient under three different scenarios of underlying data with tied observations. All the analysis is done without consideration of model estimation errors, the non-tied values are kept the same, however, note that the original rank ordering in the data is changed as observations tie into different locations during this analysis.
For the scenario with ties are observed at only one location, we have shown that the maximum correlation can be expected in case the ties are located in the middle of the value range as shown in Figure 2. On the contrary, the more volatile and numerically lower coefficient is observed when ties are located at the upper or lower limits of the whole value range.
For random multi-value ties, we have shown that there is a linear relationship between the mean coefficient calculated and the percentage number of ties. Also, as a result of the significance test, the lower limit of the coefficient is observed at around 85% of ties.
In case the ties are bounded at both ends of the value range, we show that: First, the linear relationship of mean coefficient and the percentage of ties is replaced with a convexity relationship, in other words the strength and direction of correlation is expected to decrease faster than in the random multi-value scenario. Second, the lower limit of significant correlation coefficient is found at 70% instead of 85% in the random multi-value case, this is in-line with the convexity relationship as mentioned above.
Comparing the ‘width’ between the min and max value at each percentage number of ties, we see that this measure appears to be stable in case for the random multi-value tied scenario, but the same measure is found to increase alongside the percentage number of ties in case for the bounded values.
In summary, without consideration of estimation error, this paper has provided a scenario based analysis of the Spearman Correlation coefficient relative to the percentage number of ties and location of tied observations. Our results suggest that ‘where’ and ‘how’ the tied data points are observed in the underlying data set are important factors to investigate and should be reported together with the assessment of rank ordering ability using the Spearman Correlation test.
References
1 Myers, Jerome L.; Well, Arnold D. Research Design and Statistical Analysis (2nd ed.) Lawrence Erlbaum., p. 508. ISBN 0-8058-4037-0.
2 Yule, G. U.; Kendall, M. G. An Introduction to the Theory of Statistics (14th ed.) Charles Griffin & Co.. p. 268.
[...]
1 Yang Liu is a quantitative specialist at the HSBC Bank. Before joining the HSBC, he worked for the Barclays Investment Bank. Yang holds a doctorate in quantitative finance from Cass Business School, City University of London. He has published a number of papers on quantitative methods in risk and finance. The opinions expressed in this paper are those of the author and do not necessarily reflect views of any of the aforementioned Companies and Entities. HSBC, 8CS Canary Wharf, London E14 5HQ, UK
Frequently asked questions
What is Spearman Correlation and when is it used?
Spearman Correlation, also known as Spearman's rank correlation coefficient, is a non-parametric measure used to assess the strength and direction of association between two ranked data sets. It's a suitable alternative to Pearson correlation, particularly when data might not follow a normal distribution. Spearman correlation is often used to validate the performance of models where absolute accuracy is less critical than relative ranking, such as loss prediction or exposure models.
What are tied observations and why are they important in Spearman Correlation?
Tied observations occur when two or more data points have the same value. These ties can affect the Spearman Correlation coefficient. This document investigates the impact of tied observations on the Spearman Correlation coefficient under three different scenarios: single value ties, random multi-value ties, and bounded random ties.
What is the difference between single value ties, random multi-value ties, and bounded random ties?
Single value ties: All tied observations tie into the same value. Random multi-value ties: Tied observations can tie into a number of different values within the value range (up to 10 in this analysis). Bounded random ties: A special case of random multi-value ties where a large number of observations are found at the extreme values (boundaries), with only up to two other tied values existing between the boundaries.
How does the location of a single value tie affect the Spearman Correlation coefficient?
The Spearman Correlation coefficient is maximized when the ties are located in the middle of the value range (around 50%). The coefficient is more volatile and numerically lower when the ties are located at the upper or lower limits of the value range.
What is the relationship between the Spearman Correlation coefficient and the percentage number of ties in random multi-value ties?
There is a linear relationship between the mean Spearman Correlation coefficient and the percentage number of ties. The lower limit of the significant correlation coefficient is observed at around 85% of ties.
How does the bounded random ties scenario affect the relationship between the Spearman Correlation coefficient and the percentage of ties?
The linear relationship observed in the random multi-value ties scenario is replaced with a convexity relationship. This means the strength and direction of correlation are expected to decrease faster than in the random multi-value scenario. Also, the lower limit of the significant correlation coefficient is found at 70% instead of 85%.
What is the impact of the significance test on the results?
The significance test (using a p-value less than 5%) helps identify the minimum correlation coefficient value, especially at the lower end, and helps to differentiate between significant and insignificant relationships.
What is the main conclusion of the analysis?
The location and nature of tied data points within the underlying data set are important factors to investigate and should be reported alongside the Spearman Correlation test results. The analysis showed how the Spearman Correlation coefficient can change under different tie scenarios. The ‘where’ and ‘how’ the tied data points are observed in the underlying dataset are important factors to investigate.
- Citar trabajo
- Yang Liu (Autor), 2017, A Short Note on Spearman Correlation: Impact of Tied Observations, Múnich, GRIN Verlag, https://www.grin.com/document/356821