In mathematics and statistics, Spearman's rank correlation coefficient is a measure of correlation. It is named after its maker, Charles Spearman. It is written in short as the Greek letter rho (
) or sometimes as
. It is a number that shows how closely two sets of data are linked. It only can be used for data which can be put in order, such as highest to lowest.
The general formula for
is
.
For example, if you have data for how expensive different computers are, and data for how fast the computers are, you could see if they are linked, and how closely they are linked, using
.
Working it out
Step one
To work out
you first have to rank each piece of data. We are going to use the example from the intro of computers and their speed.
So, the computer with the lowest price would be rank 1. The one higher than that would have 2. Then, it goes up until it is all ranked. You have to do this to both sets of data.
| PC
|
Price ($)
|
|
Speed (GHz)
|
|
| A |
200 |
1 |
1.80 |
2
|
| B |
275 |
2 |
1.60 |
1
|
| C |
300 |
3 |
2.20 |
4
|
| D |
350 |
4 |
2.10 |
3
|
| E |
600 |
5 |
4.00 |
5
|
Step two
Next, we have to find the difference between the two ranks. Then, you multiply the difference by itself, which is called squaring. The difference is called
, and the number you get when you square
is called
.
|
|
|
|
| 1 |
2 |
-1 |
1
|
| 2 |
1 |
1 |
1
|
| 3 |
4 |
-1 |
1
|
| 4 |
3 |
1 |
1
|
| 5 |
5 |
0 |
0
|
Step three
Count how much data we have. This data has ranks 1 to 5, so we have 5 pieces of data. This number is called
.
Step four
Finally, use everything we have worked out so far in this formula:
.
means that we take the total of all the numbers that were in the column
. This is because
means total.
So,
is
which is 4. The formula says multiply it by 6, which is 24.
is
which is 120.
So, to find out
, we simply do
.
Therefore, Spearman's rank correlation coefficient is 0.8 for this set of data.
What the numbers mean
always gives an answer between −1 and 1. The numbers between are like a scale, where −1 is a very strong link, 0 is no link, and 1 is also a very strong link. The difference between 1 and −1 is that 1 is a positive correlation, and −1 is a negative correlation. A graph of data with a
value of −1 would look like the graph shown except the line and points would be going from top left to bottom right.
For example, for the data that we did above,
was 0.8. So this means that there is a positive correlation. Because it is close to 1, it means that the link is strong between the two sets of data. So, we can say that those two sets of data are linked, and go up together. If it was −0.8, we could say it was linked and as one goes up, the other goes down.
If two numbers are the same
Sometimes, when ranking data, there are two or more numbers that are the same. When this happens in
, we take the mean or average of the ranks that are the same. These are called tied ranks. To do this, we rank the tied numbers as if they were not tied. Then, we add up all the ranks that they would have, and divide it by how many there are.[1] For example, say we were ranking how well different people did in a spelling test.
| Test score
|
Rank
|
Rank (with tied)
|
| 4 |
1 |
1
|
| 6 |
2 |
|
| 6 |
3 |
|
| 6 |
4 |
|
| 8 |
5 |
|
| 8 |
6 |
|
These numbers are used in exactly the same way as normal ranks.
Related pages
Notes and references
|
|---|
|
|
|---|
| Continuous data | |
|---|
| Count data | |
|---|
| Summary tables | |
|---|
| Dependence | |
|---|
| Graphics |
- Bar chart
- Biplot
- Box plot
- Control chart
- Correlogram
- Fan chart
- Forest plot
- Histogram
- Pie chart
- Q–Q plot
- Run chart
- Scatter plot
- Stem-and-leaf display
- Radar chart
- Violin plot
|
|---|
|
|
|
|---|
| Study design |
- Population
- Statistic
- Effect size
- Statistical power
- Optimal design
- Sample size determination
- Replication
- Missing data
|
|---|
| Survey methodology | |
|---|
| Controlled experiments | |
|---|
| Adaptive Designs |
- Adaptive clinical trial
- Up-and-Down Designs
- Stochastic approximation
|
|---|
| Observational Studies |
- Cross-sectional study
- Cohort study
- Natural experiment
- Quasi-experiment
|
|---|
|
|
|
|---|
| Statistical theory | |
|---|
| Frequentist inference | | Point estimation |
- Estimating equations
- Unbiased estimators
- Mean-unbiased minimum-variance
- Rao–Blackwellization
- Lehmann–Scheffé theorem
- Median unbiased
- Plug-in
|
|---|
| Interval estimation | |
|---|
| Testing hypotheses |
- 1- & 2-tails
- Power
- Uniformly most powerful test
- Permutation test
- Multiple comparisons
|
|---|
| Parametric tests |
- Likelihood-ratio
- Score/Lagrange multiplier
- Wald
|
|---|
|
|---|
| Specific tests | | | Goodness of fit | |
|---|
| Rank statistics |
- Sign
- Signed rank (Wilcoxon)
- Rank sum (Mann–Whitney)
- Nonparametric anova
- 1-way (Kruskal–Wallis)
- 2-way (Friedman)
- Ordered alternative (Jonckheere–Terpstra)
|
|---|
|
|---|
| Bayesian inference | |
|---|
|
|
|
|---|
| Correlation | |
|---|
| Regression analysis |
- Errors and residuals
- Regression validation
- Mixed effects models
- Simultaneous equations models
- Multivariate adaptive regression splines (MARS)
|
|---|
| Linear regression | |
|---|
| Non-standard predictors |
- Nonlinear regression
- Nonparametric
- Semiparametric
- Isotonic
- Robust
- Heteroscedasticity
- Homoscedasticity
|
|---|
| Generalized linear model | |
|---|
| Partition of variance |
- Analysis of variance (ANOVA, anova)
- Analysis of covariance
- Multivariate ANOVA
- Degrees of freedom
|
|---|
|
|
Categorical / Multivariate / Time-series / Survival analysis |
|---|
| Categorical |
- Cohen's kappa
- Contingency table
- Graphical model
- Log-linear model
- McNemar's test
- Cochran-Mantel-Haenszel statistics
|
|---|
| Multivariate |
- Regression
- Manova
- Principal components
- Canonical correlation
- Discriminant analysis
- Cluster analysis
- Classification
- Structural equation model
- Multivariate distributions
|
|---|
| Time-series | | General |
- Decomposition
- Trend
- Stationarity
- Seasonal adjustment
- Exponential smoothing
- Cointegration
- Structural break
- Granger causality
|
|---|
| Specific tests |
- Dickey–Fuller
- Johansen
- Q-statistic (Ljung–Box)
- Durbin–Watson
- Breusch–Godfrey
|
|---|
| Time domain |
- Autocorrelation (ACF)
- Cross-correlation (XCF)
- ARMA model
- ARIMA model (Box–Jenkins)
- Autoregressive conditional heteroskedasticity (ARCH)
- Vector autoregression (VAR)
|
|---|
| Frequency domain | |
|---|
|
|---|
| Survival | | Survival function |
- Kaplan–Meier estimator (product limit)
- Proportional hazards models
- Accelerated failure time (AFT) model
- First hitting time
|
|---|
| Hazard function | |
|---|
| Test | |
|---|
|
|---|
|
|
Applications |
|---|
| Biostatistics | |
|---|
| Engineering statistics |
- Chemometrics
- Methods engineering
- Probabilistic design
- Process / quality control
- Reliability
- System identification
|
|---|
| Social statistics | |
|---|
| Spatial statistics |
- Cartography
- Environmental statistics
- Geographic information system
- Geostatistics
- Kriging
|
|---|
|
|