Abstract
Background: Indirect reference intervals and biological variation studies heavily rely on statistical methods to separate pathological and non-pathological subpopulations within the same dataset. In recognition of this, we compare the performance of eight univariate statistical methods for identification and exclusion of values originating from pathological subpopulations. Methods: The eight approaches examined were: Tukey's rule with and without Box-Cox transformation; median absolute deviation; double median absolute deviation; Gaussian mixture models; van der Loo (Vdl) methods 1 and 2; and the Kosmic approach. Using four scenarios including lognormal distributions and varying the conditions through the number of pathological populations, central location, spread and proportion for a total of 256 simulated mixed populations. A performance criterion of ± 0.05 fractional error from the true underlying lower and upper reference interval was chosen. Results: Overall, the Kosmic method was a standout with the highest number of scenarios lying within the acceptable error, followed by Vdl method 1 and Tukey's rule. Kosmic and Vdl method 1 appears to discriminate better the non-pathological reference population in the case of log-normal distributed data. When the proportion and spread of pathological subpopulations is high, the performance of statistical exclusion deteriorated considerably. Discussions: It is important that laboratories use a priori defined clinical criteria to minimise the proportion of pathological subpopulation in a dataset prior to analysis. The curated dataset should then be carefully examined so that the appropriate statistical method can be applied.
Original language | English |
---|---|
Pages (from-to) | 16-24 |
Number of pages | 9 |
Journal | Clinical Biochemistry |
Volume | 103 |
Early online date | 15 Feb 2022 |
DOIs | |
Publication status | Published - May 2022 |
Keywords
- Biological variation
- Data mining
- Indirect approach
- Outlier
- Outlier exclusion
- Reference intervals