An outlier is a value or an observation that is distant from other observations, that is to say, a data point that differs significantly from other data points.
In this tutorial, I present several approaches to detect outliers in R, from simple techniques such as descriptive statistics (including minimum, maximum, histogram, boxplot and percentiles) to more formal techniques such as the Hampel filter, the Grubbs, the Dixon and the Rosner tests for outliers.
This tutorial will not tell you whether you should remove outliers or not (nor if you should impute them with the median, mean, mode or any other value), but it will help you to detect them in order to, as a first step, verify them. After their verification, it is then your choice to exclude or include them for your analyses (and this usually requires a thoughtful reflection on the researcher’s side).
Removing or keeping outliers mostly depend on three factors:
- The domain/context of your analyses and the research question.
- Whether the tests you are going to apply are robust to the presence of outliers or not.
- How distant are the outliers from other observations?
You'll learn how to detect outliers in R. Ask any questions related to the content for free!