Trimming vs. Winsorizing Outliers
Outliers are a regular problem for data scientists, much like dealing with missing values. There are various popular methods for identifying outliers, such as John Tukey’s fences, or the standard score method. My focus here is not on algorithms for identifying outliers, but on algorithms for dealing with them once they’re identified. Two standard approaches are trimming and Winsorizing. Trimming amounts to simply removing the outliers from the dataset. Winsorizing, on the other hand, amounts to changing the value of each outlier to that of the nearest inlier.¹
Sometimes the term “Winsorizing” refers to the more specific method of clipping outliers to minimum and maximum percentiles, as illustrated by the above figure. An “X% Winsorization” refers to a Winsorization in which the central X% of the distribution are regarded as inliers. I’m using the term in a more general sense here, which does not require that percentiles be used to demarcate inliers from outliers.
Winsorizing is conceptually similar to signal clipping, as when the crests and troughs of a sound wave are cut flat at certain limits.
Notice that there are no gaps in the wave where the clipping occurs, because no points are removed. Instead, points that would otherwise be out of bounds are squeezed to the limit.
One subtle difference between signal clipping and Winsorizing is that with Winsorizing, outliers are squeezed to the nearest inlier rather than to the limits themselves. In some situations, squeezing outliers to the limits themselves creates strange artificial values. For example, in this King County Real Estate dataset, the number of bathrooms are measured at a granularity of 0.25. A house can have 2.25, 2.5, 2.75, or 3 bathrooms, but it can’t have 2.333 bathrooms, or 2.008 bathrooms. If you use Tukey’s fences to demarcate outliers, you will find that the fences are located at 0.625 and 3.625. Squeezing directly to the Tukey fences would therefore create strange, inscrutable data points, such as apartments with 0.625 bathrooms.
Though trimming is the most common approach to dealing with outliers, Winsorizing is worth discussing because there’s a potential drawback to trimming: collateral damage. The most straightforward way to trim outliers is to drop those observations (rows) from the dataset. If you drop the outliers from one feature (column), you drop those observations from every feature. Since observations which are outliers for one feature may be inliers for others, you could end up trimming more inliers than outliers (in a sense). You could avoid collateral damage by setting outliers to null values instead of dropping them. However, the problem with the null values approach is that you then have to work with nulls in your dataset. Since a lot of modeling software can’t handle null values, you’ve likely just traded one problem for another.
One advantage of Winsorizing is that there is no collateral damage. You can Winsorize outliers for every feature of a 2,000-column dataset without dropping a single row.
But perhaps the concern about collateral damage is overstated. One should avoid hastily applying the same outlier algorithm to every feature of a large dataset. Each feature demands an independent justification for how to handle its outliers. Are a certain feature’s outliers less important than its inlying data points? Do you suspect the outliers to be erroneous? Does it make sense to discard all observations which are outliers relative to a certain feature?
An important property of Winsorizing is that it preserves some of the original information. Since Winsorizing reduces the weight of outliers without eliminating them, the former outliers still have influence in models or statistical calculations. Preserving more of the original information comes at a cost, however, as illustrated by the second histogram below. I use Tukey’s fences, located on both sides of the interquartile range, to demarcate outliers.
Notice the giant artificial spike around $1.2M made up mostly of former outliers. You may recall similar spikes in the percentile Winsorization figure. The former outliers have reduced weight, but now there’s an artificial cluster of points at the maximum inlying value. This spike at the positive extreme may be undesirable for modeling purposes for any number of reasons. For one thing, you may feel that the former outliers still have too much influence. Regarding the above figure, perhaps you are just not interested in ultra-expensive houses with prices above $1.2M. You don’t want them to have any weight at all. If that’s the case, then you should trim those observations from the dataset.
Spikes at the extremes may cause other problems for you as well. If you are building a linear regression model and have high standards for the homoscedasticity of the residuals, you may be disappointed by the results of Winsorization. Consider the following plots of residuals vs. predicted values for two multiple regression models where house price is the prediction target. Both models have the same predictors, but the model on the left has outliers trimmed from all features while the model on the right has outliers Winsorized.
As you can see, there is a strange linear artifact on the upper right of the Winsorization scatterplot. A plot which showed homoscedasticity would have variance which remained constant across the x-axis. While the linear artifact caused by Winsorization doesn’t introduce major heteroscedasticity, it does introduce some dirt. It would be interesting to try to measure the effect of Winsorization on homoscedasticity, but unfortunately such a task is beyond the scope of this blog post.
The most important consideration in the choice between Winsorizing and trimming is whether you want to reduce the weight of outliers or eliminate them completely. Winsorizing seems to really shine when it doesn’t produce artificial spikes at the extremes, or when such spikes don’t cause problems. I find that discrete numeric variables often don’t suffer from the spike effect, although I haven’t made any attempt to formally investigate this. Trimming is the method of choice if you suspect the outlying observations are erroneous, and especially if you suspect them to be erroneous along more than one feature. Trimming is also a good choice if extreme values are irrelevant for some reason. For example, if the outliers are celebrities, and your modeling task concerns ordinary people, you may want to completely eliminate the celebrities.
- John W. Tukey “The Future of Data Analysis,” The Annals of Mathematical Statistics, Ann. Math. Statist. 33(1), 1–67, (March, 1962)