Outlier removal is common in hormonal research. Here we investigated to what extent removing outliers in hormonal data leads to divergent statistical conclusions. We first show that the most common outlier detection rule is based on a number of standard deviations (SD) from the mean. Next, we used simulations to examine the degree to which statistical conclusions diverge when a test with outlier exclusion yields a statistically significant result whereas the test with outlier inclusion did not, or vice versa (at p = .05). Simulations were run in duplicate for independent samples t-tests and repeated measures ANOVA designs, and based on real testosterone (T) data and a theoretical gamma distribution of T data. We ran simulations for different sample sizes (30 to 100) and outlier removal rules (2.5 SD and 3 SD). For significant t-tests, we found that in between 14 % to 55 % of the significant cases a test with outlier exclusion yielded a statistically significant result whereas the test with outlier inclusion did not, or vice versa (median p difference: .03–.06). For significant repeated measures ANOVAs, we found that in between 7 % to 28 % of significant cases a test where outlier exclusion yielded a statistically significant result whereas the test with outlier inclusion did not, or vice versa (median p difference: .01–.03). When reporting any test that would lead to a statistically significant result (either the test with inclusion or exclusion of outliers (or both)), in between 5.15 % and 6.89 % of the independent sample t-tests were statistically significant, and for the repeated measures ANOVA design this was between 6.32 % and 7.62 % of the tests. Our results suggest that outlier handling can have a substantial impact on significance testing. We suggest several potential solutions for handling outliers and we argue for a careful assessment of handling outliers in hormonal data.