How we handle Outliers

What are Outliers

Outliers are data points that are significantly different from the rest of the data in a dataset. These points can sometimes be the result of measurement errors or other factors that make them not representative of the underlying population.

One way to identify outliers is to use the mean and standard deviation of the data. If a data point is more than three standard deviations away from the mean, it is considered an outlier. This method is known as the "mean +/- 3 * standard deviation" rule.

The outliers option is enabled by default after a certain number of transactions (1000 in this case) have been collected in the project. This means that any data point that falls outside of the mean +/- 3 * standard deviation range will be considered an outlier, and will be removed from any analysis or visualization of the data.

The user has the option to turn on or off the outliers option in the statistics view, which will determine whether or not the outlier data points are included in the analysis. This allows the user to decide whether they want to include or exclude these potentially misleading data points.

 

Example of how the “mean standard deviation” rule is applied:

For example, let's say we have a dataset containing the heights of 100 transactions. After calculating the mean and standard deviation of the data, we find that the mean cost is 170€ and the standard deviation is 5€. Using the "mean +/- 3 * standard deviation" rule, we can identify any data point that falls outside of the range of 155€ to 185€ as outliers.

Let's say we have two transactions in our dataset with costs of 200€ and 160€. The 200€ instance would be considered an outlier because it is more than three standard deviations away from the mean (170€ +/- 3 * 5€ = 155€/185€). On the other hand, the 160€ instance would not be considered an outlier because it falls within the range of 155€ to 185€.

In this example, the user has the option to turn on or off the outliers option, which would determine whether or not the 200€ instance is included in the analysis of the data. If the option is turned on, the 200€ instance would be excluded from the analysis and any calculations or visualizations would be based only on the non-outlier data points.

Was this article helpful?
0 out of 0 found this helpful