Missing values in olive oil dataset

Question

I have a dataset of olive oil samples and the goal of creating a classification model for oil quality. I'm having trouble deciding how to deal with missing data. have a look at the data here if you like : https://data.mendeley.com/datasets/thkcz3h6n6/6.

The issue is that the data is missing systematically from low quality oil samples. It seems that the company that collected the data skipped testing UV absorption and FAEES for samples already deemed as poor. I can't Impute based on other samples categorised as poor quality ("Lampante oil") because there actually is none, its all missing. I have looked at trying to use "regression-based imputation" but there is not really a strong relationship between UV and FAEES and other columns.

So what would my course of action be for the missing values. I don't want to remove the columns completely and I can't remove the rows since it would mean removing all the Lampante (Poor quality) oil sample data.

Peter · Accepted Answer · 2025-04-28 12:18:28Z

0

You can pick the low quality parameter and make random data that is within the parameter. It's a simple method and idk if it's going to solve your problem.

Useful link: https://www.datacamp.com/pt/tutorial/techniques-to-handle-missing-data-values

Good luck!

answered yesterday

Peter

1

New contributor

Add a comment |

Sahan Randika · Accepted Answer · 2025-04-28 18:27:52Z

Since the missing data in the olive oil dataset is systematic specifically, UV absorption and FAEES values are missing for poor-quality ("Lampante") samples traditional imputation methods are not appropriate. Instead, the best strategy is to treat the missingness itself as an informative signal. This can be done by creating new binary features indicating whether each value is missing (e.g., UV_missing, FAEES_missing).

For the missing UV and FAEES values, a conservative imputation method should be used, such as filling with a constant outside the normal range (e.g., -1), ensuring that no false patterns are introduced. This approach preserves all samples, retains important features, and allows classification models (especially tree-based ones) to learn from both the observed values and the missingness patterns. Dropping rows or columns would discard valuable information and significantly weaken the model’s ability to classify Lampante oils.

Therefore, I think creating missingness indicators combined with conservative imputation offers the most robust and informative solution.

Yes thanks this was a suggestion I got when researching. In a real world situation would using missingness as information be a bad idea as we don't want a model that incorrectly finds a relationship with incomplete data samples and poor quality. this is a model that learns on the methods of a particular companies data collection method instead of the pure characteristics of olive oil. Hope that makes sense. — BOBTHEBUILDER, Commented 15 hours ago

Collectives™ on Stack Overflow

Missing values in olive oil dataset

2 Answers 2

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related