0

I have a dataset of olive oil samples and the goal of creating a classification model for oil quality. I'm having trouble deciding how to deal with missing data. have a look at the data here if you like : https://data.mendeley.com/datasets/thkcz3h6n6/6.

The issue is that the data is missing systematically from low quality oil samples. It seems that the company that collected the data skipped testing UV absorption and FAEES for samples already deemed as poor. I can't Impute based on other samples categorised as poor quality ("Lampante oil") because there actually is none, its all missing. I have looked at trying to use "regression-based imputation" but there is not really a strong relationship between UV and FAEES and other columns.

So what would my course of action be for the missing values. I don't want to remove the columns completely and I can't remove the rows since it would mean removing all the Lampante (Poor quality) oil sample data.

2 Answers 2

0

You can pick the low quality parameter and make random data that is within the parameter. It's a simple method and idk if it's going to solve your problem.

Useful link: https://www.datacamp.com/pt/tutorial/techniques-to-handle-missing-data-values

Good luck!

New contributor
Peter is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
0

Since the missing data in the olive oil dataset is systematic specifically, UV absorption and FAEES values are missing for poor-quality ("Lampante") samples traditional imputation methods are not appropriate. Instead, the best strategy is to treat the missingness itself as an informative signal. This can be done by creating new binary features indicating whether each value is missing (e.g., UV_missing, FAEES_missing).

For the missing UV and FAEES values, a conservative imputation method should be used, such as filling with a constant outside the normal range (e.g., -1), ensuring that no false patterns are introduced. This approach preserves all samples, retains important features, and allows classification models (especially tree-based ones) to learn from both the observed values and the missingness patterns. Dropping rows or columns would discard valuable information and significantly weaken the model’s ability to classify Lampante oils.

Therefore, I think creating missingness indicators combined with conservative imputation offers the most robust and informative solution.

1
  • 1
    Yes thanks this was a suggestion I got when researching. In a real world situation would using missingness as information be a bad idea as we don't want a model that incorrectly finds a relationship with incomplete data samples and poor quality. this is a model that learns on the methods of a particular companies data collection method instead of the pure characteristics of olive oil. Hope that makes sense. Commented 15 hours ago

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.