Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Sample weights for median_absolute_error #17225

Merged
merged 49 commits into from May 27, 2020

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented May 14, 2020

Reference Issues/PRs

Follows from #6217
Addresses last item in #3450

What does this implement/fix? Explain your changes.

Add sample_weight to median_absolute_error.
Use sklearn.utils.stats._weighted_percentile to calculated weighted median as suggested here: #6217 (comment)
Amended _weighted_percentile to calculate weighted percentile along axis=0 (for each column) if 2D array input. This is to allow multioutput. Does not change behaviour of _weighted_percentile when array 1D. Not sure if this is best way to implement thus will wait for comment before fixing tests.

Any other comments?

@lucyleeow lucyleeow changed the title ENH Sample weights for median_absolute_error [WIP] ENH Sample weights for median_absolute_error May 14, 2020
@lucyleeow lucyleeow changed the title [WIP] ENH Sample weights for median_absolute_error ENH Sample weights for median_absolute_error May 14, 2020
@lucyleeow lucyleeow changed the title ENH Sample weights for median_absolute_error WIP ENH Sample weights for median_absolute_error May 14, 2020
@lucyleeow lucyleeow changed the title WIP ENH Sample weights for median_absolute_error ENH Sample weights for median_absolute_error May 15, 2020
@lucyleeow
Copy link
Member Author

@lucyleeow lucyleeow commented May 15, 2020

Used np.take_along_axis to amend _weighted_percentile to work for 2D arrays. As np.take_along_axis was introduced in numpy v1.15, added the function in fixes.py that implements a simplified version of numpy.take_along_axis if numpy version < 1.15.

test_scorer_sample_weight doesn't work for neg_median_absolute_error due to the binary targets (also noted in the original PR #6217 (comment)), so skipped for this scorer. The test works for neg_median_absolute_error if targets continuous (e.g., using make_regression to generate data). Not sure best way to add/amend to check neg_median_absolute_error as well.

ping @glemaitre :p

Copy link
Member

@NicolasHug NicolasHug left a comment

Thanks @lucyleeow , made a quick (and incomplete) pass.

I'm wondering whether it's worth extending _weighted_percentile, or if it'd be easier and faster to just have our own version of weighted_median? I would assume that computing a weighted median is simpler and faster than having a general algorithm that works for every percentile, though I haven't looked into the details

sklearn/metrics/_regression.py Outdated Show resolved Hide resolved
sklearn/metrics/_regression.py Outdated Show resolved Hide resolved
output_errors = _weighted_percentile(np.abs(y_pred - y_true),
sample_weight=sample_weight)
Copy link
Member

@NicolasHug NicolasHug May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid calling _weighted_percentile if sample_weight is None and still rely on np.median. The reason being that np.median is probably much faster than our more general _weighted_percentile (could be wrong on that, but I'd be surprised).

err_msg="scorer {0} behaves differently when "
"ignoring samples and setting sample_weight to"
" 0: {1} vs {2}".format(name, weighted,
if name != 'neg_median_absolute_error':
Copy link
Member

@NicolasHug NicolasHug May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me undestand why you'd need this?

Also if we need to ignore neg_median_absolute_error, it would be preferable to filter it out before the loop for name, scorer in SCORERS.items():

Copy link
Member Author

@lucyleeow lucyleeow May 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our data is binary classification data, with target being 1 or 0, there are only 2 possible values of np.abs(y_pred - y_true), 1 or 0. With only 2 possible values it's likely that median after 'randomly' removing/weighting as zero 10 elements (out of 25) is the same as the median with all 25 elements.

To filter before the loop would you make a copy of SCORER and remove neg_median_absolute_error before the loop? Is there a better way?

Copy link
Member Author

@lucyleeow lucyleeow May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could instead use make_regression to create a y just for neg_median_absolute_error?

Copy link
Member

@jnothman jnothman May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using regression targets here for regression metrics seems reasonable...?

TBH, I think this test is weird. Why are we testing the metric properties of scorers when really the unit that this file should be testing is the wrapper around the metric? We should only be testing here that sample_weight is correctly passed to the metric, regardless of the parameters to make_scorer.

Copy link
Member Author

@lucyleeow lucyleeow May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Though I would like to add more tests for _weighted_percentile wrt to weighted median. Specifically, that _weighted_percentile(percentile=50) with equal weights is same as np.median and also that the difference between sum of weights left and right of the weighted median are the smallest possible. Reference: https://en.wikipedia.org/wiki/Weighted_median#Properties
If you agree, should I add the tests to this PR or a different one?

Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.

With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?

Copy link
Member

@jnothman jnothman May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?

if it works for all, that would be more elegant.

Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.

Not sklearn/metrics/tests/test_regreession.py?

Copy link
Member Author

@lucyleeow lucyleeow May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.

Whoops I meant the tests for _weighted_percentile, which are in test_gradient_boosting_loss_functions.py. I would also like to add to tests for _weighted_percentile

Copy link
Member Author

@lucyleeow lucyleeow May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starting here:

@lucyleeow
Copy link
Member Author

@lucyleeow lucyleeow commented May 21, 2020

ping @jnothman

@@ -1,18 +1,36 @@
import numpy as np

from .extmath import stable_cumsum
from sklearn.utils.fixes import _take_along_axis
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a relative import here

Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from sklearn.utils.fixes import _take_along_axis
from .fixes import _take_along_axis

@glemaitre glemaitre self-requested a review May 26, 2020
@@ -335,7 +336,8 @@ def mean_squared_log_error(y_true, y_pred, *,


@_deprecate_positional_args
def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average'):
def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average',
sample_weight=None,):
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either remove the last comma or do

def median_absolute_error(
    y_true, y_pred, *, multioutput='uniform_average',
    sample_weight=None,
):

Copy link
Member Author

@lucyleeow lucyleeow May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma removed

sklearn/metrics/_regression.py Show resolved Hide resolved
@@ -1,18 +1,36 @@
import numpy as np

from .extmath import stable_cumsum
from sklearn.utils.fixes import _take_along_axis
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from sklearn.utils.fixes import _take_along_axis
from .fixes import _take_along_axis

def _take_along_axis(arr, indices, axis):
"""Implements a simplified version of numpy.take_along_axis if numpy
version < 1.15"""
import numpy
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy is already imported as np

version < 1.15"""
import numpy

if numpy.__version__ >= LooseVersion('1.15'):
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np_version is already containing the version

Suggested change
if numpy.__version__ >= LooseVersion('1.15'):
if np_version > (1, 14):

w_median = _weighted_percentile(x_2d, w_2d)

for i, value in enumerate(w_median):
assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i]))
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert should not take parenthesis:

Suggested change
assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i]))
p = _weighted_percentile(x_2d[:, i], w_2d[:, i])
assert value == pytest.approx(p)

Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or

p_axis_0 = [
    _weighted_percentile(x_2d[:, i], w_2d[:, i]) for i in range(len(w_median))
]
assert_allclose(w_median, p_axis_0)



def test_weighted_percentile_2d():
# Check for when array is 2D
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should test the case -> array 2-D and sample_weight 1-D

sw = np.ones(102, dtype=np.float64)
sw[-1] = 0.0
score = _weighted_percentile(y, sw, 50)
assert score == 1
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a floating point number, just use pytest.approx(...) here and after.

# Make estimators that make sense to test various scoring methods
sensible_regr = DecisionTreeRegressor(random_state=0)
# some of the regressions scorers require strictly positive input.
sensible_regr.fit(X_train, y_train + 1)
if y_reg_train is None:
sensible_regr.fit(X_train, y_train + 1)
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y_train + 1 is to get only positive value? It seems that _require_positive_y would be much better no?

Copy link
Member Author

@lucyleeow lucyleeow May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, not sure exactly what the purpose of sometimes using values >=0 and other times using values >=1. Will amend. Maybe this way is faster?

Copy link
Member Author

@lucyleeow lucyleeow May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre Amended but maybe + 1 is faster?

_, y_ml = make_multilabel_classification(n_samples=X.shape[0],
random_state=0)
_, y_reg = make_regression(n_samples=X.shape[0], n_features=X.shape[1],
Copy link
Contributor

@glemaitre glemaitre May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that test_scorer_sample_weight was designed to test classification score.

I think it would be best to create 2 tests function one dedicated to classification and one dedicated to regression. It will be easier to follow what is going on.

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented May 26, 2020

@lucyleeow
Copy link
Member Author

@lucyleeow lucyleeow commented May 26, 2020

Thanks for the thorough review @glemaitre. I've added the test for 2D array and 1D sample weight and split test_scorer_sample_weight into reg and clf versions.

err_msg="scorer {0} behaves differently when "
"ignoring samples and setting sample_weight to"
" 0: {1} vs {2}".format(name, weighted,
if name not in REGRESSION_SCORERS:
Copy link
Member Author

@lucyleeow lucyleeow May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't sure how to do this outside of the for loop without copying the SCORER dict (@glemaitre)

Copy link
Contributor

@glemaitre glemaitre May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is fine. Another way could have been

if name in CLF_SCORERS:

Copy link
Member Author

@lucyleeow lucyleeow May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annoyingly CLF_SCORERS does not overlap with CLUSTER_SCORERS and MULTILABEL_ONLY_SCORERS.

(though REGRESSION_SCORERS overlaps with REQUIRE_POSITIVE_Y_SCORERS so I could adjust the classification test)

Copy link
Contributor

@glemaitre glemaitre May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok so it is good.

I might have done the following to avoid a level of indent and put a nice comment

if name in REGRESSION_SCORERS:
    # skip the regression scores since we evaluate the classification scores
    continue 

Copy link
Contributor

@glemaitre glemaitre left a comment

apart of style, it looks good to me

sklearn/metrics/_regression.py Outdated Show resolved Hide resolved
sklearn/metrics/_regression.py Show resolved Hide resolved
err_msg="scorer {0} behaves differently when "
"ignoring samples and setting sample_weight to"
" 0: {1} vs {2}".format(name, weighted,
if name not in REGRESSION_SCORERS:
Copy link
Contributor

@glemaitre glemaitre May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is fine. Another way could have been

if name in CLF_SCORERS:

sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved
@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented May 27, 2020

The last changes. I think that we strictly do not check this part of the string so the tests did not fail but we should not print the right error message (without the variable values)

@glemaitre glemaitre merged commit f93f560 into scikit-learn:master May 27, 2020
21 checks passed
@lucyleeow lucyleeow deleted the median_abs_err branch Jun 10, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this issue Jun 26, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this issue Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants