New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Sample weights for median_absolute_error #17225
Conversation
Used
ping @glemaitre :p |
Thanks @lucyleeow , made a quick (and incomplete) pass.
I'm wondering whether it's worth extending _weighted_percentile
, or if it'd be easier and faster to just have our own version of weighted_median
? I would assume that computing a weighted median is simpler and faster than having a general algorithm that works for every percentile, though I haven't looked into the details
sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py
Outdated
Show resolved
Hide resolved
sklearn/metrics/_regression.py
Outdated
output_errors = _weighted_percentile(np.abs(y_pred - y_true), | ||
sample_weight=sample_weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid calling _weighted_percentile
if sample_weight is None and still rely on np.median
. The reason being that np.median
is probably much faster than our more general _weighted_percentile
(could be wrong on that, but I'd be surprised).
err_msg="scorer {0} behaves differently when " | ||
"ignoring samples and setting sample_weight to" | ||
" 0: {1} vs {2}".format(name, weighted, | ||
if name != 'neg_median_absolute_error': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you help me undestand why you'd need this?
Also if we need to ignore neg_median_absolute_error
, it would be preferable to filter it out before the loop for name, scorer in SCORERS.items():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since our data is binary classification data, with target being 1 or 0, there are only 2 possible values of np.abs(y_pred - y_true)
, 1 or 0. With only 2 possible values it's likely that median after 'randomly' removing/weighting as zero 10 elements (out of 25) is the same as the median with all 25 elements.
To filter before the loop would you make a copy of SCORER
and remove neg_median_absolute_error
before the loop? Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could instead use make_regression
to create a y
just for neg_median_absolute_error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using regression targets here for regression metrics seems reasonable...?
TBH, I think this test is weird. Why are we testing the metric properties of scorers when really the unit that this file should be testing is the wrapper around the metric? We should only be testing here that sample_weight
is correctly passed to the metric, regardless of the parameters to make_scorer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. Though I would like to add more tests for _weighted_percentile
wrt to weighted median. Specifically, that _weighted_percentile(percentile=50)
with equal weights is same as np.median
and also that the difference between sum of weights left and right of the weighted median are the smallest possible. Reference: https://en.wikipedia.org/wiki/Weighted_median#Properties
If you agree, should I add the tests to this PR or a different one?
Also is test_gradient_boosting_loss_functions.py
the best place for these tests? Makes it a bit difficult to find.
With regard to this PR, should I use regression targets just for neg_median_absolute_error
or all regression metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?
if it works for all, that would be more elegant.
Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.
Not sklearn/metrics/tests/test_regreession.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.
Whoops I meant the tests for _weighted_percentile
, which are in test_gradient_boosting_loss_functions.py
. I would also like to add to tests for _weighted_percentile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
starting here:
def test_weighted_percentile(): |
ping @jnothman |
sklearn/utils/stats.py
Outdated
@@ -1,18 +1,36 @@ | |||
import numpy as np | |||
|
|||
from .extmath import stable_cumsum | |||
from sklearn.utils.fixes import _take_along_axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be a relative import here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sklearn.utils.fixes import _take_along_axis | |
from .fixes import _take_along_axis |
sklearn/metrics/_regression.py
Outdated
@@ -335,7 +336,8 @@ def mean_squared_log_error(y_true, y_pred, *, | |||
|
|||
|
|||
@_deprecate_positional_args | |||
def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average'): | |||
def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average', | |||
sample_weight=None,): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either remove the last comma or do
def median_absolute_error(
y_true, y_pred, *, multioutput='uniform_average',
sample_weight=None,
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma removed
sklearn/utils/stats.py
Outdated
@@ -1,18 +1,36 @@ | |||
import numpy as np | |||
|
|||
from .extmath import stable_cumsum | |||
from sklearn.utils.fixes import _take_along_axis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from sklearn.utils.fixes import _take_along_axis | |
from .fixes import _take_along_axis |
sklearn/utils/fixes.py
Outdated
def _take_along_axis(arr, indices, axis): | ||
"""Implements a simplified version of numpy.take_along_axis if numpy | ||
version < 1.15""" | ||
import numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy is already imported as np
sklearn/utils/fixes.py
Outdated
version < 1.15""" | ||
import numpy | ||
|
||
if numpy.__version__ >= LooseVersion('1.15'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np_version
is already containing the version
if numpy.__version__ >= LooseVersion('1.15'): | |
if np_version > (1, 14): |
sklearn/utils/tests/test_stats.py
Outdated
w_median = _weighted_percentile(x_2d, w_2d) | ||
|
||
for i, value in enumerate(w_median): | ||
assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert should not take parenthesis:
assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i])) | |
p = _weighted_percentile(x_2d[:, i], w_2d[:, i]) | |
assert value == pytest.approx(p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or
p_axis_0 = [
_weighted_percentile(x_2d[:, i], w_2d[:, i]) for i in range(len(w_median))
]
assert_allclose(w_median, p_axis_0)
sklearn/utils/tests/test_stats.py
Outdated
|
||
|
||
def test_weighted_percentile_2d(): | ||
# Check for when array is 2D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should test the case -> array
2-D and sample_weight
1-D
sklearn/utils/tests/test_stats.py
Outdated
sw = np.ones(102, dtype=np.float64) | ||
sw[-1] = 0.0 | ||
score = _weighted_percentile(y, sw, 50) | ||
assert score == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have a floating point number, just use pytest.approx(...)
here and after.
# Make estimators that make sense to test various scoring methods | ||
sensible_regr = DecisionTreeRegressor(random_state=0) | ||
# some of the regressions scorers require strictly positive input. | ||
sensible_regr.fit(X_train, y_train + 1) | ||
if y_reg_train is None: | ||
sensible_regr.fit(X_train, y_train + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y_train + 1
is to get only positive value? It seems that _require_positive_y
would be much better no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, not sure exactly what the purpose of sometimes using values >=0 and other times using values >=1. Will amend. Maybe this way is faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glemaitre Amended but maybe + 1
is faster?
_, y_ml = make_multilabel_classification(n_samples=X.shape[0], | ||
random_state=0) | ||
_, y_reg = make_regression(n_samples=X.shape[0], n_features=X.shape[1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that test_scorer_sample_weight
was designed to test classification score.
I think it would be best to create 2 tests function one dedicated to classification and one dedicated to regression. It will be easier to follow what is going on.
Yes it would be best
…On Tue, 26 May 2020 at 11:08, Lucy Liu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/utils/stats.py
<#17225 (comment)>
:
> """
- sorted_idx = np.argsort(array)
+ n_dim = array.ndim
+ if n_dim == 0:
+ return array[()]
+ if array.ndim == 1:
+ array = array.reshape((-1, 1))
+ if array.shape != sample_weight.shape:
+ sample_weight = sample_weight.reshape(array.shape)
Thanks I didn't know this. Should I add a test for this case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17225 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABY32P4AWYANONOPUCQGUSDRTOBKRANCNFSM4NA5AWNA>
.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
|
Thanks for the thorough review @glemaitre. I've added the test for 2D array and 1D sample weight and split |
err_msg="scorer {0} behaves differently when " | ||
"ignoring samples and setting sample_weight to" | ||
" 0: {1} vs {2}".format(name, weighted, | ||
if name not in REGRESSION_SCORERS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't sure how to do this outside of the for loop without copying the SCORER
dict (@glemaitre)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is fine. Another way could have been
if name in CLF_SCORERS:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annoyingly CLF_SCORERS
does not overlap with CLUSTER_SCORERS
and MULTILABEL_ONLY_SCORERS
.
(though REGRESSION_SCORERS
overlaps with REQUIRE_POSITIVE_Y_SCORERS
so I could adjust the classification test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok so it is good.
I might have done the following to avoid a level of indent and put a nice comment
if name in REGRESSION_SCORERS:
# skip the regression scores since we evaluate the classification scores
continue
err_msg="scorer {0} behaves differently when " | ||
"ignoring samples and setting sample_weight to" | ||
" 0: {1} vs {2}".format(name, weighted, | ||
if name not in REGRESSION_SCORERS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is fine. Another way could have been
if name in CLF_SCORERS:
The last changes. I think that we strictly do not check this part of the string so the tests did not fail but we should not print the right error message (without the variable values) |
Reference Issues/PRs
Follows from #6217
Addresses last item in #3450
What does this implement/fix? Explain your changes.
Add sample_weight to
median_absolute_error
.Use
sklearn.utils.stats._weighted_percentile
to calculated weighted median as suggested here: #6217 (comment)Amended
_weighted_percentile
to calculate weighted percentile along axis=0 (for each column) if 2D array input. This is to allow multioutput. Does not change behaviour of_weighted_percentile
when array 1D. Not sure if this is best way to implement thus will wait for comment before fixing tests.Any other comments?