ENH Sample weights for median_absolute_error #17225

lucyleeow · 2020-05-14T18:04:04Z

Reference Issues/PRs

Follows from #6217
Addresses last item in #3450

What does this implement/fix? Explain your changes.

Add sample_weight to median_absolute_error.
Use sklearn.utils.stats._weighted_percentile to calculated weighted median as suggested here: #6217 (comment)
Amended _weighted_percentile to calculate weighted percentile along axis=0 (for each column) if 2D array input. This is to allow multioutput. Does not change behaviour of _weighted_percentile when array 1D. Not sure if this is best way to implement thus will wait for comment before fixing tests.

Any other comments?

lucyleeow · 2020-05-15T20:59:22Z

Used np.take_along_axis to amend _weighted_percentile to work for 2D arrays. As np.take_along_axis was introduced in numpy v1.15, added the function in fixes.py that implements a simplified version of numpy.take_along_axis if numpy version < 1.15.

test_scorer_sample_weight doesn't work for neg_median_absolute_error due to the binary targets (also noted in the original PR #6217 (comment)), so skipped for this scorer. The test works for neg_median_absolute_error if targets continuous (e.g., using make_regression to generate data). Not sure best way to add/amend to check neg_median_absolute_error as well.

ping @glemaitre :p

NicolasHug

Thanks @lucyleeow , made a quick (and incomplete) pass.

I'm wondering whether it's worth extending _weighted_percentile, or if it'd be easier and faster to just have our own version of weighted_median? I would assume that computing a weighted median is simpler and faster than having a general algorithm that works for every percentile, though I haven't looked into the details

sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py

sklearn/metrics/_regression.py

NicolasHug · 2020-05-15T22:08:50Z

sklearn/metrics/_regression.py

+    output_errors = _weighted_percentile(np.abs(y_pred - y_true),
+                                         sample_weight=sample_weight)


I think we should avoid calling _weighted_percentile if sample_weight is None and still rely on np.median. The reason being that np.median is probably much faster than our more general _weighted_percentile (could be wrong on that, but I'd be surprised).

NicolasHug · 2020-05-15T22:08:50Z

sklearn/metrics/tests/test_score_objects.py

-                                err_msg="scorer {0} behaves differently when "
-                                "ignoring samples and setting sample_weight to"
-                                " 0: {1} vs {2}".format(name, weighted,
+            if name != 'neg_median_absolute_error':


can you help me undestand why you'd need this?

Also if we need to ignore neg_median_absolute_error, it would be preferable to filter it out before the loop for name, scorer in SCORERS.items():

Since our data is binary classification data, with target being 1 or 0, there are only 2 possible values of np.abs(y_pred - y_true), 1 or 0. With only 2 possible values it's likely that median after 'randomly' removing/weighting as zero 10 elements (out of 25) is the same as the median with all 25 elements.

To filter before the loop would you make a copy of SCORER and remove neg_median_absolute_error before the loop? Is there a better way?

I could instead use make_regression to create a y just for neg_median_absolute_error?

I think using regression targets here for regression metrics seems reasonable...?

TBH, I think this test is weird. Why are we testing the metric properties of scorers when really the unit that this file should be testing is the wrapper around the metric? We should only be testing here that sample_weight is correctly passed to the metric, regardless of the parameters to make_scorer.

Fair point. Though I would like to add more tests for _weighted_percentile wrt to weighted median. Specifically, that _weighted_percentile(percentile=50) with equal weights is same as np.median and also that the difference between sum of weights left and right of the weighted median are the smallest possible. Reference: https://en.wikipedia.org/wiki/Weighted_median#Properties
If you agree, should I add the tests to this PR or a different one?

Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.

With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?

With regard to this PR, should I use regression targets just for neg_median_absolute_error or all regression metrics?

if it works for all, that would be more elegant.

Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.

Not sklearn/metrics/tests/test_regreession.py?

Also is test_gradient_boosting_loss_functions.py the best place for these tests? Makes it a bit difficult to find.

Whoops I meant the tests for _weighted_percentile, which are in test_gradient_boosting_loss_functions.py. I would also like to add to tests for _weighted_percentile

starting here:

scikit-learn/sklearn/ensemble/tests/test_gradient_boosting_loss_functions.py

Line 106 in 2d12bb5

def test_weighted_percentile():

lucyleeow · 2020-05-21T17:47:11Z

ping @jnothman

glemaitre · 2020-05-26T07:04:30Z

sklearn/utils/stats.py

@@ -1,18 +1,36 @@
 import numpy as np

 from .extmath import stable_cumsum
+from sklearn.utils.fixes import _take_along_axis


It should be a relative import here

Suggested change

from sklearn.utils.fixes import _take_along_axis

from .fixes import _take_along_axis

glemaitre · 2020-05-26T08:31:51Z

sklearn/metrics/_regression.py

@@ -335,7 +336,8 @@ def mean_squared_log_error(y_true, y_pred, *,


 @_deprecate_positional_args
-def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average'):
+def median_absolute_error(y_true, y_pred, *, multioutput='uniform_average',
+                          sample_weight=None,):


Either remove the last comma or do

def median_absolute_error( y_true, y_pred, *, multioutput='uniform_average', sample_weight=None, ):

comma removed

sklearn/metrics/_regression.py

glemaitre · 2020-05-26T08:31:51Z

sklearn/utils/stats.py

@@ -1,18 +1,36 @@
 import numpy as np

 from .extmath import stable_cumsum
+from sklearn.utils.fixes import _take_along_axis


Suggested change

from sklearn.utils.fixes import _take_along_axis

from .fixes import _take_along_axis

glemaitre · 2020-05-26T08:31:51Z

sklearn/utils/fixes.py

+def _take_along_axis(arr, indices, axis):
+    """Implements a simplified version of numpy.take_along_axis if numpy
+    version < 1.15"""
+    import numpy


Numpy is already imported as np

glemaitre · 2020-05-26T08:31:51Z

sklearn/utils/fixes.py

+    version < 1.15"""
+    import numpy
+
+    if numpy.__version__ >= LooseVersion('1.15'):


np_version is already containing the version

Suggested change

if numpy.__version__ >= LooseVersion('1.15'):

if np_version > (1, 14):

glemaitre · 2020-05-26T08:31:51Z

sklearn/utils/tests/test_stats.py

+    w_median = _weighted_percentile(x_2d, w_2d)
+
+    for i, value in enumerate(w_median):
+        assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i]))


assert should not take parenthesis:

Suggested change

assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i]))

p = _weighted_percentile(x_2d[:, i], w_2d[:, i])

assert value == pytest.approx(p)

or

p_axis_0 = [ _weighted_percentile(x_2d[:, i], w_2d[:, i]) for i in range(len(w_median)) ] assert_allclose(w_median, p_axis_0)

glemaitre · 2020-05-26T08:31:51Z

sklearn/utils/tests/test_stats.py

+
+
+def test_weighted_percentile_2d():
+    # Check for when array is 2D


I think that we should test the case -> array 2-D and sample_weight 1-D

glemaitre · 2020-05-26T08:31:51Z

sklearn/utils/tests/test_stats.py

+    sw = np.ones(102, dtype=np.float64)
+    sw[-1] = 0.0
+    score = _weighted_percentile(y, sw, 50)
+    assert score == 1


if we have a floating point number, just use pytest.approx(...) here and after.

glemaitre · 2020-05-26T08:31:51Z

sklearn/metrics/tests/test_score_objects.py

    # Make estimators that make sense to test various scoring methods
    sensible_regr = DecisionTreeRegressor(random_state=0)
    # some of the regressions scorers require strictly positive input.
-    sensible_regr.fit(X_train, y_train + 1)
+    if y_reg_train is None:
+        sensible_regr.fit(X_train, y_train + 1)


y_train + 1 is to get only positive value? It seems that _require_positive_y would be much better no?

I think so, not sure exactly what the purpose of sometimes using values >=0 and other times using values >=1. ~~Will amend.~~ Maybe this way is faster?

@glemaitre Amended but maybe + 1 is faster?

glemaitre · 2020-05-26T08:31:51Z

sklearn/metrics/tests/test_score_objects.py

    _, y_ml = make_multilabel_classification(n_samples=X.shape[0],
                                             random_state=0)
+    _, y_reg = make_regression(n_samples=X.shape[0], n_features=X.shape[1],


It seems that test_scorer_sample_weight was designed to test classification score.

I think it would be best to create 2 tests function one dedicated to classification and one dedicated to regression. It will be easier to follow what is going on.

glemaitre · 2020-05-26T09:12:27Z

Yes it would be best

…

On Tue, 26 May 2020 at 11:08, Lucy Liu ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/utils/stats.py <#17225 (comment)> : > """ - sorted_idx = np.argsort(array) + n_dim = array.ndim + if n_dim == 0: + return array[()] + if array.ndim == 1: + array = array.reshape((-1, 1)) + if array.shape != sample_weight.shape: + sample_weight = sample_weight.reshape(array.shape) Thanks I didn't know this. Should I add a test for this case? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17225 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABY32P4AWYANONOPUCQGUSDRTOBKRANCNFSM4NA5AWNA> .

-- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/

lucyleeow · 2020-05-26T10:33:00Z

Thanks for the thorough review @glemaitre. I've added the test for 2D array and 1D sample weight and split test_scorer_sample_weight into reg and clf versions.

lucyleeow · 2020-05-26T10:37:42Z

sklearn/metrics/tests/test_score_objects.py

-                                err_msg="scorer {0} behaves differently when "
-                                "ignoring samples and setting sample_weight to"
-                                " 0: {1} vs {2}".format(name, weighted,
+        if name not in REGRESSION_SCORERS:


Wasn't sure how to do this outside of the for loop without copying the SCORER dict (@glemaitre)

I think that this is fine. Another way could have been

if name in CLF_SCORERS:

Annoyingly CLF_SCORERS does not overlap with CLUSTER_SCORERS and MULTILABEL_ONLY_SCORERS.

(though REGRESSION_SCORERS overlaps with REQUIRE_POSITIVE_Y_SCORERS so I could adjust the classification test)

Ah ok so it is good.

I might have done the following to avoid a level of indent and put a nice comment

if name in REGRESSION_SCORERS: # skip the regression scores since we evaluate the classification scores continue

sklearn/utils/stats.py

glemaitre

apart of style, it looks good to me

sklearn/metrics/_regression.py

glemaitre · 2020-05-27T08:24:16Z

sklearn/metrics/tests/test_score_objects.py

-                                err_msg="scorer {0} behaves differently when "
-                                "ignoring samples and setting sample_weight to"
-                                " 0: {1} vs {2}".format(name, weighted,
+        if name not in REGRESSION_SCORERS:


I think that this is fine. Another way could have been

if name in CLF_SCORERS:

sklearn/metrics/tests/test_score_objects.py

glemaitre · 2020-05-27T12:08:46Z

The last changes. I think that we strictly do not check this part of the string so the tests did not fail but we should not print the right error message (without the variable values)

lucyleeow added 5 commits May 14, 2020

weighted percentile 2d

4c472fa

sample weights none

9837a2a

impl in median abs err

fd4bbe5

lint

0f9ef5e

lint

309f55f

lucyleeow changed the title ~~ENH Sample weights for median_absolute_error~~ [WIP] ENH Sample weights for median_absolute_error May 14, 2020

github-actions bot added module:metrics module:utils labels May 14, 2020

lucyleeow changed the title ~~[WIP] ENH Sample weights for median_absolute_error~~ ENH Sample weights for median_absolute_error May 14, 2020

lucyleeow added 3 commits May 14, 2020

fix shape

89d96eb

fix type

564acae

lint

eacc4d9

lucyleeow changed the title ~~ENH Sample weights for median_absolute_error~~ WIP ENH Sample weights for median_absolute_error May 14, 2020

lucyleeow added 13 commits May 15, 2020

squeeze

2baa8ab

ndim 0 case

8f7c07f

merge master

2a04c7c

fix text

f71509d

take along axis version

965021b

use new fun

8c6fe32

fix format

64b7cdf

spelling

b7893d2

lint

30032e9

lint

d33bbdf

fix logic

91c2a81

add comment

02d7475

add test for multioutput

f8a63d9

lucyleeow changed the title ~~WIP ENH Sample weights for median_absolute_error~~ ENH Sample weights for median_absolute_error May 15, 2020

NicolasHug reviewed May 15, 2020

View changes

suggestions

93ecc29

lucyleeow added 2 commits May 20, 2020

whats new

a198468

lint

e528f55

merge master

f16a7b6

jnothman approved these changes May 26, 2020

View changes

glemaitre reviewed May 26, 2020

View changes

glemaitre self-requested a review May 26, 2020

glemaitre reviewed May 26, 2020

View changes

lucyleeow added 4 commits May 26, 2020

suggestions, better docstring

664f13f

weight 1d, amend test

8c52d0b

split reg and clas in test sample weight

954dc1e

lint

60f67cf

lucyleeow added 2 commits May 26, 2020

use require positive

b8c93d7

word

62fd3ff

lucyleeow reviewed May 26, 2020

View changes

sklearn/utils/stats.py Show resolved Hide resolved

glemaitre reviewed May 27, 2020

View changes

lucyleeow added 2 commits May 27, 2020

suggestions

6920f6e

formatting

ba0efa2

glemaitre approved these changes May 27, 2020

View changes

sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved

sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved

sklearn/metrics/tests/test_score_objects.py Outdated Show resolved Hide resolved

fix typos

cedc674

glemaitre merged commit f93f560 into scikit-learn:master May 27, 2020
21 checks passed

lorentzenchr mentioned this pull request May 28, 2020

[MRG] ENH: Add sample_weight to median_absolute_error #6217

Closed

lorentzenchr mentioned this pull request May 28, 2020

Add sample weight support to more metrics #3450

Closed

7 tasks

lucyleeow deleted the median_abs_err branch Jun 10, 2020

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this issue Jun 26, 2020

ENH add support for sample weights in MAE (scikit-learn#17225)

ff85e0b

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this issue Oct 22, 2020

ENH add support for sample weights in MAE (scikit-learn#17225)

bf4af98

scikit-learn / scikit-learn Public

ENH Sample weights for median_absolute_error #17225

ENH Sample weights for median_absolute_error #17225

lucyleeow commented May 14, 2020 •

edited

lucyleeow commented May 15, 2020

NicolasHug left a comment •

edited

NicolasHug May 15, 2020

NicolasHug May 15, 2020

lucyleeow May 16, 2020

lucyleeow May 18, 2020

jnothman May 19, 2020

lucyleeow May 19, 2020

jnothman May 19, 2020

lucyleeow May 19, 2020 •

edited

lucyleeow May 19, 2020

lucyleeow commented May 21, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

lucyleeow May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

glemaitre May 26, 2020

lucyleeow May 26, 2020 •

edited

lucyleeow May 26, 2020

glemaitre May 26, 2020

glemaitre commented May 26, 2020

lucyleeow commented May 26, 2020

lucyleeow May 26, 2020

glemaitre May 27, 2020

lucyleeow May 27, 2020

glemaitre May 27, 2020

glemaitre left a comment

glemaitre May 27, 2020

glemaitre commented May 27, 2020

		output_errors = _weighted_percentile(np.abs(y_pred - y_true),
		sample_weight=sample_weight)

	from sklearn.utils.fixes import _take_along_axis
	from .fixes import _take_along_axis

	if numpy.__version__ >= LooseVersion('1.15'):
	if np_version > (1, 14):

	assert(value == _weighted_percentile(x_2d[:, i], w_2d[:, i]))
	p = _weighted_percentile(x_2d[:, i], w_2d[:, i])
	assert value == pytest.approx(p)



		def test_weighted_percentile_2d():
		# Check for when array is 2D

scikit-learn / scikit-learn Public

ENH Sample weights for median_absolute_error #17225

ENH Sample weights for median_absolute_error #17225

Conversation

lucyleeow commented May 14, 2020 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lucyleeow commented May 15, 2020

NicolasHug left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow May 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow commented May 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucyleeow May 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre commented May 26, 2020

lucyleeow commented May 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented May 27, 2020

lucyleeow commented May 14, 2020 •

edited

NicolasHug left a comment •

edited

lucyleeow May 19, 2020 •

edited

lucyleeow May 26, 2020 •

edited