Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Bayesian regression model dtype consistency #22525

Merged
merged 27 commits into from Feb 27, 2022

Conversation

@Micky774
Copy link
Contributor

@Micky774 Micky774 commented Feb 17, 2022

Reference Issues/PRs

Resolves #9087 (Stalled)
Contributes to #11000

What does this implement/fix? Explain your changes.

PR #9087: Avoids Bayesian Regression to aggressively cast the data to np.float64 when np.float32 is supplied.
This PR: reconciles with main and fixes gaps in dtype conversion.

Any other comments?

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Thanks @Micky774. Here are a few comments. Also please add a change log entry

sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
@jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Feb 21, 2022

The error between float32 and float64 is just above 1e-5. I guess it would be fine to set rtol=1e-4. THe purpose of the test is to check that they are not completely diffrent. It's hard to garantee a specific relative diff, due to the difficulty of tracking and controlling propagation of rounding errors.

@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented Feb 22, 2022

The error between float32 and float64 is just above 1e-5. I guess it would be fine to set rtol=1e-4. THe purpose of the test is to check that they are not completely diffrent. It's hard to garantee a specific relative diff, due to the difficulty of tracking and controlling propagation of rounding errors.

I actually ended up allowing the internal matrices (e.g. sigma_) to be np.float64 to maintain precision in calculation, then just cast the outputs to the appropriate dtype in predict and (fit_)transform. Now all tests are passing.

@jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Feb 23, 2022

I actually ended up allowing the internal matrices (e.g. sigma_) to be np.float64 to maintain precision in calculation, then just cast the outputs to the appropriate dtype in predict and (fit_)transform. Now all tests are passing.

The issue with that approach is that it can lead to many extra copies internally which is something we want to avoid because it increases the memory usage and slows down the execution. Here is an example in _update_sigma:

def _update_sigma(self, X, alpha_, lambda_, keep_lambda):
    # See slides as referenced in the docstring note
    # this function is used when n_samples >= n_features and will
    # invert a matrix of shape (n_features, n_features)
    X_keep = X[:, keep_lambda]  # float32
    gram = np.dot(X_keep.T, X_keep)  # float32
    eye = np.eye(gram.shape[0])  # float64
    sigma_inv = lambda_[keep_lambda] * eye + alpha_ * gram  # float64 + float32 -> float64  <= copy
    sigma_ = pinvh(sigma_inv)
    return sigma_

The addition of a float32 and a float64 array makes an extra copy and so is slower than adding 2 float32 or 2 float64. See

import numpy as np

X64 = np.random.uniform((1000, 1000))
X32 = X64.astype(np.float32)
Y64 = np.random.uniform((1000, 1000))
Y32 = Y64.astype(np.float32)

%timeit X32 + Y32
441 ns ± 7.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit X64 + Y64
442 ns ± 7.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit X32 + Y64
762 ns ± 69.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit X64 + Y32
761 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Thus I think your first implementation was preferable.

@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented Feb 25, 2022

@jeremiedbb Thank you for your very helpful feedback -- I indeed underestimated the difference between the implementation paths. I have now reverted back to the prior implementation and instead made the test more tolerant per your suggestions!

Copy link
Member

@jeremiedbb jeremiedbb left a comment

LGTM. Thanks @Micky774 !

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

As a quick check to make sure we have no regressions. Can you run the following and compare the results between main and this PR?

from sklearn.linear_model import BayesianRidge
from sklearn.datasets import make_regression
from time import perf_counter

X, y = make_regression(n_samples=500_000, n_features=100, random_state=0)

reg = BayesianRidge()

start = perf_counter()
reg.fit(X, y)
end = perf_counter()
print(f"{end - start} s")

@thomasjpfan thomasjpfan changed the title [MRG] Bayesian regression model dtype consistency ENH Bayesian regression model dtype consistency Feb 25, 2022
@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented Feb 25, 2022

As a quick check to make sure we have no regressions. Can you run the following and compare the results between main and this PR?

On my system over 10 trials:

  • This PR: 4.062+-0.473
  • Main: 4.636+-0.615

@jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Feb 25, 2022

the changes in this PR have no impact on float64 input, since it's only adding dtype in places where float64 is the default. Could you also check the snippet on float32 input ?

@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented Feb 25, 2022

the changes in this PR have no impact on float64 input, since it's only adding dtype in places where float64 is the default. Could you also check the snippet on float32 input ?

On my system over 10 trials:

  • This PR: 3.596+-0.782
  • Main: 5.625+-0.620

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Minor comments otherwise LGTM

sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
sklearn/linear_model/tests/test_bayes.py Outdated Show resolved Hide resolved
Micky774 and others added 2 commits Feb 26, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

One last thought/suggestion

model.fit(X, y)
y_mean, y_std = model.predict(X, return_std=True)
assert model.coef_.dtype == X.dtype
Copy link
Member

@thomasjpfan thomasjpfan Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are checking coef_, then I think it's good to check all the other attributes too:

Suggested change
model.fit(X, y)
y_mean, y_std = model.predict(X, return_std=True)
assert model.coef_.dtype == X.dtype
model.fit(X, y)
attributes = ["coef_", "alpha_", "lambda_", "sigma_", "intercept_"]
for attribute in attributes:
getattr(model, attribute).dtype == X.dtype
y_mean, y_std = model.predict(X, return_std=True)

Copy link
Contributor Author

@Micky774 Micky774 Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea -- I did actually run into a problem with this before but forgot to modify the tests. Thanks for the catch!

Copy link
Contributor Author

@Micky774 Micky774 Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @thomasjpfan do we really want to check this for alpha_, lambda_, and intercept_ since they're scalars?

Went ahead and adjusted for them too.

Copy link
Member

@thomasjpfan thomasjpfan Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did not need to make any changes to the code. The dtype checks should pass without changing the implementation.

Edit Nevermind the scalars fail.

Copy link
Contributor Author

@Micky774 Micky774 Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was failing for me after I included the assert statement into the loop

Copy link
Member

@thomasjpfan thomasjpfan Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it's worth trying to get the scalars to be the same type. Especially if all the scalar math is done with float64 math.

Type promotion with Python floats work different with NumPy scalars and NumPy arrays

import numpy as np

# NumPy scalar + Python scalar
print((np.float32(1.0) + 1.0).dtype)
# float64

# NumPy array + Python scalar
print((np.array([1.0, 2.3], dtype=np.float32) + 1.0).dtype)
# float32

If we want to do this "correctly" we would need to cast all the Python scalars into NumPy scalars of the correct dtype before any computation takes place. i.e. alpha_1, etc.

With that in mind, I am okay with not casting for the scalars and only check that arrays.

sklearn/linear_model/_bayes.py Outdated Show resolved Hide resolved
model.fit(X, y)
y_mean, y_std = model.predict(X, return_std=True)
assert model.coef_.dtype == X.dtype
Copy link
Member

@thomasjpfan thomasjpfan Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it's worth trying to get the scalars to be the same type. Especially if all the scalar math is done with float64 math.

Type promotion with Python floats work different with NumPy scalars and NumPy arrays

import numpy as np

# NumPy scalar + Python scalar
print((np.float32(1.0) + 1.0).dtype)
# float64

# NumPy array + Python scalar
print((np.array([1.0, 2.3], dtype=np.float32) + 1.0).dtype)
# float32

If we want to do this "correctly" we would need to cast all the Python scalars into NumPy scalars of the correct dtype before any computation takes place. i.e. alpha_1, etc.

With that in mind, I am okay with not casting for the scalars and only check that arrays.

@thomasjpfan thomasjpfan merged commit 7a2a0f7 into scikit-learn:main Feb 27, 2022
31 checks passed
@Micky774 Micky774 deleted the bayesian_reg branch Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants