New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Bayesian regression model dtype
consistency
#22525
Conversation
This does not include the changes to `bayes.py` present in the original PR, since the file has since been changed to `_bayes.py` and must be reconciled.
Thanks @Micky774. Here are a few comments. Also please add a change log entry
… into bayesian_reg
The error between float32 and float64 is just above 1e-5. I guess it would be fine to set |
I actually ended up allowing the internal matrices (e.g. |
The issue with that approach is that it can lead to many extra copies internally which is something we want to avoid because it increases the memory usage and slows down the execution. Here is an example in _update_sigma: def _update_sigma(self, X, alpha_, lambda_, keep_lambda):
# See slides as referenced in the docstring note
# this function is used when n_samples >= n_features and will
# invert a matrix of shape (n_features, n_features)
X_keep = X[:, keep_lambda] # float32
gram = np.dot(X_keep.T, X_keep) # float32
eye = np.eye(gram.shape[0]) # float64
sigma_inv = lambda_[keep_lambda] * eye + alpha_ * gram # float64 + float32 -> float64 <= copy
sigma_ = pinvh(sigma_inv)
return sigma_ The addition of a float32 and a float64 array makes an extra copy and so is slower than adding 2 float32 or 2 float64. See import numpy as np
X64 = np.random.uniform((1000, 1000))
X32 = X64.astype(np.float32)
Y64 = np.random.uniform((1000, 1000))
Y32 = Y64.astype(np.float32)
%timeit X32 + Y32
441 ns ± 7.83 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit X64 + Y64
442 ns ± 7.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit X32 + Y64
762 ns ± 69.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit X64 + Y32
761 ns ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Thus I think your first implementation was preferable. |
@jeremiedbb Thank you for your very helpful feedback -- I indeed underestimated the difference between the implementation paths. I have now reverted back to the prior implementation and instead made the test more tolerant per your suggestions! |
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
As a quick check to make sure we have no regressions. Can you run the following and compare the results between main
and this PR?
from sklearn.linear_model import BayesianRidge
from sklearn.datasets import make_regression
from time import perf_counter
X, y = make_regression(n_samples=500_000, n_features=100, random_state=0)
reg = BayesianRidge()
start = perf_counter()
reg.fit(X, y)
end = perf_counter()
print(f"{end - start} s")
dtype
consistencydtype
consistency
On my system over 10 trials:
|
the changes in this PR have no impact on float64 input, since it's only adding dtype in places where float64 is the default. Could you also check the snippet on float32 input ? |
On my system over 10 trials:
|
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
model.fit(X, y) | ||
y_mean, y_std = model.predict(X, return_std=True) | ||
assert model.coef_.dtype == X.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are checking coef_
, then I think it's good to check all the other attributes too:
model.fit(X, y) | |
y_mean, y_std = model.predict(X, return_std=True) | |
assert model.coef_.dtype == X.dtype | |
model.fit(X, y) | |
attributes = ["coef_", "alpha_", "lambda_", "sigma_", "intercept_"] | |
for attribute in attributes: | |
getattr(model, attribute).dtype == X.dtype | |
y_mean, y_std = model.predict(X, return_std=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea -- I did actually run into a problem with this before but forgot to modify the tests. Thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, @thomasjpfan do we really want to check this for alpha_
, lambda_
, and intercept_
since they're scalars?
Went ahead and adjusted for them too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did not need to make any changes to the code. The dtype checks should pass without changing the implementation.
Edit Nevermind the scalars fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was failing for me after I included the assert statement into the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it's worth trying to get the scalars to be the same type. Especially if all the scalar math is done with float64 math.
Type promotion with Python floats work different with NumPy scalars and NumPy arrays
import numpy as np
# NumPy scalar + Python scalar
print((np.float32(1.0) + 1.0).dtype)
# float64
# NumPy array + Python scalar
print((np.array([1.0, 2.3], dtype=np.float32) + 1.0).dtype)
# float32
If we want to do this "correctly" we would need to cast all the Python scalars into NumPy scalars of the correct dtype before any computation takes place. i.e. alpha_1
, etc.
With that in mind, I am okay with not casting for the scalars and only check that arrays.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
model.fit(X, y) | ||
y_mean, y_std = model.predict(X, return_std=True) | ||
assert model.coef_.dtype == X.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it's worth trying to get the scalars to be the same type. Especially if all the scalar math is done with float64 math.
Type promotion with Python floats work different with NumPy scalars and NumPy arrays
import numpy as np
# NumPy scalar + Python scalar
print((np.float32(1.0) + 1.0).dtype)
# float64
# NumPy array + Python scalar
print((np.array([1.0, 2.3], dtype=np.float32) + 1.0).dtype)
# float32
If we want to do this "correctly" we would need to cast all the Python scalars into NumPy scalars of the correct dtype before any computation takes place. i.e. alpha_1
, etc.
With that in mind, I am okay with not casting for the scalars and only check that arrays.
Reference Issues/PRs
Resolves #9087 (Stalled)
Contributes to #11000
What does this implement/fix? Explain your changes.
PR #9087: Avoids Bayesian Regression to aggressively cast the data to np.float64 when np.float32 is supplied.
This PR: reconciles with main and fixes gaps in
dtype
conversion.Any other comments?
The text was updated successfully, but these errors were encountered: