Eliminate duplicated calculations and unnecessary work for linear regression #25922

rhettinger · 2021-05-05T17:16:51Z

The current code, while pretty, does repeated calculations and unnecessary work:

covariance() and variance() both divide by n - 1 which is thrown away in the slope calculation. This also causes two unnecessary roundings.
covariance(x,y) and variance(x) both compute fmean(x). This doesn't need to be done twice.
variance(x) uses the extremely slow internal _ss(), _sum(), and _convert() functions whose purpose is to preserve type information. However, that type information is thrown away by linear_regression(x, y) which always returns a pair of floats:

    >>> from statistics import linear_regression
    >>> from fractions import Fraction as F
    >>> linear_regression([F(1,2), F(2,3)], [F(5,7), F(8,9)])
    LinearRegression(intercept=0.19047619047619047, slope=1.0476190476190477)

the intercept calculation makes two more redundant fmean() calls that are unnecessary.
The inlined code makes the actual calculation more clear. It matches this typical presentation: slope = s_{x,y} / s^2_x

Update to 15 March

Merge branch 'master' of github.com:python/cpython

Merge branch 'main' of github.com:python/cpython into main

pablogsal

LGTM!

pablogsal · 2021-05-06T13:58:38Z

Lib/statistics.py

+    x, y = regressor, dependent_variable
+    xbar = fsum(x) / n
+    ybar = fsum(y) / n
+    sxy = fsum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))


Question: isn't the generator + zip going to make it slightly slower?

That was an existing line take from covariance(). I think it is the fastest way the run this computation.

miss-islington · 2021-05-06T14:43:16Z

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

miss-islington · 2021-05-06T14:43:18Z

Sorry @rhettinger, I had trouble checking out the 3.10 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 55b78ce3c4e23abe4f27bf16d7968f8851532e47 3.10

miss-islington · 2021-05-06T14:44:10Z

Thanks @rhettinger for the PR 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

bedevere-bot · 2021-05-06T14:44:18Z

GH-25945 is a backport of this pull request to the 3.10 branch.

…ression (pythonGH-25922) (cherry picked from commit 55b78ce) Co-authored-by: Raymond Hettinger <rhettinger@users.noreply.github.com>

…ression (GH-25922) (GH-25945)

rhettinger added 10 commits Mar 16, 2021

Merge pull request #1 from python/master

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: 4AEE18F83AFDEB23 Learn about vigilant mode.

bbd2da9

Update to 15 March

Merge branch 'master' of github.com:python/cpython

74bdf1b

Merge branch 'master' of github.com:python/cpython

6c53f1a

.

a487c4f

Merge branch 'master' of github.com:python/cpython

.

eb56423

Merge branch 'master' of github.com:python/cpython

.

cc7ba06

Merge branch 'master' of github.com:python/cpython

.

d024dd0

Merge branch 'master' of github.com:python/cpython

merge

b10f912

Merge branch 'main' of github.com:python/cpython into main

Avoid repeated calculations, unnecessary scaling, and type preservation

c6dcb38

Improve variable names

Loading status checks…

125a09f

rhettinger requested a review from pablogsal May 5, 2021

bedevere-bot added the awaiting core review label May 5, 2021

the-knights-who-say-ni added the CLA signed label May 5, 2021

rhettinger added needs backport to 3.10 skip issue skip news type-performance labels May 5, 2021

rhettinger changed the title ~~Inline the calculations for linear regression~~ Eliminate duplicated calculations and unnecessary work for linear regression May 5, 2021

pablogsal approved these changes May 6, 2021

View changes

bedevere-bot added awaiting merge and removed awaiting core review labels May 6, 2021

pablogsal reviewed May 6, 2021

View changes

bedevere-bot removed the awaiting merge label May 6, 2021

miss-islington assigned rhettinger May 6, 2021

rhettinger added needs backport to 3.10 and removed needs backport to 3.10 labels May 6, 2021

bedevere-bot removed the needs backport to 3.10 label May 6, 2021

python / cpython

Eliminate duplicated calculations and unnecessary work for linear regression #25922

Eliminate duplicated calculations and unnecessary work for linear regression #25922

rhettinger commented May 5, 2021 •

edited

pablogsal left a comment

This comment has been minimized.

This comment has been minimized.

miss-islington commented May 6, 2021

miss-islington commented May 6, 2021

miss-islington commented May 6, 2021

bedevere-bot commented May 6, 2021

python / cpython

Sponsor python/cpython

Eliminate duplicated calculations and unnecessary work for linear regression #25922

Eliminate duplicated calculations and unnecessary work for linear regression #25922

Conversation

rhettinger commented May 5, 2021 • edited

pablogsal left a comment

This comment has been minimized.

pablogsal May 6, 2021 Member

This comment has been minimized.

rhettinger May 6, 2021 Author Contributor

miss-islington commented May 6, 2021

miss-islington commented May 6, 2021

miss-islington commented May 6, 2021

bedevere-bot commented May 6, 2021

rhettinger commented May 5, 2021 •

edited

pablogsal May 6, 2021
Member

rhettinger May 6, 2021
Author Contributor