BLD Reduce binary size of wheels #25063

thomasjpfan · 2022-11-29T03:07:49Z

This PR reduces the wheel size back to the sizes seen in 1.1.3. The changes include:

Removes the .c, .cpp, and .pyx files from the wheel
Uses -O2 optimization to reduce size of compiled files. It is documented that -O3 is not great:

Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.

CC @jeremiedbb

glemaitre · 2022-11-29T09:04:09Z

On clang documentation, it is mentioned that the large binaries size is intended to accelerate the execution time.

ogrisel · 2022-11-29T10:32:52Z

The linux binaries do not use clang but gcc though. The size problem is only for the linux wheels right?

ogrisel · 2022-11-29T10:35:11Z

It would be great to trigger an ASV run of this PR vs main but this might be expensive. @jeremiedbb assuming you have an easy access to the machine that runs the nightly benchmarks, can you please give it a try?

Alternatively we can use another machine but it needs to be dedicated to the benchmark for a few ours to make the results reliable enough.

ogrisel · 2022-11-29T10:41:04Z

Actually I found a machine, let me try to do this.

thomasjpfan · 2022-11-29T12:41:30Z

Looking at the log from the 1.1.X build they O3 was enabled by default, even tho the tree's setup.py was the only one that was configured for O3.

To be safe, I reverted this PR back to O3.

Edit: The revert is in 3e605e5 (#25063) (The commit message has a typo, it should be "Revert back to O3")

ogrisel · 2022-11-29T13:49:26Z

I ran a bunch of benchmarks both on linux with gcc and on macos with clang on varvious Cython / C++ intensive estimators and I do not observe a significant performance degradation when building scikit-learn from this PR. It could even be the case that there is some slight improvement (although there is some variability in the results from run to run, especially for the hist gradient boosting models).

EDIT: this comment is about the state of the PR that used the -O2 flag.

ogrisel · 2022-11-29T13:55:17Z

Looking at the log from the 1.1.X build they O3 was enabled by default, even tho the tree's setup.py was the only one that was configured for O3.

But even in 1.1.3 the linux binaries are already much bigger that for macos and windows. I checked that the number of files is the same between a macos wheel and a linux wheel. The only difference are the shipped dynamic libraries.

ogrisel · 2022-11-29T14:05:22Z

Here are the sizes in bytes of the biggest .so files in the 3.11 wheels for x86_64 macos and linux for scikit-learn 1.1.3:

macos

dev ❯ find macwheel/ | grep .so | xargs ls -l | awk '{print $5, $9}' | sort -nr | head
2279968 macwheel//sklearn/_loss/_loss.cpython-311-darwin.so
852832 macwheel//sklearn/utils/sparsefuncs_fast.cpython-311-darwin.so
629184 macwheel//sklearn/tree/_tree.cpython-311-darwin.so
572096 macwheel//sklearn/neighbors/_ball_tree.cpython-311-darwin.so
550640 macwheel//sklearn/neighbors/_kd_tree.cpython-311-darwin.so
513968 macwheel//sklearn/metrics/_pairwise_distances_reduction.cpython-311-darwin.so
507728 macwheel//sklearn/linear_model/_cd_fast.cpython-311-darwin.so
456768 macwheel//sklearn/metrics/_dist_metrics.cpython-311-darwin.so
441680 macwheel//sklearn/cluster/_k_means_common.cpython-311-darwin.so
440576 macwheel//sklearn/cluster/_k_means_elkan.cpython-311-darwin.so

linux

dev ❯ find linwheel | grep .so | xargs ls -l | awk '{print $5, $9}' | sort -nr | head
14352152 linwheel/sklearn/_loss/_loss.cpython-311-x86_64-linux-gnu.so
5857096 linwheel/sklearn/utils/sparsefuncs_fast.cpython-311-x86_64-linux-gnu.so
4110592 linwheel/sklearn/tree/_tree.cpython-311-x86_64-linux-gnu.so
3591664 linwheel/sklearn/metrics/_pairwise_distances_reduction.cpython-311-x86_64-linux-gnu.so
3477360 linwheel/sklearn/linear_model/_cd_fast.cpython-311-x86_64-linux-gnu.so
3383200 linwheel/sklearn/neighbors/_ball_tree.cpython-311-x86_64-linux-gnu.so
3313568 linwheel/sklearn/neighbors/_kd_tree.cpython-311-x86_64-linux-gnu.so
3084064 linwheel/sklearn/utils/_cython_blas.cpython-311-x86_64-linux-gnu.so
2789408 linwheel/sklearn/cluster/_k_means_common.cpython-311-x86_64-linux-gnu.so
2638896 linwheel/sklearn/cluster/_k_means_elkan.cpython-311-x86_64-linux-gnu.so

ogrisel · 2022-11-29T14:06:02Z

The vendored openmp runtimes are small comparatively.

thomasjpfan · 2022-11-29T14:09:25Z

I ran a bunch of benchmarks both on linux with gcc and on macos with clang on varvious Cython / C++ intensive estimators and I do not observe a significant performance degradation when building scikit-learn from this PR.

In that case, I do like having O2 for the smaller binary size. I updated this PR to O2.

ogrisel · 2022-11-29T14:13:11Z

I now realize that the benchmarking I did for macos was with the -03 version of this PR hence the lack of performance effect was expected. I confirm that the linux benchmarks on the other hand actually compared -02 of this PR to the -03 and in main and so far I did no spot any regression (but I still have other benchmarks running at this time).

adrinjalali · 2022-11-29T14:26:49Z

A note on O2 vs O3, at least on GCC, the kernel community has seen behavior with O3 which would result in wrong output. I personally have observed optimizations which would result in a sigterm on O3 for gcc (albeit that was a long time ago, and it had to do with copy constructors). So I'd say if we can, we should stick with O2.

ogrisel · 2022-11-29T15:48:37Z

Hum this must be something else, the generated wheels for linux in the last CI build are still much bigger than for macos and windows... 30 MB or more for Linux vs less than 10 MB for those others.

Maybe this is related to the configuration of debug symbols?

ogrisel · 2022-11-29T16:00:04Z

Actually when I look at the build output of main on my local machines I see -02 (twice) despite the fact that we have default_extra_compile_args = ["-O3"] on the setup.py file... This makes no sense.

For instance, on my on linux with gcc I get lines such as:

ccache gcc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /data/parietal/store3/work/ogrisel/mambaforge/envs/dev/include -fPIC -O2 -isystem /data/par$etal/store3/work/ogrisel/mambaforge/envs/dev/include -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -I/data/parietal/store3/work/ogrisel/mambaforge/envs/dev/lib/python3.10/site-pa$kages/numpy/core/include -I/data/parietal/store3/work/ogrisel/mambaforge/envs/dev/include/python3.10 -c sklearn/_loss/_loss.c -o build/temp.linux-x86_64-3.10/sklearn/_loss/_loss.o -O$
 -fopenmp

ogrisel · 2022-11-29T16:06:54Z

stripping the debug symbols does not make a big difference:

(dev) du -hs sklearn
147M    sklearn
(dev) strip sklearn/**/*.so
(dev) du -hs sklearn
143M    sklearn

jeremiedbb · 2022-11-29T16:15:35Z

I checked and on windows: I see the -O3

"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\bin\HostX86\x64\cl.exe" /c
/nologo /O2 /W3 /GL /DNDEBUG /MD -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -IC:\Users\J\miniconda3
\envs\dev\lib\site-packages\numpy\core\include -IC:\Users\J\miniconda3\envs\dev\include -IC:\Users\J\miniconda3\envs\dev
\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\include" "-IC:\Program 
Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared"
 "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include
\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /Tcsklearn\utils\_heap.c 
/Fobuild\temp.win-amd64-cpython-39\Release\sklearn\utils\_heap.obj -O3 /openmp
                                                                    ^

IIRC, flags that appear after in the command override flags that appear before, so O3 should be used instead of O2 (which appears earlier in the command above because it's the default one set by setuptools)

However this is not the appropriate flag for msvc, It should be /O3
Ligne de commande warning D9002 : option '-O3' inconnue ignorée
(Yes my msvc is in french :) It means that this flag is not recognized and thus ignored)

thomasjpfan · 2022-11-29T16:55:46Z

For reference, here are the artifacts and the total size of the compressed artifacts:

this PR with O2 - 228 MB
this PR with O3 - 247 MB
1.2.X - 381 MB
1.1.X - 216 MB

In all cases, the Linux wheels are much bigger than OSX and windows. The large linux wheel size existed for many versions including: PyPI 0.24.2, PyPI 1.0.2, and PyPi 1.1.3.

@jeremiedbb From looking at your output and the CI for windows, /O2 is added automatically and -O2 is ignored.
Edit: I see you had the same point.

thomasjpfan · 2022-11-29T17:43:44Z

Actually when I look at the build output of main on my local machines I see -02 (twice) despite the fact that we have default_extra_compile_args = ["-O3"] on the setup.py file... This makes no sense.

From looking at your output, maybe the O3 is cutoff at the end? On my local linux machine and macOS machine, I see "O2" appear multiple times, but at the very end there is -O3 -fopenmp.

On the wheel builder CI in this PR for linux, the default is O3 but O2 is passed in later, which means O2 is used.

In summary, if we want O2, then this PR is doing the correct thing. On Linux and macOS, O2 is passed in and overrides O3. For Windows, I updated this PR to be more explicit and pass in /O2 correctly.

ogrisel · 2022-11-29T18:53:49Z

From looking at your output, maybe the O3 is cutoff at the end? On my local linux machine and macOS machine, I see "O2" appear multiple times, but at the very end there is -O3 -fopenmp.

Indeed.

ogrisel

I think this is already a net improvement even if that does not explain the big increase in size of the linux binaries compared to the other systems and older scikit-learn releases.

We can probably explore more how to improve up this later (e.g. maybe try to use clang under linux? unfortunately just setting clang in the CC and CXX variables does not seem to work.

ogrisel · 2022-11-29T19:04:32Z

Also after merging this, we should check the impact on the next point on https://scikit-learn.org/scikit-learn-benchmarks/#ensemble.HistGradientBoostingClassifierBenchmark.time_fit and friends to see if we the switch to -02 does not introduce a significant regression.

thomasjpfan · 2022-11-29T19:53:10Z

On a 16 core machine on Linux, I ran:

asv continuous -b HistGradientBoostingClassifier upstream/main HEAD

And get the following results:

This PR + O2

[ 0.00%] · For scikit-learn commit be995ce1 <optimization_flag_setuptools_pr_v2> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[ 8.33%] ··· Setting up ensemble:103                                                                                                    ok
[ 8.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             109M
[16.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.2M
[25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                             993±5ms
[33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                      19.0±0.2ms
[41.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[50.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751
[50.00%] · For scikit-learn commit 808c32df (round 1/1):
[50.00%] ·· Building for conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[58.33%] ··· Setting up ensemble:103                                                                                                    ok
[58.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             110M
[66.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.1M
[75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                            973±10ms
[83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                     19.1±0.09ms
[91.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[100.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

This PR + O3

[ 0.00%] · For scikit-learn commit 163e0322 <optimization_flag_setuptools_pr_v3> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[ 8.33%] ··· Setting up ensemble:103                                                                                                    ok
[ 8.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             109M
[16.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.4M
[25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                             1.00±0s
[33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                      19.2±0.1ms
[41.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[50.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751
[50.00%] · For scikit-learn commit 808c32df (round 1/1):
[50.00%] ·· Building for conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[58.33%] ··· Setting up ensemble:103                                                                                                    ok
[58.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             110M
[66.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.4M
[75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                             984±9ms
[83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                      19.2±0.1ms
[91.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[100.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

In all case, the time_fit and time_predict are about the same for O2 and O3.

ogrisel · 2022-11-29T20:19:42Z

This is compatible with what I observed. I also tried SVC and KMeans and got similar (lack of) results.

jeremiedbb · 2022-11-30T09:11:44Z

unfortunately just setting clang in the CC and CXX variables does not seem to work.

@ogrisel I think you also need LDSHARED="clang -shared"

jeremiedbb

LGTM. Thanks @thomasjpfan

ogrisel · 2022-11-30T13:28:14Z

@ogrisel I think you also need LDSHARED="clang -shared"

Indeed I forgot about that. We can try that in a follow PR experiment. I am not sure about the performance impacts and potential compatbility problems w.r.t. multithreading and openmp when scikit-learn is imported in a Python process that also imports other OpenMP / multithreaded libraries...

thomasjpfan added 2 commits November 28, 2022 22:04

BLD Reduce binary size of wheels

7207cd6

CI Run [cd build gh]

c5a5334

thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 29, 2022

thomasjpfan added this to the 1.2 milestone Nov 29, 2022

FIX Adds china.jpg [cd build]

86f5bea

REV Revert back to O2 [cd build gh]

3e605e5

BLD Go back to O2 [cd build gh]

3f6986e

BLD Be explicit about windows [cd build gh]

3820ba3

ogrisel approved these changes Nov 29, 2022

View changes

CLN Use variable for optimization level [cd build gh]

be995ce

jeremiedbb approved these changes Nov 30, 2022

View changes

jeremiedbb merged commit f7239bf into scikit-learn:main Nov 30, 2022
40 checks passed

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 6, 2022

BLD Reduce binary size of wheels (scikit-learn#25063)

bc51791

thomasjpfan mentioned this pull request Dec 6, 2022

BLD Reduces size of wheels by stripping symbols #25123

Merged

BLD Reduce binary size of wheels #25063

BLD Reduce binary size of wheels #25063

thomasjpfan commented Nov 29, 2022 •

edited

glemaitre commented Nov 29, 2022

ogrisel commented Nov 29, 2022

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022

thomasjpfan commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022

ogrisel commented Nov 29, 2022

thomasjpfan commented Nov 29, 2022

ogrisel commented Nov 29, 2022 •

edited

adrinjalali commented Nov 29, 2022

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022

jeremiedbb commented Nov 29, 2022 •

edited

thomasjpfan commented Nov 29, 2022 •

edited

thomasjpfan commented Nov 29, 2022 •

edited by ogrisel

ogrisel commented Nov 29, 2022

ogrisel left a comment

ogrisel commented Nov 29, 2022

thomasjpfan commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022

jeremiedbb commented Nov 30, 2022

jeremiedbb left a comment

ogrisel commented Nov 30, 2022

BLD Reduce binary size of wheels #25063

BLD Reduce binary size of wheels #25063

Conversation

thomasjpfan commented Nov 29, 2022 • edited

glemaitre commented Nov 29, 2022

ogrisel commented Nov 29, 2022

ogrisel commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022

thomasjpfan commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022

ogrisel commented Nov 29, 2022

thomasjpfan commented Nov 29, 2022

ogrisel commented Nov 29, 2022 • edited

adrinjalali commented Nov 29, 2022

ogrisel commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022

jeremiedbb commented Nov 29, 2022 • edited

thomasjpfan commented Nov 29, 2022 • edited

thomasjpfan commented Nov 29, 2022 • edited by ogrisel

ogrisel commented Nov 29, 2022

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Nov 29, 2022

thomasjpfan commented Nov 29, 2022 • edited

ogrisel commented Nov 29, 2022

jeremiedbb commented Nov 30, 2022

jeremiedbb left a comment

Choose a reason for hiding this comment

ogrisel commented Nov 30, 2022

thomasjpfan commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

thomasjpfan commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

ogrisel commented Nov 29, 2022 •

edited

jeremiedbb commented Nov 29, 2022 •

edited

thomasjpfan commented Nov 29, 2022 •

edited

thomasjpfan commented Nov 29, 2022 •

edited by ogrisel

thomasjpfan commented Nov 29, 2022 •

edited