Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD Reduce binary size of wheels #25063

Merged

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Nov 29, 2022

This PR reduces the wheel size back to the sizes seen in 1.1.3. The changes include:

  • Removes the .c, .cpp, and .pyx files from the wheel

  • Uses -O2 optimization to reduce size of compiled files. It is documented that -O3 is not great:

Compiling with -O3 is not a guaranteed way to improve performance, and in fact, in many cases, can slow down a system due to larger binaries and increased memory usage.

CC @jeremiedbb

@thomasjpfan thomasjpfan added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Nov 29, 2022
@thomasjpfan thomasjpfan added this to the 1.2 milestone Nov 29, 2022
@glemaitre
Copy link
Contributor

On clang documentation, it is mentioned that the large binaries size is intended to accelerate the execution time.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

The linux binaries do not use clang but gcc though. The size problem is only for the linux wheels right?

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

It would be great to trigger an ASV run of this PR vs main but this might be expensive. @jeremiedbb assuming you have an easy access to the machine that runs the nightly benchmarks, can you please give it a try?

Alternatively we can use another machine but it needs to be dedicated to the benchmark for a few ours to make the results reliable enough.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

Actually I found a machine, let me try to do this.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 29, 2022

Looking at the log from the 1.1.X build they O3 was enabled by default, even tho the tree's setup.py was the only one that was configured for O3.

To be safe, I reverted this PR back to O3.

Edit: The revert is in 3e605e5 (#25063) (The commit message has a typo, it should be "Revert back to O3")

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

I ran a bunch of benchmarks both on linux with gcc and on macos with clang on varvious Cython / C++ intensive estimators and I do not observe a significant performance degradation when building scikit-learn from this PR. It could even be the case that there is some slight improvement (although there is some variability in the results from run to run, especially for the hist gradient boosting models).

EDIT: this comment is about the state of the PR that used the -O2 flag.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

Looking at the log from the 1.1.X build they O3 was enabled by default, even tho the tree's setup.py was the only one that was configured for O3.

But even in 1.1.3 the linux binaries are already much bigger that for macos and windows. I checked that the number of files is the same between a macos wheel and a linux wheel. The only difference are the shipped dynamic libraries.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

Here are the sizes in bytes of the biggest .so files in the 3.11 wheels for x86_64 macos and linux for scikit-learn 1.1.3:

  • macos
dev ❯ find macwheel/ | grep .so | xargs ls -l | awk '{print $5, $9}' | sort -nr | head
2279968 macwheel//sklearn/_loss/_loss.cpython-311-darwin.so
852832 macwheel//sklearn/utils/sparsefuncs_fast.cpython-311-darwin.so
629184 macwheel//sklearn/tree/_tree.cpython-311-darwin.so
572096 macwheel//sklearn/neighbors/_ball_tree.cpython-311-darwin.so
550640 macwheel//sklearn/neighbors/_kd_tree.cpython-311-darwin.so
513968 macwheel//sklearn/metrics/_pairwise_distances_reduction.cpython-311-darwin.so
507728 macwheel//sklearn/linear_model/_cd_fast.cpython-311-darwin.so
456768 macwheel//sklearn/metrics/_dist_metrics.cpython-311-darwin.so
441680 macwheel//sklearn/cluster/_k_means_common.cpython-311-darwin.so
440576 macwheel//sklearn/cluster/_k_means_elkan.cpython-311-darwin.so
  • linux
dev ❯ find linwheel | grep .so | xargs ls -l | awk '{print $5, $9}' | sort -nr | head
14352152 linwheel/sklearn/_loss/_loss.cpython-311-x86_64-linux-gnu.so
5857096 linwheel/sklearn/utils/sparsefuncs_fast.cpython-311-x86_64-linux-gnu.so
4110592 linwheel/sklearn/tree/_tree.cpython-311-x86_64-linux-gnu.so
3591664 linwheel/sklearn/metrics/_pairwise_distances_reduction.cpython-311-x86_64-linux-gnu.so
3477360 linwheel/sklearn/linear_model/_cd_fast.cpython-311-x86_64-linux-gnu.so
3383200 linwheel/sklearn/neighbors/_ball_tree.cpython-311-x86_64-linux-gnu.so
3313568 linwheel/sklearn/neighbors/_kd_tree.cpython-311-x86_64-linux-gnu.so
3084064 linwheel/sklearn/utils/_cython_blas.cpython-311-x86_64-linux-gnu.so
2789408 linwheel/sklearn/cluster/_k_means_common.cpython-311-x86_64-linux-gnu.so
2638896 linwheel/sklearn/cluster/_k_means_elkan.cpython-311-x86_64-linux-gnu.so

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

The vendored openmp runtimes are small comparatively.

@thomasjpfan
Copy link
Member Author

I ran a bunch of benchmarks both on linux with gcc and on macos with clang on varvious Cython / C++ intensive estimators and I do not observe a significant performance degradation when building scikit-learn from this PR.

In that case, I do like having O2 for the smaller binary size. I updated this PR to O2.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

I now realize that the benchmarking I did for macos was with the -03 version of this PR hence the lack of performance effect was expected. I confirm that the linux benchmarks on the other hand actually compared -02 of this PR to the -03 and in main and so far I did no spot any regression (but I still have other benchmarks running at this time).

@adrinjalali
Copy link
Member

A note on O2 vs O3, at least on GCC, the kernel community has seen behavior with O3 which would result in wrong output. I personally have observed optimizations which would result in a sigterm on O3 for gcc (albeit that was a long time ago, and it had to do with copy constructors). So I'd say if we can, we should stick with O2.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

Hum this must be something else, the generated wheels for linux in the last CI build are still much bigger than for macos and windows... 30 MB or more for Linux vs less than 10 MB for those others.

Maybe this is related to the configuration of debug symbols?

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

Actually when I look at the build output of main on my local machines I see -02 (twice) despite the fact that we have default_extra_compile_args = ["-O3"] on the setup.py file... This makes no sense.

For instance, on my on linux with gcc I get lines such as:

ccache gcc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /data/parietal/store3/work/ogrisel/mambaforge/envs/dev/include -fPIC -O2 -isystem /data/par$etal/store3/work/ogrisel/mambaforge/envs/dev/include -fPIC -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -I/data/parietal/store3/work/ogrisel/mambaforge/envs/dev/lib/python3.10/site-pa$kages/numpy/core/include -I/data/parietal/store3/work/ogrisel/mambaforge/envs/dev/include/python3.10 -c sklearn/_loss/_loss.c -o build/temp.linux-x86_64-3.10/sklearn/_loss/_loss.o -O$
 -fopenmp

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

stripping the debug symbols does not make a big difference:

(dev) du -hs sklearn
147M    sklearn
(dev) strip sklearn/**/*.so
(dev) du -hs sklearn
143M    sklearn

@jeremiedbb
Copy link
Member

jeremiedbb commented Nov 29, 2022

I checked and on windows: I see the -O3

"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\bin\HostX86\x64\cl.exe" /c
/nologo /O2 /W3 /GL /DNDEBUG /MD -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -IC:\Users\J\miniconda3
\envs\dev\lib\site-packages\numpy\core\include -IC:\Users\J\miniconda3\envs\dev\include -IC:\Users\J\miniconda3\envs\dev
\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.26.28801\include" "-IC:\Program 
Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared"
 "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include
\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /Tcsklearn\utils\_heap.c 
/Fobuild\temp.win-amd64-cpython-39\Release\sklearn\utils\_heap.obj -O3 /openmp
                                                                    ^

IIRC, flags that appear after in the command override flags that appear before, so O3 should be used instead of O2 (which appears earlier in the command above because it's the default one set by setuptools)

However this is not the appropriate flag for msvc, It should be /O3
Ligne de commande warning D9002 : option '-O3' inconnue ignorée
(Yes my msvc is in french :) It means that this flag is not recognized and thus ignored)

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 29, 2022

For reference, here are the artifacts and the total size of the compressed artifacts:

In all cases, the Linux wheels are much bigger than OSX and windows. The large linux wheel size existed for many versions including: PyPI 0.24.2, PyPI 1.0.2, and PyPi 1.1.3.

@jeremiedbb From looking at your output and the CI for windows, /O2 is added automatically and -O2 is ignored.
Edit: I see you had the same point.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 29, 2022

Actually when I look at the build output of main on my local machines I see -02 (twice) despite the fact that we have default_extra_compile_args = ["-O3"] on the setup.py file... This makes no sense.

From looking at your output, maybe the O3 is cutoff at the end? On my local linux machine and macOS machine, I see "O2" appear multiple times, but at the very end there is -O3 -fopenmp.

On the wheel builder CI in this PR for linux, the default is O3 but O2 is passed in later, which means O2 is used.

In summary, if we want O2, then this PR is doing the correct thing. On Linux and macOS, O2 is passed in and overrides O3. For Windows, I updated this PR to be more explicit and pass in /O2 correctly.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

From looking at your output, maybe the O3 is cutoff at the end? On my local linux machine and macOS machine, I see "O2" appear multiple times, but at the very end there is -O3 -fopenmp.

Indeed.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already a net improvement even if that does not explain the big increase in size of the linux binaries compared to the other systems and older scikit-learn releases.

We can probably explore more how to improve up this later (e.g. maybe try to use clang under linux? unfortunately just setting clang in the CC and CXX variables does not seem to work.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

Also after merging this, we should check the impact on the next point on https://scikit-learn.org/scikit-learn-benchmarks/#ensemble.HistGradientBoostingClassifierBenchmark.time_fit and friends to see if we the switch to -02 does not introduce a significant regression.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Nov 29, 2022

On a 16 core machine on Linux, I ran:

asv continuous -b HistGradientBoostingClassifier upstream/main HEAD

And get the following results:

This PR + O2
[ 0.00%] · For scikit-learn commit be995ce1 <optimization_flag_setuptools_pr_v2> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[ 8.33%] ··· Setting up ensemble:103                                                                                                    ok
[ 8.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             109M
[16.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.2M
[25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                             993±5ms
[33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                      19.0±0.2ms
[41.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[50.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751
[50.00%] · For scikit-learn commit 808c32df (round 1/1):
[50.00%] ·· Building for conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[58.33%] ··· Setting up ensemble:103                                                                                                    ok
[58.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             110M
[66.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.1M
[75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                            973±10ms
[83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                     19.1±0.09ms
[91.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[100.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751

BENCHMARKS NOT SIGNIFICANTLY CHANGED.
This PR + O3
[ 0.00%] · For scikit-learn commit 163e0322 <optimization_flag_setuptools_pr_v3> (round 1/1):
[ 0.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[ 8.33%] ··· Setting up ensemble:103                                                                                                    ok
[ 8.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             109M
[16.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.4M
[25.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                             1.00±0s
[33.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                      19.2±0.1ms
[41.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[50.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751
[50.00%] · For scikit-learn commit 808c32df (round 1/1):
[50.00%] ·· Building for conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl.
[50.00%] ·· Benchmarking conda-py3.10-cython-joblib-numpy-pandas-scipy-threadpoolctl
[58.33%] ··· Setting up ensemble:103                                                                                                    ok
[58.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_fit                                                             110M
[66.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.peakmem_predict                                                        99.4M
[75.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_fit                                                             984±9ms
[83.33%] ··· ensemble.HistGradientBoostingClassifierBenchmark.time_predict                                                      19.2±0.1ms
[91.67%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_test_score                                          0.7230709112942986
[100.00%] ··· ensemble.HistGradientBoostingClassifierBenchmark.track_train_score                                         0.9812160155622751

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

In all case, the time_fit and time_predict are about the same for O2 and O3.

@ogrisel
Copy link
Member

ogrisel commented Nov 29, 2022

This is compatible with what I observed. I also tried SVC and KMeans and got similar (lack of) results.

@jeremiedbb
Copy link
Member

unfortunately just setting clang in the CC and CXX variables does not seem to work.

@ogrisel I think you also need LDSHARED="clang -shared"

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @thomasjpfan

@jeremiedbb jeremiedbb merged commit f7239bf into scikit-learn:main Nov 30, 2022
40 checks passed
@ogrisel
Copy link
Member

ogrisel commented Nov 30, 2022

@ogrisel I think you also need LDSHARED="clang -shared"

Indeed I forgot about that. We can try that in a follow PR experiment. I am not sure about the performance impacts and potential compatbility problems w.r.t. multithreading and openmp when scikit-learn is imported in a Python process that also imports other OpenMP / multithreaded libraries...

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants