Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Automated subclass models #15044

Merged
merged 103 commits into from Jan 5, 2024

Conversation

RasmusWL
Copy link
Member

@RasmusWL RasmusWL commented Dec 8, 2023

Overview

This PR adds automatically captured subclass information for a the majority of interesting PyPI packages. As an example of what this PR achieves:

We've traditionally had automatic dependency installation available on the codeql-action, but are moving to a solution without this (see https://github.blog/changelog/2023-07-12-code-scanning-with-codeql-no-longer-installs-python-dependencies-automatically-for-new-users/).

We've previously relied on analyzing installed dependencies to reach this conclusion (by following subclass relationship). This PR is the solution to still be able to reach the same conclusion when we stop installing dependencies.

We achieve this by using the extensible type-models to ahead-of-time record important subclass/aliasing information. For example, see the very first commit (2f17d2f)

Our internal testing shows that for all but a few cases, we end up with a solution comparable or better to what we had before, even when narrowing our focus to repos where dependency installation was successful before.

(Thanks to @tausbn for helping with the modeling ❤️)

Reviewing this PR

What a mess. Sorry. It's a mix of working on the tooling process-mrva-results.py/SubclassFinder.qll and enabling subclasses/aliases to be found in the actual modeling. The latter bit required removing the private annotations for much of our modeling. I think we'll just have to live with that.

The only commits I've found that don't follow this pattern, and that could have been made into separate PRs, are:

Notes

  1. The tooling to generate these subclass-capture models automatically only lives internally.
  2. to ensure this automated modeling could still be recreated once we don't do dependency installation (if we wanted to use a different format say), the actual modeling with MRVA has only been done while I made sure we wouldn't make use of any dependencies that might have been installed (specifically by this commit).

RasmusWL and others added 30 commits December 8, 2023 11:27
Based on some DBs I had that contained dependencies
Also makes `empty.model.yml` empty once again
(makes future diffing much easier)
This is important to model mixins correctly, for example when they help
handle incoming requests, and therefore need to know that `self.kwargs`
contains data controlled by a user.
:thinkies: turns out that .getASubclass*() had to be applied everywhere...
This required making some of the relevant bits public, but they are marked as internal anyway.
Same trick as 'generate-code-scanning-query-list.py'
yoff and others added 23 commits December 19, 2023 17:07
for module entry definitions from the dataflow graph.
mostly removing of nodes from the graph.
One result lost:
```
check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr
```
Co-authored-by: Taus <tausbn@github.com>
Co-authored-by: Taus <tausbn@github.com>
these changes took performance for loading and writing all files locally
29.60s to 3.17s

(that is, using `gather_from_existing`)
Verified by joining all files, splitting again, and observing no diff in
git.

(these operations only take a few seconds on my local machine, so
shouldn't be too much of an issue)
Although it might be hidden by github UI by default, it could be
interesting for a reviewer to notice the effect changes in the modeling
query has to the results in this file.
In reality, we only want to model this as a `rest_framework.response.Response`, since our .qll modeling is more precise for rest-framework responses than if we also modeled it as a basic django http response. (specifically, that default mime-type handling is way different).
A little more explicit, so less prone to be overlooked when adding a new spec
In the final git history this only deletes one file, but when working
locally I deleted ALL files.
(locally done with split + 5 x modeling runs + join, but squashed into one commit)
This reverts commit 0ed363bd79f9d3f9e9a905c1192adfe88f1faffb.
@RasmusWL
Copy link
Member Author

RasmusWL commented Dec 19, 2023

I've had to fix the query to find new subclasses, since we did not account for rest_framework.response.Response being a subclass of django.http.response.HttpResponse in the spec definition. This meant that in the .yml files for any subclass of the rest_framework Response, it would be modeled as BOTH rest_framework and django responses.

However, since the rest_framework modeling of responses is a more specific subclass of django responses (see code below), we actually only want our .yml modeling to contain the row for rest_framework responses.

/** A direct instantiation of `rest_framework.response.Response`. */
private class ClassInstantiation extends PrivateDjango::DjangoImpl::DjangoHttp::Response::HttpResponse::InstanceSource,
DataFlow::CallCfgNode
{
ClassInstantiation() { this = classRef().getACall() }
override DataFlow::Node getBody() { result in [this.getArg(0), this.getArgByName("data")] }
override DataFlow::Node getMimetypeOrContentTypeArg() {
result in [this.getArg(5), this.getArgByName("content_type")]
}
override string getMimetypeDefault() { none() }
}

I also realized that we had missed a few important repos in our automatic analysis.

After fixing those two problems, I have regenerated all the modeling, and force pushed some updates to this branch rewriting history to remove the old modeling commits. I guess I am worried about adding multiple commits that each changes 80k lines, so everything has been squashed together instead. I can refer to individual SHAs or restore the actual history if a reviewer would find that helpful 😊

The git diff --shortstat for the old vs. new modeling is 419 files changed, 12226 insertions(+), 12955 deletions(-). I did some looking through the diff manually, and it seems like quite a few is the more precise modeling of django rest framework response (so it's not ALSO modeled as a pure Django Response in the .yml files). However, there is also quite a few other changes, which seems to be due to what repos actually had DBs available for MRVA analysis. This suggest it would be smart to keep track of exactly what repos were analyzed at what SHAs, so it's easier to debug such situations in the future.

Internal note: The results were achieved after 5 rounds of analysis of the list of repos (which is internal only). In the last round, only 1 additional model was added, so that's the amount of work to expect next time building all the models from scratch 😊

Copy link
Contributor

@tausbn tausbn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks good to me. There's a few places where we still use classRef or cls instead of subclassRef, but fixing this inconsistency is probably best left for a follow-up PR.

@RasmusWL RasmusWL marked this pull request as ready for review January 4, 2024 15:10
@RasmusWL RasmusWL merged commit 95c2427 into github:main Jan 5, 2024
13 of 15 checks passed
@RasmusWL RasmusWL deleted the automated-subclass-models branch January 5, 2024 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants