Python: Automated subclass models #15044

RasmusWL · 2023-12-08T10:46:20Z

Overview

This PR adds automatically captured subclass information for a the majority of interesting PyPI packages. As an example of what this PR achieves:

the PyPI package flask-restplus defines flask_restful.Resource (src) which is a subclass of flask.views.MethodView.
Our modeling of flask based remote-flow-sources ultimately depends on being able to figure out whether a class is a subclass of flask.views.MethodView
so knowing a subclass of flask_restful.Resource is also a flask.views.MethodView allows us to model the remote-flow-sources properly.

We've traditionally had automatic dependency installation available on the codeql-action, but are moving to a solution without this (see https://github.blog/changelog/2023-07-12-code-scanning-with-codeql-no-longer-installs-python-dependencies-automatically-for-new-users/).

We've previously relied on analyzing installed dependencies to reach this conclusion (by following subclass relationship). This PR is the solution to still be able to reach the same conclusion when we stop installing dependencies.

We achieve this by using the extensible type-models to ahead-of-time record important subclass/aliasing information. For example, see the very first commit (2f17d2f)

Our internal testing shows that for all but a few cases, we end up with a solution comparable or better to what we had before, even when narrowing our focus to repos where dependency installation was successful before.

(Thanks to @tausbn for helping with the modeling ❤️)

Reviewing this PR

What a mess. Sorry. It's a mix of working on the tooling process-mrva-results.py/SubclassFinder.qll and enabling subclasses/aliases to be found in the actual modeling. The latter bit required removing the private annotations for much of our modeling. I think we'll just have to live with that.

The only commits I've found that don't follow this pattern, and that could have been made into separate PRs, are:

Notes

The tooling to generate these subclass-capture models automatically only lives internally.
to ensure this automated modeling could still be recreated once we don't do dependency installation (if we wanted to use a different format say), the actual modeling with MRVA has only been done while I made sure we wouldn't make use of any dependencies that might have been installed (specifically by this commit).

Based on some DBs I had that contained dependencies

Also makes `empty.model.yml` empty once again

(makes future diffing much easier)

Ooops

This is important to model mixins correctly, for example when they help handle incoming requests, and therefore need to know that `self.kwargs` contains data controlled by a user.

:thinkies: turns out that .getASubclass*() had to be applied everywhere...

This required making some of the relevant bits public, but they are marked as internal anyway.

Same trick as 'generate-code-scanning-query-list.py'

for module entry definitions from the dataflow graph.

mostly removing of nodes from the graph. One result lost: ``` check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr ```

…s from github#15030

Co-authored-by: Taus <tausbn@github.com>

these changes took performance for loading and writing all files locally 29.60s to 3.17s (that is, using `gather_from_existing`)

Verified by joining all files, splitting again, and observing no diff in git. (these operations only take a few seconds on my local machine, so shouldn't be too much of an issue)

Although it might be hidden by github UI by default, it could be interesting for a reviewer to notice the effect changes in the modeling query has to the results in this file.

In reality, we only want to model this as a `rest_framework.response.Response`, since our .qll modeling is more precise for rest-framework responses than if we also modeled it as a basic django http response. (specifically, that default mime-type handling is way different).

A little more explicit, so less prone to be overlooked when adding a new spec

In the final git history this only deletes one file, but when working locally I deleted ALL files.

(locally done with split + 5 x modeling runs + join, but squashed into one commit)

This reverts commit 0ed363bd79f9d3f9e9a905c1192adfe88f1faffb.

RasmusWL · 2023-12-19T16:17:05Z

I've had to fix the query to find new subclasses, since we did not account for rest_framework.response.Response being a subclass of django.http.response.HttpResponse in the spec definition. This meant that in the .yml files for any subclass of the rest_framework Response, it would be modeled as BOTH rest_framework and django responses.

However, since the rest_framework modeling of responses is a more specific subclass of django responses (see code below), we actually only want our .yml modeling to contain the row for rest_framework responses.

codeql/python/ql/lib/semmle/python/frameworks/RestFramework.qll

Lines 312 to 325 in 43fe9ca

    
           /** A direct instantiation of `rest_framework.response.Response`. */ 
        
           private class ClassInstantiation extends PrivateDjango::DjangoImpl::DjangoHttp::Response::HttpResponse::InstanceSource, 
        
             DataFlow::CallCfgNode 
        
           { 
        
             ClassInstantiation() { this = classRef().getACall() } 
        
             override DataFlow::Node getBody() { result in [this.getArg(0), this.getArgByName("data")] } 
        
             override DataFlow::Node getMimetypeOrContentTypeArg() { 
        
               result in [this.getArg(5), this.getArgByName("content_type")] 
        
             } 
        
             override string getMimetypeDefault() { none() } 
        
           }

I also realized that we had missed a few important repos in our automatic analysis.

After fixing those two problems, I have regenerated all the modeling, and force pushed some updates to this branch rewriting history to remove the old modeling commits. I guess I am worried about adding multiple commits that each changes 80k lines, so everything has been squashed together instead. I can refer to individual SHAs or restore the actual history if a reviewer would find that helpful 😊

The git diff --shortstat for the old vs. new modeling is 419 files changed, 12226 insertions(+), 12955 deletions(-). I did some looking through the diff manually, and it seems like quite a few is the more precise modeling of django rest framework response (so it's not ALSO modeled as a pure Django Response in the .yml files). However, there is also quite a few other changes, which seems to be due to what repos actually had DBs available for MRVA analysis. This suggest it would be smart to keep track of exactly what repos were analyzed at what SHAs, so it's easier to debug such situations in the future.

Internal note: The results were achieved after 5 rounds of analysis of the list of repos (which is internal only). In the last round, only 1 additional model was added, so that's the amount of work to expect next time building all the models from scratch 😊

tausbn

Overall I think this looks good to me. There's a few places where we still use classRef or cls instead of subclassRef, but fixing this inconsistency is probably best left for a follow-up PR.

RasmusWL and others added 30 commits December 8, 2023 11:27

WIP: Flask View class modeling for restplus

2f17d2f

Based on some DBs I had that contained dependencies

WIP rest of modeling done so far

f06bbd2

Python: Improve docs/names around already modeled classes

bb3ced0

Python: Adjust test-code predicate

ba0a5b1

Python: Streamline what modules to allow for now

b66dd23

Python: Add query metadata

b1f5dea

Python: Remove query predicate annotation

451a210

Python: Add script to process results from MRVA (bqrs files)

5e98ff4

Also makes `empty.model.yml` empty once again

FIXME already fixed

1c43d11

Python: Sort MaD rows

734dcb1

(makes future diffing much easier)

Python: Make Django use auto-modeling

d6fec9e

Ooops

Python: Automodel for tornado

eb97a79

Python: Automodel for WSGIServer

ec38464

Python: Improve import * handling

77a4d81

Python: Allow any results.bqrs file

dfdb66f

Python: Improve SelfRefMixin

ba19f95

This is important to model mixins correctly, for example when they help handle incoming requests, and therefore need to know that `self.kwargs` contains data controlled by a user.

Python: Enable auto-model BaseHttpRequestHandler

af6c5cc

Python: More import fixes

1e69762

:thinkies: turns out that .getASubclass*() had to be applied everywhere...

Python: Enable auto-model for cgi.FieldStorage

bff7ae2

Python: Enable auto-model for Django Model

d622d87

Python: Add Django response models

7b1c6b0

Python: Add Flask response model

cb1efa9

Python: Add Requests response model

1d4b4ee

This required making some of the relevant bits public, but they are marked as internal anyway.

Python: Add http.client.HTTPResponse model

750f14f

Python: Improve speed of process-mrva-results.py

7d86a8d

Same trick as 'generate-code-scanning-query-list.py'

Python: Add test of find-subclass code

e7d5573

Python: Also capture alias with new name

f19b672

Python: Add starlette.websocket model

83e6e51

Python: Add clickhouse_driver model

f5bed2d

Python: Add aiohttp.ClientSession model

947aa09

yoff and others added 23 commits December 19, 2023 17:07

Python: remove control flow nodes

c563c7f

for module entry definitions from the dataflow graph.

Python: adjust test expectations

75f9eeb

mostly removing of nodes from the graph. One result lost: ``` check("submodule.submodule_attr", submodule.submodule_attr, "submodule_attr", globals()) #$ MISSING:prints=submodule_attr ```

Python: Recover subclass finder .expected after cherry picking commit…

0fe29b6

…s from github#15030

Apply suggestions from code review

937af90

Co-authored-by: Taus <tausbn@github.com>

Python: treat auto subclass capture models as auto-generated

2f5d51c

Co-authored-by: Taus <tausbn@github.com>

Python: Update a few QLdocs

13c2378

Python: Script: Improve performance by using C++ impl

f30a3b0

these changes took performance for loading and writing all files locally 29.60s to 3.17s (that is, using `gather_from_existing`)

Python: Add ability to split and join autogenerated yml files

3e6423a

Verified by joining all files, splitting again, and observing no diff in git. (these operations only take a few seconds on my local machine, so shouldn't be too much of an issue)

Python: Make rest_framework tests runnable again

933938d

Python: Highlight split/join subclass files usage

cfd3f89

Python: Make split/join executable (chmod +x)

ee3319b

Python: Add the rest_framework models for demonstration purposes

5c89c38

Although it might be hidden by github UI by default, it could be interesting for a reviewer to notice the effect changes in the modeling query has to the results in this file.

Python: Model django response subclass relationship

3e878f5

Python: Regenerate rest_framework models

24a3a23

Python: Ignore known subclass models

a78f13c

Python: Fill getFullyQualifiedName for rest of subclassing specs

32251a0

Python: refactor how subclasses are specified

bf271d7

A little more explicit, so less prone to be overlooked when adding a new spec

Python: Delete old auto subclass capture files

de2a563

In the final git history this only deletes one file, but when working locally I deleted ALL files.

NEVER MERGE: Ensure we don't use site-packages stuff

ca7b69e

Python: auto subclass capture

9863309

(locally done with split + 5 x modeling runs + join, but squashed into one commit)

Revert "NEVER MERGE: Ensure we don't use site-packages stuff"

56d86f9

This reverts commit 0ed363bd79f9d3f9e9a905c1192adfe88f1faffb.

Merge branch 'main' into automated-subclass-models

72687e0

RasmusWL dismissed yoff’s stale review via 72687e0 December 19, 2023 16:16

RasmusWL force-pushed the automated-subclass-models branch from 3220c9e to 72687e0 Compare December 19, 2023 16:16

tausbn approved these changes Jan 4, 2024

View reviewed changes

RasmusWL marked this pull request as ready for review January 4, 2024 15:10

RasmusWL merged commit 95c2427 into github:main Jan 5, 2024
13 of 15 checks passed

RasmusWL deleted the automated-subclass-models branch January 5, 2024 09:43

Python: Automated subclass models #15044

Python: Automated subclass models #15044

RasmusWL commented Dec 8, 2023 •

edited

RasmusWL commented Dec 19, 2023 •

edited

tausbn left a comment

Python: Automated subclass models #15044

Python: Automated subclass models #15044

Conversation

RasmusWL commented Dec 8, 2023 • edited

Overview

Reviewing this PR

Notes

RasmusWL commented Dec 19, 2023 • edited

tausbn left a comment

Choose a reason for hiding this comment

RasmusWL commented Dec 8, 2023 •

edited

RasmusWL commented Dec 19, 2023 •

edited