Skip to content

Tree sitter update #8909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
May 11, 2022
Merged

Tree sitter update #8909

merged 10 commits into from
May 11, 2022

Conversation

aibaars
Copy link
Contributor

@aibaars aibaars commented Apr 27, 2022

This pull request updates the tree-sitter-ruby grammar used by our extractor. The pull request implements up- and downgrade scripts that are semantics preserving except in case of complex literals (1i, 3.14i, 3ri,etc). In the new grammar complex literals are no longer a simple token but a node with a float, integer, or rational literal as child. As a result an upgrade script would have to create new nodes, which is not possible. Instead the upgrade script simply converts complex literals to float literals. This is wrong, but preserves the structure of the AST so there are no gaps in the control and data flow graphs.

@github-actions github-actions bot added the Ruby label Apr 27, 2022
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 20 vulnerabilities.

@aibaars aibaars force-pushed the tree-sitter-update branch 2 times, most recently from bb52ca1 to 99b77c6 Compare April 27, 2022 19:37
@aibaars aibaars force-pushed the tree-sitter-update branch 2 times, most recently from 925cf49 to ccc1864 Compare April 29, 2022 14:04
@aibaars aibaars marked this pull request as ready for review April 29, 2022 14:06
@aibaars aibaars requested a review from a team as a code owner April 29, 2022 14:06
Co-authored-by: intrigus-lgtm <60750685+intrigus-lgtm@users.noreply.github.com>
@@ -11,7 +11,7 @@ flate2 = "1.0"
node-types = { path = "../node-types" }
tree-sitter = "0.19"
tree-sitter-embedded-template = "0.19"
tree-sitter-ruby = { git = "https://github.com/tree-sitter/tree-sitter-ruby.git", rev = "1ebfdb288842dae5a9233e2509a135949023dd82" }
tree-sitter-ruby = { git = "https://github.com/tree-sitter/tree-sitter-ruby.git", rev = "1a3936a3545c0bd9344a0bf983fafc7e17443e39" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could just remove the rev field here and rely on Cargo.lock to pin the exact revision. Assuming we want to stay relatively up-to-date with tree-sitter-ruby changes, this would help us do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work, although I'd like to be explicit. Things won't actually work if you refreshed the lock file because a minor grammar change might require dbscheme, library and upgrade/downgrade scripts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but then tests would fail right? Either approach works, but I do worry that we will forget to keep this up-to-date over time unless there's some automation around it.

@@ -252,10 +252,24 @@ class Lambda extends Callable, BodyStmt, TLambda {

/** A block. */
class Block extends Callable, StmtSequence, Scope, TBlock {
/**
* Get a local variable declared by this block.
* For example `local` in `{ | param; local| puts param }`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea this was a feature. Is it new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very old feature, just rarely used.

@@ -248,9 +252,6 @@ private module Cached {
casePattern(g)
)
} or
TScopeResolutionMethodCall(Ruby::ScopeResolution g, Ruby::Identifier i) {
isScopeResolutionMethodCall(g, i)
} or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser now correctly distinguishes between calls using :: and scope resolutions for constants, so this code is now dead. expr::identifier is a call with :: as the call operator, and expr::Constant is a scope resolution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah so expr::identifer is now considered equivalent to expr.identifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Both are a Call but their getOperator() will return a different token.

or
toGenerated(result) = g.getMethod().(Ruby::ArgumentList).getChild(n)
}
final override Expr getArgumentImpl(int n) { toGenerated(result) = g.getArguments().getChild(n) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice cleanup - weird that Ruby::ArgumentList was previously considered a "method"...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that made no sense at all, and had been on my list of things to fix for a long time ;-)

# 6| getStmt: [BooleanLiteral] true
# 7| getStmt: [BooleanLiteral] TRUE
# 7| getStmt: [ConstantReadAccess] TRUE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

I didn't know we were incorrectly interpreting these as booleans before. Good to see it fixed.

@hmac
Copy link
Contributor

hmac commented May 10, 2022

On the upgrade/downgrade scripts: I have a sense of what they're doing but don't know how to ensure that they work properly. Do we have any tests that ensure they give the expected result? We have a "Check DB upgrade/downgrade script" test but I don't know if it does much more than just run them and make sure nothing throws an error.

@aibaars
Copy link
Contributor Author

aibaars commented May 10, 2022

On the upgrade/downgrade scripts: I have a sense of what they're doing but don't know how to ensure that they work properly. Do we have any tests that ensure they give the expected result? We have a "Check DB upgrade/downgrade script" test but I don't know if it does much more than just run them and make sure nothing throws an error.

The Check DB upgrade/downgrade script doesn't check that much. I think it only verifies that the upgrades form a chain from the initial dbscheme to the current and the upgrade/downgrade script can be compiled.

I validated the behaviour of the upgrade scripts by building the extractor-pack from the main branch and run the QL tests. This causes QL test to build databases with the old extractor but run the test cases with the new library and dbscheme. In this case CodeQL runs the upgrade scripts because the databases are older than the new dbscheme.

These are the steps I used:

git checkout upstream/main .
./scripts/create-extractor-pack.sh
git checkout HEAD -f
codeql test run --learn  --search-path extractor-pack/ --search-path ql/lib ql/test
git difftool ql/test

For testing the downgrade scripts you can do the opposite, compile a new extractor, but run the QL tests with the old library.

./scripts/create-extractor-pack.sh
git checkout upstream/main ql/lib
codeql test run --learn  --search-path extractor-pack/ ql/test
git difftool ql/test

Note that there are some differences in the test output, but these should be benign or otherwise expected.

Comment on lines +2 to +5
RubyScopeResolution() {
exists(RubyConstant name | ruby_scope_resolution_def(this, name)) and
not ruby_call_def(_, this)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting that we'd simply drop rows where the second column is not a @ruby_token_constant, so this charpred seems like it could be deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not ruby_call_def(_, this) is still important. It says that things like expr::Foo() are not scope resolutions. Even though Foo looks like a constant name, it is not.

Comment on lines +30 to +44
ruby_ast_node_info(node, _, _, loc) and
parent instanceof ScopeResolutionMethodCall and
node =
rank[index + 1](RubyAstNode child, int x, int oldIndex |
exists(RubyAstNodeParent oldParent |
ruby_ast_node_info(child, oldParent, oldIndex, _) and
child != parent.(ScopeResolutionMethodCall).getScopeResolution()
|
oldParent = parent and x = 1
or
oldParent = parent.(ScopeResolutionMethodCall).getScopeResolution() and x = 0
)
|
child order by x, oldIndex
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this is doing, and why?

Copy link
Contributor Author

@aibaars aibaars May 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A call like expr::Foo(args) block used to be represented as Call{ method: ScopeResolution(expr, Foo), arguments: args, block: block}. The children of the Call node were:

  1. ScopeResolution(expr, Foo)
  2. args
  3. block

The new parse tree is like Call{ receiver: expr, method: Foo, arguments: args } . For this Call the children should be:

  1. expr
  2. Foo
  3. args
  4. block

To fix the parent child relation for the Call we simply take the children of the ScopeResolution first (x=0) ordered by oldIndex followed by the other children (x=1) of the original Call in their oldIndex order. The rank is used to ensure the new index is consecutive even if some of the children are missing.

@aibaars aibaars force-pushed the tree-sitter-update branch from 042bd24 to ba5b715 Compare May 10, 2022 12:34
Co-authored-by: Nick Rolfe <nickrolfe@github.com>
@aibaars aibaars force-pushed the tree-sitter-update branch from ba5b715 to 907c3db Compare May 11, 2022 07:59
Copy link
Contributor

@nickrolfe nickrolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@aibaars aibaars merged commit a47e429 into github:main May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants