Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32 cpp string concatenation library #14954

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bdrodes
Copy link
Contributor

@bdrodes bdrodes commented Nov 29, 2023

Adding a new general purpose StringConcatenation library to allow us to locate general string concatenation operations, grab their operands, and grab the result dataflow node.

@bdrodes bdrodes requested a review from a team as a code owner November 29, 2023 21:05
@github-actions github-actions bot added the C++ label Nov 29, 2023
class StringConcatenation extends Call {
StringConcatenation() {
// sprintf-like functions, i.e., concat through formating
exists(FormattingFunctionCall fc | this = fc)

Check warning

Code scanning / CodeQL

Expression can be replaced with a cast Warning

The assignment to
fc
in the exists(..) can replaced with an instanceof expression.
Comment on lines +66 to +68
this.(FormattingFunctionCall)
.getTarget()
.(FormattingFunction)

Check warning

Code scanning / CodeQL

Redundant cast Warning

Redundant cast to
FormattingFunction
.

class StringConcatenation extends Call {
StringConcatenation() {
// sprintf-like functions, i.e., concat through formating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// sprintf-like functions, i.e., concat through formating
// sprintf-like functions, i.e., concat through formatting

}

/**
* Gets the operands of this concatenation (one of the string operands being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Gets the operands of this concatenation (one of the string operands being
* Gets an operand of this concatenation (one of the string operands being

(both are reasonable explanations of what this predicate does, but we've standardized on "Gets a" / "Gets an" when there are multiple return values)

// The result is an argument of 'this' (a call)
result = this.getAnArgument() and
// addresses odd behavior with overloaded operators
// i.e., "call to operator+" appearing as an operand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this where you have more than one concatenation, e.g. "a" + "b" + "c"? I think in that case we would expect one of the concatenations to have the other (a "call to operator+") as an argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what the "odd behavior" is here. If you have a call such as:

std::string s1, s2, s3;
string s = s1 + s2 + s3;

then this will be represented as:

string s = (s1.operator+(s2)).operator+(s3);

and hence the "call to operator+" is simply the qualifier. Should that call to operator+ not be an operand according to this library?

this.(FormattingFunctionCall)
.getTarget()
.(FormattingFunction)
.getFirstFormatArgumentIndex()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would ideally exclude strings passed into a formatting function call where the format specifier isn't a variation of %s. For example I think you can output a char * string with %p and it does not concatenate the contents of the string.

[result.asExpr(), result.asIndirectExpr()] =
this.(FormattingFunctionCall).getOutputArgument(_)
else [result.asExpr(), result.asIndirectExpr()] = this.(Call)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this predicate could be written as a sequence of expressions joined by or, rather than if-then-else. It might be easier to read, it also avoids some potential performance problems that can occur with nested if-then-else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there's a misunderstanding of the semantics of QL here. Every time you write something like:

if x instanceof Foo then
  result = x.(Foo).bar()
else if x instanceof Baz then
  result = x.(Baz).qux()
else ...

you can simplify this to be

result = x.(Foo).bar()
or
result = x.(Baz).qux()
or
...

since x.(Foo) won't have any result when x isn't a Foo. And so result = x.(Foo).bar() will contribute 0 tuples to the final predicate when x isn't a Foo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: There's no inherent performs problems with many nested if-then-elses in QL. What @geoffw0 is referring to is that QL desugars the formula if x then y else z to the formula:

x and y
or
not x and z

and if you're not careful the not x part can result in quite a lot of tuples if you're not thinking very carefully about what you're writing. For instance, if you have two nested if-then-elses such as:

if x then (if y then z else t) else w

then this desugars to:

x and (if y then z else t)
or
not x and w

which desugars to

x and
(
  y and z
  or
  not y and t
)
or
not x and w

and you need to be sure that none of these terms generate a large number of tuples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. To be honest I couldn't really remember why we avoid excessive if-then-else, only that it's usually preferable when you have a choice. 👍

Copy link
Contributor

@MathiasVP MathiasVP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of comments. This kinda reminds me of https://github.com/github/codeql/blob/main/go/ql/lib/semmle/go/StringOps.qll#L462. Do you think it's worth taking inspiration from that Go library?

import cpp
import semmle.code.cpp.models.implementations.Strcat
import semmle.code.cpp.models.interfaces.FormattingFunction
import semmle.code.cpp.dataflow.new.DataFlow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have people rely on dataflow implicitly being imported when they import this library. Ideally, the other imports should also be private, but if you're using a library for string concatenation I think it's fair that you also implicitly import those other libraries.

Suggested change
import semmle.code.cpp.dataflow.new.DataFlow
private import semmle.code.cpp.dataflow.new.DataFlow

class StringConcatenation extends Call {
StringConcatenation() {
// sprintf-like functions, i.e., concat through formating
exists(FormattingFunctionCall fc | this = fc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Code Scanning is suggesting.

Suggested change
exists(FormattingFunctionCall fc | this = fc)
this instanceof FormattingFunctionCall

// The result is an argument of 'this' (a call)
result = this.getAnArgument() and
// addresses odd behavior with overloaded operators
// i.e., "call to operator+" appearing as an operand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure what the "odd behavior" is here. If you have a call such as:

std::string s1, s2, s3;
string s = s1 + s2 + s3;

then this will be represented as:

string s = (s1.operator+(s2)).operator+(s3);

and hence the "call to operator+" is simply the qualifier. Should that call to operator+ not be an operand according to this library?

(
result.getUnderlyingType().stripType().getName() = "char"
or
result.getUnderlyingType().getName() = "string"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What string type is this supposed to represent? Since getUnderlyingType strips away typedefs this can't be std::string (which is typedef'd as a version of a basic_string that is also handled in the next disjunct).

this.getArgument(this.getTarget().(StrcatFunction).getParamDest())
or
// Hardcoding it is also the return
[result.asExpr(), result.asIndirectExpr()] = this.(Call)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need both asExpr() and asIndirectExpr() here? If you want the pointer to the string to be the result, then it should be asExpr(), and if you want the actual char data to be the result then it should be asIndirectExpr().

I would imagine that you want limit this to simply be asExpr()?

Comment on lines +89 to +90
[result.asExpr(), result.asIndirectExpr()] =
this.getArgument(this.getTarget().(StrlcatFunction).getParamDest())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be the output argument of calling strlcat? That is, I would expect this to be:

result.asDefiningArgument() = this.getArgument(this.getTarget().(StrlcatFunction).getParamDest())

similarly to how you did the StrcatFunction case?

Comment on lines +94 to +95
[result.asExpr(), result.asIndirectExpr()] =
this.(FormattingFunctionCall).getOutputArgument(_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: I think this should be the node representing the output argument. That is, this should probably be:

result.asDefiningArgument() = this.(FormattingFunctionCall).getOutputArgument(_)

right?

[result.asExpr(), result.asIndirectExpr()] =
this.(FormattingFunctionCall).getOutputArgument(_)
else [result.asExpr(), result.asIndirectExpr()] = this.(Call)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there's a misunderstanding of the semantics of QL here. Every time you write something like:

if x instanceof Foo then
  result = x.(Foo).bar()
else if x instanceof Baz then
  result = x.(Baz).qux()
else ...

you can simplify this to be

result = x.(Foo).bar()
or
result = x.(Baz).qux()
or
...

since x.(Foo) won't have any result when x isn't a Foo. And so result = x.(Foo).bar() will contribute 0 tuples to the final predicate when x isn't a Foo.

[result.asExpr(), result.asIndirectExpr()] =
this.(FormattingFunctionCall).getOutputArgument(_)
else [result.asExpr(), result.asIndirectExpr()] = this.(Call)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: There's no inherent performs problems with many nested if-then-elses in QL. What @geoffw0 is referring to is that QL desugars the formula if x then y else z to the formula:

x and y
or
not x and z

and if you're not careful the not x part can result in quite a lot of tuples if you're not thinking very carefully about what you're writing. For instance, if you have two nested if-then-elses such as:

if x then (if y then z else t) else w

then this desugars to:

x and (if y then z else t)
or
not x and w

which desugars to

x and
(
  y and z
  or
  not y and t
)
or
not x and w

and you need to be sure that none of these terms generate a large number of tuples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants