New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
32 cpp string concatenation library #14954
base: main
Are you sure you want to change the base?
32 cpp string concatenation library #14954
Conversation
class StringConcatenation extends Call { | ||
StringConcatenation() { | ||
// sprintf-like functions, i.e., concat through formating | ||
exists(FormattingFunctionCall fc | this = fc) |
Check warning
Code scanning / CodeQL
Expression can be replaced with a cast Warning
fc
this.(FormattingFunctionCall) | ||
.getTarget() | ||
.(FormattingFunction) |
Check warning
Code scanning / CodeQL
Redundant cast Warning
FormattingFunction
|
||
class StringConcatenation extends Call { | ||
StringConcatenation() { | ||
// sprintf-like functions, i.e., concat through formating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// sprintf-like functions, i.e., concat through formating | |
// sprintf-like functions, i.e., concat through formatting |
} | ||
|
||
/** | ||
* Gets the operands of this concatenation (one of the string operands being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Gets the operands of this concatenation (one of the string operands being | |
* Gets an operand of this concatenation (one of the string operands being |
(both are reasonable explanations of what this predicate does, but we've standardized on "Gets a" / "Gets an" when there are multiple return values)
// The result is an argument of 'this' (a call) | ||
result = this.getAnArgument() and | ||
// addresses odd behavior with overloaded operators | ||
// i.e., "call to operator+" appearing as an operand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where you have more than one concatenation, e.g. "a" + "b" + "c"
? I think in that case we would expect one of the concatenations to have the other (a "call to operator+") as an argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure what the "odd behavior" is here. If you have a call such as:
std::string s1, s2, s3;
string s = s1 + s2 + s3;
then this will be represented as:
string s = (s1.operator+(s2)).operator+(s3);
and hence the "call to operator+" is simply the qualifier. Should that call to operator+
not be an operand according to this library?
this.(FormattingFunctionCall) | ||
.getTarget() | ||
.(FormattingFunction) | ||
.getFirstFormatArgumentIndex() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would ideally exclude strings passed into a formatting function call where the format specifier isn't a variation of %s
. For example I think you can output a char *
string with %p
and it does not concatenate the contents of the string.
[result.asExpr(), result.asIndirectExpr()] = | ||
this.(FormattingFunctionCall).getOutputArgument(_) | ||
else [result.asExpr(), result.asIndirectExpr()] = this.(Call) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this predicate could be written as a sequence of expressions joined by or
, rather than if-then-else. It might be easier to read, it also avoids some potential performance problems that can occur with nested if-then-else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there's a misunderstanding of the semantics of QL here. Every time you write something like:
if x instanceof Foo then
result = x.(Foo).bar()
else if x instanceof Baz then
result = x.(Baz).qux()
else ...
you can simplify this to be
result = x.(Foo).bar()
or
result = x.(Baz).qux()
or
...
since x.(Foo)
won't have any result when x
isn't a Foo
. And so result = x.(Foo).bar()
will contribute 0 tuples to the final predicate when x
isn't a Foo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: There's no inherent performs problems with many nested if-then-elses in QL. What @geoffw0 is referring to is that QL desugars the formula if x then y else z
to the formula:
x and y
or
not x and z
and if you're not careful the not x
part can result in quite a lot of tuples if you're not thinking very carefully about what you're writing. For instance, if you have two nested if-then-elses such as:
if x then (if y then z else t) else w
then this desugars to:
x and (if y then z else t)
or
not x and w
which desugars to
x and
(
y and z
or
not y and t
)
or
not x and w
and you need to be sure that none of these terms generate a large number of tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. To be honest I couldn't really remember why we avoid excessive if-then-else, only that it's usually preferable when you have a choice. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments. This kinda reminds me of https://github.com/github/codeql/blob/main/go/ql/lib/semmle/go/StringOps.qll#L462. Do you think it's worth taking inspiration from that Go library?
import cpp | ||
import semmle.code.cpp.models.implementations.Strcat | ||
import semmle.code.cpp.models.interfaces.FormattingFunction | ||
import semmle.code.cpp.dataflow.new.DataFlow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have people rely on dataflow implicitly being imported when they import this library. Ideally, the other imports should also be private, but if you're using a library for string concatenation I think it's fair that you also implicitly import those other libraries.
import semmle.code.cpp.dataflow.new.DataFlow | |
private import semmle.code.cpp.dataflow.new.DataFlow |
class StringConcatenation extends Call { | ||
StringConcatenation() { | ||
// sprintf-like functions, i.e., concat through formating | ||
exists(FormattingFunctionCall fc | this = fc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Code Scanning is suggesting.
exists(FormattingFunctionCall fc | this = fc) | |
this instanceof FormattingFunctionCall |
// The result is an argument of 'this' (a call) | ||
result = this.getAnArgument() and | ||
// addresses odd behavior with overloaded operators | ||
// i.e., "call to operator+" appearing as an operand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure what the "odd behavior" is here. If you have a call such as:
std::string s1, s2, s3;
string s = s1 + s2 + s3;
then this will be represented as:
string s = (s1.operator+(s2)).operator+(s3);
and hence the "call to operator+" is simply the qualifier. Should that call to operator+
not be an operand according to this library?
( | ||
result.getUnderlyingType().stripType().getName() = "char" | ||
or | ||
result.getUnderlyingType().getName() = "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What string
type is this supposed to represent? Since getUnderlyingType
strips away typedefs this can't be std::string
(which is typedef'd as a version of a basic_string
that is also handled in the next disjunct).
this.getArgument(this.getTarget().(StrcatFunction).getParamDest()) | ||
or | ||
// Hardcoding it is also the return | ||
[result.asExpr(), result.asIndirectExpr()] = this.(Call) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need both asExpr()
and asIndirectExpr()
here? If you want the pointer to the string to be the result, then it should be asExpr()
, and if you want the actual char data to be the result then it should be asIndirectExpr()
.
I would imagine that you want limit this to simply be asExpr()
?
[result.asExpr(), result.asIndirectExpr()] = | ||
this.getArgument(this.getTarget().(StrlcatFunction).getParamDest()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be the output argument of calling strlcat
? That is, I would expect this to be:
result.asDefiningArgument() = this.getArgument(this.getTarget().(StrlcatFunction).getParamDest())
similarly to how you did the StrcatFunction
case?
[result.asExpr(), result.asIndirectExpr()] = | ||
this.(FormattingFunctionCall).getOutputArgument(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: I think this should be the node representing the output argument. That is, this should probably be:
result.asDefiningArgument() = this.(FormattingFunctionCall).getOutputArgument(_)
right?
[result.asExpr(), result.asIndirectExpr()] = | ||
this.(FormattingFunctionCall).getOutputArgument(_) | ||
else [result.asExpr(), result.asIndirectExpr()] = this.(Call) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there's a misunderstanding of the semantics of QL here. Every time you write something like:
if x instanceof Foo then
result = x.(Foo).bar()
else if x instanceof Baz then
result = x.(Baz).qux()
else ...
you can simplify this to be
result = x.(Foo).bar()
or
result = x.(Baz).qux()
or
...
since x.(Foo)
won't have any result when x
isn't a Foo
. And so result = x.(Foo).bar()
will contribute 0 tuples to the final predicate when x
isn't a Foo
.
[result.asExpr(), result.asIndirectExpr()] = | ||
this.(FormattingFunctionCall).getOutputArgument(_) | ||
else [result.asExpr(), result.asIndirectExpr()] = this.(Call) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: There's no inherent performs problems with many nested if-then-elses in QL. What @geoffw0 is referring to is that QL desugars the formula if x then y else z
to the formula:
x and y
or
not x and z
and if you're not careful the not x
part can result in quite a lot of tuples if you're not thinking very carefully about what you're writing. For instance, if you have two nested if-then-elses such as:
if x then (if y then z else t) else w
then this desugars to:
x and (if y then z else t)
or
not x and w
which desugars to
x and
(
y and z
or
not y and t
)
or
not x and w
and you need to be sure that none of these terms generate a large number of tuples.
Adding a new general purpose StringConcatenation library to allow us to locate general string concatenation operations, grab their operands, and grab the result dataflow node.