Missing Source Files from Database #39

creative16 · 2020-06-26T09:28:19Z

When trying to build a database for Chromium on Windows, I found that a significant number of source files were missing from the database. One specific example would be src/services/network/p2p/socket_manager.cc.

I'm not sure if its relevant, but during the build process, the extractor crashed many times, in what appears to be an access violation.

I'm wondering if this is something the developers would be willing to look into?

nickrolfe · 2020-06-26T10:59:36Z

Hi @creative16 - I work on our C/C++ support, and I am absolutely interested in finding and fixing extractor crashes. I'll build a Chromium/Windows database to try and reproduce them and track them down.

Thanks!

nickrolfe · 2020-07-01T12:18:39Z

A quick update: we've made a change to address this, so the next version (2.2.5) should fix the problem for you. I expect that release will happen in approximately 2 weeks' time.

Thanks again for letting us know.

creative16 · 2020-07-02T01:56:07Z

Thanks for looking into it promptly! I'll look forward to the release.

sad-dev · 2020-08-17T11:48:13Z

The chromium on linux build also misses many files, including the aforementioned src/services/network/p2p/socket_manager.cc. It might be a regression, as building with 2.1.0 actually results in a larger database with more files included.

sad-dev · 2020-08-23T12:17:04Z

I noticed 2.25 came out and rebuilt chromium on Linux. Compared to my builds from 2.10-2.14 with 73k files, the 2.24 and 2.25 dbs return about 70k files. Running comm between them flagged up some very important files. I'll hence be moving back to 2.14 for now.

I have attached the filelists (from File f select f.getRelativePath())) here @nickrolfe. Please let me know if there is anything else in the log files (4 GB, sorry) to look out for in helping to troubleshoot
old_files.txt
new_files.txt

sad-dev · 2020-08-23T18:18:11Z

One thing that occurred to me was that my old, bigger databases were created without setting --threads. Codeql could be causing extra crashes in the extractor due to this; I'll rebuild on 2.14 and 2.25 with/without the --threads and see how it goes. Really hoping that this isn't the root cause though, especially if it leads to larger compile times overall.

sad-dev · 2020-08-24T09:00:32Z

Have 4 builds for 2.14 and 2.25 with/without --threads, all of which still have around 70k files. The only two variables left that I can think of which vary as compared to my bigger 73k builds are different versions of github/codeql in my codeql-home, and varying power/clock speed configurations of my CPU (the bigger builds were done on CPUs with less cores and lower clock speeds) . Does the version of codeql affect the cli/extractor?

I did notice that identical (same codeql-cli, same github/codeql, same codebase) builds can result in slightly different database contents, but am at my wits end right now. It is looking like instability is the likely cause, but havn't quite pinned down exactly why.

nickrolfe · 2020-08-24T12:39:45Z

Hi @sad-dev - sorry to hear that the 2.2.5 release did not fix things for you, and thanks for letting us know. I will build a Chromium snapshot locally. Hopefully I can reproduce your problem and I will track down the cause of the missing files.

Usually it's because of either a segfault or catastrophic error in the extractor. In general, build-system parallelism shouldn't affect this, unless the crashes are related to out-of-memory errors. But the more likely cause is that the extractor is choking on some construct in a header file included by all the files that didn't get extracted. I'll let you know what I find.

Does the version of codeql affect the cli/extractor?

Yes, each new release of the CodeQL CLI includes the latest extractor changes.

sad-dev · 2020-08-24T12:49:03Z

Hi @sad-dev - sorry to hear that the 2.2.5 release did not fix things for you, and thanks for letting us know. I will build a Chromium snapshot locally. Hopefully I can reproduce your problem and I will track down the cause of the missing files.

Usually it's because of either a segfault or catastrophic error in the extractor. In general, build-system parallelism shouldn't affect this, unless the crashes are related to out-of-memory errors. But the more likely cause is that the extractor is choking on some construct in a header file included by all the files that didn't get extracted. I'll let you know what I find.

Does the version of codeql affect the cli/extractor?

Yes, each new release of the CodeQL CLI includes the latest extractor changes.

Thank you for taking a look! With regards to the version of codeql, I was referring to the repos for the queries rather than the CLI (i.e. those found by codeql resolve languages). I deleted them and was still able to build with the cli, so that answers my question, I guess :)

sad-dev · 2020-08-25T12:14:02Z

@nickrolfe I made three builds of chromium with codeql-cli 2.25 on three machines with --threads=0 passed to codeql database create. The specs and results are below:

6C/12T, 32gb ram - 74.6k files
8C/16T, 32gb ram - 73k files
16C/32T, 32gb ram - 70k files

From the looks of it, more threads are leading to stability issues in the extractor, possibly due to a larger 'ripple effect'.

nickrolfe · 2020-08-25T14:00:32Z

Hi @sad-dev. The Chromium build I'm doing here (also using CodeQL 2.2.5) on an old Linux laptop still hasn't quite finished, but so far I see only 5 files (all in Skia) that failed to extract. And since you are seeing different behaviours on different machines, I'd like to see if we can find what's going wrong with the extractor from your build-tracer.log file.

Assuming that the failed extractions are caused by the extractor segfaulting or otherwise hitting a catastrophic error, do you think you could take a look at places where CodeQL C++ extractor: Backtrace occurs in the log and see if there's a common pattern in the backtraces you find? Note that the top few functions will always be in the exception-handling code, but the following entries will show what was executing when the exception happened.

If the log file is too big to open in a text editor, you could do something like:

grep -A 12 "CodeQL C++ extractor: Backtrace:" log/build-tracer.log

sad-dev · 2020-08-25T14:58:57Z

The greps (with and without the -A12) for "CodeQL C++ extractor" are attached for my 6,8, and 16 core builds, as well as the filelists.

extractor_6c.txt
extractor_6c_full.txt
extractor_8c.txt
extractor_8c_full.txt
extractor_16c.txt
extractor_16c_full.txt
f6c.txt
f8c.txt
f16c.txt

nickrolfe · 2020-08-25T15:52:39Z

Thanks for sending those. We (the CodeQL team) will need to fix those crashes, but they don't explain why you're missing so many files.

Can you confirm that you're building Chromium from scratch? CodeQL will only extract source files that it observes being compiled, so if the build system thinks foo.o is already up to date, it won't recompile foo.cc and then it won't get extracted. Also, are you using ninja to build? Ninja should be fine, but some other build systems compile code from a daemon process that CodeQL won't 'see'.

The next thing I would do is pick a file that's missing, say foobar.cc, and grep build-tracer.log for that name. If there are no results, then CodeQL did not observe that file being compiled. If there are some results, then they should give clues as to what happened when trying to extract that file.

The other possibility is that extraction was successful, but something went wrong during database finalization. Once my local build finishes, I can investigate whether anything goes wrong there for me, but it might be worth checking the database-create-*.log file for errors around the line Importing .*foobar.cc.*, if that exists.

sad-dev · 2020-08-25T17:01:45Z

I edited the previous post to update with filelists and extractor crashes. Also can confirm that all three were build with autoninja -C out/Default chrome with a empty (rm -rf out) output folder

Picking v8/src/wasm/wasm-objects.cc at random from comm f6.txt f16.txt,

1.) The build-tracer.log files are slightly broken up (even on 6c, it appears that multiple threads write to it at the same time) but they all show :
a) the extractor being called to 'mimic' clang on v8/src/wasm/wasm-objects.cc to create obj/v8/v8_base_without_compiler/wasm-objects.o
b) Creating trap tarball /...v8/src/wasm/wasm-objects.cc....trap.tar.br
c) Emitting trap files for ../../v8/src/wasm/wasm-objects.cc
d) Archiving /home/user/codeql_chromium_db/src/home/user/chromium/src/v8/src/wasm/wasm-objects.cc

only my 6core and 8core logs contain lines of the form
e) Wrote XXX files to ...v8/src/wasm/wasm-objects.cc....trap.tar.br
f) 11 errors detected in the compilation of "../../v8/src/wasm/wasm-objects.cc"

2.) the database-create-*log does not mention v8/src/wasm/wasm-objects.cc on 6c and 8c, but the 16c log has the line:
[ERROR] Skipping import of TRAP files in /home/user/codeql_latest_db/trap/cpp/tarballs/home/user/chromium/src/v8/src/wasm/wasm-objects.cc.506d6f4e_0.trap.tar.br as the manifest could not be read.

grep "manifest could not be read" on the logs throws up 118, 321, and 769 errors on the 6c, 8c, and 16c builds respectively.Some quick sed commands show they they all correspond to missing files in the respective databases

Further observations:
3.) On the 16c build, a large number of files do not even reach step (b) - creating the trap tarball.
4.) the manifest reading error almost perfectly corresponds with the difference in number of files appearing in steps (b) and steps (e).
5.) The files reaching step (e) corresponds almost (> 99%) perfectly to the different .cc files appearing between builds
6.) .h files only appear in step d, if at all (and not in steps a to c), and I observed missing files reaching and not reaching this step.

nickrolfe · 2020-08-25T17:46:01Z

Thanks, that's really useful.

My guess is that, on the 16c machine, you have a file at /home/user/codeql_latest_db/trap/cpp/tarballs/home/user/chromium/src/v8/src/wasm/wasm-objects.cc.506d6f4e_0.trap.tar.br.manifest.tmp file that is partially or completely truncated, and was not renamed (to strip the .tmp extension).

That's consistent with the extractor terminating prematurely, but I don't yet know why that would happen, and without any kind of backtrace. One guess is that it's the OOM killer.

However, I now see that it's also happening on my machine, so I have a good chance of tracking this down. I'll let you know how I get on, but thanks again for your help and patience!

nickrolfe · 2020-08-25T17:51:16Z

.h files only appear in step d, if at all (and not in steps a to c), and I observed missing files reaching and not reaching this step.

That is normal - the log entries in steps a to c derive from the name of the file being compiled. Step d corresponds to copying any observed source or header file to the database archive.

I also note the extractor running out of memory would be consistent with your observation that more threads means more failures. I will try to confirm if that's what's happening and, if so, what steps we can take to reduce the memory usage.

sad-dev · 2020-08-26T01:58:09Z

.h files only appear in step d, if at all (and not in steps a to c), and I observed missing files reaching and not reaching this step.

That is normal - the log entries in steps a to c derive from the name of the file being compiled. Step d corresponds to copying any observed source or header file to the database archive.

I also note the extractor running out of memory would be consistent with your observation that more threads means more failures. I will try to confirm if that's what's happening and, if so, what steps we can take to reduce the memory usage.

Instead of reducing the memory usage (or if it proves unfeasible to do so / fails to completely fix extractor crash problems), is it possible to rerun the extractor, either immediately upon failure or just before finalizing? One obvious thing to look out for is that build files might be temporary, thus necessitating an archival if the extractor is to be rerun at the end.

creative16 · 2020-08-26T04:07:50Z

Hello @nickrolfe ,

just an update after trying version 2.2.5 to build Chromium on Windows. Generally, it definitely looks like an improvement over 2.2.4. The table below provides some summarised numbers from my build:

	2.2.4	2.2.5
Total files in database	63356	73672
Total `.c`, `.cc`, `.cpp`, `.cxx` files	26248	31670
Total `.h`, `.hxx`, `.hpp`, `.hh` files	36787	41680

Grepping for "manifest could not be read" from database-create*.log only resulted in 76 results, as compared to 5555 for version 2.2.4. These files do correspond to files missing in the database.

There is, however, no significant difference when looking for "CodeQL C++ extractor: Backtrace" in build-tracer.log for both versions of CodeQL CLI (about 10 to 11 results each). The files indicated are also not present in database-create*.log, although I'm guessing perhaps this is to be expected. So there are still at least 76 + 11 = 87 missing source files.

Finally, as compared to version 2.2.4, the extractor for version 2.2.5 only crashed 7 times due to access violations as compared to hundreds (or thousands) of times for version 2.2.4.

To get a rough idea of how many files are missing, I'm wondering if you or @sad-dev have any idea how to get a full list of files that will be built by ninja?

sad-dev · 2020-08-26T06:18:10Z

I monitored my peak memory usage on a smaller codebase that uses ninja: v8

-j = 34 (default with ninja), no codeql: 10GB
-j = 12, codeql --threads=1 (default): 18GB
-j = 12, codeql --threads=0 : 18GB
-j = 18, codeql --threads=0 : 28GB
As long as memory is not exhausted, I have consistent builds on v8. Remove the limiter and:

-j = 34, codeql --threads=0 : maxes out on my 32gb system, and throws about 150 manifest errors as compared to 0 otherwise. About 5% less files in the database.

We are thus looking at a roughly 5x multiplier on peak memory usage that codeql imposes when compiling v8, and threads used for importing doesn't seem to affect that. Given chromium's higher complexity and much larger (about 30x) size, I wouldn't be surprised if larger multipliers of 10x or so are observed on the memory requirements.

creative16 · 2020-08-28T02:25:58Z

After comparing the list of files in the database, and a list of possible files I believe should have been compiled, I think the database is still missing about 1000 source files (i.e. .cc, .cpp, .cxx, .c) in the Windows build.

I did a grep for some of the missing files in build-tracer.log, and managed to find results for them that appears to show they were intercepted by the extractor. However, there seemed to be errors indicated. I include three of the most common ones I found below. Could these errors be the reason why the files were not included in the database? I checked the compiled binary using IDA Pro and verified that the files were compiled into the binary during the build.

"../../third_party/ffmpeg/libavutil/cpu.c", line 21: catastrophic error: cannot open source file "stdatomic.h"
  #include <stdatomic.h>

"../../content/browser/gpu/gpu_process_host.cc": using precompiled header file "obj/content/browser/browser_cc.pch.semmle"
"../../base/trace_event/trace_category.h", line 34: error: expression must have a constant value
          offsetof(TraceCategory, state_) == 0,
          ^
"../../base/trace_event/trace_category.h", line 34: note: conversion from "unsigned char" to "char" is invalid in constant-expression evaluation
          offsetof(TraceCategory, state_) == 0,
          ^

"../../buildtools/third_party/libc++/trunk/include/exception", line 216: error: expected a ")"
  _LIBCPP_FUNC_VIS exception_ptr __copy_exception_ptr(void *__except, const void* __ptr);
                                                            ^

nickrolfe · 2020-08-28T11:45:17Z

Catastrophic errors will cause that source file to be missing from the database; regular errors will not. I've created an internal issue to investigate why we don't find stdatomic.h on Windows.

Regarding the excessive memory consumption, we've identified one of the major causes. In general, memory consumption should only be affected by the complexity of the one source file being extracted, but there is one feature that causes memory usage to scale with the overall size of the entire project. For the vast majority of projects, this is never a problem, but with Chromium and its tens of thousands of source files, it is. Of course we want to reliably build databases of Chromium and other massive C++ projects, so we're looking for a solution.

creative16 · 2020-09-16T06:23:47Z

Hello,

I've tried a Chromium build on Windows using version 2.2.6. The database had about 1000 more source files than the one I created using 2.2.5, which is great. Thanks for the improvements!

The catastrophic error regarding the missing stdatomic.h still appears in the log file. Another error I noticed that popped up 173 times in the log file was the following:

ERROR: Uncaught Windows exception (EXCEPTION_STACK_OVERFLOW). Exception code: 0xc00000fd

I checked a few of the source files where this error popped up and indeed they were not included in the database. Would this be a cause for the missing files as well?

nickrolfe · 2020-09-16T10:41:16Z

Hi @creative16, thanks for letting us know.

Yes, the stack overflow exception would also cause the files to be missing from the database. That was the same underlying issue for the problem reported at the start of this thread, but since then we increased the extractor's stack size while we investigate a longer-term fix for the excessive stack usage. If you're seeing that in 2.2.6, then even the increased stack size is not sufficient.

I haven't observed this problem on my own Windows machine while building V8 databases, but I'll do another build of Chromium to try and reproduce the exception.

adityasharad added the CLI label Jul 31, 2020

github / codeql-cli-binaries

Missing Source Files from Database #39

Missing Source Files from Database #39

creative16 commented Jun 26, 2020

nickrolfe commented Jun 26, 2020

nickrolfe commented Jul 1, 2020

creative16 commented Jul 2, 2020

sad-dev commented Aug 17, 2020 •

edited

sad-dev commented Aug 23, 2020 •

edited

sad-dev commented Aug 23, 2020 •

edited

sad-dev commented Aug 24, 2020 •

edited

nickrolfe commented Aug 24, 2020

sad-dev commented Aug 24, 2020

sad-dev commented Aug 25, 2020 •

edited

nickrolfe commented Aug 25, 2020

sad-dev commented Aug 25, 2020 •

edited

nickrolfe commented Aug 25, 2020

sad-dev commented Aug 25, 2020 •

edited

nickrolfe commented Aug 25, 2020

nickrolfe commented Aug 25, 2020

sad-dev commented Aug 26, 2020 •

edited

creative16 commented Aug 26, 2020 •

edited

sad-dev commented Aug 26, 2020 •

edited

creative16 commented Aug 28, 2020 •

edited

nickrolfe commented Aug 28, 2020

creative16 commented Sep 16, 2020

nickrolfe commented Sep 16, 2020

github / codeql-cli-binaries

Join GitHub today

Missing Source Files from Database #39

Missing Source Files from Database #39

Comments

creative16 commented Jun 26, 2020

nickrolfe commented Jun 26, 2020

nickrolfe commented Jul 1, 2020

creative16 commented Jul 2, 2020

sad-dev commented Aug 17, 2020 • edited

sad-dev commented Aug 23, 2020 • edited

sad-dev commented Aug 23, 2020 • edited

sad-dev commented Aug 24, 2020 • edited

nickrolfe commented Aug 24, 2020

sad-dev commented Aug 24, 2020

sad-dev commented Aug 25, 2020 • edited

nickrolfe commented Aug 25, 2020

sad-dev commented Aug 25, 2020 • edited

nickrolfe commented Aug 25, 2020

sad-dev commented Aug 25, 2020 • edited

nickrolfe commented Aug 25, 2020

nickrolfe commented Aug 25, 2020

sad-dev commented Aug 26, 2020 • edited

creative16 commented Aug 26, 2020 • edited

sad-dev commented Aug 26, 2020 • edited

creative16 commented Aug 28, 2020 • edited

nickrolfe commented Aug 28, 2020

creative16 commented Sep 16, 2020

nickrolfe commented Sep 16, 2020

Essential cookies

Always active

Analytics cookies

sad-dev commented Aug 17, 2020 •

edited

sad-dev commented Aug 23, 2020 •

edited

sad-dev commented Aug 23, 2020 •

edited

sad-dev commented Aug 24, 2020 •

edited

sad-dev commented Aug 25, 2020 •

edited

sad-dev commented Aug 25, 2020 •

edited

sad-dev commented Aug 25, 2020 •

edited

sad-dev commented Aug 26, 2020 •

edited

creative16 commented Aug 26, 2020 •

edited

sad-dev commented Aug 26, 2020 •

edited

creative16 commented Aug 28, 2020 •

edited