How arangodb serves the query request if data not present in main memory #9531

ImSajeed · 2019-07-22T04:27:25Z

ArangoDB Version: 3.4.4
Storage Engine : RocksDB
Deployment Mode: Single node
Deployment Strategy : Manual Start
Infrastructure: AWS
Operating System: Ubuntu 16.04
Total RAM in your machine: 32Gb.

Hi @jsteemann @graetzer @lservini
As per my understanding from this FAQ:

I have few queries that needs to be addressed:

ArangoDB stores the working set in main memory(the set of pages that are frequently accessed) It’s left to the operating system to determine the working set and to transfer pages between main memory and secondary storage. The data that are currently not needed are kept only on secondary storage.

Our total data size is 50GB and RAM size is 32GB

1)For a particular query result if data is not present in working set(main memory) will it call the secondary storage to fetch the data and return the result set?

2)if so will the disk reads on machine will spike drastically and CPU usage will increase at that time?

3)After serving the query request will the disk reads come down ? or will it try to swap the working set with new working set from secondary storage?

4)if its trying to swap the working set how much time will it take ( in my case whole data is 50GB and RAM 332GB) and during this time disk reads will be high and constant till the swap happens?

5)Is there any way to schedule the swap interval?

dothebart · 2019-07-24T12:00:53Z

Hi,
I'm a bit sorry, but you trapped on a sort of unmaintained documentation.

The 'mostly in memory' is only true if you choose the MMFiles-Storageengine. Its name origins from 'Memory Mapped files' - which is that you can map on disk files into memory.
This also means, that if you access parts of the files where the kernel decided that its not part of the "hot set" it needs to keep buffered in RAM it will load it from that disk file. The query engine working on top of this doesn't notice anything about that - maybe except that its not as fast.
yes, if you go outside of your hotset, performance is not as good, resource usage increases.
the operating system kernel decides what to do here.
you can influence the resource usage of your collections. If it needs to be 'loaded' the index has to be built up in memory, which will have a look at all documents, and add references to them to its index.
by 'loading' a collection you have it available in RAM.

For the rocksdb storage engine we explain whats actually happening in this blog article:
https://www.arangodb.com/2019/03/small-steps-reduce-arangodb-resource-footprint/

ImSajeed · 2019-07-24T15:47:44Z

Hi @dothebart ,

Is there any specific reason why the Disk Reads on ArangoDB will sudden spike up, is it due to data indexing from main memory to secondary storage?

Attached the CPU,Disk Reads and memory utilization of our read only arangodb

Disk reads will be almost constant through out the day, but suddenly it will spike drastically and becomes normal after some time

could you please explain this behavior

dothebart · 2019-07-25T15:57:50Z

Do you also see high(er) write activity in that period? If yes, the reason could be the rocksdb compaction, which has to re-oranize your data.
The database itself will store writes / updates at the time they're executed as new documents. Once all their steps are brought together and slack space inside of the database files are filled this is the compaction. It happenes during off-hours.

ImSajeed · 2019-07-26T07:30:51Z

its read only machine which we create daily by taking AMI backup of master db machine. so they no disk writes on this machine.

we observe this behavior daily, due to high disk reads(on read only machine) all queries getting queued up and shown as slow queries and we have restart the machine.

we have increased the IOPS from 1.5k to 3k though we are observing this behavior.

if it is due to compaction? is there any way to schedule it in off-load hours

dothebart · 2019-07-26T08:26:10Z

well, if the master has had writes, it derives that state of its data structures, right?
Yes, if you suffer from no more available burst credits or so everything comes to a grinding halt.

ImSajeed · 2019-07-30T05:08:44Z

if it is due to compaction? is there any way to schedule it in off-load hours

ImSajeed · 2020-02-27T15:00:17Z

@dothebart can you update on this

dothebart · 2020-09-02T13:45:47Z

I'm closing this since ArangoDB 3.4 is EOLed meanwhile.

The best way to observe what process is doing in such a situation is, to run gcore (please note that gdb 9 is required with recent arangods at least) and top to inspect which threads (they should have name) are actually using the CPU.

Please note that ArangoDB 3.7 is available.

dothebart added 1 Question 3 Documentation 3 MMFiles 3 RocksDB performance labels Jul 24, 2019

dothebart closed this Sep 2, 2020

dothebart added the 2 Out Of Date label Sep 2, 2020

arangodb / arangodb

How arangodb serves the query request if data not present in main memory #9531

How arangodb serves the query request if data not present in main memory #9531

ImSajeed commented Jul 22, 2019 •

edited

dothebart commented Jul 24, 2019 •

edited

ImSajeed commented Jul 24, 2019 •

edited

dothebart commented Jul 25, 2019

ImSajeed commented Jul 26, 2019

dothebart commented Jul 26, 2019

ImSajeed commented Jul 30, 2019

ImSajeed commented Feb 27, 2020

dothebart commented Sep 2, 2020

arangodb / arangodb

How arangodb serves the query request if data not present in main memory #9531

How arangodb serves the query request if data not present in main memory #9531

Comments

ImSajeed commented Jul 22, 2019 • edited

dothebart commented Jul 24, 2019 • edited

ImSajeed commented Jul 24, 2019 • edited

dothebart commented Jul 25, 2019

ImSajeed commented Jul 26, 2019

dothebart commented Jul 26, 2019

ImSajeed commented Jul 30, 2019

ImSajeed commented Feb 27, 2020

dothebart commented Sep 2, 2020

ImSajeed commented Jul 22, 2019 •

edited

dothebart commented Jul 24, 2019 •

edited

ImSajeed commented Jul 24, 2019 •

edited