Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upAttempt to handle a stream being silently killed in the background #2121
Conversation
await pool.query('SELECT NOW()') | ||
await pool.query('SELECT NOW()') | ||
await pool.query('SELECT NOW()') | ||
|
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 26, 2020
Perhaps you should also add something like:
const client2 = await pool.connect();
try {
// Verify that `client` was removed from the pool and replaced w/ a new client
expect(client2).to.not.equal(client);
} finally {
await client2.release();
}
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 26, 2020
(Otherwise, it's possible that client
just magically fixed itself through some other means. So, adding this check helps to rule out that possibility)
This comment has been minimized.
This comment has been minimized.
brianc
Feb 26, 2020
Author
Owner
yah that's not a bad idea! I'd like to hold off on merging this until I get some confirmation its actually fixing things in aws lambda. It also unfortunately doesn't work w/ the native bindings yet so I'll need to somehow figure that out.
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 26, 2020
Sounds good.
As I mentioned in this thread, I suspect that both "pg"
and "knex"
were encountering the same issue within their respective pool implementations. So, we'll most likely need to implement a similar fix on the "knex"
side.
// remove this client, it's died in the background - this can happen | ||
// in aws lambdas - they go idle and the streams are closed without an error | ||
if (client.connection && client.connection.stream.readyState !== 'open') { | ||
this._remove(client) |
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 26, 2020
In the current logic: if it notices that the next-available client
is already dead, then it removes it from the pool. Afterwards, it instantiates a new client to replace it.
Would there be any advantage to looping on the client
eviction step? In pseudo-code:
while(this._idle.length) {
const client = this._idle.pop();
// ... Other important details I'm ignoring for the moment...
if(isDead(client)) {
this.remove(client);
} else {
return client
}
}
if(!this._isFull()) {
return this.newClient(...);
}
throw new Error('unexpected condition');
This seems like it helps in 2 ways:
- It clears out the dead clients more quickly, and
- It avoids the need to wait for
this.newClient(..)
when there is already another client available.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
592da
Feb 26, 2020
hi guys, I want to test this branch, but npm says its missing the version... tried to fork it, put the version (https://github.com/592da/node-postgres) and then compiling failed : Cannot find module 'pg'
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 26, 2020
Hmm... I suspect that's because this is a mono-repo that is built via Lerna . You might need to build it locally first and then link to the build artifacts. (@brianc might know another approach. I haven't actually worked with this codebase before).
I don't remember the exact build steps for Lerna off the top of my head, but I think it's something like:
- Install Lerna
- Navigate to the root of the repo
- Run the command:
lerna bootstrap
This comment has been minimized.
This comment has been minimized.
brianc
Feb 26, 2020
Author
Owner
hi guys, I want to test this branch, but npm says its missing the version... tried to fork it, put the version (https://github.com/592da/node-postgres) and then compiling failed :
Cannot find module 'pg'
if you wanna test it it's probably easiest to just apply the diff of the 1 file i changed inside your node_modules/pg-pool
folder to index.js
(or just copy/paste the changes).
This comment has been minimized.
This comment has been minimized.
592da
Feb 26, 2020
thanks for your reply @briandamaged, unfortunately, it didn't work, maybe because its also imported from typeorm
This comment has been minimized.
This comment has been minimized.
592da
Feb 26, 2020
@brianc , having the same issue...
2020-02-26T21:38:37.302Z da5bbf66-06b7-4c69-9c85-ab84b1b777a1 INFO Error: Connection terminated unexpectedly
at Connection.<anonymous> (/var/task/src/rest/webpack:/node_modules/pg/lib/client.js:255:1)
at Object.onceWrapper (events.js:312:28)
at Connection.emit (events.js:223:5)
at Connection.EventEmitter.emit (domain.js:475:20)
at Socket.<anonymous> (/var/task/src/rest/webpack:/node_modules/pg/lib/connection.js:139:1)
at Socket.emit (events.js:228:7)
at Socket.EventEmitter.emit (domain.js:475:20)
at endReadableNT (_stream_readable.js:1185:12)
at processTicksAndRejections (internal/process/task_queues.js:81:21)```
return this._acquireClient(client, pendingItem, idleListener, false) | ||
// remove this client, it's died in the background - this can happen | ||
// in aws lambdas - they go idle and the streams are closed without an error | ||
if (client.connection && client.connection.stream.readyState !== 'open') { |
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 27, 2020
I'm seeing some mention online that stream.readyState
is deprecated/unsupported. It might be safer to do this instead:
(client.connection.stream.readable && client.connection.stream.writable)
When experimenting w/ things locally, I'm finding that these properties change as soon as you call stream.end()
.
This comment has been minimized.
This comment has been minimized.
brianc
Feb 27, 2020
Author
Owner
yeah those properties don't exist in all supported versions of node that pg supports. stream.readyState does exist in all supported versions.
This comment has been minimized.
This comment has been minimized.
brianc
Feb 27, 2020
Author
Owner
When experimenting w/ things locally, I'm finding that these properties change as soon as you call
stream.end()
.
you should check if readyState
changes as well (I think it does) - reason is any official patch is gonna need to work in node 8, 10, 12, and 13. So if its "officially deprecated" in node but still works, I'd use it (as it probably wasn't deprecated in older versions of node). Typically I do that, write a bunch of tests for the feature, and then if a test breaks on an edge release of node I either decide to drop old version support at that point or make a compatibilty branch in the code that will handle both old and new versions of node.
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 27, 2020
Sure -- I'll give that a try after the initial experiment. Speaking of which: you can find the experiment within the knex
pooling here:
https://github.com/briandamaged/knex/tree/experiment/aws-disconnections
And, here's the link to the specific commit:
So, I think this roughly aligns w/ your implementation. (Aside from the readyState
piece)
This comment has been minimized.
This comment has been minimized.
brianc
Feb 27, 2020
Author
Owner
yah totally actually I don't wanna burden you w/ backwards compatiblity - if you find a way to make it work in lambda w/ a version of node, i'll do my best to back port it and "production-ify" it for this library. Mostly it's just hard for me to isolate & test since I don't have a lambda nor much experience w/ them so me setting all that testing infrastrucutre up is a huge time sync. But if you figure out the root cause, I'll help any way I can getting a fix out
This comment has been minimized.
This comment has been minimized.
briandamaged
Feb 27, 2020
Ah -- yeah, I currently don't have a way to recreate the issue, either. I just know that Knex's current implementation relies upon Events to detect the closed connections. So, this just seems like a possible way to work around that issue.
Hopefully we'll hear some good news from this experiment! If so, then we can figure out how to formulate the concepts into something that is production-worthy. (One thought: we might even need to swap out implementations depending upon Node version. Ex: use readable
/ writable
if the Node version supports those properties; otherwise, fall back to readyState
)
I put this up as a proof of concept fix. If this doesn't fix the issue please lemme know & I'll close it. If it does fix the issue, let me know & I'll clean it up and land it for real. If this doesn't end up fixing it I hope this can be a starting point for someone to do more investigation and eventually submit a pull request. As I don't have a lambda or a way to reproduce the problem I don't think my time is best spent trying to track this particular issue down. So...I'm open to a PR to fix the issue & supplying ideas and guidance but I'm likely not going to push on fixing this issue myself until I take care of other higher priority issues for the project. |
michaelseibt
commented
Feb 28, 2020
I'll give this a try. Since we're not able to reproduce the issue reliably, it could take a while to come back to you. Thanks for the patience! |
michaelseibt
commented
Feb 28, 2020
•
Here is what I tried: I've added a couple of logging statements around the lines of code touched by this PR:
Built it, shipped it, waited. After a while, these logs came up:
Is there something else that I could provide to help debugging this? Edit: I noticed that two different Lambdas (not two instances of the same Lambda) had the same error logged at the same time: 09:09:24.78 and 09:09:24.518 Two messages are also logged in RDS:
|
592da
commented
Mar 5, 2020
any news regarding ? |
osdiab
commented
Mar 31, 2020
•
looks like this PR is stalling? any way to help? |
I don't really use lambda - the best way to help would be to contribute a PR to fix the issue - without me having access to lambda I'm likely not going to be very helpful in finding a fix. Will definitely be down to review & merge any fix if you find it! |
dgobaud
commented
May 7, 2020
we are having this same problem in Google Cloud it seems |
brianc commentedFeb 26, 2020
I think this will fix the issue outlined in #2112
I just pounded out this initial version pretty quickly as I'm not 100% sure it'll fix the issue yet....as I'm not sure my test is exactly replicating what's going on in lambda. In the test I'm "cleanly" severing the connected stream to the backend which doesn't emit an error anywhere so the client/pool still thinks its okay.
Could someone from #2112 try applying this patch to their code in lambda and see if it fixes the issue? If so I'll polish this up, write a couple more tests, and get a patch version fix out.