Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apf: add additional latency into work estimate #103240

Merged
merged 1 commit into from Aug 13, 2021

Conversation

@tkashem
Copy link
Contributor

@tkashem tkashem commented Jun 27, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

add additional latency into width for a request

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Jun 27, 2021

/assign @wojtek-t

@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Jun 27, 2021

/assign @MikeSpreitzer

@tkashem tkashem force-pushed the tkashem:apf-additional-latency branch 2 times, most recently from c725c82 to a3af0e5 Jun 27, 2021
@tkashem tkashem force-pushed the tkashem:apf-additional-latency branch from a3af0e5 to 2a23c50 Jun 27, 2021
@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Jun 28, 2021

Extra latency should not be part of width.
Before we make a big divergence between counting requests and counting seats in the metrics, we should finish adding all the seat-counting metrics.
It would be easier to impose the extra latency in apf_filter.go than inside the QueueSet. The one advantage I see for imposing extra latency inside the QueueSet is the ability to have a divergence between the request-couting metrics and the seat-counting metrics.
A sleep in the request-handling goroutine will not get the desired behavior, which is to get a response to the client before the extra latency expires.
A PassiveClock does not have a sleep method.
Remember that PassiveClock is designed to facilitate testing, and QueueSet is tested with a FakeEventClock.
Remember that we have two clock packages.

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Jun 28, 2021

#94738 - two clock packages with divergent history

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Jun 28, 2021

To do a testable sleep we would promote EventClock into the clock package and add a Sleep method, which the fake one implements like the existing ClockWait function in queueset_test.go.

@fedebongio
Copy link
Contributor

@fedebongio fedebongio commented Jun 29, 2021

/triage accepted

@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Aug 11, 2021

The unit test is failing, I am going to investigate it...

// latency has no impact on the user experience.
additionalLatency := r.workEstimate.AdditionalLatency
go func() {
qs.clock.EventAfterDuration(func(_ time.Time) {

This comment has been minimized.

@wojtek-t

wojtek-t Aug 12, 2021
Member

EventAfterDuration spins up a goroutine itself - so you don't want the outer one.

This comment has been minimized.

@tkashem

tkashem Aug 12, 2021
Author Contributor

good catch, that was causing the test to fail.

// Sleep has been invoked appropriately.
if additionalLatencyExpected := test.workEstimate.AdditionalLatency; additionalLatencyExpected > 0 {
// advance the fake clock so the events can transpire
clk.SetTime(now.Add(additionalLatencyExpected + time.Hour))

This comment has been minimized.

@wojtek-t

wojtek-t Aug 12, 2021
Member

I think you can just do clk.Run(nil)

This comment has been minimized.

@wojtek-t

wojtek-t Aug 12, 2021
Member

[and do this unconditionally to ensure that everything fired and finished]

This comment has been minimized.

@tkashem

tkashem Aug 12, 2021
Author Contributor

I prefer SetTime because this is a way for the test to say - i expect the event to happen exactly after AdditionalLatency


func() {
qs.lock.Lock()
defer qs.lock.Unlock()

This comment has been minimized.

@wojtek-t

wojtek-t Aug 12, 2021
Member

[and if you do above, I don't think you need locking]

This comment has been minimized.

@tkashem

tkashem Aug 12, 2021
Author Contributor

yes, removed.

func() {
qs.lock.Lock()
defer qs.lock.Unlock()

This comment has been minimized.

@wojtek-t

wojtek-t Aug 12, 2021
Member

And following on @MikeSpreitzer comment, you can also check time now - it should be now+additional_latency at this point

@tkashem tkashem force-pushed the tkashem:apf-additional-latency branch 2 times, most recently from 1b4d093 to bdd4c57 Aug 12, 2021
Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

newTestableQueueSetFactory is intended to be the start of how a test func creates a queueSet.
Note that when doing goroutine counting, you need to use a promiseFactoryFactory that participates.

@tkashem tkashem changed the title [WIP] apf: add additional latency into work estimate apf: add additional latency into work estimate Aug 12, 2021
// TODO: for now we keep the logic localized so it is easier to see
// how the counters are tracked for queueset and queue, in future we
// can refactor to move this function.
releaseSeatsFn := func() {

This comment has been minimized.

@MikeSpreitzer

MikeSpreitzer Aug 12, 2021
Member

How about calling this releaseSeatsLocked ?

@tkashem tkashem force-pushed the tkashem:apf-additional-latency branch from bdd4c57 to 0fbcd81 Aug 12, 2021
@tkashem tkashem force-pushed the tkashem:apf-additional-latency branch from 0fbcd81 to d681864 Aug 12, 2021
Copy link
Member

@wojtek-t wojtek-t left a comment

The prod code looks fine to me.

The test still requires some though - see also Mike`s comments.

qs.finishRequestLocked(r)

// as soon as AdditionalLatency elapses we expect the seats to be released
clk.SetTime(now.Add(test.workEstimate.AdditionalLatency))

This comment has been minimized.

@wojtek-t

wojtek-t Aug 12, 2021
Member

Why not run clk.Run(nil)

[but then Mike`s comment about proper initialization of QS is a valid one]

This comment has been minimized.

@tkashem

tkashem Aug 12, 2021
Author Contributor

Why not run clk.Run(nil)

@wojtek-t for the test at hand, clk.SetTime is more concise in expressing the expectation that the event function should be executed immediately after AdditionalLatency elapses. So this is a verification step.

Alternatively, we can achieve this with clk.Run(nil) too:

clk.Run(nil)
get the time when the event fired and assert that it happened after AdditionalLatency elapsed

I think SetTime does the job concisely, thoughts?

This comment has been minimized.

@MikeSpreitzer

MikeSpreitzer Aug 12, 2021
Member

In my opinion, using SetTime to the precise time needed is sufficient for a test of finishRequestLocked. Using clk.Run(nil) would also be OK, and adding a test that Run advanced the fake time to the exact correct value would be better.

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Aug 12, 2021

Rather than a checksum test, I recommend updating the behavioral tests.
Add width and duration padding to uniformClient.
Update the calculation of the expected results.
Add scenarios with width other than 1 and with non-zero duration padding.

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Aug 12, 2021

demands[i] = float64(nThreads) * float64(uc.execDuration) / float64(uc.thinkDuration+uc.execDuration)

would become

demands[i] = float64(nThreads) * float64(uc.width) * float64(uc.execDuration) / float64(max(uc.thinkDuration, uc.padDuration )+ uc.execDuration)
@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Aug 12, 2021

Rather than a checksum test, I recommend updating the behavioral tests.

this test is a behavioral test, but at a unit level, so if there is a regression due to a bug in finishRequestLocked this test will alert us by failing.
The behavioral test you are referring to is exercising the code from a much higher level. So I think this test in this PR is essential.

I agree that we should update the behavioral tests with variable work estimates. Do we want to do it in a follow-up PR? I am trying to expedite this to unblock #103539. Otherwise, happy to update the behavioral tests in this PR.

@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Aug 12, 2021

newTestableQueueSetFactory is intended to be the start of how a test func creates a queueSet.
Note that when doing goroutine counting, you need to use a promiseFactoryFactory that participates.

@MikeSpreitzer do you mean this initialization pattern?

	clk, counter := testeventclock.NewFake(now, 0, nil)
	qsf := newTestableQueueSetFactory(clk, countingPromiseFactoryFactory(counter))
	qCfg := fq.QueuingConfig{
		Name:             "TestTotalRequestsExecutingWithPanic",
		DesiredNumQueues: 0,
		RequestWaitLimit: 15 * time.Second,
	}
	qsc, err := qsf.BeginConstruction(qCfg, newObserverPair(clk))
	if err != nil {
		t.Fatal(err)
	}
	qs := qsc.Complete(fq.DispatchingConfig{ConcurrencyLimit: 1})
	counter.Add(1) // account for the goroutine running this test
@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Aug 12, 2021

@MikeSpreitzer this test exercises finishRequestLocked (a unit at a much lower level), not qs.StartRequest - so i don't think we need to use the above pattern, it adds unnecessary steps for the test:

                        now := time.Now()
			clk, counter := testeventclock.NewFake(now, 0, nil)
			qsf := newTestableQueueSetFactory(clk, countingPromiseFactoryFactory(counter))
			qCfg := fq.QueuingConfig{
				DesiredNumQueues: 1,
				QueueLengthLimit: 8,
				HandSize:         1,
			}
			qsc, err := qsf.BeginConstruction(qCfg, newObserverPair(clk))
			if err != nil {
				t.Fatal(err)
			}
			o := qsc.Complete(fq.DispatchingConfig{ConcurrencyLimit: 1})
			counter.Add(1) // account for the goroutine running this test

			qs, ok := o.(*queueSet)
			if !ok {
				t.Fatal("Not a queueSet")
			}

			queue := qs.queues[0]
			r := &request{
				qs:           qs,
				queue:        queue,
				workEstimate: test.workEstimate,
			}

                        ...

			qs.finishRequestLocked(r)
			counter.Add(-1) // completion of main activity of goroutine running this test

I am fine making this change if we want to stick to creating a fully initialized QueueSet in every unit test (as opposed to initialize what is needed), thoughts?

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Aug 12, 2021

Yes, that's the initialization pattern I meant.

I suppose "checksum test" means "test of a small unit". My point is that tests of the whole queueset unit are necessary and sufficient. I am not a fan of maintaining tests of smaller units. If you want to put in a test of finishRequestLocked with non-zero duration padding in this PR and switch to whole-package testing in a follow-on PR, that is OK with me. I wiill prepare that follow-on.


qs.finishRequestLocked(r)

// as soon as AdditionalLatency elapses we expect the seats to be released

This comment has been minimized.

@MikeSpreitzer

MikeSpreitzer Aug 13, 2021
Member

If we really want to lock down the behavior of finishRequestLocked, we could also check the values one nanosecond before the proper release time.

This comment has been minimized.

@tkashem

tkashem Aug 13, 2021
Author Contributor

Yes, I will update the test to assert on this.

return
}

additionalLatency := r.workEstimate.AdditionalLatency

This comment has been minimized.

@MikeSpreitzer

MikeSpreitzer Aug 13, 2021
Member

I do not understand why local copy of this value is being created here. As far as i can tell, it is harmless but not helpful.

This comment has been minimized.

@tkashem

tkashem Aug 13, 2021
Author Contributor

r represents request scoped data and should not be accessed from another active goroutine, but just to be on the safe side i made a copy here.

This comment has been minimized.

@MikeSpreitzer

MikeSpreitzer Aug 13, 2021
Member

Neither the original nor the copy is accessed from the forked goroutine.

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

I made some independent comments.
I do not want to hold up this improvement due to wrangling over the test; that can be worked in follow-up.

@k8s-ci-robot
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Aug 13, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Aug 13, 2021

/LGTM

@k8s-ci-robot k8s-ci-robot merged commit eba1632 into kubernetes:master Aug 13, 2021
14 checks passed
14 checks passed
@thelinuxfoundation
cla/linuxfoundation tkashem authorized
Details
@k8s-ci-robot
pull-kubernetes-conformance-kind-ga-only-parallel Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-dependencies Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce-100-performance Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-e2e-gce-ubuntu-containerd Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-e2e-kind Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-e2e-kind-ipv6 Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-integration Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-node-e2e-containerd Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-typecheck Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-unit Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-verify Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
pull-kubernetes-verify-govet-levee Job succeeded.                     BaseSHA:f466d9278f2f0421a85b3f907e2b849b68141d8b
Details
@k8s-ci-robot
tide
Details
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Aug 13, 2021
@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented Aug 13, 2021

Thanks!

@tkashem
Copy link
Contributor Author

@tkashem tkashem commented Aug 13, 2021

I do not want to hold up this improvement due to wrangling over the test; that can be worked in follow-up.

Thanks @MikeSpreitzer, I will be working on the follow-on PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment