HomeMesosphereNo notifications. 4 unresolved issues.

Do a better job at maintaining task failure rate limiting values per RunSpec
ClosedAll Users

Authored by ichernetsky on Aug 21 2017, 8:40 PM.

Details

Summary

There was --minimum_viable_task_execution_duration command-line argument was
introduced which is 60s by default. It doesn't respect maxLaunchDelay completely,
and looks like a hack. For instance, if maxLaunchDelay is more than 60s for
some RunSpec, the corresponding rate limiting can be reset/removed, while
the deployment of RunSpec is being delayed.

In addition to that, on conditions like Running, Created, existing delays are
advanced to make sure that delays are applied to failures, and time taken by
provisioning containers doesn't get subtracted from them.

In the future we might implement removal of rate-limiting delays when
a corresponding RunSpec becomes healthy.

Test Plan

sbt test

Diff Detail

Repository
rMARATHON marathon
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
ichernetsky created this revision.Aug 21 2017, 8:40 PM
jenkins requested changes to this revision.Aug 21 2017, 8:46 PM
This revision now requires changes to proceed.Aug 21 2017, 8:46 PM
✗ Build of 4090 failed jenkins-public-marathon-phabricator-836.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

✗ Build of 4090 failed jenkins-public-marathon-phabricator-837.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

I like it... restarting tests

✗ Build of 4090 failed jenkins-public-marathon-phabricator-839.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

✗ Build of 4090 failed jenkins-public-marathon-phabricator-841.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

✗ Build of 4090 failed jenkins-public-marathon-phabricator-842.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

✗ Build of 4090 failed jenkins-public-marathon-phabricator-845.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

jenkins requested changes to this revision.Aug 22 2017, 5:49 AM
This revision now requires changes to proceed.Aug 22 2017, 5:49 AM
✗ Build of 4093 failed jenkins-public-marathon-phabricator-846.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

ichernetsky updated this revision to Diff 4104.Aug 24 2017, 1:31 AM
  • Remove minimum_viable_task_execution_duration option from MarathonTest
jenkins requested changes to this revision.Aug 24 2017, 1:37 AM
This revision now requires changes to proceed.Aug 24 2017, 1:37 AM
jenkins accepted this revision.Aug 24 2017, 1:52 AM
This revision is now accepted and ready to land.Aug 24 2017, 1:52 AM
✔ Build of 4104 completed jenkins-public-marathon-phabricator-854.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

\\ ٩( ᐛ )و //

kensipe accepted this revision.Aug 24 2017, 1:53 AM
timcharper requested changes to this revision.Aug 25 2017, 12:35 AM

Seems okay. I think in fixing the test the semantics have been affected detrimentally.

src/test/scala/mesosphere/marathon/core/launchqueue/impl/RateLimiterTest.scala
35

I'm finding myself wishing this test were clarified. What, exactly, is it testing" "does X thing after calling resetDelaysOfViableTasks".

further, the stillWaiting variable name seems misleading.

Presume clock is 00:00:00

viable has a back off strategy of 10 seconds, with a max launch delay of 60 seconds. We invoke a delay. This presumably puts the deadline to 00:00:10.

stillWaiting has a back off strategy of 20 seconds, with a max launch delay of 70 seconds. We invoke a delay. This presumably puts the deadline to 00:00:20.

We then advance 61 seconds. The clock is now 00:01:01.

viable's deadline is completely reset, so the current time is returned because it has been in a delayed state for more than 60 seconds. This is sensible.

stillWaiting's deadline is the same as the last time delay was called: 41 seconds ago. I might be misreading this, but that does not seem "stillWaiting".

Is this a realistic simulation for how the component will actually be used?

Would it be better to simulate stepping through time, calling delay multiple times, perhaps 10 seconds at a time, and then asserting in a frame that a delay call, followed by resetDelaysOfViableTasks ceases to have an effect once maxLaunchDelay is reached?

I'm unsure if this would a better way to express the test; but, as it stands, the test is quite confusing.

This revision now requires changes to proceed.Aug 25 2017, 12:35 AM

"Seams okay" didn't come across right. This change seems okay :) I'm not entirely sure why we are removing the param that Matthias introduced, nor why he introduced it in the first place. And, the test is in a really weird state.

ichernetsky marked an inline comment as done.Aug 29 2017, 7:49 PM
ichernetsky added inline comments.
src/test/scala/mesosphere/marathon/core/launchqueue/impl/RateLimiterTest.scala
35

Thanks for commenting on this. I agree that the variable names are confusing. The purpose of this test is to ensure that the existing delays as the time flies, get deleted.

ichernetsky updated this revision to Diff 4150.Aug 29 2017, 7:50 PM
ichernetsky marked an inline comment as done.
  • Make resetDelaysOfViableTasks test case less confusing
jenkins requested changes to this revision.Aug 29 2017, 7:52 PM
This revision now requires changes to proceed.Aug 29 2017, 7:52 PM

I think Matthias introduced that parameter just to make sure that delays get GCed. But the implementation of clean-up and resetting delays was buggy.

jenkins accepted this revision.Aug 29 2017, 8:06 PM
✔ Build of 4150 completed jenkins-public-marathon-phabricator-901.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

\\ ٩( ᐛ )و //

timcharper accepted this revision.Aug 30 2017, 12:07 AM
This revision is now accepted and ready to land.Aug 30 2017, 12:07 AM
This revision was automatically updated to reflect the committed changes.