HomeMesosphereNo notifications. 4 unresolved issues.

Handle OfferMatching timeouts as well as some performance improvements
ClosedAll Users

Authored by aquamatthias on Dec 9 2016, 4:02 PM.

Details

Summary

The complete offer gets rejected, if one matcher runs into a timeout.
Remove custom ctor, that translates from Seq to Map.
LaunchQueue.notifyInstanceUpdate should not send queue info in return, since this is very expensive and not needed
Log debug messages on debug.
It does not make sense to start an offer match, if there is minimal time left.

Test Plan

Launch Simulator and start a simple app with 50K tasks
Launch Simulator and start 5000 apps with one task

Diff Detail

Repository
rMARATHON marathon
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
aquamatthias retitled this revision from to Handle OfferMatching timeouts as well as some performance improvements.Dec 9 2016, 4:02 PM
aquamatthias updated this object.
aquamatthias edited the test plan for this revision. (Show Details)

Mostly questions.

I'd like to see the test plan include full scale testing with caps, e.g.:

1 app max # of instances
N apps with 1, 10, 100, 1000 instances per app
Largest group able to be deployed with 1, 10, 100, 1000 instances per app

src/main/scala/mesosphere/marathon/core/launchqueue/LaunchQueue.scala
80

Out of context for this review, but shouldn't this come on event bus or a Source that we give to the queue so it can be informed? Seems like we could do better here basically.

src/main/scala/mesosphere/marathon/core/matcher/base/util/ActorOfferMatcher.scala
26

Should the ask also be using the deadline?

43

Seems a little short to me considering we still have to send the match to mesos right?

src/main/scala/mesosphere/marathon/core/task/update/impl/steps/NotifyLaunchQueueStepImpl.scala
21–22

Why not Done instead of a Future?

This revision is now accepted and ready to land.Dec 9 2016, 6:56 PM
meichstedt requested changes to this revision.Dec 9 2016, 7:19 PM
meichstedt added inline comments.
src/main/scala/mesosphere/marathon/core/launchqueue/LaunchQueue.scala
80

Hm, risky. This function is called from within the taskStatusUpdateSteps which are processed sequentially:

ContinueOnErrorStep(notifyHealthCheckManagerStepImpl),
ContinueOnErrorStep(notifyRateLimiterStepImpl),
ContinueOnErrorStep(notifyLaunchQueueStepImpl),
ContinueOnErrorStep(taskStatusEmitterPublishImpl),
ContinueOnErrorStep(postToEventStreamStepImpl),
ContinueOnErrorStep(scaleAppUpdateStepImpl)

If the notifyLaunchQueueStepImpl returns immediately without waiting until the TaskLauncherActor has updated its internal state, there might be a race at least with the scaleAppUpdateStepImpl. This will ask the launchQueue for it's state and react on that. The message delivery from two different senders is not guaranteed to be in order, so if the TaskLauncherActor receives the ask from the SchedulerActor before the update message, it will respond with incorrect values.

That whole construct is not very nice and we can do better; for now I think the future is the way to go ;)

src/main/scala/mesosphere/marathon/core/launchqueue/impl/LaunchQueueDelegate.scala
34–35

See above, I'm afraid we have to ask :|

This revision now requires changes to proceed.Dec 9 2016, 7:19 PM
  • NotifyLaunchQueueStep: make sure the update is processed in the TaskLauncherActor
meichstedt accepted this revision.Dec 12 2016, 2:54 PM

Thanks!

src/main/scala/mesosphere/marathon/core/launcher/InstanceOpFactory.scala
34

This is another candidate, but this can be cleaned up ins a subsequent patch

This revision is now accepted and ready to land.Dec 12 2016, 2:54 PM
This revision was automatically updated to reflect the committed changes.
aquamatthias marked 4 inline comments as done.Dec 12 2016, 4:00 PM
aquamatthias added inline comments.
src/main/scala/mesosphere/marathon/core/launchqueue/LaunchQueue.scala
80

That's probably a perfect example for a Stream.

src/main/scala/mesosphere/marathon/core/matcher/base/util/ActorOfferMatcher.scala
26

Yes it does

43

This is an optimization. If the match will take longer, a timeout is raised and interpreted as empty match. This does not take time to send to mesos into account.