HomeMesosphereNo notifications. 4 unresolved issues.

Rework DeploymentManager logic
ClosedAll Users

Authored by zen-dog on Nov 18 2016, 5:53 PM.

Details

Reviewers
aquamatthias
meichstedt
Commits
rMARATHON95969d4ec96d: DeploymentManagers handles leader abdication correctly
rMARATHONa8bac97b8bf0: Rework DeploymentManager logic
rMARATHON3edb7b76bb6c: Remove SchedulerDriver from the PerformDeployment message
rMARATHON1a4d91cddd63: Rework DeploymentManager logic
rMARATHON07a96bff38a4: Fixed small typo in the DeploymentManagerTest
rMARATHON77480be2f189: Implemented part of the review change requests
rMARATHON0757f9ae7cfc: Rework DeploymentManager logic
rMARATHON6ac141df7dff: Remove SchedulerDriver from the PerformDeployment message
rMARATHONc571775db076: Fix MarathonSchedulerActor "Cancellation timeout" test
rMARATHON1b3b8ee4dd79: Fix MarathonSchedulerActor "Cancellation timeout" test
rMARATHONfcf377b9f45f: Rework DeploymentManager logic
rMARATHON0512601d1f99: DeploymentManagers handles leader abdication correctly
rMARATHON25c3b470fae0: Fix MarathonSchedulerActor "Cancellation timeout" test
rMARATHON4ff8a81fb223: Rework DeploymentManager logic
rMARATHON902525004ad1: Fix MarathonSchedulerActor "Cancellation timeout" test
rMARATHONe21dfeab7024: Rebased on master and implemented review change requests
rMARATHON11769f986b47: Fixed small typo in the DeploymentManagerTest
rMARATHON5ef326a717de: Rebased on master and implemented review change requests
rMARATHON29a981004e3c: Rework DeploymentManager logic
rMARATHONe0aa5d87671c: Fix MarathonSchedulerActor "Cancellation timeout" test
rMARATHONe7b3f8652e08: Fix MarathonSchedulerActor "Cancellation timeout" test
rMARATHON50c5b624725c: Implemented part of the review change requests
rMARATHONba74e901da0b: Rework DeploymentManager logic
JIRA Issues
JIRA MARATHON_EE-1042 Move deployment and logic out of MarathonSchedulerActor -> DeploymentManager
Summary

Rework DeploymentManager logic

  • Deployments are started with the new StartDeployment message
  • Removed MarathonSchedulerActor "awaitCancelation" logic - forced deployments are now completely handled by the DeploymentManager
  • DeploymentManager has it's own deployment repository reference and makes all the repository operations
  • DeploymentManager also handles a failed deployment when it's locked (force = false) through StartDeployment message
  • Fixed unit tests to reflect the changes and added new tests
  • Correct handling of leader abdication/reelection. DeploymentManager now goes into suspended receive mode where it ignores all the messages except for "RecoverDeployments" which signals leader election.
  • Added documentation on deployment manager message flow
  • Removed deployment repository reference from MarathonSchedulerActor
Test Plan

unit-test, integration tests

Diff Detail

Repository
rMARATHON marathon
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
There are a very large number of changes, so older changes are hidden. Show Older Changes
zen-dog added inline comments.Nov 28 2016, 6:18 PM
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
264–265

I left it there because this way the even publishing happens strictly after the lock is removed. Otherwise there is no guarantee for that.

zen-dog updated this revision to Diff 815.Nov 28 2016, 8:44 PM

Implemented review change requests

  • Minor improvement in the MarathonSchedulerActor logic
zen-dog marked 2 inline comments as done.Nov 28 2016, 8:46 PM
zen-dog added inline comments.
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
246–247

It is.

263–264

I simplified the calling logic and added a comment to it.

aquamatthias requested changes to this revision.Nov 29 2016, 2:24 PM
aquamatthias added inline comments.
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
263–264

Oversimplified. Now this is possible:

  • Deployment A starts --> lock is acquired by A
  • Deployment B starts with force=true --> lock is acquired by B
  • Deployment A fails --> this will remove the lock acquired by B Deployment B now is in progress and no lock for all apps of A and B is available.
This revision now requires changes to proceed.Nov 29 2016, 2:24 PM
zen-dog updated this revision to Diff 883.Nov 30 2016, 12:09 PM
zen-dog marked 2 inline comments as done.

Fixed incorrect lock handling in the MarathonSchedulerActor

zen-dog marked 2 inline comments as done.Nov 30 2016, 12:15 PM
zen-dog added inline comments.
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
263–264

Right. Thanks for the catch. I've made lockedRunSpecs a Map[PathId, Int] to save the number of locks being held. This makes it better but the solution is not very robust. Since DeploymentManager has all the information about running deployments I would suggest removing the locks from MarathonSchedulerActor completely and ask DeploymentManager instead. This should be fairly simple to implement but I would wait with it until our ITs are green again.

aquamatthias requested changes to this revision.Dec 1 2016, 11:23 AM

Thanks. Mostly minor comments.

src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
52

According to the code, this comment is misleading:

  • a lock is acquired if deployment is started
  • a lock is acquired if a kill operation is executed
  • a lock is acquired if a scale operation is executed

This basically means:

  • a kill/scale operation should not be performed, while a deployment is in progress
  • a deployment should not be started, if a scale/kill operation is in progress

Asking the deployment manager would not be sufficient.

254

Please add a sentence to force:
If a deployment is forced

  • the new deployment will be started
  • the old deployment will be cancelled and release all claimed locks
  • only in this case, one RunSpec can have 2 locks!
src/main/scala/mesosphere/marathon/upgrade/DeploymentManager.scala
150

The DeploymentManager is started with Marathon even if we are not elected as leader.
That is the reason why this is an Option.
You can not call get safely here.

193–198

Just for clarity: please add: if !isScheduledDeployment(plan.id)

323

You not only mark this deployment, but also start a deployment actor.
Suggestion: startDeployment??

src/test/scala/mesosphere/marathon/MarathonSchedulerActorTest.scala
509–510

yes.

This revision now requires changes to proceed.Dec 1 2016, 11:23 AM
zen-dog updated this revision to Diff 1037.EditedDec 4 2016, 8:13 PM
zen-dog marked 5 inline comments as done.
  • Rebased on master and implemented review change requests
  • Handled one case where DeploymentManager should send MarathonSchedulerActor a notification about canceled deployment
  • Removed SchedulerDriver from the DeploymentManager since it wasn't really needed anywhere
  • Removed "Cancellation timeout" MarathonSchedulerActor test since it wasn't needed anymore
  • Fixed some comments
zen-dog updated this revision to Diff 1048.Dec 5 2016, 12:03 PM

Retest this please

zen-dog updated this revision to Diff 1058.Dec 5 2016, 2:56 PM

Yet another rebase

aquamatthias accepted this revision.Dec 5 2016, 3:32 PM

@zen-dog Thanks. A step in the right direction. This definitively needs further cleanup.

This revision is now accepted and ready to land.Dec 5 2016, 3:32 PM
zen-dog updated this revision to Diff 1062.Dec 5 2016, 3:42 PM

Fixed scapegoat warnings

zen-dog updated this revision to Diff 1074.Dec 5 2016, 7:49 PM
  • Fixed small typo in the DeploymentManagerTest
  • Implemented part of the review change requests
  • Rebased on master and implemented review change requests
meichstedt requested changes to this revision.Dec 6 2016, 2:11 PM
meichstedt added inline comments.
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
86

The name is not clear imo. What the message does is basically a GET, although I understand that it should only be used when trying to continue deployments that have been suspended because of a loss of leadership, which should be reflected in the name of this message. What about RetrieveDeployments or RetrieveSuspendedDeployments?

The associated response should be named accordingly, DeploymentsRecovered is not good name either (imo).

88

See above: not a good name. DeploymentsRecovered implies that deployments have been recovered which is not the case here. Rather DeploymentsToRecover or SuspendedDeployments or something like that.

91

Not introduced with this diff, but why do we send from deadLetters?

223

not high prio for this PR, but couldn't we extract all that lock stuff into something that only handles those, thus moving all of these helpers out of this actor?

src/main/scala/mesosphere/marathon/upgrade/DeploymentManager.scala
213

If storing the plan fails, it is still marked as scheduled. It should only be marked as scheduled if storing succeeded, otherwise the actor's state and the repo are out of sync.

I also wonder why you wait for the plan to be stored and then send a message to self. Wouldn't it be clearer to

  1. store new plan and map into that future with self ! LaunchDeploymentActor(plan, recipient)
  2. Do the internal state mutation on receiving that message, after the plan has been persisted:
case LaunchDeploymentActor(plan, recipient) => {
  markScheduled(plan)
  origSender ! DeploymentStarted(plan)
  startDeployment(plan, origSender)

?

233

It's easier to understand that way now, but it's easy to mess up with permutations in the future. The current state is exhaustive wrt StartDeployment. If one of these cases is slightly changed, there are chances the the PF doesn't catch certain messages anymore. How about catching the message in 1 case and calling distinct methods from there? Note all 3 cases will compute conflicting deployments to check whether the case matches, and in 2 of them the conflicting deployments will be computed again in the function body. When matching once you can store the result of this computation.

Also, case matches with more than 2 lines prevent readability imo.

253

Same as above: the internal state is changed before persisting the change. please persist and then change state.

This revision now requires changes to proceed.Dec 6 2016, 2:11 PM
zen-dog added inline comments.Dec 6 2016, 5:04 PM
src/main/scala/mesosphere/marathon/upgrade/DeploymentManager.scala
213

Consider following scenario with two deployments (same pathId) A and B where B is forced and ZK storing is slow:

  • StartDeployment A is coming in
  • A is stored (wait for store to succeed)
  • StartDeployment B is coming in and is forced
  • B is stored too since it's not in the internal state yet and therefore there are no conflicts
  • LaunchDeploymentActor A comes in and proceeds
  • LaunchDeploymentActor B comes in and... proceeds too!

The whole point of saving deployments in the internal state first was to avoid such conflicts.
The case where storing fails is not handled (and wasn't handled in the previous version) since then master fails and the new master picks it up from there with a new state. This feels like a longer discussion so maybe we should sit down together and talk about it.

meichstedt added inline comments.Dec 7 2016, 1:27 PM
src/main/scala/mesosphere/marathon/upgrade/DeploymentManager.scala
213

I don't see why starting DeploymentActor B couldn't mean stopping DeploymentActor A when DeploymentPlan B overrules A. But I see that this probably complicates things on another level, I'm open for discussion.

It might be that failing futures re persisting the plans wasn't handled before, but that's no excuse :P
And no – Marathon does not fail over just because of a failed write. It fails over when the zk connection is lost, not when a read or write operation fails. This should definitely be handled here, no matter in which order the steps are performed.

219

could we stop doing that then? if there is one place in the code where we send with deadletters as sender, we basically need to check for noSender everywhere, as it's really hard to track originalSender's throughout the codebase.

zen-dog updated this revision to Diff 1146.Dec 8 2016, 2:29 AM
  • Implemented basic recovery from failing repository operations
zen-dog updated this revision to Diff 1147.Dec 8 2016, 11:01 AM

Rebased and removed an unused import

zen-dog updated this revision to Diff 1149.Dec 8 2016, 11:34 AM
  • Fixed scapegoat async/await warnings
zen-dog updated this revision to Diff 1150.Dec 8 2016, 12:25 PM

Fixed scaladoc error

meichstedt requested changes to this revision.Dec 8 2016, 4:06 PM

Thanks, just 3 super mini comments :)

src/main/scala/mesosphere/marathon/upgrade/DeploymentManager.scala
249

NonFatal(e)

266

NonFatal(e)

306

NonFatal(e)

This revision now requires changes to proceed.Dec 8 2016, 4:06 PM
zen-dog updated this revision to Diff 1158.Dec 8 2016, 4:38 PM
zen-dog marked 3 inline comments as done.

Implemented feedback: catching NonFatal instead of Throwable

This comment was removed by zen-dog.
meichstedt accepted this revision.Dec 8 2016, 4:42 PM

Thanks!

This revision is now accepted and ready to land.Dec 8 2016, 4:42 PM
This revision was automatically updated to reflect the committed changes.