HomeMesosphereNo notifications. 4 unresolved issues.

Reset root group cache when elected as leader
ClosedAll Users

Authored by unterstein on Jun 9 2017, 9:33 AM.

Details

Summary

During the replacement of GroupManagerActor to GroupManagerImpl, the following behavior was introduced:
When taking over leadership, a marathon instance will not load current root group from zookeeper, instead the old (probably outdated) root group will used to operate. This leads wrong behavior, because this root group is not in sync with the current zookeeper state.

This fix introduced a refresh cache implementation which is called during starting after taking leadership.

Test Plan

sbt integration:test

Diff Detail

Repository
rMARATHON marathon
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
Changes from before your most recent comment are hidden. Show Older Changes
unterstein added inline comments.Jun 12 2017, 6:14 PM
src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala
199

We do have two. The first one is quite fast which is refreshGroupCache with an infinite time out. The second one could probably mutates to a long calculation during do leadership preparation. You are able to define a dedicated timeout for the latter, therefore I would not combine those two to one async/await.

src/main/scala/mesosphere/marathon/core/group/GroupManager.scala
153

sure =)

src/main/scala/mesosphere/marathon/core/group/impl/GroupManagerImpl.scala
202

Unfortunately yes :(

src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala
183

ok

216

clone and owned this from the other few tests in this file which are doing this the same way ¯\_(ツ)_/¯

This revision was automatically updated to reflect the committed changes.
jdef added a subscriber: jdef.Jun 14 2017, 3:31 AM
jdef added inline comments.
src/main/scala/mesosphere/marathon/storage/repository/GroupRepositoryImpl.scala
246

I think there's an implicit assumption here that refreshGroupCache will never be invoked at the same time as any other func in this class. If that's true then please clearly document this fact. If that's not the long term intent, then please protect rootFuture here with a lock like everywhere else in this class.

@unterstein can you explain more the motivation for refreshGroupCache? It seems like this method is only called when we are elected leader?

Could this same thing be expressed by making some promise in rootGroup, and then postponing all requests to load state until after this promise is completed? This way the order of operations dependencies are explicit?

I am not as familiar with this code and did not dive in deeply, so, I could be barking up the wrong tree :)