GroupManager and GroupRepository are holding in memory caches of the root group. The cache is loaded when it is accessed the first time.
Actually this is really bad, because each marathon will log the amount of groups during startup through Kamon.
Therefore the root group state is loaded from zk when the marathon instance is started.
When the marathon instance is elected as leader, this cache is still in the same state as the time marathon started.
Therefore we need to re-load the root group from zk again from zookeeper when becoming leader.
The same is true after doing the migration. A migration or a restore also affects the state of zookeeper, but does not
update the internal hold caches. Therefore we need to refresh the internally loaded caches after the migration.
Actually we need to do the fresh twice, before the migration, to perform the migration on the current zk state and after
the migration to have marathon loaded the current valid state to the internal caches.
Details
- Reviewers
jeschkies timcharper zen-dog jenkins - Commits
- rMARATHON55ff0437dfff: Fix caching issues after performing migration or restoring a backup
rMARATHONe7f002431fd5: added delete app
rMARATHON546cc3dadee9: Fix caching issues after performing migration or restoring a backup
rMARATHON5a9af8b828eb: Fix caching issues after performing migration or restoring a backup
rMARATHON7a637868b19f: renamed method
rMARATHONfcaaf72010f6: Fix caching issues after performing migration or restoring a backup
rMARATHON7490803049ef: more sleep, just for leksi
rMARATHON92a39586624c: Fix caching issues after performing migration or restoring a backup
rMARATHONae809f906f32: Added integration test for backup/restore
rMARATHON5ed03999d654: added more cases to the integration test - JIRA Issues
- JIRA MARATHON-7565 Marathon restore restores apps not in a backup
sbt integration:test-only *BackupRestoreIntegrationTest
Diff Detail
- Repository
- rMARATHON marathon
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
Ping me after you have test results for at least 100 runs on your loop and I'll accept ;) Otherwise lgtm!
src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala | ||
---|---|---|
295 | It's sec but I would still make it smth. like 10000 | |
343–344 | I like that you try to cover all the edge cases. But this test has too many "moving parts" and will be flaky. We should see the jenkins loop statistics for you branch for at least 100 runs before we land it. |
That being said, I'm missing old GroupManagerActor. It had a clear life span, didn't need custom locking/synchronization and didn't need the cache invalidation. Why did we went away from the actor implementation? What was the problem if there was any?
src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala | ||
---|---|---|
310 | this was unnecessarily introduced in the last review, because afterwards there were some additional checks which we don`t need here |
REBASED
- Added integration test for backup/restore
- added more cases to the integration test
- added delete app
- renamed method
Updating D895: Fix caching issues after performing migration or restoring a backup
IntegrationTest current in progress
Error message:
Stage Compile and Test failed.
(๑′°︿°๑)
- timeouts
Updating D895: Fix caching issues after performing migration or restoring a backup
IntegrationTest current in progress
Error message:
Stage Compile and Test failed.
(๑′°︿°๑)
- more sleep, just for leksi
Updating D895: Fix caching issues after performing migration or restoring a backup
IntegrationTest current in progress
stats:
1 mesosphere.marathon.core.task.tracker.impl.InstanceOpProcessorImplTest:InstanceOpProcessorImpl should process update with failing taskRepository.store and load also fails 1 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should backoff delays are reset on configuration changes 1 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should create a simple app with a Marathon HTTP health check using port instead of portIndex 1 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should rollback a deployment 1 mesosphere.marathon.integration.BackupRestoreIntegrationTest:Abdicating a leader should keep all running apps alive 1 mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should forwarding ping 1 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should A group with a running deployment can not be deleted without force 1 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should Groups with dependencies get deployed in the correct order 1 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should update a group with the same application so no restart is triggered 1 mesosphere.marathon.integration.KeepAppsRunningDuringAbdicationIntegrationTest:Abdicating a leader should keep all running apps alive 1 mesosphere.marathon.integration.ReelectionLeaderIntegrationTest:Reelecting a leader should it survives a small reelection test 1 mesosphere.marathon.metrics.MetricsTimerTest:Metrics Timers should measure a failed source 1 mesosphere.marathon.metrics.MetricsTimerTest:Metrics Timers should measure a successful source 2 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should An upgrade in progress cannot be interrupted without force 2 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should Restart 2 mesosphere.marathon.integration.RestartIntegrationTest:Restarting Marathon when health checks should deployment with 2 unhealthy instances is continued properly after master abdication 3 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should create and deploy an app with two tasks 3 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should resident task is launched completely on reserved resources 3 mesosphere.marathon.integration.RestartIntegrationTest:Restarting Marathon when not kill a running task currently involved in a deployment 3 mesosphere.marathon.integration.TaskUnreachableIntegrationTest:TaskUnreachable should A task unreachable update will trigger a replacement task 5 mesosphere.marathon.integration.RestartIntegrationTest:Restarting Marathon when readiness should deployment with 1 ready and 1 not ready instance is continued properly after a restart
Error message:
Stage Compile and Test failed.
(๑′°︿°๑)
- Added integration test for backup/restore
- added more cases to the integration test
- added delete app
- renamed method
- timeouts
- more sleep, just for leksi
Updating D895: Fix caching issues after performing migration or restoring a backup
IntegrationTest current in progress
You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:
"url": "https://downloads.mesosphere.io/marathon/snapshots/marathon-1.5.0-SNAPSHOT-629-g1e2b7ef.tgz", "sha1": "0f92ee3c2b6df83309cf58a1bfb9f90e2edf4b78"
\\ ٩( ᐛ )و //
You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:
"url": "https://downloads.mesosphere.io/marathon/snapshots/marathon-1.5.0-SNAPSHOT-632-gd441faa.tgz", "sha1": "f46d42213f27d7f604f9481d1833f15cfae550ca"
is there a separate fix for this? how can we avoid paying this penalty?