HomeMesosphereNo notifications. 4 unresolved issues.

Fix caching issues after performing migration or restoring a backup IntegrationTest current in progress
ClosedAll Users

Authored by unterstein on Jun 28 2017, 11:00 PM.

Details

Summary

GroupManager and GroupRepository are holding in memory caches of the root group. The cache is loaded when it is accessed the first time.
Actually this is really bad, because each marathon will log the amount of groups during startup through Kamon.
Therefore the root group state is loaded from zk when the marathon instance is started.
When the marathon instance is elected as leader, this cache is still in the same state as the time marathon started.
Therefore we need to re-load the root group from zk again from zookeeper when becoming leader.
The same is true after doing the migration. A migration or a restore also affects the state of zookeeper, but does not
update the internal hold caches. Therefore we need to refresh the internally loaded caches after the migration.
Actually we need to do the fresh twice, before the migration, to perform the migration on the current zk state and after
the migration to have marathon loaded the current valid state to the internal caches.

Test Plan

sbt integration:test-only *BackupRestoreIntegrationTest

Diff Detail

Repository
rMARATHON marathon
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
Changes from before your most recent comment are hidden. Show Older Changes
jeschkies accepted this revision.Jun 30 2017, 1:13 PM

Cool! Could you run it 100 times in the loop and report the stats here?

src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala
295

Isn't 50ms a little too short?

310

Why is this going?

Ping me after you have test results for at least 100 runs on your loop and I'll accept ;) Otherwise lgtm!

src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala
295

It's sec but I would still make it smth. like 10000

343–344

I like that you try to cover all the edge cases. But this test has too many "moving parts" and will be flaky. We should see the jenkins loop statistics for you branch for at least 100 runs before we land it.

That being said, I'm missing old GroupManagerActor. It had a clear life span, didn't need custom locking/synchronization and didn't need the cache invalidation. Why did we went away from the actor implementation? What was the problem if there was any?

unterstein added inline comments.Jun 30 2017, 3:35 PM
src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala
310

this was unnecessarily introduced in the last review, because afterwards there were some additional checks which we don`t need here

unterstein updated this revision to Diff 3691.Jun 30 2017, 3:48 PM

REBASED

  • Added integration test for backup/restore
  • added more cases to the integration test
  • added delete app
  • renamed method

Updating D895: Fix caching issues after performing migration or restoring a backup

IntegrationTest current in progress

jenkins requested changes to this revision.Jun 30 2017, 4:13 PM
This revision now requires changes to proceed.Jun 30 2017, 4:13 PM
✗ Build of 3691 failed jenkins-public-marathon-phabricator-409.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

unterstein updated this revision to Diff 3693.Jun 30 2017, 5:07 PM
  • timeouts

Updating D895: Fix caching issues after performing migration or restoring a backup

IntegrationTest current in progress

jenkins requested changes to this revision.Jun 30 2017, 5:12 PM
This revision now requires changes to proceed.Jun 30 2017, 5:12 PM
✗ Build of 3691 failed jenkins-public-marathon-phabricator-413.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

unterstein updated this revision to Diff 3698.Jul 3 2017, 10:28 AM
  • more sleep, just for leksi

Updating D895: Fix caching issues after performing migration or restoring a backup

IntegrationTest current in progress

jenkins requested changes to this revision.Jul 3 2017, 10:28 AM
This revision now requires changes to proceed.Jul 3 2017, 10:28 AM
zen-dog accepted this revision.Jul 3 2017, 10:28 AM

stats:

1 mesosphere.marathon.core.task.tracker.impl.InstanceOpProcessorImplTest:InstanceOpProcessorImpl should process update with failing taskRepository.store and load also fails
1 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should backoff delays are reset on configuration changes
1 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should create a simple app with a Marathon HTTP health check using port instead of portIndex
1 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should rollback a deployment
1 mesosphere.marathon.integration.BackupRestoreIntegrationTest:Abdicating a leader should keep all running apps alive
1 mesosphere.marathon.integration.ForwardToLeaderIntegrationTest:ForwardingToLeader should forwarding ping
1 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should A group with a running deployment can not be deleted without force
1 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should Groups with dependencies get deployed in the correct order
1 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should update a group with the same application so no restart is triggered
1 mesosphere.marathon.integration.KeepAppsRunningDuringAbdicationIntegrationTest:Abdicating a leader should keep all running apps alive
1 mesosphere.marathon.integration.ReelectionLeaderIntegrationTest:Reelecting a leader should it survives a small reelection test
1 mesosphere.marathon.metrics.MetricsTimerTest:Metrics Timers should measure a failed source
1 mesosphere.marathon.metrics.MetricsTimerTest:Metrics Timers should measure a successful source
2 mesosphere.marathon.integration.GroupDeployIntegrationTest:GroupDeployment should An upgrade in progress cannot be interrupted without force
2 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should Restart
2 mesosphere.marathon.integration.RestartIntegrationTest:Restarting Marathon when health checks should deployment with 2 unhealthy instances is continued properly after master abdication
3 mesosphere.marathon.integration.AppDeployIntegrationTest:AppDeploy should create and deploy an app with two tasks
3 mesosphere.marathon.integration.ResidentTaskIntegrationTest:ResidentTaskIntegrationTest should resident task is launched completely on reserved resources
3 mesosphere.marathon.integration.RestartIntegrationTest:Restarting Marathon when not kill a running task currently involved in a deployment
3 mesosphere.marathon.integration.TaskUnreachableIntegrationTest:TaskUnreachable should A task unreachable update will trigger a replacement task
5 mesosphere.marathon.integration.RestartIntegrationTest:Restarting Marathon when readiness should deployment with 1 ready and 1 not ready instance is continued properly after a restart
✗ Build of 3698 failed jenkins-public-marathon-phabricator-416.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

unterstein updated this revision to Diff 3699.Jul 3 2017, 10:51 AM
  • Added integration test for backup/restore
  • added more cases to the integration test
  • added delete app
  • renamed method
  • timeouts
  • more sleep, just for leksi

Updating D895: Fix caching issues after performing migration or restoring a backup

IntegrationTest current in progress

jenkins accepted this revision.Jul 3 2017, 10:57 AM
This revision is now accepted and ready to land.Jul 3 2017, 10:57 AM
✔ Build of 3698 completed jenkins-public-marathon-phabricator-417.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://downloads.mesosphere.io/marathon/snapshots/marathon-1.5.0-SNAPSHOT-629-g1e2b7ef.tgz",
"sha1": "0f92ee3c2b6df83309cf58a1bfb9f90e2edf4b78"

\\ ٩( ᐛ )و //

jenkins requested changes to this revision.Jul 3 2017, 10:57 AM
This revision now requires changes to proceed.Jul 3 2017, 10:57 AM
jenkins accepted this revision.Jul 3 2017, 11:13 AM
This revision is now accepted and ready to land.Jul 3 2017, 11:13 AM
✔ Build of 3699 completed jenkins-public-marathon-phabricator-418.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://downloads.mesosphere.io/marathon/snapshots/marathon-1.5.0-SNAPSHOT-632-gd441faa.tgz",
"sha1": "f46d42213f27d7f604f9481d1833f15cfae550ca"

\\ ٩( ᐛ )و //

jeschkies accepted this revision.Jul 3 2017, 11:38 AM
This revision was automatically updated to reflect the committed changes.