HomeMesosphereNo notifications. 4 unresolved issues.

Force delete pod deployment in test.
AbandonedAll Users

Authored by jeschkies on Jun 6 2017, 4:44 PM.

Details

Summary

MesosAppIntegrationTest.MesosApp should deploy a simple pod with health checks
sometimes fails with "409 was not equal to 202" which means the delete
has a conflict. To avoid this error we just force delete.

I did not verify why there is another deployment running. I assume that
there is a slight race condition: We receive the deployment finished
event but the deployment is not completely removed yet.

Test Plan

pipeline

Diff Detail

Repository
rMARATHON marathon
Branch
karsten/409-202-error
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 2863
Build 5433: Marathon (revised)Jenkins
Build 5432: arc lint + arc unit
Changes from before your most recent comment are hidden. Show Older Changes
In D833#32946, @zen-dog wrote:

eventually is also just masking the problem. If the deployment is successful (and it is with waitForDeployment(updateResult)) then you should be able to delete the pod with one call. Everything else is a bug.

It probably. However, it will be super hard to find since this happens roughly 2% of the time. There is no quick fix other than this.

I gather the following

  • There is an state consistency issue in which we release the lock for a pod after the deployment affecting said pod is complete
  • We don't have a good way to fix it right now, and it is not an issue likely to be affecting customers right now (deleting a pod right after deployment is probably not a common thing)

Have we already logged a JIRA describing this behavior? It seems like it'd be worth a nice loud comment mentioning "hey there's weird behavior here that we are working around" in the test, and mention this JIRA issue.

Also, @jeschkies - possible to make it only retry in the case of 409 ? It seems like right now it will retry on any failure.

jeschkies abandoned this revision.Jun 28 2017, 3:27 PM

Last 200 hundred runs

 1 "24 was not less than 21"
 1 "24 was not less than 24"
 1 "33 was not less than 31"
 1 "4159 was not less than 4000 the task kill event took longer than the task kill grace period"
 1 "4195 was not less than 4000 the task kill event took longer than the task kill grace period"
 1 "List(ITEnrichedTask(/app-8a693f0c-bcac-429c-ba00-9397df447141,app-8a693f0c-bcac-429c-ba00-9397df447141.3efa83d3-58fb-11e7-9e64-02426214f604,172.16.10.193,Some(List(31514)),Some(2cffc1ac-30fb-4857-961a-e045737b13c9-S0),Some(Sat Jun 24 00:00:00 UTC 2017),Some(Sat Jun 24 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-24T16:36:17.385Z))) was not equal to List(ITEnrichedTask(/app-8a693f0c-bcac-429c-ba00-9397df447141,app-8a693f0c-bcac-429c-ba00-9397df447141.406ba004-58fb-11e7-9e64-02426214f604,172.16.10.193,Some(List(31222)),Some(2cffc1ac-30fb-4857-961a-e045737b13c9-S0),Some(Sat Jun 24 00:00:00 UTC 2017),Some(Sat Jun 24 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-24T16:36:17.385Z)))"
 1 "List(ITEnrichedTask(/app-c5d6c956-044b-42c3-b014-47649d9acf65,app-c5d6c956-044b-42c3-b014-47649d9acf65.f8b48620-588e-11e7-8371-024285222b1e,172.16.10.183,Some(List(31326)),Some(d72d8f60-8186-4eaa-af24-ed8a5cd82f9d-S0),Some(Sat Jun 24 00:00:00 UTC 2017),Some(Sat Jun 24 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-24T03:41:13.814Z))) was not equal to List(ITEnrichedTask(/app-c5d6c956-044b-42c3-b014-47649d9acf65,app-c5d6c956-044b-42c3-b014-47649d9acf65.f9cde561-588e-11e7-8371-024285222b1e,172.16.10.183,Some(List(31963)),Some(d72d8f60-8186-4eaa-af24-ed8a5cd82f9d-S0),Some(Sat Jun 24 00:00:00 UTC 2017),Some(Sat Jun 24 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-24T03:41:13.814Z)))"
 1 "List(ITEnrichedTask(/restart-dont-kill,restart-dont-kill.1a0b7415-5847-11e7-8439-02426214f604,172.16.10.193,Some(List(31833)),Some(9f4f66bd-4e57-45cb-aa0c-4048f9a766b6-S0),Some(Fri Jun 23 00:00:00 UTC 2017),Some(Fri Jun 23 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-23T19:06:43.071Z))) was not equal to List(ITEnrichedTask(/restart-dont-kill,restart-dont-kill.2fcb149a-5847-11e7-990b-02426214f604,172.16.10.193,Some(List(31295)),Some(9f4f66bd-4e57-45cb-aa0c-4048f9a766b6-S0),Some(Fri Jun 23 00:00:00 UTC 2017),Some(Fri Jun 23 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-23T19:06:43.071Z))) Tasks before (List(ITEnrichedTask(/restart-dont-kill,restart-dont-kill.1a0b7415-5847-11e7-8439-02426214f604,172.16.10.193,Some(List(31833)),Some(9f4f66bd-4e57-45cb-aa0c-4048f9a766b6-S0),Some(Fri Jun 23 00:00:00 UTC 2017),Some(Fri Jun 23 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-23T19:06:43.071Z)))) and after (List(ITEnrichedTask(/restart-dont-kill,restart-dont-kill.2fcb149a-5847-11e7-990b-02426214f604,172.16.10.193,Some(List(31295)),Some(9f4f66bd-4e57-45cb-aa0c-4048f9a766b6-S0),Some(Fri Jun 23 00:00:00 UTC 2017),Some(Fri Jun 23 00:00:00 UTC 2017),TASK_RUNNING,Some(2017-06-23T19:06:43.071Z)))) abdication are different"
 1 "List(\"app-1.c80b3626-58a7-11e7-9ea9-02426214f604\", \"app-1.d067414e-58a7-11e7-929f-02426214f604\") did not equal List(\"app-1.c80b3626-58a7-11e7-9ea9-02426214f604\", \"app-1.c80fa2f7-58a7-11e7-9ea9-02426214f604\")"
 1 "List(\"app-1.e72b4ade-5996-11e7-9e8c-02426214f604\", \"app-1.efc0dd3c-5996-11e7-b185-02426214f604\") did not equal List(\"app-1.e72a396d-5996-11e7-9e8c-02426214f604\", \"app-1.e72b4ade-5996-11e7-9e8c-02426214f604\")"
 1 "No events matched <Task is declared unreachable>"
 1 "The future returned an exception of type: akka.stream.BindFailedException$, with message: bind failed."
 1 "The test did not complete within the specified 10 seconds time limit."
 1 "\"Pong /[group-8/app-7]\" was not equal to \"Pong /[app-732b33e7-ce58-46ae-a5f9-e68cd5af7f64]\""
 1 "\"Pong /[regression]\" was not equal to \"Pong /[app-ddd39eae-36ee-4868-a0ac-b19f2e2315eb]\""
 1 "\"Pong /[restart-dont-kill]\" was not equal to \"Pong /[app-bb0f2d84-d7cf-4cf3-ac8b-ab2831b73247]\""
 1 "\"Pong /app-[0]\" was not equal to \"Pong /app-[28670201-6b4e-41d5-a0cd-00c0b24bfe1e]\""
 1 "\"Pong /app-[0]\" was not equal to \"Pong /app-[bdf9bd36-bdac-4b44-9c1c-aef06b91d4b4]\""
 1 "\"Pong /app-[1]\" was not equal to \"Pong /app-[b632ba5a-2a3c-4cc2-889d-e38f55ca2960]\""
 1 "deployment visible not valid for 5 seconds. Give up."
 1 "reserved_resources Some({cpus: 0.002, disk: 6.0, gpus: 0.0, mem: 2.0, ports: \"[31735-31735]\" }) did not equal Some({cpus: 0.001, disk: 3.0, gpus: 0.0, mem: 1.0 })"
 1 "used_resources {cpus: 0.002, disk: 6.0, gpus: 0.0, mem: 2.0, ports: \"[31443-31443]\" } did not equal {cpus: 0.001, disk: 3.0, gpus: 0.0, mem: 1.0 }"
 1 "used_resources {cpus: 0.002, disk: 6.0, gpus: 0.0, mem: 2.0, ports: \"[31785-31785]\" } did not equal {cpus: 0.001, disk: 3.0, gpus: 0.0, mem: 1.0 }"
 2 "Tcp command [Connect(localhost:11718,None,List(),Some(10 seconds),true)] failed"
 2 "The test did not complete within the specified 300 seconds time limit."
 3 "The test did not complete within the specified 150 milliseconds time limit."
 4 "No events matched <Replacement task is killed>"
 4 "The future returned an exception of type: mesosphere.marathon.util.TimeoutException, with message: Timed out after 30 seconds."
 9 "Response code was not 201 but 503 with body '{\"message\":\"Futures timed out after [2000 milliseconds]\"}'"
11 "0 was not equal to 1"
21 "No events matched <event deployment_success to arrive>"

The 409 error is gone.