Simplify leader election code
ClosedAll Users
Actions

Authored by ichernetsky on Apr 26 2017, 6:34 PM.

Details

Reviewers

timcharper
meichstedt
jeschkies
zen-dog
kensipe
unterstein
jdef
jenkins

Commits

rMARATHONc85639f12298: Simplify leader election code
rMARATHON8cd913e203f4: Update a few comments per Aleksey's request
rMARATHON0ebe9270cd0f: Simplify leader election code
rMARATHON6b2c45b12adb: Simplify leader election code
rMARATHONa67a7e44339d: Simplify leader election code

JIRA Issues

JIRA MARATHON-4083 Provide and use crashing strategy interface
JIRA MARATHON-2008 Simplify leadership election code

Summary

"Simplify" means here basically two things: 1) reduce the number of states and their transitions, and 2) fail-fast instead of doing complex state-machine logic

Test Plan

Manual testing of various failure scenarious + run all kinds of automated testing

Diff Detail

Repository

rMARATHON marathon

Lint

Automatic diff as part of commit; lint not applicable.

Unit

Automatic diff as part of commit; unit tests not applicable.

There are a very large number of changes, so older changes are hidden. Show Older Changes

@zen-dog I think we might have some further simplification potential here; what would you think if we continued to boil this down to it's essence before embarking on effort to fix? @ichernetsky and I are going to talk on Tuesday

Also, what do you think if we land the patch as is? There is substantial improvement already and the history for this one is getting quite long.

(oh, except for our test coverage went down significantly :) we should address this)

I support merging this is and any additional enhancement coming from another PR.
Long review and lgtm

@zen-dog, I added a few words to ElectionService's Scaladoc comment. And since we don't have an (explicit) FSM anymore, I think there is not much to add at this point.

@timcharper, I've restored CuratorElectionServiceTest. Previously most of the state transitions/method invocations were tested using ElectionServiceBase. Now a real ZK is required to do so, which means we need to have integration tests to achieve that, I believe. Partially, the test coverage went down because of that. The other reason is the total number of state transitions got reduced.

Restore CuratorElectionServiceTest
Expand ElectionService's Scaladoc comment

jenkins requested changes to this revision.Jun 27 2017, 8:20 PM

This revision now requires changes to proceed.Jun 27 2017, 8:20 PM

✗ Build of 3641 failed jenkins-public-marathon-phabricator-362.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

Harbormaster failed to build B2997: Diff 3641!Jun 27 2017, 8:20 PM

Could you please go through https://jira.mesosphere.com/browse/MARATHON_EE-1542 and verify that that condition is not possible after your fix? I know writing a test would probably be very hard so a "thought experiment" would suffice. Thx.

@zen-dog, I will do it, but is it somehow connected to the fact that you kept "reject" to this patch?

Checking for that case is wise. I haven't looked into it too deeply and am not currently sure if there is a way we can make more obvious the case mentioned by Aleksey, or what's been explored in that area. But having an explanation for why it does not affect the post-refactor here is good. Pending that, I approve! Great work!

I have to say that the logic looks much cleaner and I like the fact that it's flat now. I have a couple of questions/nits/better docs request but nothing that would justify a reject.
However before you merge please:

create a jenkins loop (copy of the existing ones, edit and put your branch) and let it run for at least 24h - I'd like to make sure we're not adding to existing flakiness

after the merge we should soak test this:

talk to Armand/Viktor so we can soak-test an according snapshot (will probably require to merge it to dcos-ee)
you should also talk to our chaos-engineering team so they run the chaos-tests with this snapshot - last bug in re-election code was found by them

Thx again for great patch!

src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala
235	I don't think this comment still makes sense since we removed the extra abdication parameters.
src/main/scala/mesosphere/marathon/core/election/ElectionService.scala
21	This is a good start for a class doc :thumbup: However I would be also interested in the complete chain of events e.g.: -> where do ZK/CuratorService events come in? -> what happens after we're elected? What events are sent through the system? -> what parts of the chain happens synchronously vs. asynchronously and can happen parallel to other events? I know that "the code is self-explanatory" and "why dont you just read the code" but having full chain of events, concurrency model and guaranties explained goes a long way. Sry if it feels like I'm the pain in the neck :)
src/main/scala/mesosphere/marathon/core/election/impl/CuratorElectionService.scala
106	So if we were offered leadership twice in a short period of time, neither will succeed and marathon will exit? Can't this lead to a scenario where marathon gets too many leadership offers but is not able to act on them successfully?

zen-dog accepted this revision.Jul 4 2017, 4:14 PM

Regarding a Jenkins loop job, soaking it and probably getting in touch with the Chaos engineering team — it all makes sense and I will do it. Thanks for the review.

src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala
235	I think it still makes sense because it is more about leadership abdication, not the method's parameters. It is more about the fact that `driver.foreach` is used, because it can be set to `None` in `stopDriver()` before this line gets to execute. What do you think?
src/main/scala/mesosphere/marathon/core/election/ElectionService.scala
21	`where do ZK/CuratorService events come in?` — I don't get what you mean here. `what happens after we're elected?` — it is already documented in both `ElectionService` and `ElectionCandidate`. `what events are sent through the system?` — done. `what parts of the chain happens synchronously vs. asynchronously and can happen parallel to other events?` — All methods are synchronous which should be expected by default because no future is returned and no callback is accepted in any method.
src/main/scala/mesosphere/marathon/core/election/impl/CuratorElectionService.scala
106	If a leadership is offered more than once, then it is a (I would say, severe) bug in the code. As of now, only `MarathonScheculerService` invokes this method, and surely it is supposed to do so only once during start-up. Therefore, stopping is a proper action here, I deem.

Update a few comments per Aleksey's request

jenkins requested changes to this revision.Jul 6 2017, 8:48 PM

This revision now requires changes to proceed.Jul 6 2017, 8:48 PM

✗ Build of 3723 failed jenkins-public-marathon-phabricator-443.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

Harbormaster failed to build B3064: Diff 3723!Jul 6 2017, 9:03 PM

Make tests stable again on Amazon instances by getting rid of intercepting calls to shutdown JVM

jenkins requested changes to this revision.Jul 12 2017, 6:28 PM

This revision now requires changes to proceed.Jul 12 2017, 6:28 PM

✗ Build of 3763 failed jenkins-public-marathon-phabricator-487.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

Harbormaster failed to build B3100: Diff 3763!Jul 12 2017, 6:44 PM

Thanks for introducing crashing strategy; does it make sense to link the patch to the issue? Great work, Ivan!

src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceMetrics.scala
23	I commented in a ticket that reporting leader metrics for non-ha mode is misleading. But, it matches old behavior and I think this is great for now. Let's move forward.

timcharper accepted this revision.Jul 12 2017, 8:21 PM

Linked it to the Jira issue. Thanks for the quick review, Tim.

✗ Build of 3763 failed jenkins-public-marathon-phabricator-490.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

✗ Build of 3763 failed jenkins-public-marathon-phabricator-491.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

jenkins accepted this revision.Jul 12 2017, 10:17 PM

This revision is now accepted and ready to land.Jul 12 2017, 10:17 PM

✔ Build of 3763 completed jenkins-public-marathon-phabricator-492.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://downloads.mesosphere.io/marathon/snapshots/marathon-1.5.0-SNAPSHOT-643-gf8e47a6.tgz",
"sha1": "32a6184e86fb67c2956629e1508ff3dc8096e0a0"

＼\ ٩( ᐛ )و /／

src/main/scala/mesosphere/marathon/core/election/impl/PseudoElectionService.scala
25–26	Hm, it seems this class shares a lot of logic with `CuratorElectionService`. How are they different? Is there a way to share some logic?
src/test/scala/mesosphere/marathon/core/election/impl/PseudoElectionServiceTest.scala
33	It seems we are not testing `CuratorElectionService`. Is this impression correct?
63–64	Sweet!

src/main/scala/mesosphere/marathon/core/election/impl/PseudoElectionService.scala
25–26	I see a vicious circle right there. I feel like I can write a book about trade-offs when implementing leader election which relies on an external service like ZooKeeper. :) This is exactly what I was asked for from the very beginning: get rid of `ElectionServiceBase`. My first step was to simplify things, and reduce the number of states and their transitions in `ElectionServiceBase`, which lead to to renaming of it to `ElectionServiceFSM` because it started to look like a real FSM. It wasn't enough, and it led to this further simplification at the expense of having this duplicated code. It seems we are headed towards Akka Streams. This code duplication won't be there anymore after another round of refactoring of this code. On the other hand, if one imagines a two-level inheritance tree with one parent and multiple children as a horizontal thing as a apposed to a vertical thing, where the parent make calls into children and vice versa, it can be seen very well as a message-passing with compile-time checks of it. Long story short, it is what it is. The decision was made, it is too late to change it drastically at this point. Though I do see that `CuratorElectionService` is harder to test now, because previously most of it was tested by testing `ElectionServiceBase` There is code duplication. we should keep this intermediate approach, move on, and have it refactored to make it leverage Akka Streams later, after 1.10 release.
src/test/scala/mesosphere/marathon/core/election/impl/PseudoElectionServiceTest.scala
33	It is correct. Please refer to my comment to your first one. Most of it was tested using `ElectionServiceBase/FSM` before, but since most of the code is not shared between `CuratorElectionService` and `PseudoElectionService`, it is gone. Now testing of it would require running ZK, which looks more like SI testing at least.

kensipe accepted this revision.Jul 17 2017, 8:14 PM

build.sbt
22	seems unrelated to leader election. why is this change here?
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
442	why is this linter rule needed?
src/main/scala/mesosphere/marathon/core/base/CrashStrategy.scala
10	why not a sealed trait here?
14	case object (if we use sealed trait above)?
src/main/scala/mesosphere/marathon/core/election/impl/CuratorElectionService.scala
52	not that it's related to this change (because this just looks like a rename) but what's the impact of using a separate execution context here, vs. the "global" one that has support for the injected Context stuff Jason wrote? https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/async/ExecutionContexts.scala
55	`leadershipOffered` and `acquiringLeadership` function more like single-use latches. it might make sense to either (a) document that, or else; (b) wrap with a helper type that only allows a single-use one-way transition: package mesosphere.marathon; package core.async; final class Latch { private[this] val counter = new Semaphore(1) def acquire: Option[Done] = // use of Option return type forces pattern matches to deal w/ all alternatives, likely highly desirable if(counter.tryAcquire()) Some(Done) else None } I like the idea of limiting transitions via a `Latch` type because it enforces the design as intended
145	nit: extract constant
220–231	looks like the ACLs changed here. why is this part of this PR and not a separate changeset? seems significant enough?
233	nit: extract constants
src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceMetrics.scala
17	is it possible that `startMetrics` and `stopMetrics` could be called concurrently? if so, the atomic boolean guard won't help you avoid a data race. if not, then i'd like to see that assumption documented someplace
src/main/scala/mesosphere/marathon/core/election/impl/PseudoElectionService.scala
45	ditto re: `Latch` suggestion
86	nit: extract constant

build.sbt
22	This patch has been in review for 3 months. And I address compiler warning every time I merge in `origin/master`.
src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala
442	[warn] /Users/ivanchernestsky/dev/marathon/src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala:442: [UseIfExpression] Assign the result of the if expression to variable ifres$macro$26 directly. [warn] if (knownTaskStatuses.nonEmpty) driver.reconcileTasks(knownTaskStatuses.asJavaCollection)
src/main/scala/mesosphere/marathon/core/election/impl/CuratorElectionService.scala
52	As far as I understand, it was done to let Curator be blocked in a separate thread, not in the one of https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/async/ExecutionContexts.scala#L50
55	I don't think it'd better to replace those two variables with one `Latch` (if you meant this) because they used in two different places. It is correct that both of them are one-way (from `false` to `true`). If you meant with replacing them with two `Latch`es, I think it is better not to do it at this point (frankly, it is quite late), because it will have to be full re-tested.
220–231	I noticed that the ACLs were incorrect a few days after I started working on the leader election simplification, and at that point I was new to the dev workflow here, and it seemed that I would take just a few more days to land this patch.
src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceMetrics.scala
17	It is done to avoid calling `stopMetrics` if `startMetrics` was not invoked before, or call `stopMetrics` twice (on both leader abdication and JVM shutdown).

Address some of @jdef's comments

This revision is now accepted and ready to land.Jul 17 2017, 11:21 PM

jenkins requested changes to this revision.Jul 17 2017, 11:22 PM

This revision now requires changes to proceed.Jul 17 2017, 11:22 PM

Wrap ExecutionContent with ContextPropagatingExecutionContextWrapper

jenkins requested changes to this revision.Jul 17 2017, 11:38 PM

This revision now requires changes to proceed.Jul 17 2017, 11:38 PM

✗ Build of 3802 failed jenkins-public-marathon-phabricator-522.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

Harbormaster failed to build B3128: Diff 3802!Jul 17 2017, 11:38 PM

jenkins accepted this revision.Jul 17 2017, 11:57 PM

This revision is now accepted and ready to land.Jul 17 2017, 11:57 PM

✔ Build of 3803 completed jenkins-public-marathon-phabricator-523.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://downloads.mesosphere.io/marathon/snapshots/marathon-1.5.0-SNAPSHOT-658-g4cf8c34.tgz",
"sha1": "a7091bc22ae6be5288d66c5954ecf1a085619d32"

＼\ ٩( ᐛ )و /／

Harbormaster completed building B3129: Diff 3803.Jul 17 2017, 11:57 PM

Closed by commit rMARATHON0ebe9270cd0f: Simplify leader election code (authored by ichernetsky). · Explain WhyJul 18 2017, 9:40 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

		Path
M		build.sbt (2 lines)
M		src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala (3 lines)
M		src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala (9 lines)
M		src/main/scala/mesosphere/marathon/api/v2/LeaderResource.scala (4 lines)
A	M	src/main/scala/mesosphere/marathon/core/base/CrashStrategy.scala (17 lines)
M		src/main/scala/mesosphere/marathon/core/election/ElectionModule.scala (17 lines)
M		src/main/scala/mesosphere/marathon/core/election/ElectionService.scala (36 lines)
M		src/main/scala/mesosphere/marathon/core/election/impl/CuratorElectionService.scala (303 lines)
D	M	src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceBase.scala (243 lines)
A	M	src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceEventStream.scala (22 lines)
A	M	src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceMetrics.scala (28 lines)
D	M	src/main/scala/mesosphere/marathon/core/election/impl/ExponentialBackoff.scala (34 lines)
M		src/main/scala/mesosphere/marathon/core/election/impl/PseudoElectionService.scala (135 lines)
M		src/test/scala/mesosphere/UnitTest.scala (4 lines)
M		src/test/scala/mesosphere/marathon/MarathonSchedulerServiceTest.scala (33 lines)
M		src/test/scala/mesosphere/marathon/api/v2/LeaderResourceTest.scala (5 lines)
D	M	src/test/scala/mesosphere/marathon/core/base/RichRuntimeTest.scala (17 lines)
M		src/test/scala/mesosphere/marathon/core/election/impl/CuratorElectionServiceTest.scala (20 lines)
D	M	src/test/scala/mesosphere/marathon/core/election/impl/ElectionServiceBaseTest.scala (250 lines)
A	M	src/test/scala/mesosphere/marathon/core/election/impl/PseudoElectionServiceTest.scala (127 lines)
M		src/test/scala/mesosphere/marathon/core/health/impl/HealthCheckWorkerActorTest.scala (2 lines)
M		src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala (2 lines)
M		src/test/scala/mesosphere/marathon/integration/ResidentTaskIntegrationTest.scala (2 lines)
M		src/test/scala/mesosphere/marathon/integration/setup/ForwarderService.scala (2 lines)
M		src/test/scala/mesosphere/marathon/integration/setup/MarathonTest.scala (4 lines)
D	M	src/test/scala/mesosphere/marathon/test/ExitDisabledTest.scala (94 lines)

Diff	ID	Base	Description	Created	Lint	Unit
Base			Base
Diff 1	2914	8f70ede		Apr 26 2017, 6:34 PM	★	★
Diff 2	2936	8f70ede	- Update tests and adress most of the revision comments	Apr 28 2017, 3:01 AM	★	★
Diff 3	2948	8f70ede	- Addressed comments to the diff	Apr 28 2017, 5:55 PM	★	★
Diff 4	2970	8f70ede	- Document some methods of ElectionServiceFSM	Apr 29 2017, 10:06 PM	★	★
Diff 5	3006	8f70ede	- Return "Leadership abdicated" from LeaderResource	May 3 2017, 4:48 PM	★	★
Diff 6	3007	bed2a26	Rebased	May 3 2017, 4:51 PM	★	★
Diff 7	3015	bed2a26	- Comment acquireLeadership method	May 3 2017, 7:24 PM	★	★
Diff 8	3017	bed2a26	- Remove CuratorElectionServiceTest.scala which accidently got ressurected…	May 3 2017, 7:44 PM	★	★
Diff 9	3151	bfaeec1	Rebase	May 11 2017, 4:02 AM	★	★
Diff 10	3449	f9c0d57	- Get rid of FSM	Jun 7 2017, 2:35 PM	★	★
Diff 11	3584	f9c0d57	- Use val, not def (thanks, Tim!)	Jun 21 2017, 11:34 AM	★	★
Diff 12	3615	87188f4	- Use val, not def (thanks, Tim!)	Jun 23 2017, 12:35 PM	★	★
Diff 13	3617	87188f4	- Don't stop the VM from MarathonSchedulerService.stopLeadership method	Jun 23 2017, 1:46 PM	★	★
Diff 14	3641	87188f4	- Restore CuratorElectionServiceTest	Jun 27 2017, 8:00 PM	★	★
Diff 15	3723	52d5f15	- Update a few comments per Aleksey's request	Jul 6 2017, 8:45 PM	★	★
Diff 16	3763	f8e47a6	Make tests stable again on Amazon instances by getting rid of intercepting…	Jul 12 2017, 6:28 PM	★	★
Diff 17	3802	4cf8c34	- Address some of @jdef's comments	Jul 17 2017, 11:21 PM	★	★
Diff 18	3803	4cf8c34	- Wrap ExecutionContent with ContextPropagatingExecutionContextWrapper	Jul 17 2017, 11:34 PM	★	★
Diff 19	3804	4cf8c34	rMARATHON0ebe9270cd0fe0388aaa70eb01b7b17b4573a38a	Jul 18 2017, 9:40 AM	★	★

Commit	Tree	Parents	Author	Summary	Date
818f935aad73	c632c16e8aca	26f689a876ba	Ivan Chernetsky	Wrap ExecutionContent with ContextPropagatingExecutionContextWrapper	Jul 17 2017, 11:33 PM
26f689a876ba	9a0f0161a7cb	cfe8d3db40cc	Ivan Chernetsky	Address some of @jdef's comments	Jul 17 2017, 11:21 PM
cfe8d3db40cc	90df1f3ff3d3	5fe0c61b1283	Ivan Chernetsky	Ta-da	Jul 12 2017, 9:49 AM
5fe0c61b1283	8faf5bb1d558	642319be767a	Ivan Chernetsky	Add CrashStrategy	Jul 12 2017, 8:45 AM
642319be767a	88443ee4d3e0	baf2fcc963e2	Ivan Chernetsky	Use CrashStrategy	Jul 12 2017, 8:38 AM
baf2fcc963e2	0c36856921d8	a3a0fb215c34	Ivan Chernetsky	Revert some dummy changes	Jul 10 2017, 10:22 PM
a3a0fb215c34	c7461a2802c4	9b2548d2647a	Ivan Chernetsky	Disable one PseudoElectionService test	Jul 10 2017, 5:30 PM
9b2548d2647a	7976ea32fffd	2a6d8cccb4af	Ivan Chernetsky	Add ExitDisabledTest to the list of ancestors of PseudoElectionServiceTest	Jul 8 2017, 12:43 AM
2a6d8cccb4af	279b85e6689d	5f396c830a01	Ivan Chernetsky	Add verbose output	Jul 7 2017, 11:51 PM
5f396c830a01	0842c3ec5da9	b11d5d2617c2	Ivan Chernetsky	Rethrow exceptions in ExitDisabledTest	Jul 7 2017, 11:19 PM
b11d5d2617c2	bfb75bee6005	366f10066288	Ivan Chernetsky	Attempt to fix the flaky test	Jul 7 2017, 10:05 PM
366f10066288	778b07efcc6d	f2200554899e	Ivan Chernetsky	Update a few comments per Aleksey's request	Jul 6 2017, 8:44 PM
f2200554899e	650847dabee9	c0c6f78c7636	Ivan Chernetsky	Expand ElectionService's Scaladoc comment	Jun 27 2017, 7:52 PM
c0c6f78c7636	c639ebb8eb97	f32b37b90740	Ivan Chernetsky	Restore CuratorElectionServiceTest	Jun 27 2017, 6:53 PM
f32b37b90740	45ee11295978	8438af232bb7	Ivan Chernetsky	Don't stop the VM from MarathonSchedulerService.stopLeadership method	Jun 23 2017, 1:45 PM
8438af232bb7	0b20bbeb3892	0075cf5185ae	Ivan Chernetsky	Close the client upon stopping too	Jun 23 2017, 12:34 PM
0075cf5185ae	9815ce30aed1	3a8998f77e38	Ivan Chernetsky	Close the leader latch only when its state is STARTED	Jun 23 2017, 12:01 PM
3a8998f77e38	009be5231105	c6734739deeb	Ivan Chernetsky	Use Atomic{Reference, Boolean} instead of RichLock	Jun 22 2017, 3:47 PM
c6734739deeb	50d03b443ca7	294428d7d60b	Ivan Chernetsky	Use val, not def (thanks, Tim!)	Jun 21 2017, 11:33 AM
294428d7d60b	7faa1288ec49	899540f68fb3	Ivan Chernetsky	Gett rid of FSM	Jun 7 2017, 2:34 PM
899540f68fb3	4cabe2675526	700a68d580e1	Ivan Chernetsky	Remove CuratorElectionServiceTest.scala which accidently got ressurected when… (Show More…)	May 3 2017, 7:42 PM
700a68d580e1	eed2b88e2b93	c13984f0c967	Ivan Chernetsky	Comment acquireLeadership method	May 3 2017, 7:23 PM
c13984f0c967	6e25c4f475b7	37875b05445b	Ivan Chernetsky	Return "Leadership abdicated" from LeaderResource	May 3 2017, 4:44 PM
37875b05445b	cd27eb1da2a3	3d452ab6e5bf	Ivan Chernetsky	Document some methods of ElectionServiceFSM	Apr 29 2017, 10:06 PM
3d452ab6e5bf	96f2bff34994	80ce98ba4d28	Ivan Chernetsky	Addressed comments to the diff	Apr 28 2017, 5:55 PM
80ce98ba4d28	4a0970965cad	4b6cd9e2360b	Ivan Chernetsky	Update tests and adress most of the revision comments	Apr 28 2017, 2:58 AM
4b6cd9e2360b	9ddea2aacf82	2636fa26a9b9	Ivan Chernetsky	Simplify leader election code (Show More…)	Apr 26 2017, 6:18 PM
2636fa26a9b9	0101a5caf81b	4cf8c34b4f90	Ivan Chernetsky	Simplify leader election code and reduce the total number of states and… (Show More…)	Apr 26 2017, 12:31 AM

Show All 13 Lines
14	14		def formattingTestArg(target: File) = Tests.Argument("-u", target.getAbsolutePath, "-eDFG")
15	15
16	16		credentials ++= loadM2Credentials(streams.value.log)
17	17		resolvers ++= loadM2Resolvers(sLog.value)
18	18
19	19		resolvers += Resolver.sonatypeRepo("snapshots")
20	20		addCompilerPlugin("org.psywerx.hairyfotr" %% "linter" % "0.1.17")
21	21
22			cleanFiles <+= baseDirectory { base => base / "sandboxes" }
		jdefUnsubmitted Not Done seems unrelated to leader election. why is this change here?
		ichernetskyAuthorUnsubmitted Not Done This patch has been in review for 3 months. And I address compiler warning every time I merge in `origin/master`.
	22		cleanFiles += baseDirectory { base => base / "sandboxes" }.value
23	23
24	24		lazy val formatSettings = SbtScalariform.scalariformSettings ++ Seq(
25	25		ScalariformKeys.preferences := FormattingPreferences()
26	26		.setPreference(AlignArguments, false)
27	27		.setPreference(AlignParameters, false)
28	28		.setPreference(AlignSingleLineCaseStatements, false)
29	29		.setPreference(CompactControlReadability, false)
30	30		.setPreference(DoubleIndentClassDeclaration, true)
▲ Show 20 Lines • Show All 272 Lines • Show Last 20 Lines

Show All 14 Lines
15	15		import mesosphere.marathon.core.instance.Instance.AgentInfo
16	16		import mesosphere.marathon.core.launchqueue.LaunchQueue
17	17		import mesosphere.marathon.core.task.Task
18	18		import mesosphere.marathon.core.task.termination.{ KillReason, KillService }
19	19		import mesosphere.marathon.core.task.tracker.InstanceTracker
20	20		import mesosphere.marathon.state.{ PathId, RunSpec }
21	21		import mesosphere.marathon.storage.repository.{ DeploymentRepository, GroupRepository }
22	22		import mesosphere.marathon.stream.Implicits._
23			import mesosphere.marathon.util._
24	23		import mesosphere.mesos.Constraints
25	24		import org.apache.mesos
26	25		import org.apache.mesos.Protos.{ Status, TaskState }
27	26		import org.apache.mesos.SchedulerDriver
28	27
29	28		import scala.async.Async.{ async, await }
30	29		import scala.concurrent.{ ExecutionContext, Future }
31	30		import scala.util.control.NonFatal
▲ Show 20 Lines • Show All 403 Lines • ▼ Show 20 Line(s)
435	434		instances.specInstances(unknownId).foreach { orphanTask =>
436	435		logger.info(s"Killing ${orphanTask.instanceId}")
437	436		killService.killInstance(orphanTask, KillReason.Orphaned)
438	437		}
439	438		}
440	439
441	440		logger.info("Requesting task reconciliation with the Mesos master")
442	441		logger.debug(s"Tasks to reconcile: $knownTaskStatuses")
443			if (knownTaskStatuses.nonEmpty) driver.reconcileTasks(knownTaskStatuses.asJavaCollection)
	442		if (knownTaskStatuses.nonEmpty) driver.reconcileTasks(knownTaskStatuses.asJavaCollection) // linter:ignore UseIfExpression
		jdefUnsubmitted Not Done why is this linter rule needed?
		ichernetskyAuthorUnsubmitted Not Done [warn] /Users/ivanchernestsky/dev/marathon/src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala:442: [UseIfExpression] Assign the result of the if expression to variable ifres$macro$26 directly. [warn] if (knownTaskStatuses.nonEmpty) driver.reconcileTasks(knownTaskStatuses.asJavaCollection)
444	443
445	444		// in addition to the known statuses send an empty list to get the unknown
446	445		driver.reconcileTasks(java.util.Arrays.asList())
447	446		}
448	447
449	448		def reconcileHealthChecks(): Unit = {
450	449		groupRepository.root().flatMap { rootGroup =>
451	450		healthCheckManager.reconcile(rootGroup.transitiveAppsById.valuesIterator.to[Seq])
▲ Show 20 Lines • Show All 105 Lines • Show Last 20 Lines

1	1	package mesosphere.marathon
2	2
3	3	import java.util.concurrent.CountDownLatch
4	4	import java.util.{ Timer, TimerTask }
5	5	import javax.inject.{ Inject, Named }
6	6
7	7	import akka.Done
8	8	import akka.actor.{ ActorRef, ActorSystem }
9	9	import akka.stream.Materializer
10	10	import akka.util.Timeout
11	11	import com.google.common.util.concurrent.AbstractExecutionThreadService
12	12	import mesosphere.marathon.MarathonSchedulerActor._
13		import mesosphere.marathon.core.base.toRichRuntime
14	13	import mesosphere.marathon.core.deployment.{ DeploymentManager, DeploymentPlan, DeploymentStepInfo }
15	14	import mesosphere.marathon.core.election.{ ElectionCandidate, ElectionService }
16	15	import mesosphere.marathon.core.group.GroupManager
17	16	import mesosphere.marathon.core.heartbeat._
18	17	import mesosphere.marathon.core.instance.Instance
19	18	import mesosphere.marathon.core.leadership.LeadershipCoordinator
20	19	import mesosphere.marathon.state.{ AppDefinition, PathId, Timestamp }
21	20	import mesosphere.marathon.storage.migration.Migration
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Line(s)
162	161	}
163	162
164	163	log.info("Completed run")
165	164	}
166	165
167	166	override def triggerShutdown(): Unit = synchronized {
168	167	log.info("Shutdown triggered")
169	168
170		electionService.abdicateLeadership(reoffer = false)
	169	electionService.abdicateLeadership()
171	170	stopDriver()
172	171
173	172	log.info("Cancelling timer")
174	173	timer.cancel()
175	174
176	175	// The countdown latch blocks run() from exiting. Counting down the latch removes the block.
177	176	log.info("Removing the blocking of run()")
178	177	isRunningLatch.countDown()

1	1	package mesosphere.marathon
2	2	package api.v2
3	3
4	4	import javax.servlet.http.HttpServletRequest
5	5	import javax.ws.rs.core.{ Context, Response }
6	6	import javax.ws.rs._
7	7
8	8	import com.google.inject.Inject
9	9	import mesosphere.chaos.http.HttpConf
10		import mesosphere.marathon.MarathonConf
11	10	import mesosphere.marathon.api.{ AuthResource, MarathonMediaType, RestResource }
12	11	import mesosphere.marathon.core.election.ElectionService
13	12	import mesosphere.marathon.plugin.auth._
14	13	import mesosphere.marathon.storage.repository.RuntimeConfigurationRepository
15	14	import mesosphere.marathon.raml.RuntimeConfiguration
16	15	import Validation._
	16	import akka.actor.ActorSystem
17	17	import mesosphere.marathon.stream.UriIO
18	18
19	19	@Path("v2/leader")
20	20	class LeaderResource @Inject() (
	21	system: ActorSystem,
21	22	electionService: ElectionService,
22	23	val config: MarathonConf with HttpConf,
23	24	val runtimeConfigRepo: RuntimeConfigurationRepository,
24	25	val authenticator: Authenticator,
25	26	val authorizer: Authorizer)
26	27	extends RestResource with AuthResource {
27	28
28	29	@GET
Show All 15 Lines
44	45	@QueryParam("restore") restoreNullable: String,
45	46	@Context req: HttpServletRequest): Response = authenticated(req) { implicit identity =>
46	47	withAuthorization(UpdateResource, AuthorizedResource.Leader) {
47	48	if (electionService.isLeader) {
48	49	assumeValid {
49	50	val backup = validateOrThrow(Option(backupNullable))(optional(UriIO.valid))
50	51	val restore = validateOrThrow(Option(restoreNullable))(optional(UriIO.valid))
51	52	result(runtimeConfigRepo.store(RuntimeConfiguration(backup, restore)))
	53

Simplify leader election codeClosedAll UsersActions

Details

Diff Detail

(๑′°︿°๑)

(๑′°︿°๑)

(๑′°︿°๑)

(๑′°︿°๑)

(๑′°︿°๑)

＼\ ٩( ᐛ )و /／

(๑′°︿°๑)

＼\ ٩( ᐛ )و /／

Revision Contents

Diff 3804

build.sbt

src/main/scala/mesosphere/marathon/MarathonSchedulerActor.scala

src/main/scala/mesosphere/marathon/MarathonSchedulerService.scala

src/main/scala/mesosphere/marathon/api/v2/LeaderResource.scala

src/main/scala/mesosphere/marathon/core/base/CrashStrategy.scala

src/main/scala/mesosphere/marathon/core/election/ElectionModule.scala

src/main/scala/mesosphere/marathon/core/election/ElectionService.scala

src/main/scala/mesosphere/marathon/core/election/impl/CuratorElectionService.scala

src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceBase.scala

src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceEventStream.scala

src/main/scala/mesosphere/marathon/core/election/impl/ElectionServiceMetrics.scala

src/main/scala/mesosphere/marathon/core/election/impl/ExponentialBackoff.scala

src/main/scala/mesosphere/marathon/core/election/impl/PseudoElectionService.scala

src/test/scala/mesosphere/UnitTest.scala

src/test/scala/mesosphere/marathon/MarathonSchedulerServiceTest.scala

src/test/scala/mesosphere/marathon/api/v2/LeaderResourceTest.scala

src/test/scala/mesosphere/marathon/core/base/RichRuntimeTest.scala

src/test/scala/mesosphere/marathon/core/election/impl/CuratorElectionServiceTest.scala

src/test/scala/mesosphere/marathon/core/election/impl/ElectionServiceBaseTest.scala

src/test/scala/mesosphere/marathon/core/election/impl/PseudoElectionServiceTest.scala

src/test/scala/mesosphere/marathon/core/health/impl/HealthCheckWorkerActorTest.scala

src/test/scala/mesosphere/marathon/integration/LeaderIntegrationTest.scala

src/test/scala/mesosphere/marathon/integration/ResidentTaskIntegrationTest.scala

src/test/scala/mesosphere/marathon/integration/setup/ForwarderService.scala

src/test/scala/mesosphere/marathon/integration/setup/MarathonTest.scala

src/test/scala/mesosphere/marathon/test/ExitDisabledTest.scala

Simplify leader election code
ClosedAll Users
Actions