17:00:04 <djmitche> #startmeeting weekly
17:00:04 <bb-supy`> Meeting started Tue Jan  2 17:00:04 2018 UTC and is due to finish in 60 minutes.  The chair is djmitche. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:00:04 <bb-supy`> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
17:00:04 <bb-supy`> The meeting name has been set to 'weekly'
17:00:09 <djmitche> #topic Introduction
17:00:18 <djmitche> welcome to Buildbot in 2018
17:00:25 <djmitche> http://bit.ly/2rup31x
17:00:54 <djmitche> Anyone else around?
17:01:56 <djmitche> heh
17:04:59 * djmitche decides to double his salary
17:05:06 <djmitche> any objections?
17:05:07 <bdbaddog> Greetings! Sorry I'm a few minutes late.
17:05:12 <djmitche> great, motion passes
17:05:16 <djmitche> haha, hi!
17:05:16 <bdbaddog> so 2*$0.00 ?
17:05:20 <djmitche> yeah :/
17:05:34 <bdbaddog> are we low on attendance today?
17:05:37 <djmitche> #topic MOSS Revisit
17:05:39 <djmitche> yeah, it's you and me afaict
17:06:11 <djmitche> Pierre and I replied, but I'm not sure we answered your questions...
17:06:14 <djmitche> (re MOSS)
17:06:31 <bdbaddog> O.k. so lemme find the bounty text.
17:07:30 <bdbaddog> So Pierre added logic to shut down the buildbot worker if the master becomes unavailable for some time, but I think that doesn't really satisfy the meat of the bounty.
17:07:41 <bdbaddog> http://trac.buildbot.net/ticket/3392
17:08:04 <djmitche> yep, that feature is #3393
17:08:12 <bdbaddog> Which for this item is: Providing support for shutting down an instance when it cannot connect to the master for a prolonged period
17:08:28 <bdbaddog> Right now, there are a number of ways for an EC2 instance to become "stranded", where it's not connected to the master, but still running and thus costing the user perfectly good beer money.
17:08:51 <bdbaddog> I'm not super familiar with ec2, but is there not a way to shut down an instance from within the instance?
17:09:03 <djmitche> yeah just /sbin/halt will do it
17:09:22 <bdbaddog> does that stop the $$'s from adding up?
17:09:27 <djmitche> (assuming the terminate-on-halt flag is set, which it generally is for spot instances)
17:09:27 <djmitche> yes
17:09:55 <bdbaddog> o.k. is that common accross the cloud providers (azure, google,etc)?
17:10:04 <tardyp> hi. Sorry I missed the meeting start
17:10:05 <djmitche> I believe so, yes
17:10:09 <djmitche> no problem :)
17:10:10 <tardyp> busy with the release..
17:10:11 <bdbaddog> not the flag,but if the instance shuts down then the billing stops?
17:10:29 <djmitche> bdbaddog: I think so, yes -- at least you stop paying for compute time
17:10:53 <bdbaddog> sure. you still pay for the disk image GB's..
17:10:54 <djmitche> you may still pay for storage if the instance is in a state where it could be started again (that's the difference between "terminated" and "shut down" in EC2 parlance)
17:11:07 <djmitche> those GBs are much cheaper than a running instance tho
17:11:32 <bdbaddog> o.k. so if we can setup a way for the worker to run a script when it shuts down and/or when it shuts down due to time out that'd satisfy the runaway costs issue?
17:12:00 <djmitche> yes
17:12:04 <tardyp> right
17:12:15 <djmitche> another more popular option is to run a script which looks for a buildslave process and, if one is not found, halts
17:12:35 <djmitche> well, more reliable, I have no idea what's popular with the kids these days
17:12:46 <bdbaddog> o.k. is that a third party thing? liek supervisord
17:12:50 <djmitche> yeah
17:13:00 <tardyp> or systemd
17:14:18 <bdbaddog> o.k. can we (meaning someone who knows how to do this,and likely not me) just document one (or more) such options
17:14:31 <bdbaddog> are the other bullets handled?
17:14:32 <bdbaddog> http://trac.buildbot.net/ticket/3392
17:14:43 <tardyp> I added a bit of doc about it in my commit
17:15:06 <bdbaddog> btw.. did you see Tony Sebro just stepped down from SFC and is going to wikimedia?
17:16:43 <djmitche> yeah, I wonder if it's related to the SFLC thing
17:16:53 <djmitche> anyway, that aside
17:17:34 <djmitche> I'm happy with some documentation bits and minor features to satisfy this bounty.
17:17:54 <djmitche> I don't think what remains is worth $5k, since it's barely used
17:18:38 <djmitche> so I guess the options are:
17:18:40 <djmitche> * pay back
17:18:59 <djmitche> * claim we have captured the spirit in general improvements to latent support + Hyper.sh
17:19:08 <djmitche> * find a new $5k bounty instead
17:19:23 <bdbaddog> Did MOSS already hand us the full amount of the bounty?
17:19:30 <djmitche> I don't think "do a lot of technical work to accomplish these items" is a good choice
17:19:43 <tardyp> I agree with this
17:19:44 <djmitche> to my knowledge, yes, it is in the SFC account
17:20:26 <bdbaddog> o.k. Then if once the doc updates are done a above, let me know, I'll review and assuming even I can understand them, I'll let MOSS know we've awrded the bounty?
17:20:41 <tardyp> https://docs.buildbot.net/latest/manual/installation/worker.html#cmdoption-buildbot-worker-create-worker-maxretries
17:21:09 <tardyp> it is rather small. I will add some paragraph in EC2 part
17:22:26 <djmitche> should I close https://github.com/buildbot/buildbot/issues/2962 or keep it open?
17:22:28 <bdbaddog> is the systemd config a file? if so can we stick an example in the source tree somewhere? maybe: https://github.com/buildbot/buildbot-contrib
17:23:23 <tardyp> bdbaddog: right
17:24:06 <djmitche> I updated https://github.com/buildbot/buildbot/issues/2961
17:24:09 <bdbaddog> maybe a buildbot-worker-ec2 (or cloud ) or something file with the shutdown stuff in it?
17:24:33 <tardyp> the thing is I can hardly test it
17:25:38 <bdbaddog> that's find. it's contrib. it doesn't ahve to be perfect. Put it there and when I get a chance I'll try it in my ec2 account.
17:25:47 <djmitche> ok
17:26:00 * djmitche will file a gh issue for that
17:26:27 <tardyp> cool
17:26:30 <djmitche> what about the "lost" instances?
17:26:34 <djmitche> should we close that as WONTFIX?
17:27:02 <tardyp> yes
17:27:09 <tardyp> I think it is easier to do externally
17:27:39 <tardyp> or if somebody contribute it.
17:28:46 <djmitche> ok
17:28:59 <djmitche> bdbaddog: and are you comfortable dropping that from the bounty?
17:29:33 <bdbaddog> lemme take a quick look at that bug.
17:31:37 <bdbaddog> so in the current case assuming we have systemd shutting down if the worker can't talke the the master which shuts down the worker, would the ec2 spot instance then shut down?
17:31:56 <tardyp> yes
17:32:14 <bdbaddog> so while the request wouldn't get cancelled, the charge should be minimal?
17:33:01 <djmitche> ah, good point
17:33:06 <tardyp> yes, you would have a bunch of dead insteance after a while that you would need to cleanup. Assuming your master is really instable, which is something you should take care of
17:33:10 <djmitche> so it would eventually start up, fail to connect, yadda yadda, shut down
17:33:29 <djmitche> if they are terminate-on-halt you shouldn't need to clean anything up
17:33:43 <bdbaddog> I guess the other question is, how hard to actually fix the issue where non-terminal status codes would lead to canceling the request?
17:33:43 <tardyp> that's beyond my EC2 foo
17:34:18 <djmitche> mine too
17:34:28 <tardyp> https://medium.com/buildbot/buildbot-0-9-15-4614630d9cb
17:34:33 <djmitche> also, EC2 just changed the whole spot pricing thing a few weeks ago
17:34:38 <djmitche> it's not even pretending to be a market anymore
17:34:53 <bdbaddog> assuming there's already logic now to cancel requests?
17:34:54 <djmitche> so price-too-low, capacity-unavailable, and capacity-oversubscribed are *far* more common now
17:35:01 <djmitche> I don't know if there is..
17:35:02 * djmitche greps
17:35:30 <djmitche> oh
17:35:36 <djmitche> it looks like we *do* cancel on price-too-low
17:37:13 <bdbaddog> djmitche: so to fully resolve that issue is it just the matter of canceling the request for a few more status codes?
17:37:19 <djmitche> https://github.com/buildbot/buildbot/commit/58de958616898818f1a14a1736e6ca7bfe550114
17:37:25 <djmitche> maybe?
17:37:49 <djmitche> maybe just cancelling by default
17:37:52 <djmitche> I have no idea how to test that though
17:38:09 <djmitche> but, yeah, looks like that would be easy to add-and-pray
17:38:36 <djmitche> I'll make a note in the relevant issue
17:39:08 <bdbaddog> o.k. I think if we can add the doc, and sample in contrib, and handle this remaining case then the meat of not allowing runaway ec2 costs would be completed.
17:39:15 <bdbaddog> that sound reasonable?
17:40:15 <djmitche> ok
17:40:22 <djmitche> proposed comment on the bug? https://irccloud.mozilla.com/pastebin/OQFYSOBj/
17:42:18 <tardyp> looks good
17:43:17 <bdbaddog> so the ..failsafe method.. is what you're proposing to drop from the bounty?
17:43:24 <djmitche> basically, yes
17:43:41 <bdbaddog> o.k. the wording for that is to vague to be ever done.
17:43:45 <djmitche> and only making a passing attempt at "correcting spot instance handling"
17:43:50 <djmitche> yeah
17:43:56 <bdbaddog> I'm o.k. with that.
17:44:07 <djmitche> the idea was that whoever proposed would have more specific ideas :)
17:44:10 <djmitche> ok
17:45:10 <djmitche> commented, and one is assigned to me
17:45:11 <djmitche> so I'll get to that
17:45:18 <bdbaddog> fair enough. I think there's $5k of value in the work we've agreed to toward preventing runaway ec2 usage.
17:45:31 <djmitche> if you count the Hyper.sh stuff, definitely
17:46:34 <djmitche> so I think the argument is two-pronged: we addressed runaway EC2 usage, if not comprehensively; but moreover we've created a new, more reliable and effective latent-worker approach in Hyper.sh which we recommend instead of EC2
17:47:53 <tardyp> a nice way to say it..
17:50:29 <djmitche> ok
17:50:36 <djmitche> ready to wrap up?
17:50:47 <djmitche> #info long discussion of MOSS bounty
17:50:47 <djmitche> https://github.com/buildbot/buildbot/issues/2961
17:50:55 <bdbaddog> yup
17:51:09 <djmitche> #info summary: the argument that this is complete is two-pronged: we addressed runaway EC2 usage, if not comprehensively; but moreover we've created a new, more reliable and effective latent-worker approach in Hyper.sh which we recommend instead of EC2
17:51:38 <djmitche> #info proposal is to consider the bounty claimed by the Buildbot community and retain the funds as general operating funds
17:51:41 <djmitche> ok#
17:51:44 <djmitche> #endmeeting