16:30:21 <djmitche> #startmeeting weekly
16:30:21 <bb-supy> Meeting started Tue Sep 13 16:30:21 2016 UTC and is due to finish in 60 minutes.  The chair is djmitche. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:30:21 <bb-supy> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:30:21 <bb-supy> The meeting name has been set to 'weekly'
16:30:23 <djmitche> #topic Introduction
16:30:30 <djmitche> https://titanpad.com/buildbot-agenda
16:30:33 <tardyp> hey!
16:30:35 <djmitche> pretty empty agenda today
16:30:37 <djmitche> hey!
16:30:42 <gracinet> hi guys
16:30:42 <djmitche> anyone else around?
16:30:47 <tardyp> well, I did not have time to put things on it
16:30:58 <djmitche> now's your chance :)
16:31:03 <djmitche> gracinet: how are you?
16:31:30 <djmitche> Sep 13 16:28:59 trac sm-mta[28170]: STARTTLS=client, relay=mx.buildbot.net., version=TLSv1/SSLv3, verify=FAIL, cipher=ECDHE-RSA-AES256-GCM-SHA384, bits=256/256
16:31:32 <gracinet> I'm fine dustin, and you ? just added TLS / endpoints to the agenda
16:31:44 <djmitche> awesome, sorry I took it off :)
16:31:53 <djmitche> verify=FAIL has me worried there
16:32:19 <djmitche> #topic week in review
16:32:24 <gracinet> what kind of log is that from ?
16:32:27 <djmitche> I'm blind this week -- what's new? tardyp?
16:32:34 <djmitche> gracinet: it's from the mailer on the trac jail
16:32:39 <djmitche> which should have sent the weekly email
16:32:42 <verm__> here, sorry didn't realise i wasn't in the channel
16:33:03 <tardyp> sorry I was in another meeting, that just ended
16:33:35 <djmitche> it's OK -- anything to highlight?
16:34:02 <tardyp> this week has been quiet actually, besides the UsageData
16:34:05 <djmitche> hey amar - we'll get there in a sec
16:34:06 <djmitche> ok
16:34:18 <djmitche> #info quiet week, except for UsageData discussion (coming up shortly)
16:34:19 <verm__> no problem i'll be off and on i'll check back frequently
16:34:21 <tardyp> I have taken a lot of time to build the vagrant setup, and now it is done
16:34:32 <djmitche> yes, I need to have a look at that
16:34:47 <djmitche> https://github.com/buildbot/buildbot-infra/pull/148/files/90c8214ea54c251cbe073af49887f2859dc02e00#r78527113
16:34:55 <djmitche> #undo
16:34:55 <bb-supy> Removing item from minutes: <ircmeeting.items.Link object at 0x806856310>
16:34:57 <djmitche> https://github.com/buildbot/buildbot-infra/pull/148
16:35:10 <tardyp> we had a few support on the ML, but nothing huge just new users \o/
16:35:17 <djmitche> work on supporting buildbot-infra development using vagrant
16:35:23 <djmitche> new users are awesome!
16:35:35 <djmitche> so no changes in 0.9.x release
16:35:50 <tardyp> nope
16:35:51 <djmitche> and no updates on 0.8.x either it seems
16:35:54 <djmitche> ok
16:36:02 <djmitche> #topic Weekly Email missing
16:36:15 <gracinet> on that, I see that's a CaCert certificate
16:36:26 <djmitche> verm__: ok, sorry for the delay
16:36:33 <djmitche> verm__: I was wondering if anything changed with mx recently
16:36:35 <gracinet> and the random machine I'm checking it from cannot verify either
16:36:39 <tardyp> I can look at the issue, now that the vagrant setup is working for me
16:36:40 <djmitche> gracinet: yeah, I wonder if it's expired
16:37:22 <gracinet> yes, it is, since march unless i"m blind
16:37:33 <gracinet> Issuer: O=CAcert Inc., OU=http://www.CAcert.org, CN=CAcert Class 3 Root
16:37:33 <gracinet> Validity
16:37:33 <gracinet> Not Before: Mar 19 18:51:58 2014 GMT
16:37:33 <gracinet> Not After : Mar 18 18:51:58 2016 GMT
16:37:46 <djmitche> oh, hm
16:37:56 <djmitche> so maybe that's not the issue - we've gotten emails up until this week
16:37:58 <tardyp> but this worked last week!
16:38:00 <gracinet> this is trying from a server of mine, on port 25
16:38:06 <gracinet> yes, I noticed
16:38:20 <djmitche> (and just to be clear -- did anyone get the weekly summary email from buildbot?)
16:38:32 <gracinet> $ openssl s_client -connect mx.buildbot.net:25 -starttls smtp
16:38:53 <gracinet> no, didn't get it (noticed 10mn ago)(
16:39:28 <djmitche> oh, I didn't know -starttls, nice
16:39:32 <verm__> no, nothing
16:39:35 <djmitche> https://irccloud.mozilla.com/pastebin/IY2Fz0mA
16:39:36 <djmitche> that's the log on the mailer
16:39:38 <verm__> which emails aren't working?
16:40:03 <verm__> looks like i forgot to update the cert to the new one but that should not matter
16:40:21 <gracinet> maybe a Python update that starts verifying ?
16:40:45 <djmitche> verm__: yeah, it seems to connect and send the message just w/o TLS
16:40:51 <djmitche> the message is stuck in mailq
16:41:45 <djmitche> #info the weekly status email did not get sent to the mailing lists
16:42:01 <tardyp> just a random though... could that be the fact that we unlocked the mq from one of the fail and then our server got banned?
16:42:03 <djmitche> #info there was a TLS error, but that appears not to have been the issue (message was delivered to mx.buildbot.net anyway)
16:42:10 <tardyp> s/fail/jail
16:42:19 <djmitche> possible :)
16:42:37 <djmitche> except these messages are stuck *before* they are delivered to the lists.bb.n
16:42:45 <tardyp> ok
16:43:14 <tardyp> I guess we wont debug that in the meeting..
16:43:19 <verm__> huh weird my email alias is not working either
16:44:13 <djmitche> verm__: do you mind if we continue with a few other topics and loop back?
16:44:18 <verm__> yes i'll look into this
16:44:32 <djmitche> ok
16:44:49 <djmitche> #topic Usage Data discussion
16:45:02 <djmitche> tardyp: was there a lot of feedback on the users list?
16:45:16 <verm__> whoaa wtf is going on here
16:45:16 <verm__> 4:45PM  up 535 days,  4:02, 0 users, load averages: 32.24, 32.66, 31.94
16:45:20 <tardyp> none except for yours
16:45:24 <djmitche> haha, ok
16:45:32 <verm__> there are 63 bbinfra python processes
16:45:33 <djmitche> so 100% positive feedback!
16:45:38 <djmitche> verm__: which jail?
16:45:41 <verm__> mx
16:45:50 <tardyp> its the ansible bug
16:45:53 <skelly> I killed a bunch last night
16:46:05 <djmitche> yeah, just kill 'em
16:46:12 <skelly> they were all from >1 week ago
16:46:20 <verm__> there are 239 on service1
16:46:32 <djmitche> they build up
16:46:35 <tardyp> we need to have the ansible-pull.sh kill those automatically
16:46:40 <djmitche> I think it's a git bug actually -- they hang in git
16:47:30 <djmitche> #info no feedback on user-data collection process from the mailing lists
16:47:58 <tardyp> so shall I merge my patch and release rc3?
16:48:05 <djmitche> I can't see why not!
16:48:06 <tardyp> everybody agrees?
16:48:10 <gracinet> yup
16:48:11 <verm__> here comes all the email
16:48:18 <verm__> this is going to suck, sorry
16:48:18 <gracinet> (gasp)
16:48:19 <djmitche> haha, ok
16:48:34 <djmitche> #agreed pierre will merge the user-data PR and release rc3
16:48:54 <bb-github> [13buildbot] 15tardyp closed pull request #2393: buildbotNetUsageData implementation (06master...06events) 02https://git.io/visRm
16:49:08 <tardyp> aussitot dit aussitot fait
16:49:10 <gracinet> ├ža c'est fait !
16:49:23 <djmitche> awesome!
16:49:24 <gracinet> allright, same french-writing reflex, sorry
16:49:27 <djmitche> haha
16:49:34 <djmitche> #topic Weekly Email missing (reprise)
16:49:41 <djmitche> verm__: what'd you find?
16:49:41 <verm__> mail is working again
16:50:15 <verm__> i think there was some type of resource exhaustion that was not logged anywhere (that i could see), killing all the python processes (236) and restarting postfix fixed it
16:50:22 <djmitche> hm, ok
16:50:27 <djmitche> so we should prioritize fixing that git hang :(
16:50:37 <djmitche> I wonder if it has a common cause with all of the retries
16:51:31 <verm__> hmm yeah... i would be surprised if noone else has this this
16:51:31 <djmitche> can you tell anything about what the state is of those hung processes/
16:51:36 <djmitche> there are probably more on service2/3 :)
16:51:53 <tardyp> how do you see which task they are stuck on?
16:51:54 <verm__> in the interim we can have a crontab that will kill any process that is ansible over 30 mins old
16:52:18 <verm__> 10 on service2
16:52:44 <tardyp> verm__: killing them at the time of the ansible-pull.sh cron means that we will kill all ansible that are > 1h
16:52:45 <djmitche> tardyp: it's usually the initial git pull
16:53:11 <tardyp> djmitche: how do you see that?
16:53:13 <djmitche> verm__: yeah that sounds good, if you want to write up the script and send it to the list I can set it up in ansible
16:53:17 <tardyp> I never see any git process
16:53:19 <verm__> they appear to be different processes
16:53:21 * djmitche looks
16:53:22 <verm__> lots of different tones
16:53:28 <verm__> ps auxww|grep -i ansible
16:53:31 <verm__> it'
16:53:34 <verm__> it's not the same step
16:53:56 <verm__> service3 has none
16:53:59 <verm__> did someone kill them?
16:54:15 <verm__> ah, probably sean he logged in this morning
16:54:21 <djmitche> tardyp: you're right, sorry -- previous times I've looked, I've found git processes
16:54:44 <verm__> so it's probably a bug within ansible itself
16:54:54 <djmitche> yeah
16:54:59 <djmitche> is there any way to tell what syscall it's stuck on?
16:55:06 <djmitche> like strace on linux
16:55:10 <verm__> ktrace
16:55:19 <djmitche> k
16:55:22 <tardyp> as it is consuming CPU, I would say its not stuck on a syscall
16:55:28 <djmitche> it's weird that it counts as runnable (so in the load avg)
16:55:30 <verm__> you will need to use kdump to read the output
16:55:43 <verm__> djmitche: on service1 there were over 15 processes using >10%
16:55:51 <verm__> and about 50 using > 5 < 10
16:56:04 <verm__> that's why the load was 30+
16:56:10 <tardyp> is it one CPU per process?
16:57:15 <djmitche> verm__: right, but are they actually using that?
16:57:28 <djmitche> or just runnable?
16:57:37 <djmitche> anyway, `ktrace -p nnn` returns immediately for me
16:58:25 <verm__> you don't have a ktrace.out?
16:58:29 <djmitche> I do, it's empty
16:58:34 <verm__> you need to use kdump to read what it's dumping
16:58:35 <verm__> oh
16:58:42 <djmitche> tardyp: we can loop back to the vagrant stuff after if there's time
16:59:14 <djmitche> ah, truss worked
16:59:24 <djmitche> well, and caused the process to exit
16:59:28 <djmitche> this was a "lineinfile" step
16:59:32 <djmitche> https://irccloud.mozilla.com/pastebin/hMkXaOq4
17:00:15 <verm__> hmm looks like they're all different
17:00:27 <djmitche> this reminds me of the livelock issues we had on .. OpenBSD, I think?
17:00:38 <djmitche> that were eventually traced to unsafe signal handling in Python
17:00:41 <verm__> djmitche: if the process is stuck ktrace.out will be empty...you have to wait for it to do something if it does
17:00:45 <verm__> i did ktrace on a process
17:00:49 <verm__> then did truss which made it quit
17:00:55 <verm__> then ktrace.out had details
17:01:01 <djmitche> ah, interesting -- what'd you see?
17:01:30 <djmitche> http://trac.buildbot.net/ticket/1992
17:01:51 <verm__> http://pastebin.com/UaK71E05
17:01:51 <infobob> https://paste.pound-python.org/show/6QYL0ecum5rOgmAvXbE5/ (repasted for verm__)
17:02:23 <djmitche> hm, maybe that's not hte bug I was thinking of
17:02:57 <verm__> wonder what it's polling i will see if i can find the time to use dtrace
17:03:10 <verm__> i'm in the process of moving which is happening at the end of this month packing up the lab has taken weeks :(
17:04:40 <djmitche> it looks like it was waiting for stdout from pkgng
17:04:48 <djmitche> something's weird here at the OS level..
17:05:15 <djmitche> ok, let's continue with the meeting then
17:05:33 <djmitche> #info mail issues traced down to too many live-locked ansible-related processes causing high load and no mail delivery
17:05:37 <djmitche> #info fixed now
17:05:50 <djmitche> #topic vagrant setup for buildbot-infra
17:06:29 <bb-github> [13buildbot] 15tardyp opened pull request #2398: port to stable : buildbotNetUsageData implementation (06buildbot-0.9.0...06stable) 02https://git.io/vi2uv
17:06:39 <verm__> maybe before we do anything we should update the base OS and the basejail
17:06:46 <verm__> it is kind of old, anyway
17:06:58 <djmitche> I think that's relatively recent? skelly keeps up with that
17:07:02 <skelly> eh
17:07:07 <skelly> it's a PITA to do
17:07:10 <tardyp> if I understand correctly we could update the basejail with ansible?
17:07:20 <skelly> it's on 10.1 which is still supported for ~month
17:07:28 <tardyp> and everything shall work seemlessly, no?
17:07:34 <skelly> it's probably best to not update the basejail via ansible
17:07:42 <skelly> it needs to be in sync with the kernel
17:07:48 <tardyp> ah
17:07:56 <skelly> or, not very far behind at least
17:08:09 <skelly> I would like to manage the service* hosts via freebsd-update
17:08:24 <skelly> staying up to date is much simpler and it can manage the basejail too
17:08:32 <djmitche> ++ to that
17:08:36 * djmitche *heart* freebsd-update
17:08:42 <skelly> one catch is have to use a GENERIC kernel
17:08:45 <tardyp> looks cool
17:09:30 <skelly> I'm on the freebsd-security(?) ml so I do see when security updates happen and it's easy to then do that on the hosts
17:09:58 <skelly> verm__: I know you have been against this previously
17:10:54 <skelly> and my chief concern is if the custom kernel has special changes beyond needing to load some modules
17:11:13 <verm__> yeah i've never had much luck with freebsd-update on infra
17:11:20 <verm__> yes for pf and a couple other items
17:11:30 <verm__> i think we could use a generic kernel on all the systems except vm
17:11:39 <skelly> yeah, vm would be separate
17:11:42 <djmitche> #undo
17:11:42 <bb-supy> Removing item from minutes: <ircmeeting.items.Topic object at 0x80694b910>
17:11:42 <skelly> it's on CURRENT
17:12:07 <skelly> I would like to get it on GENERIC too, but that's not going to happen until 12-RELEASE
17:12:43 <skelly> relatedly: libvirt works on FreeBSD and I have successfully (mostly) used it with bhyve
17:12:49 <verm__> i always surgecially add security updates as required the biggest issue with freebsd-update is it will update everything which can cause issues if you need to run a service or library that is buggy on newer versions
17:13:00 <skelly> certainly
17:13:08 <skelly> but, no one is doing anything on the hosts
17:13:08 <verm__> if you think you can make it work and not be full of headaches go for it :)
17:13:16 <verm__> we can always revert if it ends up being a nightmare
17:13:21 <skelly> using freebsd-update means more than you and (heh) me could do it
17:13:39 <skelly> so I'll take a shot at one of them sometime soon
17:13:42 <verm__> i've just never seen a need for it because i only update operating systems once every 2-3 years
17:13:54 <skelly> any preferences on which host to try first?
17:13:58 <verm__> especially on freebsd where you can get weird bugs that can take over a year for the devs to figure out
17:14:10 <verm__> hmm
17:14:17 <verm__> they're all pretty critical :)
17:14:41 <verm__> i guess service1 is the one we could last some time without
17:15:08 <skelly> okay
17:15:49 <tardyp> service3 also
17:16:04 <tardyp> service1 as ns and mx so probably not crash that
17:16:27 <djmitche> yeah, I'd say 2 or 3
17:16:43 <skelly> I'll do 3
17:16:46 <tardyp> service3 has my new events jail, and an unused slave
17:17:01 <skelly> 2 runs web stuff so it would be the most publicly visible with long downtime
17:17:23 <skelly> I'm thinking 3, 1, then 2
17:17:26 <verm__> 3 has the database
17:17:42 <verm__> we can live without ns we have backups
17:17:48 <verm__> mail will be held until it's back up
17:17:56 <djmitche> db's just for tests
17:17:56 <verm__> syslog we can do without there are local logs if we need to fill in the gap
17:18:18 <djmitche> 3, 1, 2 seems good
17:18:27 <verm__> mysql runs the database for trac
17:18:35 <verm__> so that will need to go down (devel.buildbot.net)
17:19:17 <djmitche> oh, haha
17:19:19 <djmitche> good point
17:19:24 <djmitche> #endmeeting