17:01:01 <phw> #startmeeting anti-censorship weekly checkin 2019-09-12
17:01:01 <MeetBot> Meeting started Thu Sep 12 17:01:01 2019 UTC.  The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot.
17:01:01 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
17:01:05 <phw> hi everyone!
17:01:15 <phw> for the record, here's our meeting pad: https://pad.riseup.net/p/tor-censorship-2019-keep
17:01:28 <cohosh> hi o/
17:02:14 * catalyst is kind of here
17:02:20 <phw> let me start with the first discussion item: are there services left that aren't monitored or (re)started automatically on crash/reboot?
17:03:13 <phw> i'm asking because we ran into an issue with gettor - it looks like it wasn't started automatically on boot? is this correct, hiro?
17:04:10 <phw> fwiw, anarcat once helped me set up a relatively simple systemd script to monitor/restart services: https://help.torproject.org/tsa/doc/services/
17:04:22 * anarcat o/
17:04:32 <anarcat> gettor should have a systemd --user process now, iirc
17:04:57 <phw> yes, thanks anarcat. i set it up for gettor and systemd now restarts the service when it crashes
17:05:02 <hiro> anarcat before sysadmin were against doing that... that's probably why there wasn't a script like that
17:05:21 <phw> (we have yet to see if it also does its job when the host reboots)
17:05:24 <anarcat> phw: awesome
17:05:30 <anarcat> hiro: ah
17:05:39 <anarcat> good thing i messed things up then :)
17:05:50 <hiro> the idea was that if a service would be restarted when it crashed nobody would fix the reason for the  crash
17:05:59 <phw> hiro: there's no possibility to monitor the gettor process from a separate machine, right? because both the http and smtp server are separate processes?
17:06:18 <hiro> the http server is apache serving a static html
17:06:23 <anarcat> hiro: oh right, that's a good point
17:06:32 <anarcat> hiro: i thought we were just starting the service in systemd
17:06:33 <phw> hiro: that's a good reason but that works only if we have sound monitoring that tells us when a service disappared. we don't have that for gettor aiui
17:06:38 <anarcat> but yeah, systemd can also restart the service on crashes
17:06:47 <anarcat> i think that as long as people get notified on crashes, it's okay
17:06:48 <hiro> the smtp server is twistd
17:06:53 <arma2> phw: to answer your original "are there any other services" question, i am hoping we get to the point where you can show me a web page with a bunch of green lights on it. until we're there, i don't know how to know what got skipped. :)
17:06:54 <hiro> I am happy to have something that restarts the service
17:06:55 <anarcat> but i shouldn't hijack your meeting :)
17:07:13 <anarcat> arma2: working on that dashboard in grafana, actually
17:07:30 <phw> hiro: so if gettor crashes, nothing is listening on port 25 anymore?
17:07:52 <hiro> the system will still get the email but it will not be processed
17:08:08 <hiro> the point is that if gettor only sends email it doesn't need to be a twisted service
17:08:21 <hiro> it can be a python script that is called when the email hit the system
17:08:35 <hiro> it's just a postfix rule
17:08:36 <phw> i understand that but our problem right now is that gettor was dead for ~3 days and nobody noticed. how can we notice?
17:08:39 <anarcat> arma2: this is the current grafana homepage https://paste.anarc.at/publish/2019-09-12-7hOjxgi6jUg/screenshot.png
17:09:02 <hiro> maybe we can have system checks on the twisted process
17:09:05 <anarcat> phw: we should be monitoring the actual service
17:09:12 <anarcat> phw: send an email, check if we get tor back
17:09:13 <hiro> via nagios or something else
17:09:38 <anarcat> checking that all the bits underneath are all in place will not be as useful as a "does the thing actually work" test
17:09:47 <hiro> yeah we can send an email
17:09:58 <anarcat> you need to have an inbox somewhere to check if you get the reply i guess
17:10:01 <anarcat> but that can be arranged
17:10:06 <anarcat> nagios has checks to do this
17:10:21 <anarcat> i don't know about the politics of putting this in our nagios
17:10:33 <anarcat> but that seems like the best solution, technically
17:10:53 <phw> i don't have a strong preference but i think that some sort of service monitoring should be a priority. what do you think, hiro?
17:11:12 <hiro> looks good to me
17:12:37 <phw> ok, let's try to get this done asap. can you take the lead on that hiro?
17:12:42 <phw> (i'm happy to help however i can)
17:12:59 <anarcat> (same here)
17:13:16 <hiro> I can check how to do this in nagios
17:13:23 <phw> thanks!
17:13:38 <phw> fwiw, our main monitoring system is sysmon, run by gman999. the config is here: https://dip.torproject.org/torproject/anti-censorship/sysmon-configuration
17:14:15 <phw> it's limited though because it cannot follow http redirects. still, it does a good job at basic tcp reachability tests and has notified us of default bridges going offline
17:14:38 <phw> i realise that a mix of sysmon, nagios, and others is not optimal but scattered monitoring is better than no monitoring
17:15:56 <phw> (i also experimented with monit on my laptop, which i use to monitor a set of private obfs4 bridges that we will hand out to an NGO)
17:17:00 <phw> ok, that's it from my side wrt monitoring and reliability. any more thoughts?
17:17:47 <phw> next is a link to google's reviewing guidelines, which i found interesting and worth a skim: https://google.github.io/eng-practices/review/reviewer/
17:18:36 <phw> i think we can learn from some of their experience
17:18:40 <cohosh> thanks for the thoughts, i can be better about doing reviews earlier in the week for sure
17:19:11 <arma2> the network team also periodically tries to prioritize reviews,
17:19:14 <phw> cohosh: right, i was thinking about gettor. if hiro has only thursday to work on it, we should be done with our reviews by wednesday the week after, so hiro doesn't block on us
17:19:26 <arma2> especially because a new volunteer will get hooked if they get feedback and attention, and wander away if they don't
17:19:40 <arma2> they've found it hard to be consistent with that priority though
17:19:47 <cohosh> yep maes sense
17:19:58 <cohosh> *makes
17:20:16 <hiro> phw I tend to do more things on thu but I have always other stuff so I do gettor when I can ... as I do all the other things
17:20:28 <hiro> so reviews do not really block me
17:20:56 <hiro> if I am blocked I ping you guys and let you know
17:21:02 <phw> hiro: gotcha
17:21:34 <cohosh> hiro: thanks :)
17:21:51 <phw> i think our weekly cycle works reasonably well but we should speed up things a bit if it's helpful
17:22:24 <cohosh> yeah i think i will try to prioritize reviews a bit more, was putting them in a giant bucket of "things to do by the next meeting"
17:24:04 <phw> another thing i liked in google's reviewing guidelines is to prefix suggestions with "nitpick" if they're worth pointing out but not necessary to incorporate
17:24:43 <phw> i can sometimes think of nicer ways to accomplish something but i don't want to drown somebody in minor feedback. that may be discouraging
17:25:00 <phw> the idea of "nitpick" is to say "hey, this is worth noting but feel free to ignore"
17:25:46 <cohosh> cool
17:25:56 <phw> (and as reviewee i would like to learn about all the ways to improve my code, even if i don't end up incorporating everything)
17:26:11 <phw> anyway, it's a useful document!
17:26:39 <phw> shall we move on to our 'needs help with' sections?
17:27:15 <cohosh> sounds good
17:27:47 <phw> hiro has 'probably more reviews' :)
17:27:50 <phw> keep em coming!
17:28:38 <hiro> yes I might have some more reviews as I incorporate your feedback and fix a few more pending things
17:28:58 <phw> cool, i'm happy to take a look
17:30:12 <phw> another thing related to gettor: i didn't mean to step on your toes with the systemd script, hiro. whenever i touch things on getulum, i try to document it and let you know. but please let me know if you can think of a better process
17:30:32 <phw> (generally, i prefer not to touch anything without talking to you first)
17:30:43 <hiro> you need to feel free to touch it actually
17:30:51 <hiro> that's why I am setting up the ansible recipe
17:31:19 <phw> ok, gotcha
17:31:23 <hiro> my idea is hat the ansible playbooks can run via cron and restart or/update the system
17:32:05 <hiro> if everthing runs via ansible there are no hidden scripts
17:32:25 <hiro> and everyone working on the service can see and improve the code
17:32:38 <phw> that's great, thanks for working on this
17:33:34 <phw> coming back to reviews: i only have one for now, #31692. it's a minor change to our docker image. can you take a look, cohosh?
17:33:55 <cohosh> phw: yup
17:34:20 <phw> next up is #29206
17:34:30 <cohosh> i think dcf1 has started reviewing it
17:34:41 <dcf1> yes I am
17:34:41 <phw> right, that's good
17:34:54 <cohosh> thanks
17:34:54 <phw> oh, you aren't absent after all, dcf1!
17:35:26 <dcf1> no I amn't
17:36:02 <phw> the last review seems to be #31455. i sent sina an email last week but haven't heard back yet.
17:36:27 <phw> that's a bummer.. he usually responds swiftly to bridge-related emails
17:37:27 <phw> is there anyone else at cymru that we could ask? rabbi rob maybe?
17:38:55 <arma2> who runs it there? is it sina or is it a generic cymru service?
17:39:14 <phw> i believe it's sina, right dcf1?
17:39:19 <dcf1> it's sina
17:40:08 <arma2> ok. then other cymru folks won't be so helpful. maybe follow up and ask how it's going and cc me?
17:40:37 <arma2> last night i started answering a batch of urgent mails from july. and my watch tells me it's no longer july. so, this happens. :/
17:40:58 * phw sent a reminder
17:41:28 <phw> looks like that's it for today. anything else?
17:42:06 <cohosh> not from me
17:42:12 <phw> alrighty, let's wrap it up
17:42:14 <phw> #endmeeting