Issue 2302193002: Change daemonize logic in watchdog and add timeout to file system read.

bpastene

bpastene@chromium.org changed reviewers: + dnj@chromium.org

4 years, 3 months ago (2016-09-02 00:05:21 UTC) #1

bpastene

ptal whenever you get a chance :) https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_watchdog/main.go File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_watchdog/main.go#newcode142 go/src/infra/tools/device_watchdog/main.go:142: go getDeviceUptime(c) ...

4 years, 3 months ago (2016-09-02 00:05:22 UTC) #2

dnj

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_watchdog/main.go File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_watchdog/main.go#newcode65 go/src/infra/tools/device_watchdog/main.go:65: func daemonize(maxUptime int) (int, error) { This isn't a ...

4 years, 3 months ago (2016-09-02 16:17:32 UTC) #3

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
File go/src/infra/tools/device_watchdog/main.go (right):

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:65: func daemonize(maxUptime int)
(int, error) {
This isn't a full daemonize workflow. It doesn't setsid() or (in this case)
handle file descriptor duplication.

This is the sort of thing you need to do:
https://github.com/sevlyar/go-daemon/blob/master/daemon_posix.go#L108

You were already sort of doing it before, but maybe review that for any
improvements (the SysProcAttr struct is interesting).

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:86: func getDeviceUptime(c chan
uptimeResult) {
Make this a directional channel:
c chan<- uptimeResult

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:142: go getDeviceUptime(c)
Select is correct, but you launch a new goroutine each time and don't make a
point of reaping the previous goroutines, which could lead to accumulating
goroutines if they are lagging behind. I think you need to close that loop for
sure. You should also make sure to close the channels when you know you're done
with them if you can.

The timeout is more complicated. You want to limit the number of goroutines that
you spawn, but turn a synchronous operation into an asynchronous one. Along
these lines, I think the thing to do is to have a "get the uptime" function that
runs in a goroutine and loops, receiving "get uptime" requests and returning
responses. This will limit your goroutine to one, and also let you test if
you're blocking.

First, create the monitor. Pass it an RPC request channel. The channel will,
itself, receive response channels. Each time a response channel is received,
that's a new RPC request, and it will get the uptime and push the result onto
the response channel.

type uptimeRPCChannel chan<- chan<- uptimeResponse

func uptimeGetter(reqC uptimeRPCChannel) {
}

// Now, let's wrap that in a "get uptime or timeout" function. We'll add some
additional logic to handle blocking gracefully. We'll mark a timeout using a
special error code.
// Wholly encapsulating the evils of the monitor and timeout in a synchronous
function makes the rest of the program more readable, testable, and
maintainable.
var errTimeout = errors.New("timeout!")

func getUptime(reqC uptimeRPCChannel, timeout time.Duration) (time.Duration,
error) (
  // Create a response channel. Buffer it so that if we timeout, the goroutine
doesn't choke/block.
  respC := make(chan uptimeRespnse, 1)

  // Start our timeout timer.
  timer := time.NewTimer(timeout)
  defer timer.Stop()

  // Try and send the RPC request. This could fail if the RPC goroutine is
blocked.
  select {
  case reqC <- respC:
    break
  case <-timer.C:
    return 0, errTimeout
  }

  // Read the RPC response. This could fail if the request causes the RPC
goroutine to
  // block.
  select {
  case resp <- respC:
    return resp.uptime, resp.err
  case <-timer.C:
    return 0, errTimeout
  }
}

// Now, you can actually use this stuff.
func main() {
  // ...

  // Start the uptime RPC goroutine.
  reqC := make(uptimeRPCChannel)
  go func() {
    for req := range reqC {
      uptime, err := getUptimeFromFile()
      req <- uptimeResponse{uptime, err}
    }
  }()
  defer close(reqC) // Probably not necessary, but cleanup and exit conditions
are good behavior.

  // Main loop.
  for {
    // ...
    uptime, err := getUptime(reqC, 5*time.Second)
    switch err {
    case nil:
      // Valid uptime, yay.
    case errTimeout:
      // Got a timeout.
    default:
      // Some other error.
    }
}

If you want to get fancy, you can start using
https://github.com/luci/luci-go/tree/master/common/clock

This will let you actually simulate the clock in a test, which is awesome
because you can actually test this logic and assert its correctness.

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:167: time.Sleep(maxUptime -
uptimeRes.Uptime + time.Second)
(Before the world forgets, add a comment about why you add a second here).

bpastene

I'll look into splitting the RPC logic into its own package and then adding tests ...

4 years, 3 months ago (2016-09-02 23:02:13 UTC) #4

I'll look into splitting the RPC logic into its own package and then adding
tests to that. But this is what I've got so far. Let me know what you think!

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
File go/src/infra/tools/device_watchdog/main.go (right):

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:65: func daemonize(maxUptime int)
(int, error) {
On 2016/09/02 16:17:32, dnj wrote:
> This isn't a full daemonize workflow. It doesn't setsid() or (in this case)
> handle file descriptor duplication.
> 
> This is the sort of thing you need to do:
> https://github.com/sevlyar/go-daemon/blob/master/daemon_posix.go#L108
> 
> You were already sort of doing it before, but maybe review that for any
> improvements (the SysProcAttr struct is interesting).

There were problems with threads not correctly getting scheduled. I kept all the
forking logic and just threw an exec at the end. That fixes it.

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:86: func getDeviceUptime(c chan
uptimeResult) {
On 2016/09/02 16:17:32, dnj wrote:
> Make this a directional channel:
> c chan<- uptimeResult

Done.

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:142: go getDeviceUptime(c)
On 2016/09/02 16:17:32, dnj wrote:
> Select is correct, but you launch a new goroutine each time and don't make a
> point of reaping the previous goroutines, which could lead to accumulating
> goroutines if they are lagging behind. I think you need to close that loop for
> sure. You should also make sure to close the channels when you know you're
done
> with them if you can.

Thanks for the suggestion Dan. I added the single "uptimeGetter" go routine like
you outlined that just responds to RPCs via channels. That should prevent
multiple go routines from getting created if they fail/timeout.

https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w...
go/src/infra/tools/device_watchdog/main.go:167: time.Sleep(maxUptime -
uptimeRes.Uptime + time.Second)
On 2016/09/02 16:17:32, dnj wrote:
> (Before the world forgets, add a comment about why you add a second here).

Done.

dnj

Looking pretty good, a few comments. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go#newcode74 go/src/infra/tools/device_watchdog/main.go:74: func daemonize(maxUptime int) ...

4 years, 3 months ago (2016-09-03 00:48:08 UTC) #5

Looking pretty good, a few comments.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
File go/src/infra/tools/device_watchdog/main.go (right):

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:74: func daemonize(maxUptime int)
(int, error) {
WDYT about having daemonize take an args slice instead of manually recreating
the flags? You can just forward that directly to the spawned process. Note that
for "Exec"-style execution, argv[0] should be forwarded and will be the
executable's name.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:75: ret, _, errno :=
syscall.Syscall(syscall.SYS_FORK, 0, 0, 0)
Forking here may be a little sketchy. In other daemonize discussions, I see the
following outline:

1) Setsid, StartProcess(self, args), Exit
2) StartProcess(self, args), Exit
3) We're a daemon!

You're doing (1) right now via SYS_FORK, which you point out is potentially
risky b/c of goroutines and stuff. You may still have this risk in the current
code. Up to you if you want to try and solve it now or count on this being a
small enough platform to ignore it. Note that Go runtime has other
goroutines/threads running even if you didn't directly spawn them (e.g., garbage
collector, signal monitor), so this might be worth your time.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:101: err = cmd.Start()
You can avoid some junk by using "os.StartProcess" here:

argv = append(argv, "--skip-daemonize")
os.StartProcess(binary, argv, &os.ProcAttr{
  Dir: os.Getwd(),
  Env: os.Environ(),
  Files: []*os.File{f, f, f},
  nil,
})

You can even avoid the "--skip-daemonize" if you want to use environment
variables:

var daemonizeEnv = "_INFRA_DAEMONIZE_MARKER"

func clearDaemonizedMarker() bool {
  _, has := os.Getenv(daemonizeEnv)
  if has {
    if err := os.Unsetenv(daemonizeEnv); err != nil {
      panic(err)
    }
    return true
  }
}

Then, at the beginning of your application, you can check (and clear):

if !clearDaemonizedMarker() {
  // Daemonize / Exit
}
// I'm a daemon, move on with my life.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:133: func getUptime(requestQueue
chan<- chan uptimeResult, timeoutPeriod time.Duration) (time.Duration, error) {
nit: chan<- chan<- uptimeResult

In other words, a channel that can be written to, and you can write to it
channels that can be written to.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:196: maxTimeouts := 5
nit: "const maxTimeouts = 5"

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:208: if consecutiveTimeouts ==
maxTimeouts {
nit: Just being paranoid, but might as well make this ">=".

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:209: logcatLog(logError, "%d
consective timeouts when fetching uptime. Triggering reboot", maxTimeouts)
nit: currentTimeouts, not maxTimeouts.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:211: } else if consecutiveTimeouts >
0 {
nit: No need for "else if", since the previous statement will exit this loop.
Just make this another "if".

bpastene

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go#newcode74 go/src/infra/tools/device_watchdog/main.go:74: func daemonize(maxUptime int) (int, error) { On 2016/09/03 00:48:07, ...

4 years, 3 months ago (2016-09-09 00:51:43 UTC) #6

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
File go/src/infra/tools/device_watchdog/main.go (right):

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:74: func daemonize(maxUptime int)
(int, error) {
On 2016/09/03 00:48:07, dnj wrote:
> WDYT about having daemonize take an args slice instead of manually recreating
> the flags? You can just forward that directly to the spawned process. Note
that
> for "Exec"-style execution, argv[0] should be forwarded and will be the
> executable's name.

Done, but I'd rather replace argv[0] with /proc/self/exe in case it's run from
somewhere else.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:75: ret, _, errno :=
syscall.Syscall(syscall.SYS_FORK, 0, 0, 0)
On 2016/09/03 00:48:07, dnj wrote:
> Forking here may be a little sketchy. In other daemonize discussions, I see
the
> following outline:
> 
> 1) Setsid, StartProcess(self, args), Exit
> 2) StartProcess(self, args), Exit
> 3) We're a daemon!
> 
> You're doing (1) right now via SYS_FORK, which you point out is potentially
> risky b/c of goroutines and stuff. You may still have this risk in the current
> code. Up to you if you want to try and solve it now or count on this being a
> small enough platform to ignore it. Note that Go runtime has other
> goroutines/threads running even if you didn't directly spawn them (e.g.,
garbage
> collector, signal monitor), so this might be worth your time.

I was modeling it after
https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave...
(which is in production/used in a lot of places) but replacing the last fork
with a subprocess call instead.

I could replace the first fork with another subprocess call, but how would I get
the subprocess to resume execution right where its parent left off (to prevent a
fork bomb)? Another arg? I think it's just simpler/cleaner to keep it a fork,
especially since the 20 or so lines afterward are all that it needs to execute,
which it seems to do just fine.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:101: err = cmd.Start()
On 2016/09/03 00:48:07, dnj wrote:
> You can avoid some junk by using "os.StartProcess" here:

I'm not sure what junk I'm avoiding, but done. (I feel like there's too many
ways to start a subprocess in go >.> )

> You can even avoid the "--skip-daemonize" if you want to use environment
> variables:

What's the advantage of using an env var? I much prefer making it an explicit
argument.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:133: func getUptime(requestQueue
chan<- chan uptimeResult, timeoutPeriod time.Duration) (time.Duration, error) {
On 2016/09/03 00:48:07, dnj wrote:
> nit: chan<- chan<- uptimeResult
> 
> In other words, a channel that can be written to, and you can write to it
> channels that can be written to.

Done, but aren't I reading from those channels later on at line 153? I guess
that restriction only applies to channels that you receive from the channel, not
channels you create yourself?

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:196: maxTimeouts := 5
On 2016/09/03 00:48:07, dnj wrote:
> nit: "const maxTimeouts = 5"

Done.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:208: if consecutiveTimeouts ==
maxTimeouts {
On 2016/09/03 00:48:07, dnj wrote:
> nit: Just being paranoid, but might as well make this ">=".

Done.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:209: logcatLog(logError, "%d
consective timeouts when fetching uptime. Triggering reboot", maxTimeouts)
On 2016/09/03 00:48:07, dnj wrote:
> nit: currentTimeouts, not maxTimeouts.

Done.

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi...
go/src/infra/tools/device_watchdog/main.go:211: } else if consecutiveTimeouts >
0 {
On 2016/09/03 00:48:07, dnj wrote:
> nit: No need for "else if", since the previous statement will exit this loop.
> Just make this another "if".

Done.

bpastene

https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go#newcode75 go/src/infra/tools/device_watchdog/main.go:75: ret, _, errno := syscall.Syscall(syscall.SYS_FORK, 0, 0, 0) On ...

4 years, 3 months ago (2016-09-09 19:38:53 UTC) #7

dnj

lgtm https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/device_watchdog/main.go#newcode133 go/src/infra/tools/device_watchdog/main.go:133: func getUptime(requestQueue chan<- chan uptimeResult, timeoutPeriod time.Duration) (time.Duration, ...

4 years, 3 months ago (2016-09-09 19:40:57 UTC) #8