|
|
Chromium Code Reviews
DescriptionChange daemonize logic in watchdog and add timeout to file system read.
Daemonize via fork made goroutines behave strangely, so this uses
exec instead.
Reading from /proc/uptime can hang indefinitely on some phones. This
adds a timeout.
BUG=632895
Committed: https://chromium.googlesource.com/infra/infra/+/d026d656c86ae12a6e180630fd3f6185c788ca28
Patch Set 1 #
Total comments: 9
Patch Set 2 : Move getuptime to a single goroutine that responds to RPCs #
Total comments: 18
Patch Set 3 : comments #Patch Set 4 : Use github.com/VividCortex/godaemon instead of home-spun version. #Patch Set 5 : rebase #Patch Set 6 : pin gocloud to bdbde4d... #Patch Set 7 : add todo #
Messages
Total messages: 26 (11 generated)
bpastene@chromium.org changed reviewers: + dnj@chromium.org
ptal whenever you get a chance :) https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:142: go getDeviceUptime(c) Uses a go routine + select statement to add a timeout to getUptime. Let me know if there's a more clean/go-friendly way of handling timeouts.
https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:65: func daemonize(maxUptime int) (int, error) { This isn't a full daemonize workflow. It doesn't setsid() or (in this case) handle file descriptor duplication. This is the sort of thing you need to do: https://github.com/sevlyar/go-daemon/blob/master/daemon_posix.go#L108 You were already sort of doing it before, but maybe review that for any improvements (the SysProcAttr struct is interesting). https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:86: func getDeviceUptime(c chan uptimeResult) { Make this a directional channel: c chan<- uptimeResult https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:142: go getDeviceUptime(c) Select is correct, but you launch a new goroutine each time and don't make a point of reaping the previous goroutines, which could lead to accumulating goroutines if they are lagging behind. I think you need to close that loop for sure. You should also make sure to close the channels when you know you're done with them if you can. The timeout is more complicated. You want to limit the number of goroutines that you spawn, but turn a synchronous operation into an asynchronous one. Along these lines, I think the thing to do is to have a "get the uptime" function that runs in a goroutine and loops, receiving "get uptime" requests and returning responses. This will limit your goroutine to one, and also let you test if you're blocking. First, create the monitor. Pass it an RPC request channel. The channel will, itself, receive response channels. Each time a response channel is received, that's a new RPC request, and it will get the uptime and push the result onto the response channel. type uptimeRPCChannel chan<- chan<- uptimeResponse func uptimeGetter(reqC uptimeRPCChannel) { } // Now, let's wrap that in a "get uptime or timeout" function. We'll add some additional logic to handle blocking gracefully. We'll mark a timeout using a special error code. // Wholly encapsulating the evils of the monitor and timeout in a synchronous function makes the rest of the program more readable, testable, and maintainable. var errTimeout = errors.New("timeout!") func getUptime(reqC uptimeRPCChannel, timeout time.Duration) (time.Duration, error) ( // Create a response channel. Buffer it so that if we timeout, the goroutine doesn't choke/block. respC := make(chan uptimeRespnse, 1) // Start our timeout timer. timer := time.NewTimer(timeout) defer timer.Stop() // Try and send the RPC request. This could fail if the RPC goroutine is blocked. select { case reqC <- respC: break case <-timer.C: return 0, errTimeout } // Read the RPC response. This could fail if the request causes the RPC goroutine to // block. select { case resp <- respC: return resp.uptime, resp.err case <-timer.C: return 0, errTimeout } } // Now, you can actually use this stuff. func main() { // ... // Start the uptime RPC goroutine. reqC := make(uptimeRPCChannel) go func() { for req := range reqC { uptime, err := getUptimeFromFile() req <- uptimeResponse{uptime, err} } }() defer close(reqC) // Probably not necessary, but cleanup and exit conditions are good behavior. // Main loop. for { // ... uptime, err := getUptime(reqC, 5*time.Second) switch err { case nil: // Valid uptime, yay. case errTimeout: // Got a timeout. default: // Some other error. } } If you want to get fancy, you can start using https://github.com/luci/luci-go/tree/master/common/clock This will let you actually simulate the clock in a test, which is awesome because you can actually test this logic and assert its correctness. https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:167: time.Sleep(maxUptime - uptimeRes.Uptime + time.Second) (Before the world forgets, add a comment about why you add a second here).
I'll look into splitting the RPC logic into its own package and then adding tests to that. But this is what I've got so far. Let me know what you think! https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:65: func daemonize(maxUptime int) (int, error) { On 2016/09/02 16:17:32, dnj wrote: > This isn't a full daemonize workflow. It doesn't setsid() or (in this case) > handle file descriptor duplication. > > This is the sort of thing you need to do: > https://github.com/sevlyar/go-daemon/blob/master/daemon_posix.go#L108 > > You were already sort of doing it before, but maybe review that for any > improvements (the SysProcAttr struct is interesting). There were problems with threads not correctly getting scheduled. I kept all the forking logic and just threw an exec at the end. That fixes it. https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:86: func getDeviceUptime(c chan uptimeResult) { On 2016/09/02 16:17:32, dnj wrote: > Make this a directional channel: > c chan<- uptimeResult Done. https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:142: go getDeviceUptime(c) On 2016/09/02 16:17:32, dnj wrote: > Select is correct, but you launch a new goroutine each time and don't make a > point of reaping the previous goroutines, which could lead to accumulating > goroutines if they are lagging behind. I think you need to close that loop for > sure. You should also make sure to close the channels when you know you're done > with them if you can. Thanks for the suggestion Dan. I added the single "uptimeGetter" go routine like you outlined that just responds to RPCs via channels. That should prevent multiple go routines from getting created if they fail/timeout. https://codereview.chromium.org/2302193002/diff/1/go/src/infra/tools/device_w... go/src/infra/tools/device_watchdog/main.go:167: time.Sleep(maxUptime - uptimeRes.Uptime + time.Second) On 2016/09/02 16:17:32, dnj wrote: > (Before the world forgets, add a comment about why you add a second here). Done.
Looking pretty good, a few comments. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:74: func daemonize(maxUptime int) (int, error) { WDYT about having daemonize take an args slice instead of manually recreating the flags? You can just forward that directly to the spawned process. Note that for "Exec"-style execution, argv[0] should be forwarded and will be the executable's name. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:75: ret, _, errno := syscall.Syscall(syscall.SYS_FORK, 0, 0, 0) Forking here may be a little sketchy. In other daemonize discussions, I see the following outline: 1) Setsid, StartProcess(self, args), Exit 2) StartProcess(self, args), Exit 3) We're a daemon! You're doing (1) right now via SYS_FORK, which you point out is potentially risky b/c of goroutines and stuff. You may still have this risk in the current code. Up to you if you want to try and solve it now or count on this being a small enough platform to ignore it. Note that Go runtime has other goroutines/threads running even if you didn't directly spawn them (e.g., garbage collector, signal monitor), so this might be worth your time. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:101: err = cmd.Start() You can avoid some junk by using "os.StartProcess" here: argv = append(argv, "--skip-daemonize") os.StartProcess(binary, argv, &os.ProcAttr{ Dir: os.Getwd(), Env: os.Environ(), Files: []*os.File{f, f, f}, nil, }) You can even avoid the "--skip-daemonize" if you want to use environment variables: var daemonizeEnv = "_INFRA_DAEMONIZE_MARKER" func clearDaemonizedMarker() bool { _, has := os.Getenv(daemonizeEnv) if has { if err := os.Unsetenv(daemonizeEnv); err != nil { panic(err) } return true } } Then, at the beginning of your application, you can check (and clear): if !clearDaemonizedMarker() { // Daemonize / Exit } // I'm a daemon, move on with my life. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:133: func getUptime(requestQueue chan<- chan uptimeResult, timeoutPeriod time.Duration) (time.Duration, error) { nit: chan<- chan<- uptimeResult In other words, a channel that can be written to, and you can write to it channels that can be written to. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:196: maxTimeouts := 5 nit: "const maxTimeouts = 5" https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:208: if consecutiveTimeouts == maxTimeouts { nit: Just being paranoid, but might as well make this ">=". https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:209: logcatLog(logError, "%d consective timeouts when fetching uptime. Triggering reboot", maxTimeouts) nit: currentTimeouts, not maxTimeouts. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:211: } else if consecutiveTimeouts > 0 { nit: No need for "else if", since the previous statement will exit this loop. Just make this another "if".
https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:74: func daemonize(maxUptime int) (int, error) { On 2016/09/03 00:48:07, dnj wrote: > WDYT about having daemonize take an args slice instead of manually recreating > the flags? You can just forward that directly to the spawned process. Note that > for "Exec"-style execution, argv[0] should be forwarded and will be the > executable's name. Done, but I'd rather replace argv[0] with /proc/self/exe in case it's run from somewhere else. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:75: ret, _, errno := syscall.Syscall(syscall.SYS_FORK, 0, 0, 0) On 2016/09/03 00:48:07, dnj wrote: > Forking here may be a little sketchy. In other daemonize discussions, I see the > following outline: > > 1) Setsid, StartProcess(self, args), Exit > 2) StartProcess(self, args), Exit > 3) We're a daemon! > > You're doing (1) right now via SYS_FORK, which you point out is potentially > risky b/c of goroutines and stuff. You may still have this risk in the current > code. Up to you if you want to try and solve it now or count on this being a > small enough platform to ignore it. Note that Go runtime has other > goroutines/threads running even if you didn't directly spawn them (e.g., garbage > collector, signal monitor), so this might be worth your time. I was modeling it after https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave... (which is in production/used in a lot of places) but replacing the last fork with a subprocess call instead. I could replace the first fork with another subprocess call, but how would I get the subprocess to resume execution right where its parent left off (to prevent a fork bomb)? Another arg? I think it's just simpler/cleaner to keep it a fork, especially since the 20 or so lines afterward are all that it needs to execute, which it seems to do just fine. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:101: err = cmd.Start() On 2016/09/03 00:48:07, dnj wrote: > You can avoid some junk by using "os.StartProcess" here: I'm not sure what junk I'm avoiding, but done. (I feel like there's too many ways to start a subprocess in go >.> ) > You can even avoid the "--skip-daemonize" if you want to use environment > variables: What's the advantage of using an env var? I much prefer making it an explicit argument. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:133: func getUptime(requestQueue chan<- chan uptimeResult, timeoutPeriod time.Duration) (time.Duration, error) { On 2016/09/03 00:48:07, dnj wrote: > nit: chan<- chan<- uptimeResult > > In other words, a channel that can be written to, and you can write to it > channels that can be written to. Done, but aren't I reading from those channels later on at line 153? I guess that restriction only applies to channels that you receive from the channel, not channels you create yourself? https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:196: maxTimeouts := 5 On 2016/09/03 00:48:07, dnj wrote: > nit: "const maxTimeouts = 5" Done. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:208: if consecutiveTimeouts == maxTimeouts { On 2016/09/03 00:48:07, dnj wrote: > nit: Just being paranoid, but might as well make this ">=". Done. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:209: logcatLog(logError, "%d consective timeouts when fetching uptime. Triggering reboot", maxTimeouts) On 2016/09/03 00:48:07, dnj wrote: > nit: currentTimeouts, not maxTimeouts. Done. https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:211: } else if consecutiveTimeouts > 0 { On 2016/09/03 00:48:07, dnj wrote: > nit: No need for "else if", since the previous statement will exit this loop. > Just make this another "if". Done.
https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:75: ret, _, errno := syscall.Syscall(syscall.SYS_FORK, 0, 0, 0) On 2016/09/09 00:51:43, bpastene wrote: > On 2016/09/03 00:48:07, dnj wrote: > > Forking here may be a little sketchy. In other daemonize discussions, I see > the > > following outline: > > > > 1) Setsid, StartProcess(self, args), Exit > > 2) StartProcess(self, args), Exit > > 3) We're a daemon! > > > > You're doing (1) right now via SYS_FORK, which you point out is potentially > > risky b/c of goroutines and stuff. You may still have this risk in the current > > code. Up to you if you want to try and solve it now or count on this being a > > small enough platform to ignore it. Note that Go runtime has other > > goroutines/threads running even if you didn't directly spawn them (e.g., > garbage > > collector, signal monitor), so this might be worth your time. > > I was modeling it after > https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave... > (which is in production/used in a lot of places) but replacing the last fork > with a subprocess call instead. > > I could replace the first fork with another subprocess call, but how would I get > the subprocess to resume execution right where its parent left off (to prevent a > fork bomb)? Another arg? I think it's just simpler/cleaner to keep it a fork, > especially since the 20 or so lines afterward are all that it needs to execute, > which it seems to do just fine. Nevermind. This is now all totally obsolete since you found the awesome github.com/VividCortex/godaemon package.
lgtm https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... File go/src/infra/tools/device_watchdog/main.go (right): https://codereview.chromium.org/2302193002/diff/20001/go/src/infra/tools/devi... go/src/infra/tools/device_watchdog/main.go:133: func getUptime(requestQueue chan<- chan uptimeResult, timeoutPeriod time.Duration) (time.Duration, error) { On 2016/09/09 00:51:43, bpastene wrote: > On 2016/09/03 00:48:07, dnj wrote: > > nit: chan<- chan<- uptimeResult > > > > In other words, a channel that can be written to, and you can write to it > > channels that can be written to. > > Done, but aren't I reading from those channels later on at line 153? I guess > that restriction only applies to channels that you receive from the channel, not > channels you create yourself? Yeah a bidirectional channel can be converted into a monodirectional channel, so "chan chan int" can be turned into "chan <- chan <- int"; just not the other way around. Here you're creating a bidirectional, but only granting this method access to one of those directions.
The CQ bit was checked by bpastene@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Infra Linux Trusty 64 Tester on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/312953eab1048610) Infra Presubmit on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/312953eab7b95c10)
The CQ bit was checked by bpastene@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from dnj@chromium.org Link to the patchset: https://codereview.chromium.org/2302193002/#ps80001 (title: "rebase")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Infra Linux Trusty 64 Tester on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/31296067c207ac10)
The CQ bit was checked by bpastene@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Infra Win Tester on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/31299ccd3a77ca10)
The CQ bit was checked by bpastene@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from dnj@chromium.org Link to the patchset: https://codereview.chromium.org/2302193002/#ps120001 (title: "add todo")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Change daemonize logic in watchdog and add timeout to file system read. Daemonize via fork made goroutines behave strangely, so this uses exec instead. Reading from /proc/uptime can hang indefinitely on some phones. This adds a timeout. BUG=632895 ========== to ========== Change daemonize logic in watchdog and add timeout to file system read. Daemonize via fork made goroutines behave strangely, so this uses exec instead. Reading from /proc/uptime can hang indefinitely on some phones. This adds a timeout. BUG=632895 Committed: https://chromium.googlesource.com/infra/infra/+/d026d656c86ae12a6e180630fd3f6... ==========
Message was sent while issue was closed.
Committed patchset #7 (id:120001) as https://chromium.googlesource.com/infra/infra/+/d026d656c86ae12a6e180630fd3f6... |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
