|
|
Chromium Code Reviews
Description[BattOr] Retry battor_agent commands when they fail.
Many battor_agent failures on the waterfall could be solved by just
retrying the command (e.g., StartTracing). Previously, the battor_agent
retried reads and a select few BattOr control messages when they failed
(e.g., INIT). This was incomplete, as there were many errors that did
not result in retried BattOr control messages.
This CL replaces this incomplete per-control-message retry logic with
retry logic for entire battor_agent commands. This will make the
battor_agent robust to intermittent failures of any BattOr control
message.
This required fixing the battor_agent overuse of the ambiguous "Unexpected Message" error. It also required replacing the timeout logic.
BUG=699581
Review-Url: https://codereview.chromium.org/2826913002
Cr-Commit-Position: refs/heads/master@{#472724}
Committed: https://chromium.googlesource.com/chromium/src/+/b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047
Patch Set 1 #Patch Set 2 : Stored samples needed to be cleared before each retry of StopTracing. #Patch Set 3 : Doubled retry attempts (each StopTracing attempt requires two retries) #
Total comments: 31
Patch Set 4 : Addressed Charlie's comments #
Total comments: 6
Patch Set 5 : [BattOr] Retry battor_agent commands when they fail. #Patch Set 6 : Final comments #
Total comments: 1
Patch Set 7 : Swapped return with break #
Messages
Total messages: 36 (15 generated)
Description was changed from ========== [BattOr] Retry battor_agent commands when they fail. Many battor_agent failures on the waterfall could be solved by just retrying the command (e.g., StartTracing). Previously, the battor_agent retried reads and a select few BattOr control messages when they failed (e.g., INIT). This was incomplete, as there were many errors that did not result in retried BattOr control messages. This CL replaces this incomplete per-control-message retry logic with retry logic for entire battor_agent commands. This will make the battor_agent robust to intermittent failures of any BattOr control message. BUG= ========== to ========== [BattOr] Retry battor_agent commands when they fail. Many battor_agent failures on the waterfall could be solved by just retrying the command (e.g., StartTracing). Previously, the battor_agent retried reads and a select few BattOr control messages when they failed (e.g., INIT). This was incomplete, as there were many errors that did not result in retried BattOr control messages. This CL replaces this incomplete per-control-message retry logic with retry logic for entire battor_agent commands. This will make the battor_agent robust to intermittent failures of any BattOr control message. This required fixing the battor_agent overuse of the ambiguous "Unexpected Message" error. It also required replacing the timeout logic. BUG=699581 ==========
rnephew@chromium.org changed reviewers: + nednguyen@google.com, zhenw@chromium.org
rnephew@chromium.org changed reviewers: + rnephew@chromium.org
Can you use this to run third_party/catapult/common/battor/battor/battor_wrapper_devicetest.py and make sure the battor_wrapper still behaves properly with this battor agent binary? https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:27: const uint16_t kCommandRetryDelaySeconds = 2; These delays seem to be much larger than the previous delays(InitRetryDelayMilliseconds = 100). Is it necessary for it to be so long?
charliea@chromium.org changed reviewers: + charliea@chromium.org
I'm planning to take a look at this tomorrow.
Randy, would you mind prereviewing this and passing it onto me when you think it's ready for final review? I'd like to get to the point where we can have you be a secondary primary owner in this codebase.
charliea@chromium.org changed reviewers: - nednguyen@google.com, zhenw@chromium.org
https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (left): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:530: connection_->Flush(); Do we not need to flush the connections now with these changes? See comment in: tools/battor_agent/battor_connection_impl.cc https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:20: const uint8_t kMaxCommandAttempts = 10; Was 20 chosen for a reason for kMaxInitAttempts/kMaxReadAttempts before Charlie? I'm guessing it was just a best guess, so moving it down to ten shouldn't be a big deal. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (left): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:182: GetAgent()->StartTracing(); Charlie is there any reason you put these calls here as opposed to where Aaron moved them to? https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:224: is_command_complete_ = false; Why is this no longer needed to be set to false? https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_connection_impl.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_connection_impl.cc:84: Flush(); Are the flushes not needed elsewhere because we moved it here?
https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:20: const uint8_t kMaxCommandAttempts = 10; On 2017/05/09 15:04:05, rnephew (Reviews Here) wrote: > Was 20 chosen for a reason for kMaxInitAttempts/kMaxReadAttempts before Charlie? > I'm guessing it was just a best guess, so moving it down to ten shouldn't be a > big deal. I haven't seen this code in a long, long time... Looking back at the blame, it looks like Aaron wrote the initial retry logic, so I defer to him here :-) https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (left): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:182: GetAgent()->StartTracing(); On 2017/05/09 15:04:06, rnephew (Reviews Here) wrote: > Charlie is there any reason you put these calls here as opposed to where Aaron > moved them to? Yeah, there was: before, there was the assumption that we'd only ever "run through" the command once. A specific step in that command might be retried if something went amiss, but things never really went backwards. Aaron's CL changes this and, instead of retrying individual steps, retries the whole command over from the start when something goes wrong. As such, it makes sense to be able to run through the command multiple times. It seems like it's probably also safe to get rid of is_command_complete = false: it's being set to false in the SetUp() method, where we're already setting command_error. In the unlikely instance that either needs to be reused within a single test case, we could always just manually set them. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:224: is_command_complete_ = false; On 2017/05/09 15:04:05, rnephew (Reviews Here) wrote: > Why is this no longer needed to be set to false? (See above comment)
Thanks for the comments Charlie and Randy. I replied to your comments. It doesn't seem like any changes to the CL are needed. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (left): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:530: connection_->Flush(); On 2017/05/09 15:04:05, rnephew (Reviews Here) wrote: > Do we not need to flush the connections now with these changes? > > See comment in: tools/battor_agent/battor_connection_impl.cc see reply in: : tools/battor_agent/battor_connection_impl.cc https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:20: const uint8_t kMaxCommandAttempts = 10; On 2017/05/09 19:58:08, charliea wrote: > On 2017/05/09 15:04:05, rnephew (Reviews Here) wrote: > > Was 20 chosen for a reason for kMaxInitAttempts/kMaxReadAttempts before > Charlie? > > I'm guessing it was just a best guess, so moving it down to ten shouldn't be a > > big deal. > > I haven't seen this code in a long, long time... > > Looking back at the blame, it looks like Aaron wrote the initial retry logic, so > I defer to him here :-) 20 was due to the super short retry interval. With the much longer retry interval we need fewer retries. 10 gave a very low failure rate in my testing, so it seemed like a good choice for now. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:27: const uint16_t kCommandRetryDelaySeconds = 2; On 2017/04/21 22:10:45, rnephew (Reviews Here) wrote: > These delays seem to be much larger than the previous > delays(InitRetryDelayMilliseconds = 100). Is it necessary for it to be so long? Yeah the 100 msec retry interval was too aggressive. Many INIT messages were being dropped. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (left): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:182: GetAgent()->StartTracing(); On 2017/05/09 19:58:08, charliea wrote: > On 2017/05/09 15:04:06, rnephew (Reviews Here) wrote: > > Charlie is there any reason you put these calls here as opposed to where Aaron > > moved them to? > > Yeah, there was: before, there was the assumption that we'd only ever "run > through" the command once. A specific step in that command might be retried if > something went amiss, but things never really went backwards. > > Aaron's CL changes this and, instead of retrying individual steps, retries the > whole command over from the start when something goes wrong. As such, it makes > sense to be able to run through the command multiple times. > > It seems like it's probably also safe to get rid of is_command_complete = false: > it's being set to false in the SetUp() method, where we're already setting > command_error. In the unlikely instance that either needs to be reused within a > single test case, we could always just manually set them. Acknowledged. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:224: is_command_complete_ = false; On 2017/05/09 19:58:08, charliea wrote: > On 2017/05/09 15:04:05, rnephew (Reviews Here) wrote: > > Why is this no longer needed to be set to false? > > (See above comment) Acknowledged. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_connection_impl.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_connection_impl.cc:84: Flush(); On 2017/05/09 15:04:06, rnephew (Reviews Here) wrote: > Are the flushes not needed elsewhere because we moved it here? Correct. Before we had to flush manually between retries of specific messages (e.g., INIT). Now when a command fails it will reopen the BattOr connection which will then jump into this condition and Flush() the existing connection.
looks good to me, but I am not the most practiced at c++ style guides at google so I will defer to Charlie for final review.
Overall, I think this is pretty close to being ready to go: just have a few relatively minor suggestions https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:468: base::TimeDelta::FromSeconds(kBattOrControlMessageTimeoutSeconds)); Could we move the timeout_callback_.Reset() and the PostDelayedTask into a separate method, SetActionTimeout(seconds)? That would allow us to get rid of your comment above, which I agree is currently necessary because the code itself isn't self-documenting. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:525: // Start control message respone timeout timer. s/respone/response, but see above comment about factoring this out https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:545: base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( One thing that might make this a little cleaner is if you story the task you want to post in a closure and just post the delayed task after the switch. I think you can do: base::Callback<void(int)> next_command; switch(command_) { case Command::START_TRACING: next_command = base::Bind(&BattOrAgent::StartTracing, AsWeakPtr()); return; case Command::STOP_TRACING: next_command = base::Bind(&BattOrAgent::StopTracing, AsWeakPtr()); return; ... } base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( next_command, base::TimeDelta::FromSeconds(kCommandRetryDelaySeconds); https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:431: TEST_F(BattOrAgentTest, StartTracingSucceedsAfterSetGainSendFails) { o_O is this passing right now? The name of the test is "StartTracingSucceedsAfter...", but the test asserts that StartTracing fails after an init send fails. It seems like EXPECT_TRUE(IsCommandComplete()); should fail because the BattOrAgent is waiting for more data for its retry attempt. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:519: // Send some samples instead of an INIT ACK. This will force an command retry. s/an command/a command https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:535: TEST_F(BattOrAgentTest, StartTracingFailsAfterTooManyFails) { I think that "StartTracingFailsAfterTooManyCumulativeFailures" would be a better indicator of what this test is actually testing (both that there's a failure limit, and that the failure limit works between different stages of the same command) https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:747: RawBattOrSample frame1[] = {RawBattOrSample{1, 1}}; nit: if there's no frame2, just call it "frame" instead of "frame1"
On 2017/05/12 22:03:43, charliea wrote: > Overall, I think this is pretty close to being ready to go: just have a few > relatively minor suggestions > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > File tools/battor_agent/battor_agent.cc (right): > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent.cc:468: > base::TimeDelta::FromSeconds(kBattOrControlMessageTimeoutSeconds)); > Could we move the timeout_callback_.Reset() and the PostDelayedTask into a > separate method, SetActionTimeout(seconds)? That would allow us to get rid of > your comment above, which I agree is currently necessary because the code itself > isn't self-documenting. > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent.cc:525: // Start control message respone timeout > timer. > s/respone/response, but see above comment about factoring this out > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent.cc:545: > base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( > One thing that might make this a little cleaner is if you story the task you > want to post in a closure and just post the delayed task after the switch. I > think you can do: > > base::Callback<void(int)> next_command; > switch(command_) { > case Command::START_TRACING: > next_command = base::Bind(&BattOrAgent::StartTracing, AsWeakPtr()); > return; > case Command::STOP_TRACING: > next_command = base::Bind(&BattOrAgent::StopTracing, AsWeakPtr()); > return; > ... > } > > base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( > next_command, base::TimeDelta::FromSeconds(kCommandRetryDelaySeconds); > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > File tools/battor_agent/battor_agent_unittest.cc (right): > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent_unittest.cc:431: TEST_F(BattOrAgentTest, > StartTracingSucceedsAfterSetGainSendFails) { > o_O is this passing right now? The name of the test is > "StartTracingSucceedsAfter...", but the test asserts that StartTracing fails > after an init send fails. > > It seems like EXPECT_TRUE(IsCommandComplete()); should fail because the > BattOrAgent is waiting for more data for its retry attempt. > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent_unittest.cc:519: // Send some samples instead of > an INIT ACK. This will force an command retry. > s/an command/a command > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent_unittest.cc:535: TEST_F(BattOrAgentTest, > StartTracingFailsAfterTooManyFails) { > I think that "StartTracingFailsAfterTooManyCumulativeFailures" would be a better > indicator of what this test is actually testing (both that there's a failure > limit, and that the failure limit works between different stages of the same > command) > > https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... > tools/battor_agent/battor_agent_unittest.cc:747: RawBattOrSample frame1[] = > {RawBattOrSample{1, 1}}; > nit: if there's no frame2, just call it "frame" instead of "frame1" Thanks Charlie! I addressed all of your comments in latest patch set.
https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:468: base::TimeDelta::FromSeconds(kBattOrControlMessageTimeoutSeconds)); On 2017/05/12 22:03:42, charliea wrote: > Could we move the timeout_callback_.Reset() and the PostDelayedTask into a > separate method, SetActionTimeout(seconds)? That would allow us to get rid of > your comment above, which I agree is currently necessary because the code itself > isn't self-documenting. Done. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:525: // Start control message respone timeout timer. On 2017/05/12 22:03:42, charliea wrote: > s/respone/response, but see above comment about factoring this out Done. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:545: base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( On 2017/05/12 22:03:42, charliea wrote: > One thing that might make this a little cleaner is if you story the task you > want to post in a closure and just post the delayed task after the switch. I > think you can do: > > base::Callback<void(int)> next_command; > switch(command_) { > case Command::START_TRACING: > next_command = base::Bind(&BattOrAgent::StartTracing, AsWeakPtr()); > return; > case Command::STOP_TRACING: > next_command = base::Bind(&BattOrAgent::StopTracing, AsWeakPtr()); > return; > ... > } > > base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( > next_command, base::TimeDelta::FromSeconds(kCommandRetryDelaySeconds); Done. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:431: TEST_F(BattOrAgentTest, StartTracingSucceedsAfterSetGainSendFails) { On 2017/05/12 22:03:42, charliea wrote: > o_O is this passing right now? The name of the test is > "StartTracingSucceedsAfter...", but the test asserts that StartTracing fails > after an init send fails. > > It seems like EXPECT_TRUE(IsCommandComplete()); should fail because the > BattOrAgent is waiting for more data for its retry attempt. right, this should say "StartTracingFails..." because the send failed. we fail immediately if a send fails because there is no other reason for that than a completely dead serial connection to the BattOr. there is no benefit to retrying. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:519: // Send some samples instead of an INIT ACK. This will force an command retry. On 2017/05/12 22:03:42, charliea wrote: > s/an command/a command Done. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:535: TEST_F(BattOrAgentTest, StartTracingFailsAfterTooManyFails) { On 2017/05/12 22:03:43, charliea wrote: > I think that "StartTracingFailsAfterTooManyCumulativeFailures" would be a better > indicator of what this test is actually testing (both that there's a failure > limit, and that the failure limit works between different stages of the same > command) Done. https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:747: RawBattOrSample frame1[] = {RawBattOrSample{1, 1}}; On 2017/05/12 22:03:42, charliea wrote: > nit: if there's no frame2, just call it "frame" instead of "frame1" Done.
lgtm w/ comments https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:431: TEST_F(BattOrAgentTest, StartTracingSucceedsAfterSetGainSendFails) { On 2017/05/15 18:49:17, aschulman wrote: > On 2017/05/12 22:03:42, charliea wrote: > > o_O is this passing right now? The name of the test is > > "StartTracingSucceedsAfter...", but the test asserts that StartTracing fails > > after an init send fails. > > > > It seems like EXPECT_TRUE(IsCommandComplete()); should fail because the > > BattOrAgent is waiting for more data for its retry attempt. > > right, this should say "StartTracingFails..." because the send failed. we fail > immediately if a send fails because there is no other reason for that than a > completely dead serial connection to the BattOr. there is no benefit to > retrying. Ah, okay. So we only retry the command if a receive fails, not if a send fails? https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:526: base::Callback<void()> next_command; Interesting: I would think that this needs to be void(uint16_t). My understanding was always that the type before the parentheses was the return type, and the types inside of the parentheses were the parameters that needed to be passed in to make the thing a real function call. If this works though, it looks good to me. https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.h (right): https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... tools/battor_agent/battor_agent.h:147: void setActionTimeout(const uint16_t timeout_seconds); s/setActionTimeout/SetActionTimeout https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... tools/battor_agent/battor_agent.h:147: void setActionTimeout(const uint16_t timeout_seconds); nit: for consistency, maybe just use uint16_t (see SendControlMessage above). I have no idea if there was an original reason that we used the non-const version, but it seems worth it to not people wonder why one method used const and another didn't when neither actually modified the parameter
Final comments are taken care of in the latest patch set. Randy, after this is committed, can you deploy it on the mac bots? https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... File tools/battor_agent/battor_agent_unittest.cc (right): https://codereview.chromium.org/2826913002/diff/40001/tools/battor_agent/batt... tools/battor_agent/battor_agent_unittest.cc:431: TEST_F(BattOrAgentTest, StartTracingSucceedsAfterSetGainSendFails) { On 2017/05/15 19:03:10, charliea wrote: > On 2017/05/15 18:49:17, aschulman wrote: > > On 2017/05/12 22:03:42, charliea wrote: > > > o_O is this passing right now? The name of the test is > > > "StartTracingSucceedsAfter...", but the test asserts that StartTracing fails > > > after an init send fails. > > > > > > It seems like EXPECT_TRUE(IsCommandComplete()); should fail because the > > > BattOrAgent is waiting for more data for its retry attempt. > > > > right, this should say "StartTracingFails..." because the send failed. we fail > > immediately if a send fails because there is no other reason for that than a > > completely dead serial connection to the BattOr. there is no benefit to > > retrying. > > Ah, okay. So we only retry the command if a receive fails, not if a send fails? correct https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... tools/battor_agent/battor_agent.cc:526: base::Callback<void()> next_command; On 2017/05/15 19:03:10, charliea wrote: > Interesting: I would think that this needs to be void(uint16_t). My > understanding was always that the type before the parentheses was the return > type, and the types inside of the parentheses were the parameters that needed to > be passed in to make the thing a real function call. If this works though, it > looks good to me. Done. https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... File tools/battor_agent/battor_agent.h (right): https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... tools/battor_agent/battor_agent.h:147: void setActionTimeout(const uint16_t timeout_seconds); On 2017/05/15 19:03:10, charliea wrote: > nit: for consistency, maybe just use uint16_t (see SendControlMessage above). I > have no idea if there was an original reason that we used the non-const version, > but it seems worth it to not people wonder why one method used const and another > didn't when neither actually modified the parameter Done. https://codereview.chromium.org/2826913002/diff/60001/tools/battor_agent/batt... tools/battor_agent/battor_agent.h:147: void setActionTimeout(const uint16_t timeout_seconds); On 2017/05/15 19:03:10, charliea wrote: > s/setActionTimeout/SetActionTimeout Done.
The CQ bit was checked by aschulman@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from charliea@chromium.org Link to the patchset: https://codereview.chromium.org/2826913002/#ps100001 (title: "Final comments")
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) win_chromium_compile_dbg_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_comp...)
https://codereview.chromium.org/2826913002/diff/100001/tools/battor_agent/bat... File tools/battor_agent/battor_agent.cc (right): https://codereview.chromium.org/2826913002/diff/100001/tools/battor_agent/bat... tools/battor_agent/battor_agent.cc:542: base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( Ah, looks like the presubmit caught a bug that I didn't: you should remove the "return" statement after each "next_command" assignment above (as well as the NOTREACHED(), as it's not necessary there). Otherwise, the next command will never get posted here I'm confused about why the unit tests didn't also catch this, though?
On 2017/05/16 16:13:08, charliea wrote: > https://codereview.chromium.org/2826913002/diff/100001/tools/battor_agent/bat... > File tools/battor_agent/battor_agent.cc (right): > > https://codereview.chromium.org/2826913002/diff/100001/tools/battor_agent/bat... > tools/battor_agent/battor_agent.cc:542: > base::ThreadTaskRunnerHandle::Get()->PostDelayedTask( > Ah, looks like the presubmit caught a bug that I didn't: you should remove the > "return" statement after each "next_command" assignment above (as well as the > NOTREACHED(), as it's not necessary there). Otherwise, the next command will > never get posted here > > I'm confused about why the unit tests didn't also catch this, though? Me too, I've been investigating it and I think I have identified the reason why it works. Essentially, I think because we simply return out of the retry function, the agent ends up essentially stuck in the state that it was in. It then just is waiting for the callback that it expects in order to move on to the next state. There will be failures that happen, but those failures will result in calls to RetryCommand, which essentially does nothing. So basically it just stays in the same state until it gets what it expects. One way we could test for this is to check if the agent moves back to the initial state after a command failure. Although I'm not sure if we need to test for that (or if we have the state exposed externally so the testing code can check it.
The CQ bit was checked by aschulman@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from charliea@chromium.org Link to the patchset: https://codereview.chromium.org/2826913002/#ps120001 (title: "Swapped return with break")
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
The CQ bit was checked by aschulman@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch.
Bot data: {"patchset_id": 120001, "attempt_start_ts": 1495093256217880,
"parent_rev": "2a03a5de9275d45001da304069313dc7810a63ec", "commit_rev":
"0406975eadca3fc0ac6435dd18718b8bc6f9fc5e"}
CQ is committing da patch.
Bot data: {"patchset_id": 120001, "attempt_start_ts": 1495093256217880,
"parent_rev": "b6d4c52370c0b1106e1bff3cdba63172950bd563", "commit_rev":
"b8e9b5f2237d0fad703d20fb0a22d45b6ebdb047"}
Message was sent while issue was closed.
Description was changed from ========== [BattOr] Retry battor_agent commands when they fail. Many battor_agent failures on the waterfall could be solved by just retrying the command (e.g., StartTracing). Previously, the battor_agent retried reads and a select few BattOr control messages when they failed (e.g., INIT). This was incomplete, as there were many errors that did not result in retried BattOr control messages. This CL replaces this incomplete per-control-message retry logic with retry logic for entire battor_agent commands. This will make the battor_agent robust to intermittent failures of any BattOr control message. This required fixing the battor_agent overuse of the ambiguous "Unexpected Message" error. It also required replacing the timeout logic. BUG=699581 ========== to ========== [BattOr] Retry battor_agent commands when they fail. Many battor_agent failures on the waterfall could be solved by just retrying the command (e.g., StartTracing). Previously, the battor_agent retried reads and a select few BattOr control messages when they failed (e.g., INIT). This was incomplete, as there were many errors that did not result in retried BattOr control messages. This CL replaces this incomplete per-control-message retry logic with retry logic for entire battor_agent commands. This will make the battor_agent robust to intermittent failures of any BattOr control message. This required fixing the battor_agent overuse of the ambiguous "Unexpected Message" error. It also required replacing the timeout logic. BUG=699581 Review-Url: https://codereview.chromium.org/2826913002 Cr-Commit-Position: refs/heads/master@{#472724} Committed: https://chromium.googlesource.com/chromium/src/+/b8e9b5f2237d0fad703d20fb0a22... ==========
Message was sent while issue was closed.
Committed patchset #7 (id:120001) as https://chromium.googlesource.com/chromium/src/+/b8e9b5f2237d0fad703d20fb0a22... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
