|
|
Created:
9 years, 9 months ago by Louis Modified:
9 years, 4 months ago CC:
chromium-os-reviews_chromium.org Visibility:
Public. |
Descriptionichspi.c: lower down the ICH7 chipset CDS timeout from 60s to 1ms.
The loooong time is not necessary actually. Change to 30 secs for a full chip erase. Also exit early if an error happens.
The long time would make flashrom look like hung for forever especially when
probing hundreds of flashes chip on ICH7M. Remove this can get failed message
less than 1 sec in total.
We were good on EEEPC and Lenovo S10-3t because we never launched 2 or more flashrom in boot time (one hung for forever is fine as long as not in boot path). But now, activate_date and dump_vpd_log hit this issue.
Change-Id: I5a13f80673b452159c8dcf3fa95ea2d5a4944b9c
R=reinauer@chromium.org,dhendrix@chromium.org
BUG=chromium-os:13092
TEST=Tested all following combinations: {on Lenovo S10-t3, EEEPC, Alex and Mario} for {BIOS and EC} flashes with {-r / -w} commands.
Committed: http://chrome-svn/viewvc/chromeos?view=rev&revision=d2129f1
Patch Set 1 #
Total comments: 2
Patch Set 2 : increase timeout to 30s for full erase. a early exit if error appears. #Messages
Total messages: 16 (0 generated)
Hi David and Stefan, Waterloo folks cannot boot up their machines due to this issue, could you submit this CL if it looks good to you? Thanks.
PTAL. This CL is blocking development and needs to be resolved. On 2011/03/15 13:07:50, Louis wrote: > Hi David and Stefan, > > Waterloo folks cannot boot up their machines due to this issue, could you submit > this CL if it looks good to you? Thanks.
On 2011/03/15 17:27:18, kliegs wrote: > PTAL. This CL is blocking development and needs to be resolved. I'm still looking into this to ensure the calculations are right for the new delay. As the comment mentions, 60sec for the original delay period is pretty excessive. However, I don't think the timeout period alone is what is causing problems. kliegs sent a verbose log and it seems that flashrom is having a really hard time calibrating the delay period for this particular machine: Calibrating delay loop... OS timer resolution is 6 usecs, 305M loops per second, delay more than 10% too short (got 68% of expected delay), recalculating... 234M loops per second, delay more than 10% too short (got 53% of expected delay), recalculating... 310M loops per second, delay more than 10% too short (got 68% of expected delay), recalculating... 286M loops per second, delay more than 10% too short (got 66% of expected delay), recalculating... 272M loops per second, delay loop is unreliable, trying to continue 10 myus = 17 us, 100 myus = 62 us, 1000 myus = 612 us, 10000 myus = 6054 us, 24 myus = 21 us, OK. We see it having to recalibrate itself several times. This appears to be the root of the problem itself, and we should figure out what's causing it. > > On 2011/03/15 13:07:50, Louis wrote: > > Hi David and Stefan, > > > > Waterloo folks cannot boot up their machines due to this issue, could you > submit > > this CL if it looks good to you? Thanks.
As the machines are currently in an unusable state should we submit this anyways as a "good enough fix while we investigate?" Or roll back the original CL while investigating? On Tue, Mar 15, 2011 at 2:51 PM, <dhendrix@chromium.org> wrote: > On 2011/03/15 17:27:18, kliegs wrote: > >> PTAL. This CL is blocking development and needs to be resolved. >> > > I'm still looking into this to ensure the calculations are right for the > new > delay. As the comment mentions, 60sec for the original delay period is > pretty > excessive. > > However, I don't think the timeout period alone is what is causing > problems. > kliegs sent a verbose log and it seems that flashrom is having a really > hard > time calibrating the delay period for this particular machine: > Calibrating delay loop... OS timer resolution is 6 usecs, 305M loops per > second, > delay more than 10% too short (got 68% of expected delay), recalculating... > 234M > loops per second, delay more than 10% too short (got 53% of expected > delay), > recalculating... 310M loops per second, delay more than 10% too short (got > 68% > of expected delay), recalculating... 286M loops per second, delay more than > 10% > too short (got 66% of expected delay), recalculating... 272M loops per > second, > delay loop is unreliable, trying to continue 10 myus = 17 us, 100 myus = 62 > us, > 1000 myus = 612 us, 10000 myus = 6054 us, 24 myus = 21 us, OK. > > We see it having to recalibrate itself several times. This appears to be > the > root of the problem itself, and we should figure out what's causing it. > > > > On 2011/03/15 13:07:50, Louis wrote: >> > Hi David and Stefan, >> > >> > Waterloo folks cannot boot up their machines due to this issue, could >> you >> submit >> > this CL if it looks good to you? Thanks. >> > > > > http://codereview.chromium.org/6698020/ >
While I am in favor of reducing the delay, I'm very reluctant to let this patch go in as-is. http://codereview.chromium.org/6698020/diff/1/ichspi.c File ichspi.c (right): http://codereview.chromium.org/6698020/diff/1/ichspi.c#newcode577 ichspi.c:577: * bit count = (1+3+256) * 8 = 2080 The longest instruction, JEDEC_AAI_WORD_PROGRAM, is 6 bytes long (1B opcode + 3B initial address + 2B data for the initial address and initial address + 1). I'm also not sure if it's good to assume 256 for the max data length. While this is true for some chips, those which support auto address increment (AAI) such as the SST25VF080 allow all bytes on the chip to be written without re-issuing an instruction using a new address. http://codereview.chromium.org/6698020/diff/1/ichspi.c#newcode578 ichspi.c:578: * clock freq = 33MHz SPI_CLK on the ICH7/NM10 is 17.86MHz according to the NM10 spec.
On 2011/03/15 19:02:12, kliegs wrote: > As the machines are currently in an unusable state should we submit this > anyways as a "good enough fix while we investigate?" > > Or roll back the original CL while investigating? All this patch does is make flashrom fail faster and does not make it more reliable AFAICT. I am rather uneasy about using it on our other platforms. I'd recommend disabling activate_date and dump_vpd_log to avoid calling flashrom.
On 2011/03/15 19:33:14, dhendrix wrote: > On 2011/03/15 19:02:12, kliegs wrote: > > As the machines are currently in an unusable state should we submit this > > anyways as a "good enough fix while we investigate?" > > > > Or roll back the original CL while investigating? > > All this patch does is make flashrom fail faster and does not make it more > reliable AFAICT. I am rather uneasy about using it on our other platforms. > > I'd recommend disabling activate_date and dump_vpd_log to avoid calling > flashrom. Another option is to add dmidecode to your image. Using DMI info, Flashrom will detect that this is an unknown laptop/netbook/tablet and give up.
Some erase commands also take a long time. On Alex, I am seeing timeouts when it attempts sector erase operations (the problem gets worse with block erase operations of larger size). I added some print statements to help instrument the time for a few instructions on Alex. Here is what I have found so far: 0x9f (Read Device ID): ~10us 0x90 (Read Manufacturer and Device ID): ~10us 0x03 (Byte read): <50us 0x02 (Write byte): ~200us 0x05 (Read status register): ~200us 0x20 ((4KB) sector erase): ~30ms 0xd8 (64KB block erase): ~350ms 0x60 (chip erase): ~10s So I think we definitely need a long timeout here to account for the various erase instructions. The instruction that we care about for the purposes of failing quickly is RDID. If we want to get hacky, we can set the timeout to something very short specifically for RDID (and maybe other instructions, like REMS).
Most look reasonable to me. But erase commands doesn't make sense to me because the behavior should be: 1. flashrom issues erase command (thru ICH) and ICH returns immediately. 2. flashrom issues read status command (thru ICH) and loop until BUSY bit is cleared. That means the ich7_run_opcode() should not timeout. Also, the flashrom -w works fine on Mario, but -E gets timeout. That looks suspicious to me because they should use the same mechanism I assume. Looking at this now. On 2011/03/16 01:07:32, dhendrix wrote: > Some erase commands also take a long time. On Alex, I am seeing timeouts when it > attempts sector erase operations (the problem gets worse with block erase > operations of larger size). > > I added some print statements to help instrument the time for a few instructions > on Alex. Here is what I have found so far: > 0x9f (Read Device ID): ~10us > 0x90 (Read Manufacturer and Device ID): ~10us > 0x03 (Byte read): <50us > 0x02 (Write byte): ~200us > 0x05 (Read status register): ~200us > 0x20 ((4KB) sector erase): ~30ms > 0xd8 (64KB block erase): ~350ms > 0x60 (chip erase): ~10s > > > So I think we definitely need a long timeout here to account for the various > erase instructions. > > The instruction that we care about for the purposes of failing quickly is RDID. > If we want to get hacky, we can set the timeout to something very short > specifically for RDID (and maybe other instructions, like REMS).
LGTM with patch set 2 A short explanation -- Louis and I determined that the "Blocked Access Status" bit is also useful for determining if access was successful at all. In kliegs' case, access to the ROM attached to the SPI controller will fail because it doesn't exist, and we are assuming that the blocked access status bit will reflect that. We tested this out and our assumption seems to be true. So to un-block people, we're going to go ahead and commit this. We may have to come up with another different approach if the Intel FAE tells us we're wrong... On 2011/03/16 01:56:31, Louis wrote: > Most look reasonable to me. But erase commands doesn't make sense to me because > the behavior should be: > 1. flashrom issues erase command (thru ICH) and ICH returns immediately. > 2. flashrom issues read status command (thru ICH) and loop until BUSY bit is > cleared. > > That means the ich7_run_opcode() should not timeout. Also, the flashrom -w works > fine on Mario, but -E gets timeout. That looks suspicious to me because they > should use the same mechanism I assume. > > Looking at this now. > > On 2011/03/16 01:07:32, dhendrix wrote: > > Some erase commands also take a long time. On Alex, I am seeing timeouts when > it > > attempts sector erase operations (the problem gets worse with block erase > > operations of larger size). > > > > I added some print statements to help instrument the time for a few > instructions > > on Alex. Here is what I have found so far: > > 0x9f (Read Device ID): ~10us > > 0x90 (Read Manufacturer and Device ID): ~10us > > 0x03 (Byte read): <50us > > 0x02 (Write byte): ~200us > > 0x05 (Read status register): ~200us > > 0x20 ((4KB) sector erase): ~30ms > > 0xd8 (64KB block erase): ~350ms > > 0x60 (chip erase): ~10s > > > > > > So I think we definitely need a long timeout here to account for the various > > erase instructions. > > > > The instruction that we care about for the purposes of failing quickly is > RDID. > > If we want to get hacky, we can set the timeout to something very short > > specifically for RDID (and maybe other instructions, like REMS).
Thanks David. We come out a good idea for this. Please see the new uploaded code: 1. timeout lows down to 30 secs for full chip erase. 2. if an error shows up in loop, exit early. So this can fit all our cases as long as FCERR can indicate error on Lenovo machines. Tested on Lenovo, EEEPC, and Mario now.
On 2011/03/16 04:26:04, Louis wrote: > Thanks David. We come out a good idea for this. Please see the new uploaded > code: > > 1. timeout lows down to 30 secs for full chip erase. > 2. if an error shows up in loop, exit early. > > So this can fit all our cases as long as FCERR can indicate error on Lenovo > machines. > > Tested on Lenovo, EEEPC, and Mario now. Hi Louis, Please also update the CL description before submitting.
Two comments: The changelog and the comment in the code are incorrect: The patch actually reduces the delay from 60 s to 30 s. Were you aware of the following pending (since Nov. 2010) patch? http://patchwork.coreboot.org/patch/2356/ I will repost an updated version against current flashrom svn trunk and add you to CC for review.
Thanks for pointing out. I will find a chance to fix it. The pending CL looks good to me. We are asking Intel FAE if the FCERR behavior consists on all platform, for example, NM10 and ICH7-M. On Thu, Mar 17, 2011 at 6:27 AM, <hailfinger@gmail.com> wrote: > Two comments: > > The changelog and the comment in the code are incorrect: The patch actually > reduces the delay from 60 s to 30 s. > > Were you aware of the following pending (since Nov. 2010) patch? > http://patchwork.coreboot.org/patch/2356/ > I will repost an updated version against current flashrom svn trunk and add > you > to CC for review. > > > http://codereview.chromium.org/6698020/ >
Sorry, I send the mail out too fast. 60ms is not enough for some actions, like full chip erase. In theory, the ICH can complete the erase command in very short time, like hundred of us, but for erase commands, they take more longer than we expect. David experimented a chip full erase and took about 10s to get CDS from ICH. This should be tested before this is committed. On Thu, Mar 17, 2011 at 5:50 PM, Louis Yung-Chieh Lo <yjlou@chromium.org>wrote: > Thanks for pointing out. I will find a chance to fix it. > > The pending CL looks good to me. We are asking Intel FAE if the FCERR > behavior consists on all platform, for example, NM10 and ICH7-M. > > > On Thu, Mar 17, 2011 at 6:27 AM, <hailfinger@gmail.com> wrote: > >> Two comments: >> >> The changelog and the comment in the code are incorrect: The patch >> actually >> reduces the delay from 60 s to 30 s. >> >> Were you aware of the following pending (since Nov. 2010) patch? >> http://patchwork.coreboot.org/patch/2356/ >> I will repost an updated version against current flashrom svn trunk and >> add you >> to CC for review. >> >> >> http://codereview.chromium.org/6698020/ >> > > |