|
|
Created:
6 years, 4 months ago by Justin Chuang Modified:
6 years, 4 months ago CC:
chromium-reviews, chromium-apps-reviews_chromium.org, extensions-reviews_chromium.org, vpagar Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
DescriptionFix race condition in CpuInfoProvider::QueryCpuTimePerProcessor
BUG=chrome-os-partner:30607
TESTED=Manual tested on the platform of crosbug.com/p/30607
try with
$ echo -n performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
$ echo -n interactive > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=288304
Patch Set 1 #Patch Set 2 : #
Total comments: 6
Patch Set 3 : Add comments and logging #
Total comments: 8
Patch Set 4 : Modify comments #Messages
Total messages: 24 (0 generated)
There is already a chrome-os-partner issue, do I need to create another chromium issue for BUG field?
On 2014/07/30 16:46:49, Justin Chuang wrote: > There is already a chrome-os-partner issue, do I need to create another chromium > issue for BUG field? linked to chrome-os-partner issue directly
Would it we safer to use http://www.gnu.org/software/libc/manual/html_node/Processor-Resources.html instead of /proc/cpuinfo?
On 2014/07/31 08:36:05, Pawel Osciak wrote: > Would it we safer to use > http://www.gnu.org/software/libc/manual/html_node/Processor-Resources.html > instead of /proc/cpuinfo? The current code uses sysconf (_SC_NPROCESSORS_ONLN). I suspect the race condition comes from /proc/stat.
Adding a random sample of OWNERS. PTAL, thank you.
https://chromiumcodereview.appspot.com/426303002/diff/20001/chrome/browser/ex... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://chromiumcodereview.appspot.com/426303002/diff/20001/chrome/browser/ex... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:40: // Cope with kernel race condition when /proc/stat does not reflect to the s/to// https://chromiumcodereview.appspot.com/426303002/diff/20001/chrome/browser/ex... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:41: // current number of online processors in time. Also handle a less-likely s/in time// https://chromiumcodereview.appspot.com/426303002/diff/20001/chrome/browser/ex... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:44: if (i == infos->size()) Please explain how this actually handles the issue? It looks like this just prevents us from crashing... Or will the caller understand that this CPU went offline? Removing the DCHECK() below makes me think it doesn't expect it...
https://codereview.chromium.org/426303002/diff/20001/chrome/browser/extension... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://codereview.chromium.org/426303002/diff/20001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:40: // Cope with kernel race condition when /proc/stat does not reflect to the On 2014/08/04 00:55:21, Pawel Osciak wrote: > s/to// Acknowledged. https://codereview.chromium.org/426303002/diff/20001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:41: // current number of online processors in time. Also handle a less-likely On 2014/08/04 00:55:21, Pawel Osciak wrote: > s/in time// Acknowledged. https://codereview.chromium.org/426303002/diff/20001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:44: if (i == infos->size()) On 2014/08/04 00:55:22, Pawel Osciak wrote: > Please explain how this actually handles the issue? It looks like this just > prevents us from crashing... Or will the caller understand that this CPU went > offline? Removing the DCHECK() below makes me think it doesn't expect it... Thanks for the comments. It's actually a race condition in Linux kernel. The caller does know the current online CPU count, it's just that output of /proc/stat doesn't match it. A different method to solve this is to add a new API to return sysconf(_SC_NPROCESSORS_CONF) in addition to _SC_NPROCESSORS_ONLN, then add this new API all Win32, Mac, Linux, BSD(?) platforms. And we also need change the original behavior of system_cpu extension API. Probably a overkill or can be done in later CL. For now, I think the current fix is sufficient for the R37/R38 bug. I'll tweak the comments and logging.
With this script: set -e while true; do STAT_COUNT=$(cat /proc/stat | egrep -e '^cpu[0-9]' | wc -l) ONLINE_COUNT=$(cat /sys/devices/system/cpu/cpu[0-9]/online | grep 1 | wc -l) if [ "$STAT_COUNT" != "$ONLINE_COUNT" ]; then echo $(date) "Mismatch detected ${STAT_COUNT} <=> ${ONLINE_COUNT}" fi done We can see lots of logs that online cpu count mismatches the number of entries in /proc/stat like: Mon Aug 4 16:00:01 CST 2014 Mismatch detected 4 <=> 3 Mon Aug 4 16:00:02 CST 2014 Mismatch detected 3 <=> 4 Mon Aug 4 16:00:03 CST 2014 Mismatch detected 4 <=> 3 Mon Aug 4 16:00:42 CST 2014 Mismatch detected 4 <=> 3 Mon Aug 4 16:00:42 CST 2014 Mismatch detected 3 <=> 4
https://chromiumcodereview.appspot.com/426303002/diff/40001/chrome/browser/ex... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://chromiumcodereview.appspot.com/426303002/diff/40001/chrome/browser/ex... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:45: // processors, it is treated as an intermittent problem in the kernel, and This is not really an intermittent problem on some platforms. It may be a normal situation to have less online than present, which means this may never pass, it will always return false. Is that still ok? https://chromiumcodereview.appspot.com/426303002/diff/40001/chrome/browser/ex... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:51: if (i == infos->size()) { s/==/>=/ https://chromiumcodereview.appspot.com/426303002/diff/40001/chrome/browser/ex... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:67: if (i < infos->size()) { Is this also an actual issue? Having less CPUs present than online sounds much worse than having less online than present...
https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:51: if (i == infos->size()) { On 2014/08/04 09:22:54, Pawel Osciak wrote: > s/==/>=/ 'i' won't be greater infos->size() unless std::vector.size() returns negative value. Shall I still change this? https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:67: if (i < infos->size()) { On 2014/08/04 09:22:54, Pawel Osciak wrote: > Is this also an actual issue? Having less CPUs present than online sounds much > worse than having less online than present... Yes, I have seen both cases on actual nyan device. Not easy to reproduce though. I haven't checked kernel code, but I think it's caused by delay in statistics routine.
https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:45: // processors, it is treated as an intermittent problem in the kernel, and On 2014/08/04 09:22:54, Pawel Osciak wrote: > This is not really an intermittent problem on some platforms. It may be a normal > situation to have less online than present, which means this may never pass, it > will always return false. Is that still ok? Interesting. Luckily it's not the case. On Nyan, this method returns *true* most of the time. Only at the moment when that the number of online cpu changes, this method may return false due to delay in the kernel of updating /proc/stat. /proc/stat is provided by fs/proc/stat.c and kernel/sched/cpuacct.c. The code is mostly platform-independent. So I expect the behavior of /proc/stat is platform independent, too. It should not introduce the scenario you're worrying ("always return false") on other Linux platforms. If we want this method to always return true in normal case, I described another method in Review #7, which will require more efforts and change the API behavior. (As a side note) actually the kernel keeps account of both online and offline cpu. Pity that current design of /proc/stat uses 'for_each_online_cpu()' to iterate online cpu cores only. We need to modify kernel code to include online+offline cpus.
lgtm as a patch, but deferring to owners if it's ok to do this wrt. the logic in the callers of this. https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:51: if (i == infos->size()) { On 2014/08/04 09:37:23, Justin Chuang wrote: > On 2014/08/04 09:22:54, Pawel Osciak wrote: > > s/==/>=/ > > 'i' won't be greater infos->size() unless std::vector.size() returns negative > value. Shall I still change this? Of course, sorry, I misread the code. No need to change.
Any update? We have an Chrome crash issue on nyan platform. Want to fix it ASAP. Thanks In addition, I don't think we can output good data in Chrome api when kernel already returns bad data.
Please just pick an OWNER or two at most, rather than adding a bunch of reviewers.
Please just pick an OWNER or two at most, rather than adding a bunch of reviewers.
lgtm https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc (right): https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:40: // Due to a race condition in the kernel, we cannot trust the kernel to Can you mention the number of CPUs might have changed between the call to base::SysInfo::NumberOfProcessors() and here? It's not really a race condition in the kernel.
On 2014/08/07 06:19:37, Lei Zhang wrote: > lgtm > > https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... > File chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc > (right): > > https://codereview.chromium.org/426303002/diff/40001/chrome/browser/extension... > chrome/browser/extensions/api/system_cpu/cpu_info_provider_linux.cc:40: // Due > to a race condition in the kernel, we cannot trust the kernel to > Can you mention the number of CPUs might have changed between the call to > base::SysInfo::NumberOfProcessors() and here? It's not really a race condition > in the kernel. Thanks. You are right. I was wrong. After trying again, it seems to be race condition in chrome, not in kernel. I'll fix this issue in different way.
Ok. I see why it happens so frequently. CpuInfoProvider uses SysInfo::NumberOfProcessors() int SysInfo::NumberOfProcessors() { return g_lazy_number_of_processors.Get().value(); } So it will only records value of sysconf(_SC_NPROCESSORS_ONLN) on lazy initialization. After CPU count changes, return value of SysInfo::NumberOfProcessors() won't change.
This implies another problem: The code passes cached value of online cpu count to Blink sandbox. This is wrong because online cpu count may change every a few seconds on embedded platform. I think the correct fix is to change SysInfo::NumberOfProcessors() to use _SC_NPROCESSORS_CONF instead of _SC_NPROCESSORS_ONLN. On Windows, the value is obtained by ::GetNativeSystemInfo -> dwNumberOfProcessors, which also looks similar to _SC_NPROCESSORS_CONF. There are ~20 places calling NumberOfProcessors(). Looks like most of them actually want _SC_NPROCESSORS_CONF rather than _SC_NPROCESSORS_ONLN. Lei, jochen, Do you agree?
On 2014/08/07 15:54:04, Justin Chuang wrote: > This implies another problem: > > The code passes cached value of online cpu count to Blink sandbox. This is wrong > because online cpu count may change every a few seconds on embedded platform. > > I think the correct fix is to change SysInfo::NumberOfProcessors() to use > _SC_NPROCESSORS_CONF instead of _SC_NPROCESSORS_ONLN. > > On Windows, the value is obtained by ::GetNativeSystemInfo -> > dwNumberOfProcessors, which also looks similar to _SC_NPROCESSORS_CONF. > > There are ~20 places calling NumberOfProcessors(). Looks like most of them > actually want _SC_NPROCESSORS_CONF rather than _SC_NPROCESSORS_ONLN. > > Lei, jochen, Do you agree? There's always a potential race between when you call NumberOfProcessors() and when you read /proc/stat. So this CL probably should go in once you update the comment. Fixing NumberOfProcessors() is a related, but separate issue.
The CQ bit was checked by jchuang@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jchuang@chromium.org/426303002/60001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: android_dbg_triggered_tests on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/android_dbg_tri...)
Message was sent while issue was closed.
Change committed as 288304 |