|
|
Created:
4 years, 11 months ago by Tom Bergan Modified:
4 years, 11 months ago CC:
pdfium-reviews_googlegroups.com Base URL:
https://pdfium.googlesource.com/pdfium.git@master Target Ref:
refs/heads/master Visibility:
Public. |
DescriptionCleanup CJS_PublicMethods::ParseNumber
Original patch by tombergan.
The old version of this function was basically strtod with a few quirks:
1. It always interpreted ',' as '.' independent of locale. I kept this
behavior, to be conservative.
2. It interpreted the first non-number character as a decimal point,
unless there was a prior decimal point, in which case all characters
up to that point are ignored. This would parse "123z4" as "123.4"
and "123xy6" as "6". I did not keep this behavior -- in the new code,
these examples all fail to parse.
The new ParseNumber was inlined into ConvertStringToNumber, which
returns true on success and (false, 0) on failure.
BUG=pdfium:361
R=tsepez@chromium.org
Committed: https://pdfium.googlesource.com/pdfium/+/4cd5b80e70e5fc50d8bd805cfa3c7b54878a0a35
Patch Set 1 #Patch Set 2 : Update bug_361_expected.txt #
Messages
Total messages: 19 (5 generated)
Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=361 ========== to ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=361 ==========
tombergan@chromium.org changed reviewers: + thestig@chromium.org, tsepez@chromium.org
Not sure who the right reviewer is for this CL. tsepez@ and thestig@ look like the most active contributors in OWNERS.
On 2016/01/13 17:19:38, Tom Bergan wrote: > Not sure who the right reviewer is for this CL. tsepez@ and thestig@ look like > the most active contributors in OWNERS. Hi, thanks for your patch. In general, the more custom code we can remove, the better. I thought there might have been some issue with using strtod(), such as we can't be sure what the current locale may have set "decimal point" to, so that changing the commas to dots might not work if locale (and hence strtod) thinks comma is a decimal point separator. Are we sure this works? Note also that if you set BUG=pdfium:361 in your description above, reitveld will link to the right bug tracker.
Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=361 ========== to ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=pdfium:361 ==========
On 2016/01/13 18:38:56, Tom Sepez wrote: > On 2016/01/13 17:19:38, Tom Bergan wrote: > > Not sure who the right reviewer is for this CL. tsepez@ and thestig@ look like > > the most active contributors in OWNERS. > > Hi, thanks for your patch. > > In general, the more custom code we can remove, the better. > > I thought there might have been some issue with using strtod(), such as we can't > be sure what the current locale may have set "decimal point" to, so that > changing the commas to dots might not work if locale (and hence strtod) thinks > comma is a decimal point separator. Are we sure this works? No, you're right, this is wrong. "man 3 strtod" claims that strtod uses locale-dependent decimal points, so probably the best solution is to remove the s/,/./ substitution and call strtod directly. Does that sound right to you? > Note also that if you set BUG=pdfium:361 in your description above, reitveld > will link to the right bug tracker. Done.
On 2016/01/13 18:51:25, Tom Bergan wrote: > On 2016/01/13 18:38:56, Tom Sepez wrote: > > On 2016/01/13 17:19:38, Tom Bergan wrote: > > > Not sure who the right reviewer is for this CL. tsepez@ and thestig@ look > like > > > the most active contributors in OWNERS. > > > > Hi, thanks for your patch. > > > > In general, the more custom code we can remove, the better. > > > > I thought there might have been some issue with using strtod(), such as we > can't > > be sure what the current locale may have set "decimal point" to, so that > > changing the commas to dots might not work if locale (and hence strtod) thinks > > comma is a decimal point separator. Are we sure this works? > > No, you're right, this is wrong. "man 3 strtod" claims that strtod uses > locale-dependent > decimal points, so probably the best solution is to remove the s/,/./ > substitution and > call strtod directly. Does that sound right to you? > Probably not. I think the arithmetic should produce the same result regardless of locale. Compounding this is that we probably don't want to call setlocale() to make sure it is a ".", since that action would affect our embedders. I'm not clear on what the right thing is.
Given that there isn't an obvious right answer, and that the existing code eventually calls into strtod() in whatever locale happens to be present, I don't think there's a lot of harm in continuing to do so.
On 2016/01/13 21:00:55, Tom Sepez wrote: > Given that there isn't an obvious right answer, and that the existing code > eventually calls into strtod() in whatever locale happens to be present, I don't > think there's a lot of harm in continuing to do so. The remaining question is whether we want to run some sort of lexer first on the string to throw out the infinity, nan, hex number, etc. cases that strtod() handles. In general, I'm opposed to "be generous in what you accept" preferring to be strict.
And just for laughs, git grep 'FX.*to[fd]' lists a bunch of buggy, mostly unused, variants that we should try to eliminate: Among them: core/include/fxcrt/fx_ext.h:FX_FLOAT FXSYS_strtof(const FX_CHAR* pcsStr, core/include/fxcrt/fx_ext.h:FX_FLOAT FXSYS_wcstof(const FX_WCHAR* pwsStr, core/include/fxcrt/fx_string.h:FX_FLOAT FX_atof(const CFX_ByteStringC& str); core/src/fxcrt/fx_basic_wstring.cpp:FX_FLOAT FX_wtof(const FX_WCHAR* str, int len) { core/src/fxcodec/codec/fx_codec.cpp:extern "C" double FXstrtod(const char* nptr, char** endptr) {
On 2016/01/13 21:04:16, Tom Sepez wrote: > On 2016/01/13 21:00:55, Tom Sepez wrote: > > Given that there isn't an obvious right answer, and that the existing code > > eventually calls into strtod() in whatever locale happens to be present, I > don't > > think there's a lot of harm in continuing to do so. > The remaining question is whether we want to run some sort of lexer first on the > string to throw out the infinity, nan, hex number, etc. cases that strtod() > handles. In general, I'm opposed to "be generous in what you accept" preferring > to be strict. I don't think we need to handle hex numbers, but we do need to handle infinity and NaN. If I add "testField(1/0);" to your test script (with my code change), the output is: Alert: Answer for "Infinity" is: number Infinity That is, the input to this parsing function comes not only from user-entered text, but also from Javascript computations. (Note that the PDF I posted in the original bug includes fields that store intermediate values.) If we do not parse inf/nan, we lose these intermediate values and the resulting computation can be incorrect. These values should (hopefully) parse directly with strtod since v8 should be using the same C library to format the number. (Probably pdfium should be storing these intermediate values as raw javascript objects, rather than converting to/from strings, but that sounds like a larger change.) I'm still not sure what to do about user-entered text. Perhaps there's a clever way to associate PDF fields with AFNumber_Format and AFNumber_Keystroke calls -- if these calls exist, they say exactly what the format should be, and otherwise, the current locale is probably a reasonable default.
On 2016/01/14 07:52:28, Tom Bergan wrote: > On 2016/01/13 21:04:16, Tom Sepez wrote: > > On 2016/01/13 21:00:55, Tom Sepez wrote: > > > Given that there isn't an obvious right answer, and that the existing code > > > eventually calls into strtod() in whatever locale happens to be present, I > > don't > > > think there's a lot of harm in continuing to do so. > > The remaining question is whether we want to run some sort of lexer first on > the > > string to throw out the infinity, nan, hex number, etc. cases that strtod() > > handles. In general, I'm opposed to "be generous in what you accept" > preferring > > to be strict. > > I don't think we need to handle hex numbers, but we do need to handle infinity > and NaN. If I add "testField(1/0);" to your test script (with my code change), > the output is: > Alert: Answer for "Infinity" is: number Infinity > > That is, the input to this parsing function comes not only from user-entered > text, but also from Javascript computations. (Note that the PDF I posted in the > original bug includes fields that store intermediate values.) If we do not parse > inf/nan, we lose these intermediate values and the resulting computation can be > incorrect. These values should (hopefully) parse directly with strtod since v8 > should be using the same C library to format the number. (Probably pdfium should > be storing these intermediate values as raw javascript objects, rather than > converting to/from strings, but that sounds like a larger change.) > You bring up a good point. Note that the value stored (in the C++ code) for the field is always a string, as produced by v8. We get there, for example at https://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium... where the operator>> eventually calls intohttps://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium/fpdfsdk/src/jsapi/fxjs_v8.cpp&rcl=1452774553&l=762 which eventually calls v8::Value:ToString(). I'd expect that the behaviour of ToString() is spec'd, and not necessarily the same as any library impl. So ... this means that testField("40000000000000000000000000") stores "40000000000000000000000000" but testField(40000000000000000000000000) stores "4e+25" and testField("Infinity") stores "Infinity" testField(Infintiy) also stores "Infinity" and we should try to reconstitute both of these into numbers. As far as locale, some tests use https://code.google.com/p/chromium/codesearch#chromium/src/base/test/scoped_l... I'm not sure if we can portably bring this into PDFium. > I'm still not sure what to do about user-entered text. Perhaps there's a clever > way to associate PDF fields with AFNumber_Format and AFNumber_Keystroke calls -- > if these calls exist, they say exactly what the format should be, and otherwise, > the current locale is probably a reasonable default.
On 2016/01/14 16:54:43, Tom Sepez wrote: > On 2016/01/14 07:52:28, Tom Bergan wrote: > > On 2016/01/13 21:04:16, Tom Sepez wrote: > > > On 2016/01/13 21:00:55, Tom Sepez wrote: > > > > Given that there isn't an obvious right answer, and that the existing code > > > > eventually calls into strtod() in whatever locale happens to be present, I > > > don't > > > > think there's a lot of harm in continuing to do so. > > > The remaining question is whether we want to run some sort of lexer first on > > the > > > string to throw out the infinity, nan, hex number, etc. cases that strtod() > > > handles. In general, I'm opposed to "be generous in what you accept" > > preferring > > > to be strict. > > > > I don't think we need to handle hex numbers, but we do need to handle infinity > > and NaN. If I add "testField(1/0);" to your test script (with my code change), > > the output is: > > Alert: Answer for "Infinity" is: number Infinity > > > > That is, the input to this parsing function comes not only from user-entered > > text, but also from Javascript computations. (Note that the PDF I posted in > the > > original bug includes fields that store intermediate values.) If we do not > parse > > inf/nan, we lose these intermediate values and the resulting computation can > be > > incorrect. These values should (hopefully) parse directly with strtod since v8 > > should be using the same C library to format the number. (Probably pdfium > should > > be storing these intermediate values as raw javascript objects, rather than > > converting to/from strings, but that sounds like a larger change.) > > > You bring up a good point. Note that the value stored (in the C++ code) for the > field > is always a string, as produced by v8. We get there, for example at > https://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium... > where the operator>> eventually calls > intohttps://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium/fpdfsdk/src/jsapi/fxjs_v8.cpp&rcl=1452774553&l=762 > which eventually calls v8::Value:ToString(). I'd expect that the behaviour of > ToString() is spec'd, and not necessarily the same as any library impl. > > So ... this means that > testField("40000000000000000000000000") stores "40000000000000000000000000" but > testField(40000000000000000000000000) stores "4e+25" > and > testField("Infinity") stores "Infinity" > testField(Infintiy) also stores "Infinity" > and > we should try to reconstitute both of these into numbers. You're right about ToString -- the spec says the decimal separator must be a dot: http://www.ecma-international.org/ecma-262/5.1/#sec-9.8 I propose submitting this CL as-is. It correctly parses numbers generated by v8::Value::ToString. For other cases (user-generated input), this CL behaves exactly as the prior code, except it handles error cases better, it supports "1e5" (not just "1e+5"), and it does not have the 17 character limitation. Longer-term, the better solution for user-generated input is probably something like: 1. If the field has AFNumber_Keystroke, then just before the value is committed (see link below), parse the number using the format specified by AFNumber_Keystroke params. Then store the value as an actual Javascript number (not a string). https://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium... 2. If the field has no AFNumber_Keystroke, check what Acroread does under various locales and do the same thing. > As far as locale, some tests use > https://code.google.com/p/chromium/codesearch#chromium/src/base/test/scoped_l... > I'm not sure if we can portably bring this into PDFium. If pdfium_test calls setlocale(LC_ALL, "") at startup, it will load the locale from environment variables and you can run pdfium_test with LC_ALL=<locale>. However, pdfium_test does not currently call setlocale(LC_ALL, ""), so the locale defaults to "C" which uses "." as a decimal separator.
Thanks for your help. I agree that this is a good interim solution. LGTM. Do you have commit access to PDFium (most of chromium doesn't)? If not I'll land this on your behalf.
On 2016/01/15 01:10:53, Tom Sepez wrote: > Thanks for your help. I agree that this is a good interim solution. LGTM. > Do you have commit access to PDFium (most of chromium doesn't)? If not I'll > land this on your behalf. And you're not in the list. If you ever think that you might land another PDFium patch, ping me and I'll add you.
Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=pdfium:361 ========== to ========== Cleanup CJS_PublicMethods::ParseNumber Original patch by tombergan. The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=pdfium:361 ==========
Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber Original patch by tombergan. The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=pdfium:361 ========== to ========== Cleanup CJS_PublicMethods::ParseNumber Original patch by tombergan. The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=pdfium:361 R=tsepez@chromium.org Committed: https://pdfium.googlesource.com/pdfium/+/4cd5b80e70e5fc50d8bd805cfa3c7b54878a... ==========
Message was sent while issue was closed.
Committed patchset #2 (id:20001) manually as 4cd5b80e70e5fc50d8bd805cfa3c7b54878a0a35 (presubmit successful).
Message was sent while issue was closed.
On 2016/01/15 01:25:13, Tom Sepez wrote: > Committed patchset #2 (id:20001) manually as > 4cd5b80e70e5fc50d8bd805cfa3c7b54878a0a35 (presubmit successful). And its reverted due to test breakage. Tom (B) I'll take it from here. |