Issue 1582013002: Cleanup CJS_PublicMethods::ParseNumber

Issue 1582013002: Cleanup CJS_PublicMethods::ParseNumber (Closed)

Created:
4 years, 11 months ago by Tom Bergan

Modified:
4 years, 11 months ago

Reviewers:
Tom Sepez, Lei Zhang

CC:
pdfium-reviews_googlegroups.com

Base URL:
https://pdfium.googlesource.com/pdfium.git@master

Target Ref:
refs/heads/master

Visibility:
Public.

More Reviews

Description

Cleanup CJS_PublicMethods::ParseNumber Original patch by tombergan. The old version of this function was basically strtod with a few quirks: 1. It always interpreted ',' as '.' independent of locale. I kept this behavior, to be conservative. 2. It interpreted the first non-number character as a decimal point, unless there was a prior decimal point, in which case all characters up to that point are ignored. This would parse "123z4" as "123.4" and "123xy6" as "6". I did not keep this behavior -- in the new code, these examples all fail to parse. The new ParseNumber was inlined into ConvertStringToNumber, which returns true on success and (false, 0) on failure. BUG=pdfium:361 R=tsepez@chromium.org Committed: https://pdfium.googlesource.com/pdfium/+/4cd5b80e70e5fc50d8bd805cfa3c7b54878a0a35

Patch Set 1 #

Patch Set 2 : Update bug_361_expected.txt #

Created: 4 years, 11 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+44 lines, -172 lines)			Patch
M	fpdfsdk/src/javascript/Field.cpp	View		3 chunks	+5 lines, -17 lines	0 comments	Download
M	fpdfsdk/src/javascript/PublicMethods.h	View		1 chunk	+1 line, -8 lines	0 comments	Download
M	fpdfsdk/src/javascript/PublicMethods.cpp	View		1 chunk	+22 lines, -131 lines	0 comments	Download
M	testing/resources/javascript/bug_361_expected.txt	View	1	1 chunk	+16 lines, -16 lines	0 comments	Download

Messages

Total messages: 19 (5 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

Tom Bergan

Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically ...

4 years, 11 months ago (2016-01-13 17:18:19 UTC) #1

Tom Bergan

Not sure who the right reviewer is for this CL. tsepez@ and thestig@ look like ...

4 years, 11 months ago (2016-01-13 17:19:38 UTC) #3

Tom Sepez

On 2016/01/13 17:19:38, Tom Bergan wrote: > Not sure who the right reviewer is for ...

4 years, 11 months ago (2016-01-13 18:38:56 UTC) #4

Tom Bergan

Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically ...

4 years, 11 months ago (2016-01-13 18:44:29 UTC) #5

Tom Bergan

On 2016/01/13 18:38:56, Tom Sepez wrote: > On 2016/01/13 17:19:38, Tom Bergan wrote: > > ...

4 years, 11 months ago (2016-01-13 18:51:25 UTC) #6

Tom Sepez

On 2016/01/13 18:51:25, Tom Bergan wrote: > On 2016/01/13 18:38:56, Tom Sepez wrote: > > ...

4 years, 11 months ago (2016-01-13 20:35:34 UTC) #7

Tom Sepez

Given that there isn't an obvious right answer, and that the existing code eventually calls ...

4 years, 11 months ago (2016-01-13 21:00:55 UTC) #8

Tom Sepez

On 2016/01/13 21:00:55, Tom Sepez wrote: > Given that there isn't an obvious right answer, ...

4 years, 11 months ago (2016-01-13 21:04:16 UTC) #9

Tom Sepez

And just for laughs, git grep 'FX.*to[fd]' lists a bunch of buggy, mostly unused, variants ...

4 years, 11 months ago (2016-01-14 00:59:48 UTC) #10

Tom Bergan

On 2016/01/13 21:04:16, Tom Sepez wrote: > On 2016/01/13 21:00:55, Tom Sepez wrote: > > ...

4 years, 11 months ago (2016-01-14 07:52:28 UTC) #11

Tom Sepez

On 2016/01/14 07:52:28, Tom Bergan wrote: > On 2016/01/13 21:04:16, Tom Sepez wrote: > > ...

4 years, 11 months ago (2016-01-14 16:54:43 UTC) #12

On 2016/01/14 07:52:28, Tom Bergan wrote:
> On 2016/01/13 21:04:16, Tom Sepez wrote:
> > On 2016/01/13 21:00:55, Tom Sepez wrote:
> > > Given that there isn't an obvious right answer, and that the existing code
> > > eventually calls into strtod() in whatever locale happens to be present, I
> > don't
> > > think there's a lot of harm in continuing to do so.
> > The remaining question is whether we want to run some sort of lexer first on
> the
> > string to throw out the infinity, nan, hex number, etc. cases that strtod()
> > handles.  In general, I'm opposed to "be generous in what you accept"
> preferring
> > to be strict.
> 
> I don't think we need to handle hex numbers, but we do need to handle infinity
> and NaN. If I add "testField(1/0);" to your test script (with my code change),
> the output is:
> Alert: Answer for "Infinity" is: number Infinity
> 
> That is, the input to this parsing function comes not only from user-entered
> text, but also from Javascript computations. (Note that the PDF I posted in
the
> original bug includes fields that store intermediate values.) If we do not
parse
> inf/nan, we lose these intermediate values and the resulting computation can
be
> incorrect. These values should (hopefully) parse directly with strtod since v8
> should be using the same C library to format the number. (Probably pdfium
should
> be storing these intermediate values as raw javascript objects, rather than
> converting to/from strings, but that sounds like a larger change.)
>
You bring up a good point.  Note that the value stored (in the C++ code) for the
field
is always a string, as produced by v8.  We get there, for example at
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium...
where the operator>> eventually calls
intohttps://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium/fpdfsdk/src/jsapi/fxjs_v8.cpp&rcl=1452774553&l=762
which eventually calls v8::Value:ToString(). I'd expect that the behaviour of
ToString() is spec'd, and not necessarily the same as any library impl.

So ... this means that 
testField("40000000000000000000000000") stores "40000000000000000000000000" but
testField(40000000000000000000000000) stores "4e+25"
and 
testField("Infinity") stores "Infinity"
testField(Infintiy) also stores "Infinity"
and
we should try to reconstitute both of these into numbers.

As far as locale, some tests use 
https://code.google.com/p/chromium/codesearch#chromium/src/base/test/scoped_l...
I'm not sure if we can portably bring this into PDFium.







 
> I'm still not sure what to do about user-entered text. Perhaps there's a
clever
> way to associate PDF fields with AFNumber_Format and AFNumber_Keystroke calls
--
> if these calls exist, they say exactly what the format should be, and
otherwise,
> the current locale is probably a reasonable default.

Tom Bergan

On 2016/01/14 16:54:43, Tom Sepez wrote: > On 2016/01/14 07:52:28, Tom Bergan wrote: > > ...

4 years, 11 months ago (2016-01-15 00:54:40 UTC) #13

On 2016/01/14 16:54:43, Tom Sepez wrote:
> On 2016/01/14 07:52:28, Tom Bergan wrote:
> > On 2016/01/13 21:04:16, Tom Sepez wrote:
> > > On 2016/01/13 21:00:55, Tom Sepez wrote:
> > > > Given that there isn't an obvious right answer, and that the existing
code
> > > > eventually calls into strtod() in whatever locale happens to be present,
I
> > > don't
> > > > think there's a lot of harm in continuing to do so.
> > > The remaining question is whether we want to run some sort of lexer first
on
> > the
> > > string to throw out the infinity, nan, hex number, etc. cases that
strtod()
> > > handles.  In general, I'm opposed to "be generous in what you accept"
> > preferring
> > > to be strict.
> > 
> > I don't think we need to handle hex numbers, but we do need to handle
infinity
> > and NaN. If I add "testField(1/0);" to your test script (with my code
change),
> > the output is:
> > Alert: Answer for "Infinity" is: number Infinity
> > 
> > That is, the input to this parsing function comes not only from user-entered
> > text, but also from Javascript computations. (Note that the PDF I posted in
> the
> > original bug includes fields that store intermediate values.) If we do not
> parse
> > inf/nan, we lose these intermediate values and the resulting computation can
> be
> > incorrect. These values should (hopefully) parse directly with strtod since
v8
> > should be using the same C library to format the number. (Probably pdfium
> should
> > be storing these intermediate values as raw javascript objects, rather than
> > converting to/from strings, but that sounds like a larger change.)
> >
> You bring up a good point.  Note that the value stored (in the C++ code) for
the
> field
> is always a string, as produced by v8.  We get there, for example at
>
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium...
> where the operator>> eventually calls
>
intohttps://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium/fpdfsdk/src/jsapi/fxjs_v8.cpp&rcl=1452774553&l=762
> which eventually calls v8::Value:ToString(). I'd expect that the behaviour of
> ToString() is spec'd, and not necessarily the same as any library impl.
> 
> So ... this means that 
> testField("40000000000000000000000000") stores "40000000000000000000000000"
but
> testField(40000000000000000000000000) stores "4e+25"
> and 
> testField("Infinity") stores "Infinity"
> testField(Infintiy) also stores "Infinity"
> and
> we should try to reconstitute both of these into numbers.

You're right about ToString -- the spec says the decimal separator must be a
dot:
http://www.ecma-international.org/ecma-262/5.1/#sec-9.8

I propose submitting this CL as-is. It correctly parses numbers generated by
v8::Value::ToString. For other cases (user-generated input), this CL behaves
exactly as the prior code, except it handles error cases better, it supports
"1e5" (not just "1e+5"), and it does not have the 17 character limitation.

Longer-term, the better solution for user-generated input is probably something
like:

1. If the field has AFNumber_Keystroke, then just before the value is committed
(see link below), parse the number using the format specified by
AFNumber_Keystroke params. Then store the value as an actual Javascript number
(not a string).
https://code.google.com/p/chromium/codesearch#chromium/src/third_party/pdfium...

2. If the field has no AFNumber_Keystroke, check what Acroread does under
various locales and do the same thing.

> As far as locale, some tests use 
>
https://code.google.com/p/chromium/codesearch#chromium/src/base/test/scoped_l...
> I'm not sure if we can portably bring this into PDFium.

If pdfium_test calls setlocale(LC_ALL, "") at startup, it will load the locale
from environment variables and you can run pdfium_test with LC_ALL=<locale>.
However, pdfium_test does not currently call setlocale(LC_ALL, ""), so the
locale defaults to "C" which uses "." as a decimal separator.

Tom Sepez

Thanks for your help. I agree that this is a good interim solution. LGTM. Do ...

4 years, 11 months ago (2016-01-15 01:10:53 UTC) #14

Tom Sepez

On 2016/01/15 01:10:53, Tom Sepez wrote: > Thanks for your help. I agree that this ...

4 years, 11 months ago (2016-01-15 01:22:00 UTC) #15

Tom Sepez

Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber The old version of this function was basically ...

4 years, 11 months ago (2016-01-15 01:22:21 UTC) #16

Tom Sepez

Description was changed from ========== Cleanup CJS_PublicMethods::ParseNumber Original patch by tombergan. The old version of ...

4 years, 11 months ago (2016-01-15 01:25:11 UTC) #17

Tom Sepez

Committed patchset #2 (id:20001) manually as 4cd5b80e70e5fc50d8bd805cfa3c7b54878a0a35 (presubmit successful).

4 years, 11 months ago (2016-01-15 01:25:13 UTC) #18

Tom Sepez

4 years, 11 months ago (2016-01-15 01:35:49 UTC) #19

Message was sent while issue was closed.

On 2016/01/15 01:25:13, Tom Sepez wrote:
> Committed patchset #2 (id:20001) manually as
> 4cd5b80e70e5fc50d8bd805cfa3c7b54878a0a35 (presubmit successful).

And its reverted due to test breakage.  Tom (B) I'll take it from here.

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages