Issue 2340583003: Add MSA (MIPS SIMD Arch) optimized VectorMath functions

Prashant.Patil

4 years, 3 months ago (2016-09-14 12:40:07 UTC) #1

Prashant.Patil

Description was changed from ========== Add MSA (MIPS SIMD Arch) optimized VectorMath functions We add ...

4 years, 3 months ago (2016-09-14 12:43:48 UTC) #2

Prashant.Patil

prashant.patil@imgtec.com changed reviewers: + dpranke@chromium.org, schenney@chromium.org, zmo@chromium.org

4 years, 3 months ago (2016-09-14 12:43:48 UTC) #3

Dirk Pranke

dpranke@chromium.org changed reviewers: - dpranke@chromium.org

4 years, 3 months ago (2016-09-14 17:12:29 UTC) #5

Raymond Toy

rtoy@chromium.org changed reviewers: + dpranke@chromium.org, rtoy@chromium.org

4 years, 3 months ago (2016-09-14 17:14:11 UTC) #6

Raymond Toy

It will take me some time to review this. Can you please run the webaudio ...

4 years, 3 months ago (2016-09-14 17:14:11 UTC) #7

Prashant.Patil

On 2016/09/14 17:14:11, Raymond Toy wrote: > It will take me some time to review ...

4 years, 3 months ago (2016-09-16 07:56:01 UTC) #8

Raymond Toy

On 2016/09/16 07:56:01, Prashant.Patil wrote: > On 2016/09/14 17:14:11, Raymond Toy wrote: > > It ...

4 years, 3 months ago (2016-09-16 15:17:35 UTC) #9

Prashant.Patil

On 2016/09/16 15:17:35, Raymond Toy wrote: > On 2016/09/16 07:56:01, Prashant.Patil wrote: > > On ...

4 years, 3 months ago (2016-09-19 06:55:11 UTC) #10

Raymond Toy

On 2016/09/19 06:55:11, Prashant.Patil wrote: > On 2016/09/16 15:17:35, Raymond Toy wrote: > > On ...

4 years, 3 months ago (2016-09-19 15:37:31 UTC) #11

Prashant.Patil

On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote: > On 2016/09/19 06:55:11, Prashant.Patil wrote: > ...

4 years, 3 months ago (2016-09-21 12:14:20 UTC) #12

Ken Russell (switch to Gerrit)

On 2016/09/21 12:14:20, Prashant.Patil wrote: > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote: > ...

4 years, 3 months ago (2016-09-22 00:46:04 UTC) #13

On 2016/09/21 12:14:20, Prashant.Patil wrote:
> On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > It will take me some time to review this.
> > > > > > 
> > > > > > Can you please run the webaudio layout tests to make sure things
still
> > > pass
> > > > > with
> > > > > > these optimizations?
> > > > > > 
> > > > > > Can you also provide performance numbers for each of these routines
> > > compared
> > > > > to
> > > > > > the non-optimize versions?  If the gain is too small, we'd rather
not
> > have
> > > > to
> > > > > > have these hard-to-read and maintain code.
> > > > > 
> > > > > 
> > > > > Tested compliance tests with these MSA optimizations.
> > > > > 
> > > > > The performance numbers compared to non-optimized version are:
> > > > > vsma - 10x
> > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > vsvesq - 4x
> > > > > vmaxmgv - 10x
> > > > > vclip - 7x
> > > > 
> > > > Very nice speed improvements!  Did you just write a small test to
compare
> > the
> > > > performance
> > > > of these routines with and without MSA?
> > > > 
> > > > What compliance tests did you run?
> > > 
> > > 
> > > Yes. We had written standalone tests to compare the performance with and
> > without
> > > MSA.
> > > 
> > > Also tested for compliance with c version using standalone tests.
> > 
> > Thanks for the info.
> > 
> > Can you also run the webaudio layout tests?  They've been pretty good at
> finding
> > small changes, and caught a few issues when SSE2 support was added.
> 
> As i mentioned earlier, we have tested these MSA optimized kernels for all
> possible combinations/corner cases locally. As of today we don't have android
> device with MSA to run webaudio layout tests.

Raymond, if you give a thumbs-up I'll approve this.

Prashant.Patil

On 2016/09/22 00:46:04, Ken Russell wrote: > On 2016/09/21 12:14:20, Prashant.Patil wrote: > > On ...

4 years, 2 months ago (2016-09-28 09:18:07 UTC) #14

On 2016/09/22 00:46:04, Ken Russell wrote:
> On 2016/09/21 12:14:20, Prashant.Patil wrote:
> > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > > It will take me some time to review this.
> > > > > > > 
> > > > > > > Can you please run the webaudio layout tests to make sure things
> still
> > > > pass
> > > > > > with
> > > > > > > these optimizations?
> > > > > > > 
> > > > > > > Can you also provide performance numbers for each of these
routines
> > > > compared
> > > > > > to
> > > > > > > the non-optimize versions?  If the gain is too small, we'd rather
> not
> > > have
> > > > > to
> > > > > > > have these hard-to-read and maintain code.
> > > > > > 
> > > > > > 
> > > > > > Tested compliance tests with these MSA optimizations.
> > > > > > 
> > > > > > The performance numbers compared to non-optimized version are:
> > > > > > vsma - 10x
> > > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > > vsvesq - 4x
> > > > > > vmaxmgv - 10x
> > > > > > vclip - 7x
> > > > > 
> > > > > Very nice speed improvements!  Did you just write a small test to
> compare
> > > the
> > > > > performance
> > > > > of these routines with and without MSA?
> > > > > 
> > > > > What compliance tests did you run?
> > > > 
> > > > 
> > > > Yes. We had written standalone tests to compare the performance with and
> > > without
> > > > MSA.
> > > > 
> > > > Also tested for compliance with c version using standalone tests.
> > > 
> > > Thanks for the info.
> > > 
> > > Can you also run the webaudio layout tests?  They've been pretty good at
> > finding
> > > small changes, and caught a few issues when SSE2 support was added.
> > 
> > As i mentioned earlier, we have tested these MSA optimized kernels for all
> > possible combinations/corner cases locally. As of today we don't have
android
> > device with MSA to run webaudio layout tests.
> 
> Raymond, if you give a thumbs-up I'll approve this.

Raymond, your view on this? should we go for commit?

Raymond Toy

On 2016/09/28 09:18:07, Prashant.Patil wrote: > On 2016/09/22 00:46:04, Ken Russell wrote: > > On ...

4 years, 2 months ago (2016-09-28 16:01:23 UTC) #15

On 2016/09/28 09:18:07, Prashant.Patil wrote:
> On 2016/09/22 00:46:04, Ken Russell wrote:
> > On 2016/09/21 12:14:20, Prashant.Patil wrote:
> > > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > > > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > > > It will take me some time to review this.
> > > > > > > > 
> > > > > > > > Can you please run the webaudio layout tests to make sure things
> > still
> > > > > pass
> > > > > > > with
> > > > > > > > these optimizations?
> > > > > > > > 
> > > > > > > > Can you also provide performance numbers for each of these
> routines
> > > > > compared
> > > > > > > to
> > > > > > > > the non-optimize versions?  If the gain is too small, we'd
rather
> > not
> > > > have
> > > > > > to
> > > > > > > > have these hard-to-read and maintain code.
> > > > > > > 
> > > > > > > 
> > > > > > > Tested compliance tests with these MSA optimizations.
> > > > > > > 
> > > > > > > The performance numbers compared to non-optimized version are:
> > > > > > > vsma - 10x
> > > > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > > > vsvesq - 4x
> > > > > > > vmaxmgv - 10x
> > > > > > > vclip - 7x
> > > > > > 
> > > > > > Very nice speed improvements!  Did you just write a small test to
> > compare
> > > > the
> > > > > > performance
> > > > > > of these routines with and without MSA?
> > > > > > 
> > > > > > What compliance tests did you run?
> > > > > 
> > > > > 
> > > > > Yes. We had written standalone tests to compare the performance with
and
> > > > without
> > > > > MSA.
> > > > > 
> > > > > Also tested for compliance with c version using standalone tests.
> > > > 
> > > > Thanks for the info.
> > > > 
> > > > Can you also run the webaudio layout tests?  They've been pretty good at
> > > finding
> > > > small changes, and caught a few issues when SSE2 support was added.
> > > 
> > > As i mentioned earlier, we have tested these MSA optimized kernels for all
> > > possible combinations/corner cases locally. As of today we don't have
> android
> > > device with MSA to run webaudio layout tests.
> > 
> > Raymond, if you give a thumbs-up I'll approve this.
> 
> Raymond, your view on this? should we go for commit?

The general idea is great and we should do this.  However, I've been away at
TPAC so I haven't reviewed the code yet.

Could you tell me how much difference in numerical results there is between the
scalar and simd code?  Perhaps as an SNR measure between the scalar and simd
results?  Or were the results exactly identical?

Prashant.Patil

On 2016/09/28 16:01:23, Raymond Toy wrote: > On 2016/09/28 09:18:07, Prashant.Patil wrote: > > On ...

4 years, 2 months ago (2016-09-29 13:25:00 UTC) #16

On 2016/09/28 16:01:23, Raymond Toy wrote:
> On 2016/09/28 09:18:07, Prashant.Patil wrote:
> > On 2016/09/22 00:46:04, Ken Russell wrote:
> > > On 2016/09/21 12:14:20, Prashant.Patil wrote:
> > > > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > > > > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > > > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > > > > It will take me some time to review this.
> > > > > > > > > 
> > > > > > > > > Can you please run the webaudio layout tests to make sure
things
> > > still
> > > > > > pass
> > > > > > > > with
> > > > > > > > > these optimizations?
> > > > > > > > > 
> > > > > > > > > Can you also provide performance numbers for each of these
> > routines
> > > > > > compared
> > > > > > > > to
> > > > > > > > > the non-optimize versions?  If the gain is too small, we'd
> rather
> > > not
> > > > > have
> > > > > > > to
> > > > > > > > > have these hard-to-read and maintain code.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Tested compliance tests with these MSA optimizations.
> > > > > > > > 
> > > > > > > > The performance numbers compared to non-optimized version are:
> > > > > > > > vsma - 10x
> > > > > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > > > > vsvesq - 4x
> > > > > > > > vmaxmgv - 10x
> > > > > > > > vclip - 7x
> > > > > > > 
> > > > > > > Very nice speed improvements!  Did you just write a small test to
> > > compare
> > > > > the
> > > > > > > performance
> > > > > > > of these routines with and without MSA?
> > > > > > > 
> > > > > > > What compliance tests did you run?
> > > > > > 
> > > > > > 
> > > > > > Yes. We had written standalone tests to compare the performance with
> and
> > > > > without
> > > > > > MSA.
> > > > > > 
> > > > > > Also tested for compliance with c version using standalone tests.
> > > > > 
> > > > > Thanks for the info.
> > > > > 
> > > > > Can you also run the webaudio layout tests?  They've been pretty good
at
> > > > finding
> > > > > small changes, and caught a few issues when SSE2 support was added.
> > > > 
> > > > As i mentioned earlier, we have tested these MSA optimized kernels for
all
> > > > possible combinations/corner cases locally. As of today we don't have
> > android
> > > > device with MSA to run webaudio layout tests.
> > > 
> > > Raymond, if you give a thumbs-up I'll approve this.
> > 
> > Raymond, your view on this? should we go for commit?
> 
> The general idea is great and we should do this.  However, I've been away at
> TPAC so I haven't reviewed the code yet.
> 
> Could you tell me how much difference in numerical results there is between
the
> scalar and simd code?  Perhaps as an SNR measure between the scalar and simd
> results?  Or were the results exactly identical?

1. The scalar and simd code for following functions generate exactly identical
results -
   vsma , vsmul, vadd, vmul, vmaxmgv, vclip

2. zvmul -
  In scalar code :  The operation is (a*b) +/- (c*d). Multiplications are done
first and then addition/subtraction. So 3 rounding errors (2 for multiply, 1 for
add/sub)
  In simd code   :  The first multiplication (a*b) is done, then fused
multiply-add instruciton is used. Here only 2 rounding errors.
  Hence the simd results are more precise than scalar counterpart. The absolute
max error observed between scalar and simd code is 0.005

3. vsvesq -
  Due to difference in accumulation order in simd and scalar code(Please refer
the msa simd code for vsvesq), the standard rounding error is accumulated
differently hence the result diverts from that of scalar counterpart.

Raymond Toy

On 2016/09/29 13:25:00, Prashant.Patil wrote: > On 2016/09/28 16:01:23, Raymond Toy wrote: > > On ...

4 years, 2 months ago (2016-09-29 18:12:24 UTC) #17

On 2016/09/29 13:25:00, Prashant.Patil wrote:
> On 2016/09/28 16:01:23, Raymond Toy wrote:
> > On 2016/09/28 09:18:07, Prashant.Patil wrote:
> > > On 2016/09/22 00:46:04, Ken Russell wrote:
> > > > On 2016/09/21 12:14:20, Prashant.Patil wrote:
> > > > > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > > > > > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > > > > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > > > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > > > > > It will take me some time to review this.
> > > > > > > > > > 
> > > > > > > > > > Can you please run the webaudio layout tests to make sure
> things
> > > > still
> > > > > > > pass
> > > > > > > > > with
> > > > > > > > > > these optimizations?
> > > > > > > > > > 
> > > > > > > > > > Can you also provide performance numbers for each of these
> > > routines
> > > > > > > compared
> > > > > > > > > to
> > > > > > > > > > the non-optimize versions?  If the gain is too small, we'd
> > rather
> > > > not
> > > > > > have
> > > > > > > > to
> > > > > > > > > > have these hard-to-read and maintain code.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Tested compliance tests with these MSA optimizations.
> > > > > > > > > 
> > > > > > > > > The performance numbers compared to non-optimized version are:
> > > > > > > > > vsma - 10x
> > > > > > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > > > > > vsvesq - 4x
> > > > > > > > > vmaxmgv - 10x
> > > > > > > > > vclip - 7x
> > > > > > > > 
> > > > > > > > Very nice speed improvements!  Did you just write a small test
to
> > > > compare
> > > > > > the
> > > > > > > > performance
> > > > > > > > of these routines with and without MSA?
> > > > > > > > 
> > > > > > > > What compliance tests did you run?
> > > > > > > 
> > > > > > > 
> > > > > > > Yes. We had written standalone tests to compare the performance
with
> > and
> > > > > > without
> > > > > > > MSA.
> > > > > > > 
> > > > > > > Also tested for compliance with c version using standalone tests.
> > > > > > 
> > > > > > Thanks for the info.
> > > > > > 
> > > > > > Can you also run the webaudio layout tests?  They've been pretty
good
> at
> > > > > finding
> > > > > > small changes, and caught a few issues when SSE2 support was added.
> > > > > 
> > > > > As i mentioned earlier, we have tested these MSA optimized kernels for
> all
> > > > > possible combinations/corner cases locally. As of today we don't have
> > > android
> > > > > device with MSA to run webaudio layout tests.
> > > > 
> > > > Raymond, if you give a thumbs-up I'll approve this.
> > > 
> > > Raymond, your view on this? should we go for commit?
> > 
> > The general idea is great and we should do this.  However, I've been away at
> > TPAC so I haven't reviewed the code yet.
> > 
> > Could you tell me how much difference in numerical results there is between
> the
> > scalar and simd code?  Perhaps as an SNR measure between the scalar and simd
> > results?  Or were the results exactly identical?
> 
> 1. The scalar and simd code for following functions generate exactly identical
> results -
>    vsma , vsmul, vadd, vmul, vmaxmgv, vclip

Fantastic! I have no problems including these simd optimizations for these
operations.

> 
> 2. zvmul -
>   In scalar code :  The operation is (a*b) +/- (c*d). Multiplications are done
> first and then addition/subtraction. So 3 rounding errors (2 for multiply, 1
for
> add/sub)
>   In simd code   :  The first multiplication (a*b) is done, then fused
> multiply-add instruciton is used. Here only 2 rounding errors.
>   Hence the simd results are more precise than scalar counterpart. The
absolute
> max error observed between scalar and simd code is 0.005

Without knowing the magnitude of the input numbers, I can't really conclude
anything meaningful, but on the face of it an error of 0.005 is huge and
unacceptable.

Also, I recall some paper by Prof. Kahan about fused multiplies for these kinds
of computations.  Depending on which product is done in the FMA (a*b or c*d),
the result could be erroneously negative for something that should have been
positive. Unfortunately, I can't seem to find that paper.  Thus, I'm a bit
concerned about using an FMA here.

> 
> 3. vsvesq -
>   Due to difference in accumulation order in simd and scalar code(Please refer
> the msa simd code for vsvesq), the standard rounding error is accumulated
> differently hence the result diverts from that of scalar counterpart.

Sure, but how much?

It's really important to make sure the simd optimizations don't change the
computed results by too much.  Such differences might be audible if combined in
certain ways.

Prashant.Patil

On 2016/09/29 18:12:24, Raymond Toy wrote: > On 2016/09/29 13:25:00, Prashant.Patil wrote: > > On ...

4 years, 2 months ago (2016-09-30 09:01:42 UTC) #18

On 2016/09/29 18:12:24, Raymond Toy wrote:
> On 2016/09/29 13:25:00, Prashant.Patil wrote:
> > On 2016/09/28 16:01:23, Raymond Toy wrote:
> > > On 2016/09/28 09:18:07, Prashant.Patil wrote:
> > > > On 2016/09/22 00:46:04, Ken Russell wrote:
> > > > > On 2016/09/21 12:14:20, Prashant.Patil wrote:
> > > > > > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > > > > > > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > > > > > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > > > > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > > > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > > > > > > It will take me some time to review this.
> > > > > > > > > > > 
> > > > > > > > > > > Can you please run the webaudio layout tests to make sure
> > things
> > > > > still
> > > > > > > > pass
> > > > > > > > > > with
> > > > > > > > > > > these optimizations?
> > > > > > > > > > > 
> > > > > > > > > > > Can you also provide performance numbers for each of these
> > > > routines
> > > > > > > > compared
> > > > > > > > > > to
> > > > > > > > > > > the non-optimize versions?  If the gain is too small, we'd
> > > rather
> > > > > not
> > > > > > > have
> > > > > > > > > to
> > > > > > > > > > > have these hard-to-read and maintain code.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Tested compliance tests with these MSA optimizations.
> > > > > > > > > > 
> > > > > > > > > > The performance numbers compared to non-optimized version
are:
> > > > > > > > > > vsma - 10x
> > > > > > > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > > > > > > vsvesq - 4x
> > > > > > > > > > vmaxmgv - 10x
> > > > > > > > > > vclip - 7x
> > > > > > > > > 
> > > > > > > > > Very nice speed improvements!  Did you just write a small test
> to
> > > > > compare
> > > > > > > the
> > > > > > > > > performance
> > > > > > > > > of these routines with and without MSA?
> > > > > > > > > 
> > > > > > > > > What compliance tests did you run?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes. We had written standalone tests to compare the performance
> with
> > > and
> > > > > > > without
> > > > > > > > MSA.
> > > > > > > > 
> > > > > > > > Also tested for compliance with c version using standalone
tests.
> > > > > > > 
> > > > > > > Thanks for the info.
> > > > > > > 
> > > > > > > Can you also run the webaudio layout tests?  They've been pretty
> good
> > at
> > > > > > finding
> > > > > > > small changes, and caught a few issues when SSE2 support was
added.
> > > > > > 
> > > > > > As i mentioned earlier, we have tested these MSA optimized kernels
for
> > all
> > > > > > possible combinations/corner cases locally. As of today we don't
have
> > > > android
> > > > > > device with MSA to run webaudio layout tests.
> > > > > 
> > > > > Raymond, if you give a thumbs-up I'll approve this.
> > > > 
> > > > Raymond, your view on this? should we go for commit?
> > > 
> > > The general idea is great and we should do this.  However, I've been away
at
> > > TPAC so I haven't reviewed the code yet.
> > > 
> > > Could you tell me how much difference in numerical results there is
between
> > the
> > > scalar and simd code?  Perhaps as an SNR measure between the scalar and
simd
> > > results?  Or were the results exactly identical?
> > 
> > 1. The scalar and simd code for following functions generate exactly
identical
> > results -
> >    vsma , vsmul, vadd, vmul, vmaxmgv, vclip
> 
> Fantastic! I have no problems including these simd optimizations for these
> operations.
> 
> > 
> > 2. zvmul -
> >   In scalar code :  The operation is (a*b) +/- (c*d). Multiplications are
done
> > first and then addition/subtraction. So 3 rounding errors (2 for multiply, 1
> for
> > add/sub)
> >   In simd code   :  The first multiplication (a*b) is done, then fused
> > multiply-add instruciton is used. Here only 2 rounding errors.
> >   Hence the simd results are more precise than scalar counterpart. The
> absolute
> > max error observed between scalar and simd code is 0.005
> 
> Without knowing the magnitude of the input numbers, I can't really conclude
> anything meaningful, but on the face of it an error of 0.005 is huge and
> unacceptable.
> 
> Also, I recall some paper by Prof. Kahan about fused multiplies for these
kinds
> of computations.  Depending on which product is done in the FMA (a*b or c*d),
> the result could be erroneously negative for something that should have been
> positive. Unfortunately, I can't seem to find that paper.  Thus, I'm a bit
> concerned about using an FMA here.
> 
> > 
> > 3. vsvesq -
> >   Due to difference in accumulation order in simd and scalar code(Please
refer
> > the msa simd code for vsvesq), the standard rounding error is accumulated
> > differently hence the result diverts from that of scalar counterpart.
> 
> Sure, but how much?
> 
> It's really important to make sure the simd optimizations don't change the
> computed results by too much.  Such differences might be audible if combined
in
> certain ways.

Thanks for the inputs.
I will create the new patch set for 6 matching functions.
Meanwhile i will do some more analysis and come up with data later for remaining
2 functions i.e. zvmul and vsvesq.

Prashant.Patil

On 2016/09/30 09:01:42, Prashant.Patil wrote: > On 2016/09/29 18:12:24, Raymond Toy wrote: > > On ...

4 years, 2 months ago (2016-09-30 09:09:28 UTC) #19

On 2016/09/30 09:01:42, Prashant.Patil wrote:
> On 2016/09/29 18:12:24, Raymond Toy wrote:
> > On 2016/09/29 13:25:00, Prashant.Patil wrote:
> > > On 2016/09/28 16:01:23, Raymond Toy wrote:
> > > > On 2016/09/28 09:18:07, Prashant.Patil wrote:
> > > > > On 2016/09/22 00:46:04, Ken Russell wrote:
> > > > > > On 2016/09/21 12:14:20, Prashant.Patil wrote:
> > > > > > > On 2016/09/19 15:37:31, Raymond Toy-OOO TPAC Sep20-24 wrote:
> > > > > > > > On 2016/09/19 06:55:11, Prashant.Patil wrote:
> > > > > > > > > On 2016/09/16 15:17:35, Raymond Toy wrote:
> > > > > > > > > > On 2016/09/16 07:56:01, Prashant.Patil wrote:
> > > > > > > > > > > On 2016/09/14 17:14:11, Raymond Toy wrote:
> > > > > > > > > > > > It will take me some time to review this.
> > > > > > > > > > > > 
> > > > > > > > > > > > Can you please run the webaudio layout tests to make
sure
> > > things
> > > > > > still
> > > > > > > > > pass
> > > > > > > > > > > with
> > > > > > > > > > > > these optimizations?
> > > > > > > > > > > > 
> > > > > > > > > > > > Can you also provide performance numbers for each of
these
> > > > > routines
> > > > > > > > > compared
> > > > > > > > > > > to
> > > > > > > > > > > > the non-optimize versions?  If the gain is too small,
we'd
> > > > rather
> > > > > > not
> > > > > > > > have
> > > > > > > > > > to
> > > > > > > > > > > > have these hard-to-read and maintain code.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Tested compliance tests with these MSA optimizations.
> > > > > > > > > > > 
> > > > > > > > > > > The performance numbers compared to non-optimized version
> are:
> > > > > > > > > > > vsma - 10x
> > > > > > > > > > > vsmul, vadd, vmul, zvmul - 5x
> > > > > > > > > > > vsvesq - 4x
> > > > > > > > > > > vmaxmgv - 10x
> > > > > > > > > > > vclip - 7x
> > > > > > > > > > 
> > > > > > > > > > Very nice speed improvements!  Did you just write a small
test
> > to
> > > > > > compare
> > > > > > > > the
> > > > > > > > > > performance
> > > > > > > > > > of these routines with and without MSA?
> > > > > > > > > > 
> > > > > > > > > > What compliance tests did you run?
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Yes. We had written standalone tests to compare the
performance
> > with
> > > > and
> > > > > > > > without
> > > > > > > > > MSA.
> > > > > > > > > 
> > > > > > > > > Also tested for compliance with c version using standalone
> tests.
> > > > > > > > 
> > > > > > > > Thanks for the info.
> > > > > > > > 
> > > > > > > > Can you also run the webaudio layout tests?  They've been pretty
> > good
> > > at
> > > > > > > finding
> > > > > > > > small changes, and caught a few issues when SSE2 support was
> added.
> > > > > > > 
> > > > > > > As i mentioned earlier, we have tested these MSA optimized kernels
> for
> > > all
> > > > > > > possible combinations/corner cases locally. As of today we don't
> have
> > > > > android
> > > > > > > device with MSA to run webaudio layout tests.
> > > > > > 
> > > > > > Raymond, if you give a thumbs-up I'll approve this.
> > > > > 
> > > > > Raymond, your view on this? should we go for commit?
> > > > 
> > > > The general idea is great and we should do this.  However, I've been
away
> at
> > > > TPAC so I haven't reviewed the code yet.
> > > > 
> > > > Could you tell me how much difference in numerical results there is
> between
> > > the
> > > > scalar and simd code?  Perhaps as an SNR measure between the scalar and
> simd
> > > > results?  Or were the results exactly identical?
> > > 
> > > 1. The scalar and simd code for following functions generate exactly
> identical
> > > results -
> > >    vsma , vsmul, vadd, vmul, vmaxmgv, vclip
> > 
> > Fantastic! I have no problems including these simd optimizations for these
> > operations.
> > 
> > > 
> > > 2. zvmul -
> > >   In scalar code :  The operation is (a*b) +/- (c*d). Multiplications are
> done
> > > first and then addition/subtraction. So 3 rounding errors (2 for multiply,
1
> > for
> > > add/sub)
> > >   In simd code   :  The first multiplication (a*b) is done, then fused
> > > multiply-add instruciton is used. Here only 2 rounding errors.
> > >   Hence the simd results are more precise than scalar counterpart. The
> > absolute
> > > max error observed between scalar and simd code is 0.005
> > 
> > Without knowing the magnitude of the input numbers, I can't really conclude
> > anything meaningful, but on the face of it an error of 0.005 is huge and
> > unacceptable.
> > 
> > Also, I recall some paper by Prof. Kahan about fused multiplies for these
> kinds
> > of computations.  Depending on which product is done in the FMA (a*b or
c*d),
> > the result could be erroneously negative for something that should have been
> > positive. Unfortunately, I can't seem to find that paper.  Thus, I'm a bit
> > concerned about using an FMA here.
> > 
> > > 
> > > 3. vsvesq -
> > >   Due to difference in accumulation order in simd and scalar code(Please
> refer
> > > the msa simd code for vsvesq), the standard rounding error is accumulated
> > > differently hence the result diverts from that of scalar counterpart.
> > 
> > Sure, but how much?
> > 
> > It's really important to make sure the simd optimizations don't change the
> > computed results by too much.  Such differences might be audible if combined
> in
> > certain ways.
> 
> Thanks for the inputs.
> I will create the new patch set for 6 matching functions.
> Meanwhile i will do some more analysis and come up with data later for
remaining
> 2 functions i.e. zvmul and vsvesq.

Created new patch set (#2) with 6 matching functions only (vsma , vsmul, vadd,
vmul, vmaxmgv, vclip) so that we can go with commit.

Raymond Toy

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode197 third_party/WebKit/Source/platform/audio/VectorMath.cpp:197: LD_SP8(sourceP, 4, vSrc0, vSrc1, vSrc2, vSrc3, vSrc4, vSrc5, vSrc6, ...

4 years, 2 months ago (2016-10-03 16:47:07 UTC) #20

Prashant.Patil

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode197 third_party/WebKit/Source/platform/audio/VectorMath.cpp:197: LD_SP8(sourceP, 4, vSrc0, vSrc1, vSrc2, vSrc3, vSrc4, vSrc5, vSrc6, ...

4 years, 2 months ago (2016-10-04 11:47:28 UTC) #21

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right):

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:197: LD_SP8(sourceP, 4,
vSrc0, vSrc1, vSrc2, vSrc3, vSrc4, vSrc5, vSrc6, vSrc7);
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Are there alignment constraints for sourceP and destPCopy?  In general
> AudioArrays are on 16-byte boundaries (I think), but I'm not sure this hold
> everywhere for all callers.  Same issue applies to all of the routines below.

There are no alignment constraints

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:205: if (n >= 28) {
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Is there really much to be gained in having this complicated unrolling for all
> of the possible remainders that aren't a multiple of 32 but are a multiple of
4?
> 
> The primary use case is for blocks of 128, so this code is never executed. 
for
> other uses, I think they're always a power of two (greater than 128), so this
> code is never executed there either.

OK. I shall remove all cases below 32.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:333: for (; n >= 40; n
-= 40) {
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Is it really worth doing blocks of 40 instead of 32?  Since the typical use
case
> is powers of two, this guarantees the messy code below is run.
> 
> Using blocks of 32 keeps a nice symmetry with the other code that uses 32.

OK. This was done considering 10 cycle vector load latency in one of the mips
variants. However most of the variants have ~8 cycle latency now, hence will
change this to 32.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:512: LD_SP8(source2P, 4,
vSrc4, vSrc5, vSrc6, vSrc7, vSrc12, vSrc13, vSrc14, vSrc15);
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Can we pick better names for vSrc[n]?  It's really odd to see source1P loaded
> into vSrc0-3 and vSrc8-11.

Done.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:519: if (n >= 20) {
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Is this really worth doing?  In the typical use case n is always a multiple of
> 32.

Done.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:679: LD_SP8(source1P, 4,
vSrc0, vSrc1, vSrc2, vSrc3, vSrc8, vSrc9, vSrc10, vSrc11);
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Same comment as in line 512.

Done.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/audio/VectorMath.cpp:959: LD_SP6(sourceP, 4,
vSrc0, vSrc1, vSrc2, vSrc3, vSrc4, vSrc5);
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Same comment as in line 333.

Done.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
File third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h (right):

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h:526: */
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Comment style here seems inconsistent with the nearby commenting style.

Done.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h:532: #define
ADD4(in0, in1, in2, in3, in4, in5, in6, in7, out0, out1, out2, out3)  \
On 2016/10/03 16:47:06, Raymond Toy wrote:
> I know the naming here is consistent with the rest of the file, but it's
almost
> impossible to know which if the in's get added to produce which out.

I will add macro description for better understanding.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h:577: Arguments   :
Inputs  - in0, in1, scale
On 2016/10/03 16:47:06, Raymond Toy wrote:
> Based on the code, out0, and out1 are also inputs.

Done.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h:612: Outputs -
out0, out1
On 2016/10/03 16:47:06, Raymond Toy wrote:
> There are not outputs out0, out1, just the one output, max.  And there's also
> the input mask.
> 
> Impossible to tell what mask is for other than looking at the code.

I will update the macro usage for better code readability.

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Sour...
third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h:628: Arguments   :
Inputs  - in0, in1
On 2016/10/03 16:47:06, Raymond Toy wrote:
> You forgot min, max as inputs.

Done.

Prashant.Patil

Description was changed from ========== Add MSA (MIPS SIMD Arch) optimized VectorMath functions We add ...

4 years, 2 months ago (2016-10-04 14:35:46 UTC) #22

Raymond Toy

https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode333 third_party/WebKit/Source/platform/audio/VectorMath.cpp:333: for (; n >= 40; n -= 40) { ...

4 years, 2 months ago (2016-10-04 15:37:44 UTC) #23

Raymond Toy

https://codereview.chromium.org/2340583003/diff/40001/third_party/WebKit/Source/platform/audio/VectorMath.cpp File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): https://codereview.chromium.org/2340583003/diff/40001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode859 third_party/WebKit/Source/platform/audio/VectorMath.cpp:859: const v4f32 vHighThr = (v4f32)__msa_fill_w(*((int32_t*)highThresholdP)); Use C++ casting here ...

4 years, 2 months ago (2016-10-04 15:48:07 UTC) #24

Prashant.Patil

On 2016/10/04 15:37:44, Raymond Toy wrote: > https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp > File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): > > https://codereview.chromium.org/2340583003/diff/20001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode333 ...

4 years, 2 months ago (2016-10-05 12:45:27 UTC) #25

Prashant.Patil

On 2016/10/04 15:48:07, Raymond Toy wrote: > https://codereview.chromium.org/2340583003/diff/40001/third_party/WebKit/Source/platform/audio/VectorMath.cpp > File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): > > https://codereview.chromium.org/2340583003/diff/40001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode859 ...

4 years, 2 months ago (2016-10-05 12:46:48 UTC) #26

Raymond Toy

https://codereview.chromium.org/2340583003/diff/80001/third_party/WebKit/Source/platform/audio/VectorMath.cpp File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): https://codereview.chromium.org/2340583003/diff/80001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode858 third_party/WebKit/Source/platform/audio/VectorMath.cpp:858: const v4f32 vLowThr = (v4f32)__msa_fill_w(FLOAT2INT(lowThreshold)); Isn't this some kind ...

4 years, 2 months ago (2016-10-05 17:26:53 UTC) #28

Prashant.Patil

https://codereview.chromium.org/2340583003/diff/80001/third_party/WebKit/Source/platform/audio/VectorMath.cpp File third_party/WebKit/Source/platform/audio/VectorMath.cpp (right): https://codereview.chromium.org/2340583003/diff/80001/third_party/WebKit/Source/platform/audio/VectorMath.cpp#newcode858 third_party/WebKit/Source/platform/audio/VectorMath.cpp:858: const v4f32 vLowThr = (v4f32)__msa_fill_w(FLOAT2INT(lowThreshold)); On 2016/10/05 17:26:53, Raymond ...

4 years, 2 months ago (2016-10-06 08:27:35 UTC) #29

Raymond Toy

Thanks for your patience in this! It's great to have these nice speedups for the ...

4 years, 2 months ago (2016-10-06 15:42:48 UTC) #31

Prashant.Patil

https://codereview.chromium.org/2340583003/diff/140001/third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h File third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h (right): https://codereview.chromium.org/2340583003/diff/140001/third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h#newcode19 third_party/WebKit/Source/platform/cpu/mips/CommonMacrosMSA.h:19: }FloatInt; On 2016/10/06 15:42:48, Raymond Toy wrote: > Nit: ...

4 years, 2 months ago (2016-10-07 08:08:06 UTC) #32

Prashant.Patil

The CQ bit was checked by prashant.patil@imgtec.com

4 years, 2 months ago (2016-10-07 08:09:59 UTC) #34

Prashant.Patil

The patchset sent to the CQ was uploaded after l-g-t-m from rtoy@chromium.org Link to the ...

4 years, 2 months ago (2016-10-07 08:09:59 UTC) #35

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2340583003/160001

4 years, 2 months ago (2016-10-07 08:10:15 UTC) #36

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-07 08:20:43 UTC) #37

commit-bot: I haz the power

Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presubmit/builds/276217)

4 years, 2 months ago (2016-10-07 08:20:45 UTC) #38

Prashant.Patil

On 2016/10/07 08:20:45, commit-bot: I haz the power wrote: > Try jobs failed on following ...

4 years, 2 months ago (2016-10-07 13:00:37 UTC) #39

Raymond Toy

On 2016/10/07 13:00:37, Prashant.Patil wrote: > On 2016/10/07 08:20:45, commit-bot: I haz the power wrote: ...

4 years, 2 months ago (2016-10-07 15:07:52 UTC) #40

Ken Russell (switch to Gerrit)

kbr@chromium.org changed reviewers: + prashant.patil@imgtec.com - Prashant.Patil@imgtec.com

4 years, 2 months ago (2016-10-07 20:02:42 UTC) #41

Raymond Toy

Prashant: This CL should be ready to land; all approvals have been granted, I think.

4 years, 2 months ago (2016-10-10 16:37:11 UTC) #43

Prashant.Patil

The CQ bit was checked by prashant.patil@imgtec.com

4 years, 2 months ago (2016-10-12 07:13:20 UTC) #44

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2340583003/160001

4 years, 2 months ago (2016-10-12 07:13:34 UTC) #45

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-12 07:22:58 UTC) #46

commit-bot: I haz the power

Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presubmit/builds/279100)

4 years, 2 months ago (2016-10-12 07:23:00 UTC) #47

Prashant.Patil

The CQ bit was checked by prashant.patil@imgtec.com

4 years, 2 months ago (2016-10-12 08:42:00 UTC) #48

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2340583003/160001

4 years, 2 months ago (2016-10-12 08:42:13 UTC) #49

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-12 08:47:17 UTC) #50

commit-bot: I haz the power

Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presubmit/builds/279142)

4 years, 2 months ago (2016-10-12 08:47:18 UTC) #51

Prashant.Patil

On 2016/10/12 08:47:18, commit-bot: I haz the power wrote: > Try jobs failed on following ...

4 years, 2 months ago (2016-10-12 15:05:21 UTC) #52

Raymond Toy

On 2016/10/12 15:05:21, Prashant.Patil wrote: > On 2016/10/12 08:47:18, commit-bot: I haz the power wrote: ...

4 years, 2 months ago (2016-10-12 15:09:24 UTC) #53

Prashant.Patil

The CQ bit was checked by prashant.patil@imgtec.com

4 years, 2 months ago (2016-10-13 06:26:21 UTC) #55

Prashant.Patil

The patchset sent to the CQ was uploaded after l-g-t-m from kbr@chromium.org, rtoy@chromium.org Link to ...

4 years, 2 months ago (2016-10-13 06:26:21 UTC) #56

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2340583003/180001

4 years, 2 months ago (2016-10-13 06:26:38 UTC) #57

commit-bot: I haz the power

Description was changed from ========== Add MSA (MIPS SIMD Arch) optimized VectorMath functions We add ...

4 years, 2 months ago (2016-10-13 07:43:24 UTC) #58

commit-bot: I haz the power

Description was changed from ========== Add MSA (MIPS SIMD Arch) optimized VectorMath functions We add ...

4 years, 2 months ago (2016-10-13 07:45:21 UTC) #60

commit-bot: I haz the power

4 years, 2 months ago (2016-10-13 07:45:23 UTC) #61

Message was sent while issue was closed.

Patchset 10 (id:??) landed as
https://crrev.com/95d7e1276aed8f2d2687de388d83919b813c78dc
Cr-Commit-Position: refs/heads/master@{#424977}

Issue 2340583003: Add MSA (MIPS SIMD Arch) optimized VectorMath functions (Closed)

Description

Patch Set 1 #

Patch Set 2 : Removing zvmul and vsvesq #

Patch Set 3 : Incorporate review comments #

Patch Set 4 : Rebase #

Patch Set 5 : Incorporate review comments #

Patch Set 6 : Incorporate review comments #

Patch Set 7 : Add MSA (MIPS SIMD Arch) optimized VectorMath functions #

Patch Set 8 : coding consistency changes #

Patch Set 9 : Incorporated review comments #

Patch Set 10 : formatting changes #

Messages