|
|
Created:
4 years, 8 months ago by zhengxiong Modified:
4 years, 8 months ago CC:
chromium-reviews, feature-media-reviews_chromium.org, halliwell Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionClamp AudioBuffer float to int{16,32} conversion
FFmpeg does not compensate for rounding errors during decode so the
resulting values can be outside the [-1, 1] range expected for
floating-point audio.
BUG=internal b/22976943
Patch Set 1 #Patch Set 2 : Fix handling of 32-bit, fix asymmetric scale #
Total comments: 4
Patch Set 3 : Remove doubles #Patch Set 4 : Reuse original AudioBus code #
Total comments: 3
Patch Set 5 : Add (bad) rounding #Patch Set 6 : Fix unit tests #Patch Set 7 : Have separate functions for rounding and non-rounding #
Messages
Total messages: 52 (16 generated)
Description was changed from ========== Clamp AudioBuffer float to int{16,32} conversion FFmpeg does not compensate for rounding errors during decode so the resulting values can be outside the [-1, 1] range expected for floating-point audio. BUG=internal b/22976943 ========== to ========== Clamp AudioBuffer float to int{16,32} conversion FFmpeg does not compensate for rounding errors during decode so the resulting values can be outside the [-1, 1] range expected for floating-point audio. BUG=internal b/22976943 ==========
zhengxiong@chromium.org changed reviewers: + dalecurtis@chromium.org
zhengxiong@chromium.org changed reviewers: + zhengxiong@chromium.org
Have you looked into the AudioBusPerfTest in media_perftests? I'm curious what the impact of these changes are. IIRC this was quite slow.
On 2016/03/31 21:26:47, DaleCurtis wrote: > Have you looked into the AudioBusPerfTest in media_perftests? I'm curious what > the impact of these changes are. IIRC this was quite slow. I found a bug in the code, where it failed to honor the different scaling factor for positive and negative numbers. After correcting for that, I obtain on Haswell/AMD64: *RESULT audio_bus_to_interleaved: int8_t= 45.049549999999996 ms *RESULT audio_bus_from_interleaved: int8_t= 13.09065 ms *RESULT audio_bus_to_interleaved: int16_t= 39.6372 ms *RESULT audio_bus_from_interleaved: int16_t= 11.3672 ms *RESULT audio_bus_to_interleaved: int32_t= 38.9896 ms *RESULT audio_bus_from_interleaved: int32_t= 20.9314 ms This result is not especially surprising, since at least two extra comparisons have been generated. In reality, getting good performance from this kind of code is not possible in the C language. The correct way to implement this, if performance is important, is through platform-specific code. For example, the throughput of float to int16, the most important case, can be done in 4 cycles for 4 floats. movps xmm0, [address] ; Test sign xorps xmm1, xmm1 cmpgt xmm1, xmm0 ; Select scale factor movps xmm2, xmm1 andps xmm2, POS_SCALE_REG movps xmm3, xmm1 andnps xmm3, NEG_SCALE_REG orps xmm2, xmm3 ; Multiply, convert, saturated reduction mulps xmm0, xmm2 cvtps2dq xmm0, xmm0 packssdw xmm0, xmm0 movq [address], xmm0 Also, the use of a different scale for positive and negative values in Chromium incurs significant cost, but it affects conversions in both directions.
On 2016/03/31 at 23:24:17, zhengxiong wrote: > On 2016/03/31 21:26:47, DaleCurtis wrote: > > Have you looked into the AudioBusPerfTest in media_perftests? I'm curious what > > the impact of these changes are. IIRC this was quite slow. > > > I found a bug in the code, where it failed to honor the different scaling factor for positive and negative numbers. After correcting for that, I obtain on Haswell/AMD64: > > *RESULT audio_bus_to_interleaved: int8_t= 45.049549999999996 ms > *RESULT audio_bus_from_interleaved: int8_t= 13.09065 ms > *RESULT audio_bus_to_interleaved: int16_t= 39.6372 ms > *RESULT audio_bus_from_interleaved: int16_t= 11.3672 ms > *RESULT audio_bus_to_interleaved: int32_t= 38.9896 ms > *RESULT audio_bus_from_interleaved: int32_t= 20.9314 ms > > This result is not especially surprising, since at least two extra comparisons have been generated. In reality, getting good performance from this kind of code is not possible in the C language. The correct way to implement this, if performance is important, is through platform-specific code. For example, the throughput of float to int16, the most important case, can be done in 4 cycles for 4 floats. > > movps xmm0, [address] > ; Test sign > xorps xmm1, xmm1 > cmpgt xmm1, xmm0 > ; Select scale factor > movps xmm2, xmm1 > andps xmm2, POS_SCALE_REG > movps xmm3, xmm1 > andnps xmm3, NEG_SCALE_REG > orps xmm2, xmm3 > ; Multiply, convert, saturated reduction > mulps xmm0, xmm2 > cvtps2dq xmm0, xmm0 > packssdw xmm0, xmm0 > movq [address], xmm0 > > Also, the use of a different scale for positive and negative values in Chromium incurs significant cost, but it affects conversions in both directions. Can you include the before values? Performance is kind of important in this code. It regularly shows up in profiling since it happens so frequently.
On 2016/03/31 23:27:57, DaleCurtis wrote: > Can you include the before values? Performance is kind of important in this > code. It regularly shows up in profiling since it happens so frequently. On 2016/03/31 23:27:57, DaleCurtis wrote: > Can you include the before values? Performance is kind of important in this > code. It regularly shows up in profiling since it happens so frequently. Before: *RESULT audio_bus_to_interleaved: int8_t= 16.732300000000002 ms *RESULT audio_bus_from_interleaved: int8_t= 13.733699999999999 ms *RESULT audio_bus_to_interleaved: int16_t= 17.10215 ms *RESULT audio_bus_from_interleaved: int16_t= 13.07505 ms *RESULT audio_bus_to_interleaved: int32_t= 14.659299999999998 ms *RESULT audio_bus_from_interleaved: int32_t= 11.9587 ms After: *RESULT audio_bus_to_interleaved: int8_t= 40.32045 ms *RESULT audio_bus_from_interleaved: int8_t= 11.84935 ms *RESULT audio_bus_to_interleaved: int16_t= 36.52435 ms *RESULT audio_bus_from_interleaved: int16_t= 11.0126 ms *RESULT audio_bus_to_interleaved: int32_t= 38.619150000000005 ms *RESULT audio_bus_from_interleaved: int32_t= 12.03115 ms I re-ran the unit test binary 3 times to avoid cache-related issues. As I noted in my previous reply, if performance is important, the generic code should be checked-in as a placeholder and platform-specific code written for actual use.
https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... File media/base/audio_sample_conversion.h (right): https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... media/base/audio_sample_conversion.h:26: // TODO(cleichner): Remove and use std::lrint(f) when C++11 library features This function is now available. https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... media/base/audio_sample_conversion.h:68: typedef double type; Why do we need a double? This will hurt performance a lot on ARM. The previous code is careful to do the calculations in such a way that an int32_t will never overflow. https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... media/base/audio_sample_conversion.h:72: struct SafeFloat : SafeFloatHelper<sizeof(Integral) >= sizeof(int32_t)> {}; We don't support anything higher than int32_t. https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... media/base/audio_sample_conversion.h:113: sample = std::min(std::max(sample, -1.0f), 1.0f); Does the benchmark improve if you write: if (sample < -1.0f) return IntegralMin; else if (sample > 1.0f) return IntegralMax; This is really the only fix this code adds right? What does the performance of the old path look like if you add this?
On 2016/04/01 00:24:01, DaleCurtis wrote: > https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... > File media/base/audio_sample_conversion.h (right): > > https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... > media/base/audio_sample_conversion.h:26: // TODO(cleichner): Remove and use > std::lrint(f) when C++11 library features > This function is now available. I remembered that in MSVCRT, lrint/round yield a library call, which executes a bit manipulation implementation, so perhaps we should wait on this. > > https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... > media/base/audio_sample_conversion.h:68: typedef double type; > Why do we need a double? This will hurt performance a lot on ARM. The previous > code is careful to do the calculations in such a way that an int32_t will never > overflow. I have modified the patch to use early termination for the int32_t case. > > https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... > media/base/audio_sample_conversion.h:72: struct SafeFloat : > SafeFloatHelper<sizeof(Integral) >= sizeof(int32_t)> {}; > We don't support anything higher than int32_t. > > https://codereview.chromium.org/1854433002/diff/20001/media/base/audio_sample... > media/base/audio_sample_conversion.h:113: sample = std::min(std::max(sample, > -1.0f), 1.0f); > Does the benchmark improve if you write: > > if (sample < -1.0f) > return IntegralMin; > else if (sample > 1.0f) > return IntegralMax; > > This is really the only fix this code adds right? What does the performance of > the old path look like if you add this? The main fix added over the old code is the selection of the correct scale factor between positive and negative inputs. This adds some cost. Regarding the use of early termination, this actually reduces performance compared to using min/max, which is not surprising. *RESULT audio_bus_to_interleaved: int8_t= 43.05685 ms *RESULT audio_bus_from_interleaved: int8_t= 14.7888 ms *RESULT audio_bus_to_interleaved: int16_t= 40.58085 ms *RESULT audio_bus_from_interleaved: int16_t= 14.3022 ms *RESULT audio_bus_to_interleaved: int32_t= 42.5664 ms *RESULT audio_bus_from_interleaved: int32_t= 14.0123 ms However, recognizing that the positive and negative scale factors are equal for the int32_t case can yield a considerable improvement from specialization: *RESULT audio_bus_to_interleaved: int8_t= 40.1085 ms *RESULT audio_bus_from_interleaved: int8_t= 11.8646 ms *RESULT audio_bus_to_interleaved: int16_t= 36.43725 ms *RESULT audio_bus_from_interleaved: int16_t= 10.91485 ms *RESULT audio_bus_to_interleaved: int32_t= 28.110000000000003 ms *RESULT audio_bus_from_interleaved: int32_t= 21.159100000000002 ms
On 2016/04/01 00:24:01, DaleCurtis wrote: > This is really the only fix this code adds right? What does the performance of > the old path look like if you add this? I just realized that you meant over the existing Chromium code. Yes, the main fix is the saturated handling of out-of-range numbers. There is really no efficient way to do this in C.
On 2016/04/01 01:29:15, zhengxiong1 wrote: > On 2016/04/01 00:24:01, DaleCurtis wrote: > > This is really the only fix this code adds right? What does the performance of > > the old path look like if you add this? > > I just realized that you meant over the existing Chromium code. Yes, the main > fix is the saturated handling of out-of-range numbers. There is really no > efficient way to do this in C. Actually, now that I look at it closer, audio_bus.cc already contains clamping at line 82 and 82 (ToInterleavedInternal), so I can't even tell why we created this patch. This review can be closed.
Message was sent while issue was closed.
Might be that the bug only affects the code in audio_buffer and that audio bus was always correct. So maybe just extracting the existing code into audio_sample_conversion and letting AudioBuffer reuse it would fix your internal issue?
On 2016/04/01 17:18:49, DaleCurtis wrote: > Might be that the bug only affects the code in audio_buffer and that audio bus > was always correct. So maybe just extracting the existing code into > audio_sample_conversion and letting AudioBuffer reuse it would fix your internal > issue? I looked into this again, and you are right. The issue with the failure to clamp is in audio_buffer, not audio_bus (line 189, 207). I have uploaded yet another revision of the patch that refactors the AudioBus conversions into a separate file and reuses them. However, this still introduces a behavior difference, because audio_buffer rounds conversions to int16_t, but audio_bus did not. I believe rounding accounted for much of the original performance issue: Refactored perftest: *RESULT audio_bus_to_interleaved: int8_t= 15.75325 ms *RESULT audio_bus_from_interleaved: int8_t= 16.50865 ms *RESULT audio_bus_to_interleaved: int16_t= 18.64865 ms *RESULT audio_bus_from_interleaved: int16_t= 13.833850000000002 ms *RESULT audio_bus_to_interleaved: int32_t= 18.47065 ms *RESULT audio_bus_from_interleaved: int32_t= 12.89145 ms
New numbers are fine with me. Your comment is a bit confusing though. AudioBuffer previously used nearbyint and you've now removed it. That's okay? https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_buffer.cc File media/base/audio_buffer.cc (right): https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_buffer... media/base/audio_buffer.cc:171: inline int32_t ConvertSample<int16_t, int32_t>(int16_t value) { Do we still need the rest of these? https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_sample... File media/base/audio_sample_conversion.h (right): https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_sample... media/base/audio_sample_conversion.h:1: // Copyright 2015 The Chromium Authors. All rights reserved. 2016 https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_sample... media/base/audio_sample_conversion.h:48: channel_data[i] = v * (v < 0 ? -min : max); Out of curiosity if you make this "v * (v ? max : -min)" Does that change anything?
On 2016/04/01 19:31:20, DaleCurtis wrote: > New numbers are fine with me. Your comment is a bit confusing though. > AudioBuffer previously used nearbyint and you've now removed it. That's okay? Well, if we unify the code paths, one of them has to change. Before, audio_bus truncated, but audio_buffer rounded. I found after some more experimentation that I can avoid the performance impact by writing the "+ 0.5f" factor inline with the cast, which I have done in the new patch. However, this will result in a behavior change for audio_bus... > > https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_buffer.cc > File media/base/audio_buffer.cc (right): > > https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_buffer... > media/base/audio_buffer.cc:171: inline int32_t ConvertSample<int16_t, > int32_t>(int16_t value) { > Do we still need the rest of these? I haven't refactored Interleave/Deinterleave to handle int/int conversions, so they are still needed. > > https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_sample... > File media/base/audio_sample_conversion.h (right): > > https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_sample... > media/base/audio_sample_conversion.h:1: // Copyright 2015 The Chromium Authors. > All rights reserved. > 2016 Done > > https://codereview.chromium.org/1854433002/diff/60001/media/base/audio_sample... > media/base/audio_sample_conversion.h:48: channel_data[i] = v * (v < 0 ? -min : > max); > Out of curiosity if you make this "v * (v ? max : -min)" Does that change > anything? "v" only evaluates to false if it is zero, so that expression would be DCE'd to "v * -min". If the scale factor is always "-min" (2^(n-1)), the performance of the int8_t/int16_t to float case increase significantly. The int32_t case is unaffected, because -min and max are the same quantity, so it was already DCE'd. This change also simplifies the float to int case, but the overflow branch is still required in C-based code, so it does not affect performance much. *RESULT audio_bus_to_interleaved: int8_t= 17.432629000000002 ms *RESULT audio_bus_from_interleaved: int8_t= 9.5719765 ms *RESULT audio_bus_to_interleaved: int16_t= 15.0226015 ms *RESULT audio_bus_from_interleaved: int16_t= 9.3553915 ms *RESULT audio_bus_to_interleaved: int32_t= 15.088564 ms *RESULT audio_bus_from_interleaved: int32_t= 13.316855 ms
Do we need the rounding at all though? I don't recall why that was added.
On 2016/04/01 20:54:54, DaleCurtis wrote: > Do we need the rounding at all though? I don't recall why that was added. I'm inclined to answer "no". I have added it back, because it was in the original audio_buffer code. It may make a slight difference in the int8_t case, since 1 ULP is about 0.5% in 8-bit. BTW: Rounding should be free with the lrintf function, but clang still generates a function call for it, so it turns out not to be the case.
On 2016/04/01 21:06:34, zhengxiong1 wrote: > On 2016/04/01 20:54:54, DaleCurtis wrote: > > Do we need the rounding at all though? I don't recall why that was added. > > I'm inclined to answer "no". I have added it back, because it was in the > original audio_buffer code. It may make a slight difference in the int8_t case, > since 1 ULP is about 0.5% in 8-bit. > > BTW: Rounding should be free with the lrintf function, but clang still generates > a function call for it, so it turns out not to be the case. FYI: On Haswell, the fastest ToInterleaved<int16_t> can go without vectorization is with the below code. It uses only one scale factor and performs no explicit overflow checks. It has a timing of 12 ms, which is better than the original. *RESULT audio_bus_to_interleaved: int16_t= 11.905977 ms 0x00000000004504e0 <+576>: movss xmm1,DWORD PTR [rbp+rsi*4+0x0] 0x00000000004504e6 <+582>: mulss xmm1,xmm0 0x00000000004504ea <+586>: xorps xmm2,xmm2 0x00000000004504ed <+589>: movss xmm2,xmm1 0x00000000004504f1 <+593>: cvtps2dq xmm1,xmm2 0x00000000004504f5 <+597>: packssdw xmm1,xmm1 0x00000000004504f9 <+601>: movd ecx,xmm1 0x00000000004504fd <+605>: mov WORD PTR [rbx],cx 0x0000000000450500 <+608>: inc rsi 0x0000000000450503 <+611>: add rbx,rax 0x0000000000450506 <+614>: cmp esi,edx 0x0000000000450508 <+616>: jl 0x4504e0 <_ZNK5media8AudioBus20ToInterleavedPartialEiiiPv+576>
lgtm
Thanks for your investigation! Performance seems fine w/ or w/o rounding, so lg as is.
The CQ bit was checked by zhengxiong@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1854433002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1854433002/80001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by zhengxiong@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from dalecurtis@chromium.org Link to the patchset: https://codereview.chromium.org/1854433002/#ps100001 (title: "Fix unit tests")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1854433002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1854433002/100001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: mac_chromium_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by zhengxiong@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1854433002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1854433002/100001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: mac_chromium_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by zhengxiong@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1854433002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1854433002/100001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: mac_chromium_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
Your failure is legit -- that code is trying to verify a waveform which is now different. You'll have to look into it.
On 2016/04/04 18:18:49, DaleCurtis wrote: > Your failure is legit -- that code is trying to verify a waveform which is now > different. You'll have to look into it. Rounding totally breaks the MediaStreamAudioTrack test, but I don't understand the test: [0404/122429:INFO:test_media_stream_audio_track.cc(463)] right: 23293 expected: -18167
The CQ bit was checked by zhengxiong@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from dalecurtis@chromium.org Link to the patchset: https://codereview.chromium.org/1854433002/#ps120001 (title: "Have separate functions for rounding and non-rounding")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1854433002/120001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1854433002/120001
Hmm, why did you separate out the rounding? I don't think it's worth having both of these. Lets just removing rounding entirely.
The CQ bit was unchecked by dalecurtis@chromium.org
On 2016/04/11 23:21:03, DaleCurtis wrote: > Hmm, why did you separate out the rounding? I don't think it's worth having both > of these. Lets just removing rounding entirely. I wanted to do this, but I found that changing it in either direction breaks a bunch of tests.
Which tests are expecting it to be rounded? Those should only be chromecast tests I'd think?
On 2016/04/11 23:46:25, DaleCurtis wrote: > Which tests are expecting it to be rounded? Those should only be chromecast > tests I'd think? I did a code audit and found that chromecast no longer uses AudioBuffer. In fact, there are no callers of the float-int code in AudioBuffer, so it can probably be removed. I will be closing this review again.
Message was sent while issue was closed.
Did PS#4 have any test failures? If not I think it'd be nice to land that so we can unify this code.
Message was sent while issue was closed.
On 2016/04/12 20:50:47, DaleCurtis wrote: > Did PS#4 have any test failures? If not I think it'd be nice to land that so we > can unify this code. PS4 broke the audio_buffer_unittest because of some rounding error. It tries to verify a round-trip conversion from int->float->int, but it is off by one due to truncation. HOWEVER, since that code is never used anywhere, I would rather suggest removing the float->int code from audio_buffer. Once that is done, there is nothing left to unify, since the conversion code is all in audio_bus.
Message was sent while issue was closed.
dalecurtis@chromium.org changed reviewers: + tguilbert@chromium.org
Message was sent while issue was closed.
Sounds good, +tguilbert to take over PS#4 and prime it for landing :) |