|
|
Created:
4 years, 8 months ago by dshwang Modified:
4 years, 7 months ago CC:
chromium-reviews, darin-cc_chromium.org, feature-media-reviews_chromium.org, jam, mcasas+watch_chromium.org, mkwst+moarreviews-renderer_chromium.org, mlamouri+watch-content_chromium.org, posciak+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
Descriptionmedia: Implement zero-copy video playback for VP8.
Current zero-copy video playback implementation is actually "one-copy video playback".
The final VideoFrame is produced by following pipeline.
1. VP8 (i.e. vpx) decoder produces a VideoFrame.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.
This CL gets rid of #2 step. VP8 decoder decodes video frame directly on
hardware VideoFrame backed by GpuMemoryBuffer.
TODO: apply it to VP9 decoding.
Dependency: https://codereview.chromium.org/1874733002/
BUG=601788, 590358
CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel
Patch Set 1 #Patch Set 2 : build fix #Patch Set 3 : optimize only VP8 #
Messages
Total messages: 37 (14 generated)
The CQ bit was checked by dongseong.hwang@intel.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1869303004/1 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1869303004/1
dongseong.hwang@intel.com changed reviewers: + ccameron@chromium.org, dcastagna@chromium.org, tiago.vignatti@intel.com
dcastagna, could you review overall idea and approach? This CL makes FFmpegVideoDecoder and VpxVideoDecoder (for VP8) use GpuMemoryBufferVideoFramePool directly. To reuse it, I slit old GpuMemoryBufferVideoFramePool into GpuMemoryBufferVideoFrameCopier and GpuMemoryBufferVideoFramePool. Is GpuMemoryBufferVideoFrameCopier's name fine? I'll add more unittests.
Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ========== to ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ==========
dongseong.hwang@intel.com changed reviewers: + dalecurtis@chromium.org - ccameron@chromium.org
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_gn_rel on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_gn_r...)
Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ========== to ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software zero-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ==========
The CQ bit was checked by dongseong.hwang@intel.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1869303004/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1869303004/20001
Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software zero-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ========== to ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software one-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
This is really cool but previously we decided not to do this since the gpu memory buffer might be locked for reading at bad times. It might also be slower to read back from then native memory. Which platforms did you test your numbers on?
On 2016/04/08 at 17:42:53, dalecurtis wrote: > This is really cool but previously we decided not to do this since the gpu memory buffer might be locked for reading at bad times. It might also be slower to read back from then native memory. > Agree, it's really cool, but we explicitly decided to avoid going this way for different reasons: - We want to avoid reading from GMBs after we send them for compositing. This might work on specific platforms, but it is not an assumption we want to make in general. - The best format to use in CC is, in general, not the format that decoders decode to. libvpx decodes to I420, a packed format might be better for composting/scanout. So, most of the times, we'll have a copy/conversion. A nice thing to do would be to move that on the GPU, we can discuss more about this in another thread and the next point should be address before we could do that. - At the moment libvpx allocator allocates one big chunk of memory and then decides where the three planes starts and what the strides are and they can be different from GMBs strides. That has to be changed (I have something that kinda works already) so that we can give back to libvpx allocator a pointer and a stride for each plane. crbug.com/590358 to keep track of the last point. > Which platforms did you test your numbers on?
Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software one-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788 ========== to ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software one-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788, 590358 ==========
Thank you for nice feedback! We need more investigation for FFmpegVideoDecoder, as you concern. However VP8 is just free lunch. VP8 already copies vpx image to software VideoFrame. It's better to copies vpx image to GMBs VideoFrame. So IMO, we can proceed without FFmpegVideoDecoder change. Let me extract FFmpegVideoDecoder change and submit it to another CL. > > Which platforms did you test your numbers on? ChromeOS. I use Pixel-2. > > This is really cool but previously we decided not to do this since the gpu > memory buffer might be locked for reading at bad times. It might also be slower > to read back from then native memory. > Agree, it's really cool, but we explicitly decided to avoid going this way for > different reasons: > - We want to avoid reading from GMBs after we send them for compositing. This > might work on specific platforms, but it is not an assumption we want to make in > general. Who and when read from GMBs after we send them for compositing? I could find the logic at least in VpxVideoDecoder and FFmpegVideoDecoder. > - The best format to use in CC is, in general, not the format that decoders > decode to. libvpx decodes to I420, a packed format might be better for > composting/scanout. So, most of the times, we'll have a copy/conversion. A nice > thing to do would be to move that on the GPU, we can discuss more about this in > another thread and the next point should be address before we could do that. You say conversion is needed for hardware overlay (i.e. scanout). Hardware overlay has only one benefit; power saving. IMO copy/conversion consumes more power than what hardware overlay can save. > - At the moment libvpx allocator allocates one big chunk of memory and then > decides where the three planes starts and what the strides are and they can be > different from GMBs strides. That has to be changed (I have something that kinda > works already) so that we can give back to libvpx allocator a pointer and a > stride for each plane. Agree > crbug.com/590358 to keep track of the last point. Thx for pointing out. We might be a bit duplicated. Do you have similar CL or prototype?
Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software one-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788, 590358 ========== to ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software one-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788, 590358 CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel ==========
Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder, FFmpegVideoDecoder. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder, FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. This CL supports only I420 and YV12. Comparison of power consumption - Use Pixel-2. - Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc - Use power_supply_info tool, which is software tool, so not very accurate. - Measure for 1 min, and take sample every 1 sec. * Use h264ify extentions to decode H.264 video. 1) software one-copy video playback energy rate (W): 12.05 stdev: 0.68 2) native zero-copy video playback energy rate (W): 11.19 stdev: 0.65 It shows 7% power saving, but we need to measure it again using power meter, because power_supply_info software tool might be not accurate. TODO1: apply it to VP9 decoding. TODO2: support more YUV planes. BUG=601788, 590358 CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel ========== to ========== media: Implement zero-copy video playback for VP8. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. VP8 (i.e. vpx) decoder produces a VideoFrame. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of #2 step. VP8 decoder decodes video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. TODO: apply it to VP9 decoding. BUG=601788, 590358 CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel ==========
New patch set optimizes only VP8 in VpxVideoDecoder. It's just free lunch. What do you think?
Description was changed from ========== media: Implement zero-copy video playback for VP8. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. VP8 (i.e. vpx) decoder produces a VideoFrame. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of #2 step. VP8 decoder decodes video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. TODO: apply it to VP9 decoding. BUG=601788, 590358 CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel ========== to ========== media: Implement zero-copy video playback for VP8. Current zero-copy video playback implementation is actually "one-copy video playback". The final VideoFrame is produced by following pipeline. 1. VP8 (i.e. vpx) decoder produces a VideoFrame. 2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware VideoFrame backed by GpuMemoryBuffer. 3. CC composites the mailbox belonging to hardware VideoFrame. This CL gets rid of #2 step. VP8 decoder decodes video frame directly on hardware VideoFrame backed by GpuMemoryBuffer. TODO: apply it to VP9 decoding. Dependency: https://codereview.chromium.org/1874733002/ BUG=601788, 590358 CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel ==========
Currently we only use VpxVideoDecoder for VP8+Alpha on every platform except Android. On Android we use it for all software fallback, but last I checked we don't have GpuMemoryBuffers available on Android. As such I wonder if it's worth adding all this for a rarely used path. On Android VP8 usage is ~5.8%. Assuming there's resolution to some of Daniele's systemic concerns, I could be convinced otherwise if we could make this work for Android; which currently doesn't have any GpuMemoryBuffer support. Though perhaps all we need for now is a basic 2D texture based GMB.
On 2016/04/11 17:33:03, DaleCurtis wrote: > Currently we only use VpxVideoDecoder for VP8+Alpha on every platform except > Android. On Android we use it for all software fallback, but last I checked we > don't have GpuMemoryBuffers available on Android. As such I wonder if it's worth > adding all this for a rarely used path. On Android VP8 usage is ~5.8%. I mainly work on ChromeOS, which supports native GMBs. Now only Broadwell (e.g. Pixel 2015) is enabled, but I will enable it soon on Skylake and Haswell. This CL doesn't add big logic change. I think ChromeOS deserves this optimization. > Assuming there's resolution to some of Daniele's systemic concerns, I could be > convinced otherwise if we could make this work for Android; which currently > doesn't have any GpuMemoryBuffer support. Though perhaps all we need for now is > a basic 2D texture based GMB. In addition, ChromeOS on Intel Core doesn't have issue about ffmpeg reading GMBs even after VideoFrame is issued to compositor. It's because SoC last level cache makes sure coherency between GPU and CPU without any lock.
On 2016/04/11 at 18:11:28, dongseong.hwang wrote: > On 2016/04/11 17:33:03, DaleCurtis wrote: > > Currently we only use VpxVideoDecoder for VP8+Alpha on every platform except > > Android. On Android we use it for all software fallback, but last I checked we > > don't have GpuMemoryBuffers available on Android. As such I wonder if it's worth > > adding all this for a rarely used path. On Android VP8 usage is ~5.8%. > > I mainly work on ChromeOS, which supports native GMBs. Now only Broadwell (e.g. Pixel 2015) is enabled, but I will enable it soon on Skylake and Haswell. > This CL doesn't add big logic change. I think ChromeOS deserves this optimization. I agree ChromeOS deserves optimizations, I was just pointing out that this optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for VP8 instead. > > > Assuming there's resolution to some of Daniele's systemic concerns, I could be > > convinced otherwise if we could make this work for Android; which currently > > doesn't have any GpuMemoryBuffer support. Though perhaps all we need for now is > > a basic 2D texture based GMB. > > In addition, ChromeOS on Intel Core doesn't have issue about ffmpeg reading GMBs even after VideoFrame > is issued to compositor. > It's because SoC last level cache makes sure coherency between GPU and CPU without any lock. I defer to Daniele here, but if we can find a clean way to enable true zero copy for platforms w/o read-back issues while addressing Daniele's systemic concerns this approach is fine with me.
On 2016/04/11 at 18:15:33, dalecurtis wrote: > On 2016/04/11 at 18:11:28, dongseong.hwang wrote: > > On 2016/04/11 17:33:03, DaleCurtis wrote: > > > Currently we only use VpxVideoDecoder for VP8+Alpha on every platform except > > > Android. On Android we use it for all software fallback, but last I checked we > > > don't have GpuMemoryBuffers available on Android. As such I wonder if it's worth > > > adding all this for a rarely used path. On Android VP8 usage is ~5.8%. > > > > I mainly work on ChromeOS, which supports native GMBs. Now only Broadwell (e.g. Pixel 2015) is enabled, but I will enable it soon on Skylake and Haswell. > > This CL doesn't add big logic change. I think ChromeOS deserves this optimization. > > I agree ChromeOS deserves optimizations, I was just pointing out that this optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for VP8 instead. > > > > > > Assuming there's resolution to some of Daniele's systemic concerns, I could be > > > convinced otherwise if we could make this work for Android; which currently > > > doesn't have any GpuMemoryBuffer support. Though perhaps all we need for now is > > > a basic 2D texture based GMB. > > > > In addition, ChromeOS on Intel Core doesn't have issue about ffmpeg reading GMBs even after VideoFrame > > is issued to compositor. > > It's because SoC last level cache makes sure coherency between GPU and CPU without any lock. > > I defer to Daniele here, but if we can find a clean way to enable true zero copy for platforms w/o read-back issues while addressing Daniele's systemic concerns this approach is fine with me. GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU and GPU simultaneously. I understand we could do this on Intel, but on other platforms (Mac and potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might be locked, and SurfaceTextures can't be always read back. So, I'd discourage to go down that path.
On 2016/04/11 19:33:40, Daniele Castagna wrote: > > I agree ChromeOS deserves optimizations, I was just pointing out that this > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for > VP8 instead. YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS. e.g. media/test/data/bear-vp8a.webm for convenience, http://localhost/browsertests/public/media/webm_vp8a.html Although Android doesn't enable native GMBs, the CL still reduce one redundant copy; software VideoFrame to software GMB VideoFrame. > > I defer to Daniele here, but if we can find a clean way to enable true zero > copy for platforms w/o read-back issues while addressing Daniele's systemic > concerns this approach is fine with me. > > GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU > and GPU simultaneously. > I understand we could do this on Intel, but on other platforms (Mac and > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might > be locked, and SurfaceTextures can't be always read back. > So, I'd discourage to go down that path. I understand your point. It's kind of abusing GMBs. However, practical benefit is so large. Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it after Skylake. This tweak can enable lots of Chromebooks to extend Youtube watching hours. How about using the tweak on only ChromeOS?
On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > On 2016/04/11 19:33:40, Daniele Castagna wrote: > > > I agree ChromeOS deserves optimizations, I was just pointing out that this > > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for > > VP8 instead. > > YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS. > e.g. media/test/data/bear-vp8a.webm for convenience, http://localhost/browsertests/public/media/webm_vp8a.html Correct, but my point is has almost zero usage; so unless we can expand this to work on a larger set of issues it doesn't seem worth landing. > > Although Android doesn't enable native GMBs, the CL still reduce one redundant copy; software VideoFrame to software GMB VideoFrame. I think this could be very valuable in this case now that we're shipping vp8, vp9 software decode on Android. Even if it's not a surface texture. Do you have the ability to run power benchmarks on Android? If so it will help your case a lot here. > > > > I defer to Daniele here, but if we can find a clean way to enable true zero > > copy for platforms w/o read-back issues while addressing Daniele's systemic > > concerns this approach is fine with me. > > > > GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU > > and GPU simultaneously. > > I understand we could do this on Intel, but on other platforms (Mac and > > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might > > be locked, and SurfaceTextures can't be always read back. > > So, I'd discourage to go down that path. What about Windows? Is there any benefit to always writing to a shared memory GMB w/ concurrent read/write access and handling the conversion to a native GMB elsewhere? > > I understand your point. It's kind of abusing GMBs. However, practical benefit is so large. > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it after Skylake. > This tweak can enable lots of Chromebooks to extend Youtube watching hours. > How about using the tweak on only ChromeOS? We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine the API such that it supports your use case w/o abuse. What about exposing an attribute on the GpuMemoryBuffer interface that indicates if the GMB has concurrent cpu and gpu access? In cases where this is not true we can use the conversion step, while in other cases we can use the copy+transform step. Daniele, do you see any paths forward with this work? Or do you and David think this is the wrong way to go no matter what?
On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > On 2016/04/11 19:33:40, Daniele Castagna wrote: > > > I agree ChromeOS deserves optimizations, I was just pointing out that this > > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for > > VP8 instead. > > YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS. > e.g. media/test/data/bear-vp8a.webm for convenience, http://localhost/browsertests/public/media/webm_vp8a.html > > Although Android doesn't enable native GMBs, the CL still reduce one redundant copy; software VideoFrame to software GMB VideoFrame. > > > > I defer to Daniele here, but if we can find a clean way to enable true zero > > copy for platforms w/o read-back issues while addressing Daniele's systemic > > concerns this approach is fine with me. > > > > GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU > > and GPU simultaneously. > > I understand we could do this on Intel, but on other platforms (Mac and > > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might > > be locked, and SurfaceTextures can't be always read back. > > So, I'd discourage to go down that path. > > I understand your point. It's kind of abusing GMBs. However, practical benefit is so large. > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it after Skylake. > This tweak can enable lots of Chromebooks to extend Youtube watching hours. > How about using the tweak on only ChromeOS? Your patch doesn't affect Vp9 decoding that is still using a MemoryPool though. We're also discussing with Stéphane about using UYVY on CrOS. That will require a copy/conversion. Let's land the R8 stuff first since that will benefit all the CrOS devices out there with a minimal change. We can come back and discuss more about this after we land those patches.
dongseong.hwang@intel.com changed reviewers: + reveman@chromium.org
On 2016/04/12 19:03:08, Daniele Castagna wrote: > > I understand your point. It's kind of abusing GMBs. However, practical benefit > is so large. > > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it > after Skylake. > > This tweak can enable lots of Chromebooks to extend Youtube watching hours. > > How about using the tweak on only ChromeOS? > > Your patch doesn't affect Vp9 decoding that is still using a MemoryPool though. > We're also discussing with Stéphane about using UYVY on CrOS. That will require > a copy/conversion. Correct. I speculated with assumption in which the issue will be resolved. https://bugs.chromium.org/p/chromium/issues/detail?id=590358 > Let's land the R8 stuff first since that will benefit all the CrOS devices out > there with a minimal change. We can come back and discuss more about this after > we land those patches. Got it. > We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine the > API such that it supports your use case w/o abuse. What about exposing an > attribute on the GpuMemoryBuffer interface that indicates if the GMB has > concurrent cpu and gpu access? In cases where this is not true we can use the > conversion step, while in other cases we can use the copy+transform step. > Daniele, do you see any paths forward with this work? Or do you and David think > this is the wrong way to go no matter what? That's good idea. For example, we can add GPU_CPU_READ_CPU_READ_WRITE Some GMBs allow CPU to read even after unmap. As GMBs are not unmapped, CPU can read still. The usage will be as follows - GBMs map - CPU read/write - GMBs unmap - GPU and CPU read Currently GpuMemoryBufferImplOzoneNativePixmap and GpuMemoryBufferImplSharedMemory can do it. It means Android, Windows (which use GpuMemoryBufferImplSharedMemory) and ChromeOS (which uses GpuMemoryBufferImplOzoneNativePixmap) can reduce one redundant copy on h.264 decoding path. reveman, what do you think we introduce GPU_CPU_READ_CPU_READ_WRITE ? > I think this could be very valuable in this case now that we're shipping vp8, > vp9 software decode on Android. Even if it's not a surface texture. Do you have > the ability to run power benchmarks on Android? If so it will help your case a > lot here. I can measure it on my Nexus 5. Can you give some tip how to measure power consumption in Android?
On 2016/04/13 at 13:38:21, dongseong.hwang wrote: > On 2016/04/12 19:03:08, Daniele Castagna wrote: > > > I understand your point. It's kind of abusing GMBs. However, practical benefit > > is so large. > > > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it > > after Skylake. > > > This tweak can enable lots of Chromebooks to extend Youtube watching hours. > > > How about using the tweak on only ChromeOS? > > > > Your patch doesn't affect Vp9 decoding that is still using a MemoryPool though. > > We're also discussing with Stéphane about using UYVY on CrOS. That will require > > a copy/conversion. > > Correct. I speculated with assumption in which the issue will be resolved. https://bugs.chromium.org/p/chromium/issues/detail?id=590358 > > > Let's land the R8 stuff first since that will benefit all the CrOS devices out > > there with a minimal change. We can come back and discuss more about this after > > we land those patches. > > Got it. > > > We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine the > > API such that it supports your use case w/o abuse. What about exposing an > > attribute on the GpuMemoryBuffer interface that indicates if the GMB has > > concurrent cpu and gpu access? In cases where this is not true we can use the > > conversion step, while in other cases we can use the copy+transform step. > > Daniele, do you see any paths forward with this work? Or do you and David think > > this is the wrong way to go no matter what? > > That's good idea. For example, we can add GPU_CPU_READ_CPU_READ_WRITE > Some GMBs allow CPU to read even after unmap. As GMBs are not unmapped, CPU can read still. > The usage will be as follows > > - GBMs map > - CPU read/write > - GMBs unmap > - GPU and CPU read > > Currently GpuMemoryBufferImplOzoneNativePixmap and GpuMemoryBufferImplSharedMemory can do it. > It means Android, Windows (which use GpuMemoryBufferImplSharedMemory) and ChromeOS (which uses GpuMemoryBufferImplOzoneNativePixmap) can reduce one redundant copy on h.264 decoding path. > > reveman, what do you think we introduce GPU_CPU_READ_CPU_READ_WRITE ? From the clients point of view GpuMemoryBufferImplSharedMemory reading the memory on the service side is GPU_READ so there should not be such a thing as GPU_CPU_READ_CPU_READ_WRITE as that's the same as GPU_READ_CPU_READ_WRITE.
On 2016/04/13 14:32:50, reveman wrote: > > That's good idea. For example, we can add GPU_CPU_READ_CPU_READ_WRITE > > Some GMBs allow CPU to read even after unmap. As GMBs are not unmapped, CPU > can read still. > > The usage will be as follows > > > > - GBMs map > > - CPU read/write > > - GMBs unmap > > - GPU and CPU read > > > > Currently GpuMemoryBufferImplOzoneNativePixmap and > GpuMemoryBufferImplSharedMemory can do it. > > It means Android, Windows (which use GpuMemoryBufferImplSharedMemory) and > ChromeOS (which uses GpuMemoryBufferImplOzoneNativePixmap) can reduce one > redundant copy on h.264 decoding path. > > > > reveman, what do you think we introduce GPU_CPU_READ_CPU_READ_WRITE ? > > From the clients point of view GpuMemoryBufferImplSharedMemory reading the > memory on the service side is GPU_READ so there should not be such a thing as > GPU_CPU_READ_CPU_READ_WRITE as that's the same as GPU_READ_CPU_READ_WRITE. I mean GpuMemoryBufferImplSharedMemory can support GPU_CPU_READ_CPU_READ_WRITE. While the service side does GPU_READ, renderer can do CPU_READ, because |shared_memory_| contains the same data to updated texture.
On 2016/04/12 at 19:03:08, dalecurtis wrote: > On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > > On 2016/04/11 19:33:40, Daniele Castagna wrote: > > > > I agree ChromeOS deserves optimizations, I was just pointing out that this > > > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for > > > VP8 instead. > > > > YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS. > > e.g. media/test/data/bear-vp8a.webm for convenience, http://localhost/browsertests/public/media/webm_vp8a.html > > Correct, but my point is has almost zero usage; so unless we can expand this to work on a larger set of issues it doesn't seem worth landing. > > > > > Although Android doesn't enable native GMBs, the CL still reduce one redundant copy; software VideoFrame to software GMB VideoFrame. > > I think this could be very valuable in this case now that we're shipping vp8, vp9 software decode on Android. Even if it's not a surface texture. Do you have the ability to run power benchmarks on Android? If so it will help your case a lot here. > > > > > > > I defer to Daniele here, but if we can find a clean way to enable true zero > > > copy for platforms w/o read-back issues while addressing Daniele's systemic > > > concerns this approach is fine with me. > > > > > > GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU > > > and GPU simultaneously. > > > I understand we could do this on Intel, but on other platforms (Mac and > > > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might > > > be locked, and SurfaceTextures can't be always read back. > > > So, I'd discourage to go down that path. > > What about Windows? Is there any benefit to always writing to a shared memory GMB w/ concurrent read/write access and handling the conversion to a native GMB elsewhere? > > > > > I understand your point. It's kind of abusing GMBs. However, practical benefit is so large. > > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it after Skylake. > > This tweak can enable lots of Chromebooks to extend Youtube watching hours. > > How about using the tweak on only ChromeOS? > > We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine the API such that it supports your use case w/o abuse. What about exposing an attribute on the GpuMemoryBuffer interface that indicates if the GMB has concurrent cpu and gpu access? In cases where this is not true we can use the conversion step, while in other cases we can use the copy+transform step. Daniele, do you see any paths forward with this work? Or do you and David think this is the wrong way to go no matter what? David already expressed his opinion about doing something like this. I'll add mine. I'd rather have something that can be used on every platform without looking at the specific GMB implementation details, even if it's trough a well defined API. In this way the code we land would have a much bigger impact (every platform) and hopefully would add less complexity. In particular, after we address https://bugs.chromium.org/p/webm/issues/detail?id=1181 we can try to implement what we already talked about: VpxVideoDecoder could decode directly into GMBs, keep them locked until the decoder stops referencing them, and output the videoframe at that point. In pathological situation where the decoder holds to the VideoFrame longer than a certain threshold we might need to copy it to another GMB. We have a UMA stat about how long the decoder references to a buffer and last time I checked it was always less than 7 frames. This approach would work on every platform, and could also drastically improve the case where the optimal cc format is not the decoder format (and I expect this to be happen almost always if we want to use overlays with vp9) since the copy/conversion from I420 to the cc format could happen on the GPU. I'm happy to write a small doc with more details about this idea if anyone is interested.
On 2016/04/13 at 13:38:21, dongseong.hwang wrote: > > I think this could be very valuable in this case now that we're shipping vp8, > > vp9 software decode on Android. Even if it's not a surface texture. Do you have > > the ability to run power benchmarks on Android? If so it will help your case a > > lot here. > > I can measure it on my Nexus 5. Can you give some tip how to measure power consumption in Android? Unfortunately it's pretty complex, you generally need an external device like a Monsoon power monitor. I suspect someone at Intel has equivalent or better hardware for doing this :) jiajia.qin // intel.com has mentioned working on Android related pieces before if you're in touch with them. Really all we want to see is that when you build a ToT version of Chromium w/ your change and try to play a vp8 or vp9 video on a device without hardware vp8 or vp9 support that this provides a solid power improvement for playback of something like a 720p30 test clip. You can use this one if you want http://storage.googleapis.com/dalecurtis-shared/buck720.mp4
On 2016/04/13 at 21:00:09, dcastagna wrote: > On 2016/04/12 at 19:03:08, dalecurtis wrote: > > On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > > > On 2016/04/11 19:33:40, Daniele Castagna wrote: > > > > > I agree ChromeOS deserves optimizations, I was just pointing out that this > > > > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for > > > > VP8 instead. > > > > > > YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS. > > > e.g. media/test/data/bear-vp8a.webm for convenience, http://localhost/browsertests/public/media/webm_vp8a.html > > > > Correct, but my point is has almost zero usage; so unless we can expand this to work on a larger set of issues it doesn't seem worth landing. > > > > > > > > Although Android doesn't enable native GMBs, the CL still reduce one redundant copy; software VideoFrame to software GMB VideoFrame. > > > > I think this could be very valuable in this case now that we're shipping vp8, vp9 software decode on Android. Even if it's not a surface texture. Do you have the ability to run power benchmarks on Android? If so it will help your case a lot here. > > > > > > > > > > I defer to Daniele here, but if we can find a clean way to enable true zero > > > > copy for platforms w/o read-back issues while addressing Daniele's systemic > > > > concerns this approach is fine with me. > > > > > > > > GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU > > > > and GPU simultaneously. > > > > I understand we could do this on Intel, but on other platforms (Mac and > > > > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might > > > > be locked, and SurfaceTextures can't be always read back. > > > > So, I'd discourage to go down that path. > > > > What about Windows? Is there any benefit to always writing to a shared memory GMB w/ concurrent read/write access and handling the conversion to a native GMB elsewhere? > > > > > > > > I understand your point. It's kind of abusing GMBs. However, practical benefit is so large. > > > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it after Skylake. > > > This tweak can enable lots of Chromebooks to extend Youtube watching hours. > > > How about using the tweak on only ChromeOS? > > > > We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine the API such that it supports your use case w/o abuse. What about exposing an attribute on the GpuMemoryBuffer interface that indicates if the GMB has concurrent cpu and gpu access? In cases where this is not true we can use the conversion step, while in other cases we can use the copy+transform step. Daniele, do you see any paths forward with this work? Or do you and David think this is the wrong way to go no matter what? > > David already expressed his opinion about doing something like this. I'll add mine. > > I'd rather have something that can be used on every platform without looking at the specific GMB implementation details, even if it's trough a well defined API. In this way the code we land would have a much bigger impact (every platform) and hopefully would add less complexity. > > In particular, after we address https://bugs.chromium.org/p/webm/issues/detail?id=1181 we can try to implement what we already talked about: VpxVideoDecoder could decode directly into GMBs, keep them locked until the decoder stops referencing them, and output the videoframe at that point. In pathological situation where the decoder holds to the VideoFrame longer than a certain threshold we might need to copy it to another GMB. We have a UMA stat about how long the decoder references to a buffer and last time I checked it was always less than 7 frames. > > This approach would work on every platform, and could also drastically improve the case where the optimal cc format is not the decoder format (and I expect this to be happen almost always if we want to use overlays with vp9) since the copy/conversion from I420 to the cc format could happen on the GPU. > > I'm happy to write a small doc with more details about this idea if anyone is interested. I think a doc on this sounds great. Solving this would help everyone.
Hi, how about resuming the review? As Daniele is extending vpx API, this change will be used here and there. In addition, ChromeOS also supports zero-copy video playback very soon. https://codereview.chromium.org/1869793002/ We can start to review the preparation patch first; https://codereview.chromium.org/1874733002/ In parallel, I'll propose new BufferUsage GPU_CPU_READ_CPU_READ_WRITE, to reuse it on FFMPEG. > We have a UMA stat about how long the decoder references to a buffer and last time I checked it was always less than 7 frames. vpx zero-copy seems to be possible to support all platforms. I try to postpone the VideoFrame creation until FFMPEG release reference. The playback smoothness is not acceptable. FFMPEG might hold the decoder references more than tens frames. |