Issue 1869303004: media: Implement zero-copy video playback for VP8.

dshwang

The CQ bit was checked by dongseong.hwang@intel.com to run a CQ dry run

4 years, 8 months ago (2016-04-08 15:00:38 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1869303004/1 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1869303004/1

4 years, 8 months ago (2016-04-08 15:00:55 UTC) #2

dshwang

dongseong.hwang@intel.com changed reviewers: + ccameron@chromium.org, dcastagna@chromium.org, tiago.vignatti@intel.com

4 years, 8 months ago (2016-04-08 15:04:45 UTC) #3

dshwang

dcastagna, could you review overall idea and approach? This CL makes FFmpegVideoDecoder and VpxVideoDecoder (for ...

4 years, 8 months ago (2016-04-08 15:04:46 UTC) #4

dshwang

Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current ...

4 years, 8 months ago (2016-04-08 15:09:21 UTC) #5

dshwang

dongseong.hwang@intel.com changed reviewers: + dalecurtis@chromium.org - ccameron@chromium.org

4 years, 8 months ago (2016-04-08 15:09:21 UTC) #6

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 8 months ago (2016-04-08 15:14:32 UTC) #7

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_gn_rel on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_gn_rel/builds/92586)

4 years, 8 months ago (2016-04-08 15:14:33 UTC) #8

dshwang

Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current ...

4 years, 8 months ago (2016-04-08 15:19:59 UTC) #9

Description was changed from

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788
==========

to

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software zero-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788
==========

dshwang

The CQ bit was checked by dongseong.hwang@intel.com to run a CQ dry run

4 years, 8 months ago (2016-04-08 15:27:29 UTC) #10

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1869303004/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1869303004/20001

4 years, 8 months ago (2016-04-08 15:28:03 UTC) #11

dshwang

Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current ...

4 years, 8 months ago (2016-04-08 15:34:54 UTC) #12

Description was changed from

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software zero-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788
==========

to

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software one-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788
==========

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 8 months ago (2016-04-08 16:36:31 UTC) #13

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/151763)

4 years, 8 months ago (2016-04-08 16:36:32 UTC) #14

DaleCurtis

This is really cool but previously we decided not to do this since the gpu ...

4 years, 8 months ago (2016-04-08 17:42:53 UTC) #15

Daniele Castagna

On 2016/04/08 at 17:42:53, dalecurtis wrote: > This is really cool but previously we decided ...

4 years, 8 months ago (2016-04-08 19:13:59 UTC) #16

dshwang

Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current ...

4 years, 8 months ago (2016-04-11 08:58:45 UTC) #17

Description was changed from

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software one-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788
==========

to

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software one-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788, 590358
==========

dshwang

Thank you for nice feedback! We need more investigation for FFmpegVideoDecoder, as you concern. However ...

4 years, 8 months ago (2016-04-11 09:12:12 UTC) #18

Thank you for nice feedback!

We need more investigation for FFmpegVideoDecoder, as you concern.
However VP8 is just free lunch. VP8 already copies vpx image to software
VideoFrame.
It's better to copies vpx image to GMBs VideoFrame.

So IMO, we can proceed without FFmpegVideoDecoder change. Let me extract
FFmpegVideoDecoder change and submit it to another CL.

> > Which platforms did you test your numbers on?

ChromeOS. I use Pixel-2.


> > This is really cool but previously we decided not to do this since the gpu
> memory buffer might be locked for reading at bad times. It might also be
slower
> to read back from then native memory.

> Agree, it's really cool, but we explicitly decided to avoid going this way for
> different reasons:
> - We want to avoid reading from GMBs after we send them for compositing. This
> might work on specific platforms, but it is not an assumption we want to make
in
> general.

Who and when read from GMBs after we send them for compositing?
I could find the logic at least in VpxVideoDecoder and FFmpegVideoDecoder.

> - The best format to use in CC is, in general, not the format that decoders
> decode to. libvpx decodes to I420, a packed format might be better for
> composting/scanout. So, most of the times, we'll have a copy/conversion. A
nice
> thing to do would be to move that on the GPU, we can discuss more about this
in
> another thread and the next point should be address before we could do that.

You say conversion is needed for hardware overlay (i.e. scanout). Hardware
overlay has only one benefit; power saving.
IMO copy/conversion consumes more power than what hardware overlay can save.

> - At the moment libvpx allocator allocates one big chunk of memory and then
> decides where the three planes starts and what the strides are and they can be
> different from GMBs strides. That has to be changed (I have something that
kinda
> works already) so that we can give back to libvpx allocator a pointer and a
> stride for each plane.

Agree

> crbug.com/590358 to keep track of the last point.

Thx for pointing out. We might be a bit duplicated. Do you have similar CL or
prototype?

dshwang

Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current ...

4 years, 8 months ago (2016-04-11 11:30:05 UTC) #19

Description was changed from

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software one-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788, 590358
==========

to

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software one-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788, 590358
CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel
==========

dshwang

Description was changed from ========== media: Implement zero-copy video playback for ffmpeg and vpx. Current ...

4 years, 8 months ago (2016-04-11 11:31:04 UTC) #20

Description was changed from

==========
media: Implement zero-copy video playback for ffmpeg and vpx.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. Software decoder produces a VideoFrame. e.g. VpxVideoDecoder,
FFmpegVideoDecoder.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of 2nd step. Software decoder (e.g. VpxVideoDecoder,
FFmpegVideoDecoder) will decode video frame directly on hardware VideoFrame
backed by GpuMemoryBuffer.

This CL supports only I420 and YV12.

Comparison of power consumption
- Use Pixel-2.
- Use 1080p60 HD H.264* video; https://www.youtube.com/embed/UceRgEyfSsc
- Use power_supply_info tool, which is software tool, so not very accurate.
- Measure for 1 min, and take sample every 1 sec.
* Use h264ify extentions to decode H.264 video.

1) software one-copy video playback
 energy rate (W): 12.05
 stdev: 0.68

2) native zero-copy video playback
 energy rate (W): 11.19
 stdev: 0.65

It shows 7% power saving, but we need to measure it again using power meter,
because power_supply_info software tool might be not accurate.

TODO1: apply it to VP9 decoding.
TODO2: support more YUV planes.

BUG=601788, 590358
CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel
==========

to

==========
media: Implement zero-copy video playback for VP8.

Current zero-copy video playback implementation is actually "one-copy video
playback".
The final VideoFrame is produced by following pipeline.
1. VP8 (i.e. vpx) decoder produces a VideoFrame.
2. GpuMemoryBufferVideoFramePool copies the software VideoFrame to hardware
VideoFrame backed by GpuMemoryBuffer.
3. CC composites the mailbox belonging to hardware VideoFrame.

This CL gets rid of #2 step. VP8 decoder decodes video frame directly on
hardware VideoFrame backed by GpuMemoryBuffer.

TODO: apply it to VP9 decoding.

BUG=601788, 590358
CQ_INCLUDE_TRYBOTS=tryserver.blink:linux_blink_rel
==========

dshwang

New patch set optimizes only VP8 in VpxVideoDecoder. It's just free lunch. What do you ...

4 years, 8 months ago (2016-04-11 11:32:02 UTC) #21

dshwang

Description was changed from ========== media: Implement zero-copy video playback for VP8. Current zero-copy video ...

4 years, 8 months ago (2016-04-11 11:38:25 UTC) #22

DaleCurtis

Currently we only use VpxVideoDecoder for VP8+Alpha on every platform except Android. On Android we ...

4 years, 8 months ago (2016-04-11 17:33:03 UTC) #23

dshwang

On 2016/04/11 17:33:03, DaleCurtis wrote: > Currently we only use VpxVideoDecoder for VP8+Alpha on every ...

4 years, 8 months ago (2016-04-11 18:11:28 UTC) #24

DaleCurtis

On 2016/04/11 at 18:11:28, dongseong.hwang wrote: > On 2016/04/11 17:33:03, DaleCurtis wrote: > > Currently ...

4 years, 8 months ago (2016-04-11 18:15:33 UTC) #25

Daniele Castagna

On 2016/04/11 at 18:15:33, dalecurtis wrote: > On 2016/04/11 at 18:11:28, dongseong.hwang wrote: > > ...

4 years, 8 months ago (2016-04-11 19:33:40 UTC) #26

On 2016/04/11 at 18:15:33, dalecurtis wrote:
> On 2016/04/11 at 18:11:28, dongseong.hwang wrote:
> > On 2016/04/11 17:33:03, DaleCurtis wrote:
> > > Currently we only use VpxVideoDecoder for VP8+Alpha on every platform
except
> > > Android. On Android we use it for all software fallback, but last I
checked we
> > > don't have GpuMemoryBuffers available on Android. As such I wonder if it's
worth
> > > adding all this for a rarely used path. On Android VP8 usage is ~5.8%.
> > 
> > I mainly work on ChromeOS, which supports native GMBs. Now only Broadwell
(e.g. Pixel 2015) is enabled, but I will enable it soon on Skylake and Haswell.
> > This CL doesn't add big logic change. I think ChromeOS deserves this
optimization.
> 
> I agree ChromeOS deserves optimizations, I was just pointing out that this
optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used for
VP8 instead.
> 
> > 
> > > Assuming there's resolution to some of Daniele's systemic concerns, I
could be
> > > convinced otherwise if we could make this work for Android; which
currently
> > > doesn't have any GpuMemoryBuffer support. Though perhaps all we need for
now is
> > > a basic 2D texture based GMB.
> > 
> > In addition, ChromeOS on Intel Core doesn't have issue about ffmpeg reading
GMBs even after VideoFrame
> > is issued to compositor.
> > It's because SoC last level cache makes sure coherency between GPU and CPU
without any lock.
> 
> I defer to Daniele here, but if we can find a clean way to enable true zero
copy for platforms w/o read-back issues while addressing Daniele's systemic
concerns this approach is fine with me.

GpuMemoryBuffer abstraction adds the constraint of not accessing them from CPU
and GPU simultaneously.
I understand we could do this on Intel, but on other platforms (Mac and
potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces might
be locked, and SurfaceTextures can't be always read back.
So, I'd discourage to go down that path.

dshwang

On 2016/04/11 19:33:40, Daniele Castagna wrote: > > I agree ChromeOS deserves optimizations, I was ...

4 years, 8 months ago (2016-04-12 11:56:14 UTC) #27

DaleCurtis

On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > On 2016/04/11 19:33:40, Daniele Castagna wrote: > > ...

4 years, 8 months ago (2016-04-12 19:03:08 UTC) #28

On 2016/04/12 at 11:56:14, dongseong.hwang wrote:
> On 2016/04/11 19:33:40, Daniele Castagna wrote:
> > > I agree ChromeOS deserves optimizations, I was just pointing out that this
> > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is used
for
> > VP8 instead.
> 
> YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS.
> e.g. media/test/data/bear-vp8a.webm for convenience,
http://localhost/browsertests/public/media/webm_vp8a.html

Correct, but my point is has almost zero usage; so unless we can expand this to
work on a larger set of issues it doesn't seem worth landing.

> 
> Although Android doesn't enable native GMBs, the CL still reduce one redundant
copy; software VideoFrame to software GMB VideoFrame.

I think this could be very valuable in this case now that we're shipping vp8,
vp9 software decode on Android. Even if it's not a surface texture. Do you have
the ability to run power benchmarks on Android? If so it will help your case a
lot here.

> 
> > > I defer to Daniele here, but if we can find a clean way to enable true
zero
> > copy for platforms w/o read-back issues while addressing Daniele's systemic
> > concerns this approach is fine with me.
> > 
> > GpuMemoryBuffer abstraction adds the constraint of not accessing them from
CPU
> > and GPU simultaneously.
> > I understand we could do this on Intel, but on other platforms (Mac and
> > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces
might
> > be locked, and SurfaceTextures can't be always read back.
> > So, I'd discourage to go down that path.

What about Windows? Is there any benefit to always writing to a shared memory
GMB w/ concurrent read/write access and handling the conversion to a native GMB
elsewhere?

> 
> I understand your point. It's kind of abusing GMBs. However, practical benefit
is so large.
> Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports it
after Skylake.
> This tweak can enable lots of Chromebooks to extend Youtube watching hours.
> How about using the tweak on only ChromeOS?

We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine the
API such that it supports your use case w/o abuse. What about exposing an
attribute on the GpuMemoryBuffer interface that indicates if the GMB has
concurrent cpu and gpu access? In cases where this is not true we can use the
conversion step, while in other cases we can use the copy+transform step.
Daniele, do you see any paths forward with this work? Or do you and David think
this is the wrong way to go no matter what?

Daniele Castagna

On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > On 2016/04/11 19:33:40, Daniele Castagna wrote: > > ...

4 years, 8 months ago (2016-04-12 19:03:08 UTC) #29

dshwang

dongseong.hwang@intel.com changed reviewers: + reveman@chromium.org

4 years, 8 months ago (2016-04-13 13:36:26 UTC) #30

dshwang

On 2016/04/12 19:03:08, Daniele Castagna wrote: > > I understand your point. It's kind of ...

4 years, 8 months ago (2016-04-13 13:38:21 UTC) #31

On 2016/04/12 19:03:08, Daniele Castagna wrote:
> > I understand your point. It's kind of abusing GMBs. However, practical
benefit
> is so large.
> > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports
it
> after Skylake.
> > This tweak can enable lots of Chromebooks to extend Youtube watching hours.
> > How about using the tweak on only ChromeOS?
> 
> Your patch doesn't affect Vp9 decoding that is still using a MemoryPool
though.
> We're also discussing with Stéphane about using UYVY on CrOS. That will
require
> a copy/conversion.

Correct. I speculated with assumption in which the issue will be resolved.
https://bugs.chromium.org/p/chromium/issues/detail?id=590358

> Let's land the R8 stuff first since that will benefit all the CrOS devices out
> there with a minimal change. We can come back and discuss more about this
after
> we land those patches.

Got it.

> We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine
the
> API such that it supports your use case w/o abuse. What about exposing an
> attribute on the GpuMemoryBuffer interface that indicates if the GMB has
> concurrent cpu and gpu access? In cases where this is not true we can use the
> conversion step, while in other cases we can use the copy+transform step.
> Daniele, do you see any paths forward with this work? Or do you and David
think
> this is the wrong way to go no matter what?

That's good idea. For example, we can add GPU_CPU_READ_CPU_READ_WRITE
Some GMBs allow CPU to read even after unmap. As GMBs are not unmapped, CPU can
read still.
The usage will be as follows

- GBMs map
- CPU read/write
- GMBs unmap
- GPU and CPU read

Currently GpuMemoryBufferImplOzoneNativePixmap and
GpuMemoryBufferImplSharedMemory can do it.
It means Android, Windows (which use GpuMemoryBufferImplSharedMemory) and
ChromeOS (which uses GpuMemoryBufferImplOzoneNativePixmap) can reduce one
redundant copy on h.264 decoding path.

reveman, what do you think we introduce GPU_CPU_READ_CPU_READ_WRITE ?

> I think this could be very valuable in this case now that we're shipping vp8,
> vp9 software decode on Android. Even if it's not a surface texture. Do you
have
> the ability to run power benchmarks on Android? If so it will help your case a
> lot here.

I can measure it on my Nexus 5. Can you give some tip how to measure power
consumption in Android?

reveman

On 2016/04/13 at 13:38:21, dongseong.hwang wrote: > On 2016/04/12 19:03:08, Daniele Castagna wrote: > > ...

4 years, 8 months ago (2016-04-13 14:32:50 UTC) #32

On 2016/04/13 at 13:38:21, dongseong.hwang wrote:
> On 2016/04/12 19:03:08, Daniele Castagna wrote:
> > > I understand your point. It's kind of abusing GMBs. However, practical
benefit
> > is so large.
> > > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports
it
> > after Skylake.
> > > This tweak can enable lots of Chromebooks to extend Youtube watching
hours.
> > > How about using the tweak on only ChromeOS?
> > 
> > Your patch doesn't affect Vp9 decoding that is still using a MemoryPool
though.
> > We're also discussing with Stéphane about using UYVY on CrOS. That will
require
> > a copy/conversion.
> 
> Correct. I speculated with assumption in which the issue will be resolved.
https://bugs.chromium.org/p/chromium/issues/detail?id=590358
> 
> > Let's land the R8 stuff first since that will benefit all the CrOS devices
out
> > there with a minimal change. We can come back and discuss more about this
after
> > we land those patches.
> 
> Got it.
> 
> > We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine
the
> > API such that it supports your use case w/o abuse. What about exposing an
> > attribute on the GpuMemoryBuffer interface that indicates if the GMB has
> > concurrent cpu and gpu access? In cases where this is not true we can use
the
> > conversion step, while in other cases we can use the copy+transform step.
> > Daniele, do you see any paths forward with this work? Or do you and David
think
> > this is the wrong way to go no matter what?
> 
> That's good idea. For example, we can add GPU_CPU_READ_CPU_READ_WRITE
> Some GMBs allow CPU to read even after unmap. As GMBs are not unmapped, CPU
can read still.
> The usage will be as follows
> 
> - GBMs map
> - CPU read/write
> - GMBs unmap
> - GPU and CPU read
> 
> Currently GpuMemoryBufferImplOzoneNativePixmap and
GpuMemoryBufferImplSharedMemory can do it.
> It means Android, Windows (which use GpuMemoryBufferImplSharedMemory) and
ChromeOS (which uses GpuMemoryBufferImplOzoneNativePixmap) can reduce one
redundant copy on h.264 decoding path.
> 
> reveman, what do you think we introduce GPU_CPU_READ_CPU_READ_WRITE ?

From the clients point of view GpuMemoryBufferImplSharedMemory reading the
memory on the service side is GPU_READ so there should not be such a thing as
GPU_CPU_READ_CPU_READ_WRITE as that's the same as GPU_READ_CPU_READ_WRITE.

dshwang

On 2016/04/13 14:32:50, reveman wrote: > > That's good idea. For example, we can add ...

4 years, 8 months ago (2016-04-13 15:29:01 UTC) #33

Daniele Castagna

On 2016/04/12 at 19:03:08, dalecurtis wrote: > On 2016/04/12 at 11:56:14, dongseong.hwang wrote: > > ...

4 years, 8 months ago (2016-04-13 21:00:09 UTC) #34

On 2016/04/12 at 19:03:08, dalecurtis wrote:
> On 2016/04/12 at 11:56:14, dongseong.hwang wrote:
> > On 2016/04/11 19:33:40, Daniele Castagna wrote:
> > > > I agree ChromeOS deserves optimizations, I was just pointing out that
this
> > > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is
used for
> > > VP8 instead.
> > 
> > YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS.
> > e.g. media/test/data/bear-vp8a.webm for convenience,
http://localhost/browsertests/public/media/webm_vp8a.html
> 
> Correct, but my point is has almost zero usage; so unless we can expand this
to work on a larger set of issues it doesn't seem worth landing.
> 
> > 
> > Although Android doesn't enable native GMBs, the CL still reduce one
redundant copy; software VideoFrame to software GMB VideoFrame.
> 
> I think this could be very valuable in this case now that we're shipping vp8,
vp9 software decode on Android. Even if it's not a surface texture. Do you have
the ability to run power benchmarks on Android? If so it will help your case a
lot here.
> 
> > 
> > > > I defer to Daniele here, but if we can find a clean way to enable true
zero
> > > copy for platforms w/o read-back issues while addressing Daniele's
systemic
> > > concerns this approach is fine with me.
> > > 
> > > GpuMemoryBuffer abstraction adds the constraint of not accessing them from
CPU
> > > and GPU simultaneously.
> > > I understand we could do this on Intel, but on other platforms (Mac and
> > > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces
might
> > > be locked, and SurfaceTextures can't be always read back.
> > > So, I'd discourage to go down that path.
> 
> What about Windows? Is there any benefit to always writing to a shared memory
GMB w/ concurrent read/write access and handling the conversion to a native GMB
elsewhere?
> 
> > 
> > I understand your point. It's kind of abusing GMBs. However, practical
benefit is so large.
> > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports
it after Skylake.
> > This tweak can enable lots of Chromebooks to extend Youtube watching hours.
> > How about using the tweak on only ChromeOS?
> 
> We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine
the API such that it supports your use case w/o abuse. What about exposing an
attribute on the GpuMemoryBuffer interface that indicates if the GMB has
concurrent cpu and gpu access? In cases where this is not true we can use the
conversion step, while in other cases we can use the copy+transform step.
Daniele, do you see any paths forward with this work? Or do you and David think
this is the wrong way to go no matter what?

David already expressed his opinion about doing something like this. I'll add
mine.

I'd rather have something that can be used on every platform without looking at
the specific GMB implementation details, even if it's trough a well defined API.
In this way the code we land would have a much bigger impact (every platform)
and hopefully would add less complexity.

In particular, after we address
https://bugs.chromium.org/p/webm/issues/detail?id=1181 we can try to implement
what we already talked about: VpxVideoDecoder could decode directly into GMBs,
keep them locked until the decoder stops referencing them, and output the
videoframe at that point. In pathological situation where the decoder holds to
the VideoFrame longer than a certain threshold we might need to copy it to
another GMB. We have a UMA stat about how long the decoder references to a
buffer and last time I checked it was always less than 7 frames.

This approach would work on every platform, and could also drastically improve
the case where the optimal cc format is not the decoder format (and I expect
this to be happen almost always if we want to use overlays with vp9) since the
copy/conversion from I420 to the cc format could happen on the GPU.

I'm happy to write a small doc with more details about this idea if anyone is
interested.

DaleCurtis

On 2016/04/13 at 13:38:21, dongseong.hwang wrote: > > I think this could be very valuable ...

4 years, 8 months ago (2016-04-13 21:05:16 UTC) #35

DaleCurtis

On 2016/04/13 at 21:00:09, dcastagna wrote: > On 2016/04/12 at 19:03:08, dalecurtis wrote: > > ...

4 years, 8 months ago (2016-04-13 22:41:12 UTC) #36

On 2016/04/13 at 21:00:09, dcastagna wrote:
> On 2016/04/12 at 19:03:08, dalecurtis wrote:
> > On 2016/04/12 at 11:56:14, dongseong.hwang wrote:
> > > On 2016/04/11 19:33:40, Daniele Castagna wrote:
> > > > > I agree ChromeOS deserves optimizations, I was just pointing out that
this
> > > > optimization will affect ~0% of Chromebooks since FFmpegVideoDecoder is
used for
> > > > VP8 instead.
> > > 
> > > YV12A is still decoded by VpxVideoDecoder in linux and ChromeOS.
> > > e.g. media/test/data/bear-vp8a.webm for convenience,
http://localhost/browsertests/public/media/webm_vp8a.html
> > 
> > Correct, but my point is has almost zero usage; so unless we can expand this
to work on a larger set of issues it doesn't seem worth landing.
> > 
> > > 
> > > Although Android doesn't enable native GMBs, the CL still reduce one
redundant copy; software VideoFrame to software GMB VideoFrame.
> > 
> > I think this could be very valuable in this case now that we're shipping
vp8, vp9 software decode on Android. Even if it's not a surface texture. Do you
have the ability to run power benchmarks on Android? If so it will help your
case a lot here.
> > 
> > > 
> > > > > I defer to Daniele here, but if we can find a clean way to enable true
zero
> > > > copy for platforms w/o read-back issues while addressing Daniele's
systemic
> > > > concerns this approach is fine with me.
> > > > 
> > > > GpuMemoryBuffer abstraction adds the constraint of not accessing them
from CPU
> > > > and GPU simultaneously.
> > > > I understand we could do this on Intel, but on other platforms (Mac and
> > > > potentially Android with SurfaceTextures) this wouldn't work. IOSurfaces
might
> > > > be locked, and SurfaceTextures can't be always read back.
> > > > So, I'd discourage to go down that path.
> > 
> > What about Windows? Is there any benefit to always writing to a shared
memory GMB w/ concurrent read/write access and handling the conversion to a
native GMB elsewhere?
> > 
> > > 
> > > I understand your point. It's kind of abusing GMBs. However, practical
benefit is so large.
> > > Most of chromebooks in the market cannot decode VP9 by GPU. Intel supports
it after Skylake.
> > > This tweak can enable lots of Chromebooks to extend Youtube watching
hours.
> > > How about using the tweak on only ChromeOS?
> > 
> > We shouldn't abuse GMBs. Instead we/you should be thinking of ways to refine
the API such that it supports your use case w/o abuse. What about exposing an
attribute on the GpuMemoryBuffer interface that indicates if the GMB has
concurrent cpu and gpu access? In cases where this is not true we can use the
conversion step, while in other cases we can use the copy+transform step.
Daniele, do you see any paths forward with this work? Or do you and David think
this is the wrong way to go no matter what?
> 
> David already expressed his opinion about doing something like this. I'll add
mine.
> 
> I'd rather have something that can be used on every platform without looking
at the specific GMB implementation details, even if it's trough a well defined
API. In this way the code we land would have a much bigger impact (every
platform) and hopefully would add less complexity.
> 
> In particular, after we address
https://bugs.chromium.org/p/webm/issues/detail?id=1181 we can try to implement
what we already talked about: VpxVideoDecoder could decode directly into GMBs,
keep them locked until the decoder stops referencing them, and output the
videoframe at that point. In pathological situation where the decoder holds to
the VideoFrame longer than a certain threshold we might need to copy it to
another GMB. We have a UMA stat about how long the decoder references to a
buffer and last time I checked it was always less than 7 frames.
> 
> This approach would work on every platform, and could also drastically improve
the case where the optimal cc format is not the decoder format (and I expect
this to be happen almost always if we want to use overlays with vp9) since the
copy/conversion from I420 to the cc format could happen on the GPU.
> 
> I'm happy to write a small doc with more details about this idea if anyone is
interested.

I think a doc on this sounds great. Solving this would help everyone.

dshwang

4 years, 7 months ago (2016-05-04 10:01:48 UTC) #37

Hi, how about resuming the review?
As Daniele is extending vpx API, this change will be used here and there.
In addition, ChromeOS also supports zero-copy video playback very soon.
https://codereview.chromium.org/1869793002/
We can start to review the preparation patch first;
https://codereview.chromium.org/1874733002/

In parallel, I'll propose new BufferUsage GPU_CPU_READ_CPU_READ_WRITE, to reuse
it on FFMPEG.

> We have a UMA stat about how long the decoder references to a buffer and last
time I checked it was always less than 7 frames.

vpx zero-copy seems to be possible to support all platforms. I try to postpone
the VideoFrame creation until FFMPEG release reference. The playback smoothness
is not acceptable.
FFMPEG might hold the decoder references more than tens frames.

Issue 1869303004: media: Implement zero-copy video playback for VP8.

Description

Patch Set 1 #

Patch Set 2 : build fix #

Patch Set 3 : optimize only VP8 #

Messages