|
|
Chromium Code Reviews|
Created:
6 years, 2 months ago by dshwang Modified:
6 years, 2 months ago Reviewers:
Ken Russell (switch to Gerrit), aelias_OOO_until_Jul13, Hongbo Min, Zhenyao Mo, vmiura, no sievers CC:
chromium-reviews, piman+watch_chromium.org, Shouqun Liu Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
DescriptionUse glDiscardFramebuffer to save memory bandwidth in PNaCl
Leverage glDiscardFramebuffer for tiled-based GPU to improve performance of
offscreen GL contexnt with preserved target buffer and also save memory
bandwidth.
BUG=424176
Patch Set 1 #
Total comments: 2
Messages
Total messages: 19 (2 generated)
dongseong.hwang@intel.com changed reviewers: + hongbo.min@intel.com, kbr@chromium.org, zmo@chromium.org
Could you review this optimization? It is the same optimization of WebGL for gles2 decoder: https://codereview.chromium.org/614883002
https://codereview.chromium.org/651863002/diff/1/gpu/command_buffer/service/g... File gpu/command_buffer/service/gles2_cmd_decoder.cc (right): https://codereview.chromium.org/651863002/diff/1/gpu/command_buffer/service/g... gpu/command_buffer/service/gles2_cmd_decoder.cc:9375: if (feature_info_->feature_flags().ext_discard_framebuffer) { According to my previous work, an attempt to clear a framebuffer on tiled GPU is time-consuming, if glDiscardFramebuffer is used, it gives a performance hint and allow GPU driver to set the pixel vale to whatever it wishes, and the glClear could be skipped (I have verified it on ASUS Nexus series, it works well without glClear if glDiscardFramebuffer is used). So my question is, where is glClear called to clear |offscreen_target_frame_buffer_| and prepare a new frame? and can we skip that glClear if have?
Hongbo, Thank you for review. https://codereview.chromium.org/651863002/diff/1/gpu/command_buffer/service/g... File gpu/command_buffer/service/gles2_cmd_decoder.cc (right): https://codereview.chromium.org/651863002/diff/1/gpu/command_buffer/service/g... gpu/command_buffer/service/gles2_cmd_decoder.cc:9375: if (feature_info_->feature_flags().ext_discard_framebuffer) { On 2014/10/14 02:27:48, Hongbo Min wrote: > According to my previous work, an attempt to clear a framebuffer on tiled GPU > is time-consuming, if glDiscardFramebuffer is used, it gives a performance hint > and allow GPU driver to set the pixel vale to whatever it wishes, and the > glClear could be skipped (I have verified it on ASUS Nexus series, it works well > without glClear if glDiscardFramebuffer is used). As far as I understand, your CL didn't affect whether glClear is skipped or not. WebGLRenderingContextBase::clearIfComposited still do glClear every frame. Let me explain why your CL speed up in tiled-based GPU as I understand. When binding another fbo, tiled-based GPU must stores tiled cache to gpu memory (texture, stencil buffer and depth buffer). glDiscardFramebuffer hints GPU to not read back tiled cache to gpu memory because the software don't need anymore the changed fbo information. It saves huge gpu bandwidth. Let's go back to your CL. The fbo lifecycle is as follows: 1. WebGL replaces color texture to front fbo. At that time, you hint glDiscardFramebuffer because DrawingBuffer doesn't update fbo so you don't want to copy tiled cache to color texture redundantly. It wastes bandwidth. 2. Something later, WebGLRenderingContextBase::clearIfComposited performs glClear on the fbo. My CL is the same to your WebGL CL. This code just bind fbo and replace color texture, so we don't want to copy back tiled cache to color texture redundantly. glDiscardFramebuffer can make it. > So my question is, where is glClear called to clear > |offscreen_target_frame_buffer_| and prepare a new frame? and can we skip that > glClear if have? As I explain above, glClear is not related to glDiscardFramebuffer.
Very elaborate description. Thanks. On 2014/10/14 12:09:30, dshwang wrote: > Hongbo, Thank you for review. > > https://codereview.chromium.org/651863002/diff/1/gpu/command_buffer/service/g... > File gpu/command_buffer/service/gles2_cmd_decoder.cc (right): > > https://codereview.chromium.org/651863002/diff/1/gpu/command_buffer/service/g... > gpu/command_buffer/service/gles2_cmd_decoder.cc:9375: if > (feature_info_->feature_flags().ext_discard_framebuffer) { > On 2014/10/14 02:27:48, Hongbo Min wrote: > > According to my previous work, an attempt to clear a framebuffer on tiled GPU > > is time-consuming, if glDiscardFramebuffer is used, it gives a performance > hint > > and allow GPU driver to set the pixel vale to whatever it wishes, and the > > glClear could be skipped (I have verified it on ASUS Nexus series, it works > well > > without glClear if glDiscardFramebuffer is used). > > As far as I understand, your CL didn't affect whether glClear is skipped or not. > WebGLRenderingContextBase::clearIfComposited still do glClear every frame. Indeed, my CL for webgl explicitly skips glClear in WebGLRenderingContextBase::clearIfComposited which would return false early if the DrawingBuffer supports discarding framebuffer by checking DrawingBuffer::discardFramebufferSupported. See https://codereview.chromium.org/614883002/patch/40001/50001. > > Let me explain why your CL speed up in tiled-based GPU as I understand. When > binding another fbo, tiled-based GPU must stores tiled cache to gpu memory > (texture, stencil buffer and depth buffer). glDiscardFramebuffer hints GPU to > not read back tiled cache to gpu memory because the software don't need anymore > the changed fbo information. It saves huge gpu bandwidth. > > Let's go back to your CL. The fbo lifecycle is as follows: > 1. WebGL replaces color texture to front fbo. At that time, you hint > glDiscardFramebuffer because DrawingBuffer doesn't update fbo so you don't want > to copy tiled cache to color texture redundantly. It wastes bandwidth. > 2. Something later, WebGLRenderingContextBase::clearIfComposited performs > glClear on the fbo. > > My CL is the same to your WebGL CL. This code just bind fbo and replace color > texture, so we don't want to copy back tiled cache to color texture redundantly. > glDiscardFramebuffer can make it. > > > > So my question is, where is glClear called to clear > > |offscreen_target_frame_buffer_| and prepare a new frame? and can we skip that > > glClear if have? > > As I explain above, glClear is not related to glDiscardFramebuffer. According to my experiments, on ASUS Nexus 7 device, only if glClear is skipped, the performance could get improved ~30% when using glDiscardFramebuffer. Otherwise, glDiscardFramebuffer won't have an impact on fps improvement. I have no much idea about why glClear calling on framebuffer is so time-consuming. Maybe a driver bug?
cc shouqun
On 2014/10/15 02:03:59, Hongbo Min wrote: > Very elaborate description. Thanks. > > On 2014/10/14 12:09:30, dshwang wrote: > > As far as I understand, your CL didn't affect whether glClear is skipped or > not. > > WebGLRenderingContextBase::clearIfComposited still do glClear every frame. > > Indeed, my CL for webgl explicitly skips glClear in > WebGLRenderingContextBase::clearIfComposited which would return false early if > the DrawingBuffer supports discarding framebuffer by checking > DrawingBuffer::discardFramebufferSupported. See > https://codereview.chromium.org/614883002/patch/40001/50001. Ah, I didn't notice this change. Thank you for correcting it. However, I found a bug thanks to your explanation. glDiscardFramebuffer() doesn't guarantee clearing FBO, although ASUS Nexus series clear FBO. I'll fix this bug in https://codereview.chromium.org/656053002/ > > As I explain above, glClear is not related to glDiscardFramebuffer. > > According to my experiments, on ASUS Nexus 7 device, only if glClear is skipped, > the performance could get improved ~30% when using glDiscardFramebuffer. > Otherwise, glDiscardFramebuffer won't have an impact on fps improvement. I have > no much idea about why glClear calling on framebuffer is so time-consuming. > Maybe a driver bug? You explain that skipping glClear mostly contribute on speed-up. Which test cases do speed-up ~30%? If WebGL draws few triangles, it might be possible, but huge tests like Aquarium would speed-up little. Am I right? Ah.. After the above fix, the speed-up will be rollback, unfortunately. GLES driver may already have optimization like glDiscardFramebuffer. The code in which you add glDiscardFramebuffer doesn't update buffer of FBO, so driver may don't read back tiled cache to gpu memory.
On 2014/10/15 10:06:31, dshwang wrote: > On 2014/10/15 02:03:59, Hongbo Min wrote: > > Very elaborate description. Thanks. > > > > On 2014/10/14 12:09:30, dshwang wrote: > > > As far as I understand, your CL didn't affect whether glClear is skipped or > > not. > > > WebGLRenderingContextBase::clearIfComposited still do glClear every frame. > > > > Indeed, my CL for webgl explicitly skips glClear in > > WebGLRenderingContextBase::clearIfComposited which would return false early if > > the DrawingBuffer supports discarding framebuffer by checking > > DrawingBuffer::discardFramebufferSupported. See > > https://codereview.chromium.org/614883002/patch/40001/50001. > > Ah, I didn't notice this change. Thank you for correcting it. > However, I found a bug thanks to your explanation. glDiscardFramebuffer() > doesn't guarantee clearing FBO, although ASUS Nexus series clear FBO. > I'll fix this bug in https://codereview.chromium.org/656053002/ > > > > As I explain above, glClear is not related to glDiscardFramebuffer. > > > > According to my experiments, on ASUS Nexus 7 device, only if glClear is > skipped, > > the performance could get improved ~30% when using glDiscardFramebuffer. > > Otherwise, glDiscardFramebuffer won't have an impact on fps improvement. I > have > > no much idea about why glClear calling on framebuffer is so time-consuming. > > Maybe a driver bug? > > You explain that skipping glClear mostly contribute on speed-up. Which test > cases do speed-up ~30%? If WebGL draws few triangles, it might be possible, but > huge tests like Aquarium would speed-up little. Am I right? > Ah.. After the above fix, the speed-up will be rollback, unfortunately. > GLES driver may already have optimization like glDiscardFramebuffer. The code in > which you add glDiscardFramebuffer doesn't update buffer of FBO, so driver may > don't read back tiled cache to gpu memory. The benchmarks I am working on is HexGL[1], and PlayCanvas[2] in Ludei demo list. [1] https://github.com/hmin/HexGL/tree/ffos-local [2] https://cocoonjsservice.ludei.com/cocoonjslaunchersvr/demo-list/ (grep PlayCanvas) I strongly suggest you to try PlayCanvas, it gets a very big performance boost.
kbr@chromium.org changed reviewers: + aelias@chromium.org, sievers@chromium.org, vmiura@chromium.org
I'm deferring to sievers@, aelias@ and vmiura@ since this would principally affect Chrome on Android. I'm not sure any more how this part of the command buffer relates to the compositor's on-screen display path. What sorts of measurements have been done of this change? Does it measurably improve performance?
Sorry. I didn't read the CL description carefully. Please file a new bug about applying this optimization to PNaCl instead of referring to the already-closed bug about WebGL. Thanks.
On 2014/10/16 02:09:58, Ken Russell wrote: > I'm deferring to sievers@, aelias@ and vmiura@ since this would principally > affect Chrome on Android. I'm not sure any more how this part of the command > buffer relates to the compositor's on-screen display path. What sorts of > measurements have been done of this change? Does it measurably improve > performance? According to my performance measurement, glDiscardFramebufferEXT may save memory bandwidth, but have no performance improvement if glClear is not skipped.
On 2014/10/16 02:12:28, Ken Russell wrote: > Sorry. I didn't read the CL description carefully. Please file a new bug about > applying this optimization to PNaCl instead of referring to the already-closed > bug about WebGL. Thanks. Thanks. I filed new bug and changed the description. sievers@, could you review? On 2014/10/16 06:27:15, Hongbo Min wrote: > On 2014/10/16 02:09:58, Ken Russell wrote: > > I'm deferring to sievers@, aelias@ and vmiura@ since this would principally > > affect Chrome on Android. I'm not sure any more how this part of the command > > buffer relates to the compositor's on-screen display path. What sorts of > > measurements have been done of this change? Does it measurably improve > > performance? > > According to my performance measurement, glDiscardFramebufferEXT may save memory > bandwidth, but have no performance improvement if glClear is not skipped. Unfortunately, I also didn't see meaningful speed-up. nevertheless, I submit this CL because of recommended in theory.
Due to Chromium security needs, glDiscardFramebufferEXT imposes a forced glClear in GPU command buffer. glDiscardFramebufferEXT marks framebuffer attachments as invalid (needs clear), which forces a glClear the next time the buffer is used. Therefore replacing a glClear with glDiscardFramebufferEXT generally leads to no improvement, and actually adds some overhead. There should be a win when glDiscardFramebufferEXT is used right at the |end| of a rendering pass when there are unneeded attachments, e.g. usually DEPTH & STENCIL buffers can be discarded to reduce bandwidth on tiling GPUs. In this case I think you could try using glDiscardFramebufferEXT with GL_DEPTH_ATTACHMENT & GL_STENCIL_ATTACHMENT, |before| doing the swap.
On 2014/10/16 17:10:12, vmiura wrote: Thank you for good review. > Due to Chromium security needs, glDiscardFramebufferEXT imposes a forced glClear > in GPU command buffer. > > glDiscardFramebufferEXT marks framebuffer attachments as invalid (needs clear), > which forces a glClear > the next time the buffer is used. Therefore replacing a glClear with > glDiscardFramebufferEXT generally > leads to no improvement, and actually adds some overhead. PNaCl spec said that "The contents of ancillary buffers are always undefined after calling SwapBuffers. The contents of the color buffer are undefined if the value of the PP_GRAPHICS3DATTRIB_SWAP_BEHAVIOR attribute of context is not PP_GRAPHICS3DATTRIB_BUFFER_PRESERVED." https://code.google.com/p/chromium/codesearch#chromium/src/ppapi/cpp/graphics... https://developer.chrome.com/native-client/pepper_dev/c/struct_p_p_b___graphi... So it's intended to make the front buffer (i.e. |offscreen_target_frame_buffer_|) undefined. Currently, the front buffer on PNaCl keeps 1 frame ago back buffer (i.e. |offscreen_saved_color_texture_|). After this CL, it's really undefined. Is it ACCEPTABLE in terms of Chromium security??? FYI, in my experiences about glDiscardFramebufferEXT, Adreno GPU: clear fbo Intel GPU with mesa driver: do nothing > There should be a win when glDiscardFramebufferEXT is used right at the |end| of > a rendering pass when > there are unneeded attachments, e.g. usually DEPTH & STENCIL buffers can be > discarded to reduce > bandwidth on tiling GPUs. > > In this case I think you could try using glDiscardFramebufferEXT with > GL_DEPTH_ATTACHMENT & GL_STENCIL_ATTACHMENT, > |before| doing the swap. After re-thinking, we can use glDiscardFramebufferEXT(GL_DEPTH_ATTACHMENT & GL_STENCIL_ATTACHMENT) with preserve-fbo because the spec said "The contents of ancillary buffers are always undefined" In more detail, let me explain how to composite the back buffer on browser. After SwapBuffer(), browser generates mailbox by the new back buffer and draws the mailbox on the browser.
On 2014/10/16 20:17:36, dshwang wrote: > On 2014/10/16 17:10:12, vmiura wrote: > > PNaCl spec said that > "The contents of ancillary buffers are always undefined after calling > SwapBuffers. The contents of the color buffer are undefined if the value of the > PP_GRAPHICS3DATTRIB_SWAP_BEHAVIOR attribute of context is not > PP_GRAPHICS3DATTRIB_BUFFER_PRESERVED." > https://code.google.com/p/chromium/codesearch#chromium/src/ppapi/cpp/graphics... > https://developer.chrome.com/native-client/pepper_dev/c/struct_p_p_b___graphi... > > So it's intended to make the front buffer (i.e. > |offscreen_target_frame_buffer_|) undefined. Currently, the front buffer on > PNaCl keeps 1 frame ago back buffer (i.e. |offscreen_saved_color_texture_|). > After this CL, it's really undefined. Is it ACCEPTABLE in terms of Chromium > security??? > That's a functional argument about not having to preserve the contents. The other argument is a security argument about not exposing some video memory region the client shouldn't have access to, if the client copies out or reads back the discarded buffer. We usually flag the buffer with needs_clear, see HandleDiscardBackbufferCHROMIUM(). The clear is done implicitly by the decoder every time you try to access the framebuffer (including draw since we don't know if the draw will cover the whole buffer), and Victor is saying that such an implied clear might destroy the perf gains you otherwise get from the discard. @Victor: I thought that tilers are usually able to optimize Clear + full Draw, so that the Clear does not generate actual pixels being written.
> @Victor: I thought that tilers are usually able to optimize Clear + full Draw, > so that the Clear does not generate actual pixels being written. For tilers usually it's free or cheap, but on hybrid tilers like Adreno it's more like drawing a full frame quad. Though, drawing a full frame quad is still cheaper than loading previous frame contents. Where it's a loss is if we do glDiscardFramebufferEXT instead of glClear because GLES2Decoder will just force the glClear anyway, just in a more long winded way. i.e: [Good case] Client 1: glClear(...); draw stuff... Service 1: glClear(...); draw stuff [Bad case] Client 2: glDiscardFramebuffer(...) draw stuff Service 2: glDiscardFramebuffer(...) Invalidate all attachments. Mark framebuffers as potentially invalid. // Force clear to black before "draw stuff" glClearColor(0,0,0,0) glColorMask(1,1,1,1) glClearDepth(0) etc. glClear(...) glClearColor(client state) glClearMask(client state) glClearDepth(client state) etc. draw stuff
@vmlura, @sievers, thank you for good explanation. As a result, this CL regresses perf also. I close it as Invalid. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
