From da1073c247523d07d0485348447fcc02000afee8 Mon Sep 17 00:00:00 2001 From: Philip Langdale Date: Sat, 29 Sep 2018 18:00:19 -0700 Subject: vo_gpu: vulkan: hwdec_cuda: Add support for Vulkan interop Despite their place in the tree, hwdecs can be loaded and used just fine by the vulkan GPU backend. In this change we add Vulkan interop support to the cuda/nvdec hwdec. The overall process is mostly straight forward, so the main observation here is that I had to implement it using an intermediate Vulkan buffer because the direct VkImage usage is blocked by a bug in the nvidia driver. When that gets fixed, I will revist this. Nevertheless, the intermediate buffer copy is very cheap as it's all device memory from start to finish. Overall CPU utilisiation is pretty much the same as with the OpenGL GPU backend. Note that we cannot use a single intermediate buffer - rather there is a pool of them. This is done because the cuda memcpys are not explicitly synchronised with the texture uploads. In the basic case, this doesn't matter because the hwdec is not asked to map and copy the next frame until after the previous one is rendered. In the interpolation case, we need extra future frames available immediately, so we'll be asked to map/copy those frames and vulkan will be asked to render them. So far, harmless right? No. All the vulkan rendering, including the upload steps, are batched together and end up running very asynchronously from the CUDA copies. The end result is that all the copies happen one after another, and only then do the uploads happen, which means all textures are uploaded the same, final, frame data. Whoops. Unsurprisingly this results in the jerky motion because every 3/4 frames are identical. The buffer pool ensures that we do not overwrite a buffer that is still waiting to be uploaded. The ra_buf_pool implementation automatically checks if existing buffers are available for use and only creates a new one if it really has to. It's hard to say for sure what the maximum number of buffers might be but we believe it won't be so large as to make this strategy unusable. The highest I've seen is 12 when using interpolation with tscale=bicubic. A future optimisation here is to synchronise the CUDA copies with respect to the vulkan uploads. This can be done with shared semaphores that would ensure the copy of the second frames only happens after the upload of the first frame, and so on. This isn't trivial to implement as I'd have to first adjust the hwdec code to use asynchronous cuda; without that, there's no way to use the semaphore for synchronisation. This should result in fewer intermediate buffers being required. --- wscript | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'wscript') diff --git a/wscript b/wscript index b299a27a1a..7bc1b1bfd8 100644 --- a/wscript +++ b/wscript @@ -846,11 +846,11 @@ hwaccel_features = [ }, { 'name': 'ffnvcodec', 'desc': 'CUDA Headers and dynamic loader', - 'func': check_pkg_config('ffnvcodec >= 8.1.24.1'), + 'func': check_pkg_config('ffnvcodec >= 8.2.15.3'), }, { 'name': '--cuda-hwaccel', 'desc': 'CUDA hwaccel', - 'deps': 'gl && ffnvcodec', + 'deps': '(gl || vulkan) && ffnvcodec', 'func': check_true, } ] -- cgit v1.2.3