Feature Request: HW Accelerated Filters

  • It would be nice if we could get the full suite of nvdec/cuda filters. I am most interested in scale_npp myself, but there's a whole host of these: transpose_npp, scale_cuda、yadif_cuda、thumbnail_cuda、overlay_cuda...

    • Offizieller Beitrag

    Yes, I see your point on this. I plan to add this to the voukoder successor actually and I would like to implement this in a smart way.

    From what I understand is how the would work in the current voukoder:

    1. Copy frame from CPU to GPU
    2. Apply CUDA filter
    3. Copy frame from GPU to CPU
    4. Copy frame from CPU to GPU
    5. Do GPU encoding
    6. Copy frame from GPU to CPU
    7. Write to disk
    8. Go to step 1

    I'd love to do it in this way:

    1. Copy multiple frames from CPU to GPU (as many as fit in the GPUs memory)
    2. Run CUDA filters over all frames
    3. Du GPU encoding on all frames
    4. Copy frame from GPU to CPU
    5. Write to disk
    6. Go to step 1

    I'm trying to find out if the FFmpeg accelerated filters work like the first way to the second way. Do you have any info on this?

  • I am certain that this is possible. Indeed, in ffmpeg currently such behavior is fairly easy to do. I believe that the difficulty with Premiere/Vegas/etc. will be getting the frame to decode at all on the GPU more than anything else, but I could be wrong. My current frameserver method has to decode frames on the CPU sadly, but the rest is done entirely on the GPU - but you might be able to do everything on the GPU with your method.



    These seem to be the most immediately relevant.

    Currently, the way I do it, using avisynth+, 64-bit ffmpeg, and 64-bit debugmode frameserver, is as follows:


    ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i frameserver.avs -filter:v "hwupload_cuda,scale_npp=w=2560:h=1440:interp_algo=lanczos" -c:v hevc_nvenc -rc vbr -cq 24 -qmin 24 -qmax 24 -level 6.2 -profile:v rext -y output.mkv

    Einmal editiert, zuletzt von Samwise Gamgee (20. März 2021 um 11:59)

    • Offizieller Beitrag

    I believe that the difficulty with Premiere/Vegas/etc. will be getting the frame to decode at all on the GPU more than anything else, but I could be wrong

    This is out of my control. At the beginning of it all I get a pointer to a frame buffer in the CPU addressable memory. So the NLE is doing all decoding, rendering, etc.

    Some months ago I was investigating if CUDA could improve the speed of pixel format conversion in the high bit depth mode. I noticed coping frames back and forth between CPU and GPU memory is quite expensive and I should do it as less as possible.

    I guess the "cuda" pixel format in FFmpeg contains a pointer to the GPU memory. So I guess with the FFmpeg command line above you're doing exactly what is was talking about as "way 2".

    As I am using FFmpeg/libav on the C API level, I still have to figure it all out. But this will be included in voukoder sooner or later.