I was recently alerted to an article that described a magical video decoding engine. The site has a history of making odd conclusions based on their observations, so naturally, I was a bit skeptical about the claims that were relayed to me by a colleague. Basically, the CPU load dropped dramatically, and the GPU load stayed the same. This sounded almost too good to be true, so I did some casual tests here (again).
I am not thrilled about downloading a 2 GB installer that messes up my PC when I uninstall it, and running things in a VM would not be an honest test. Nor am I about to buy a new Intel PC to test this out (my next PC will be a Ryzen based system), so all tests are done with readily available tools: FFMpeg and GPU-Z. I believe that Intel wrote the QSV version of the h264 decoder, so I guess it’s as good as it gets.
Tests were done on an old 3770K, 32 GB RAM, Windows 7 with a GeForce 670 dedicated GPU. The 3770K comes with the Intel HD Graphics 4000 integrated graphics solution that supports Quick Sync.
In the nerd-world, a GPU usually means a discrete GPU; a NVidia GeForce or AMD Radeon dedicated graphics card. Using the term “GPU support” is too vague, because different vendors have different support for different things. E.g. NVidia has CUDA and their NVEC codecs, and some things can be done with pixel shaders that work on all GPUs. (our decoding pipeline uses this approach and works on integrated as well as discrete GPU, so that’s why I use the term GPU accelerated decoding without embarrassment).
However, when you rely on (or are testing) something very specific, like Intel Quick Sync, then that’s the term you should use. If you say GPU support then the reader might be lead to believe that a faster NVidia card will get a performance boost (since the NVidia card is much, much faster than the integrated GPU that hosts Quick Sync). This would not be the case. A newer generation of Intel CPU would offer better performance, and it would not work at all on AMD chips with a dedicated GPU (or AMD’s APU solution). Same if you use CUDA in OpenCV, then say “CUDA support” to avoid confusion.
Usually, when I benchmark stuff, I run the item under test at full capacity. E.g. if I want to test, say the CPU based H264 decoder in FFMpeg against the Intel Quick Sync based decoder, I will ask the system to decode the exact same clip as fast as possible.
So, let’s decode a 720p clip using the CPU only, and see what we get.
The clip only takes a few seconds to decode, but if you look at the task manager, you can see that the CPU went to 100%. That means that we are pushing the 3770K to it’s capacity.
Now, let’s test Quick Sync
Not as fast as the CPU only, but we could run CPU decoding at the same time, and in aggregate get more…. but we got ~580 fps
So we are getting ~200 fps less than the CPU-only method. Fortunately, the CPU is not being taxed to 100% anymore. We’re only at 10% CPU use when the QSV decoder is doing its thing:
But surprisingly, neither is the GPU. In fact, the GPU load is at 0%
However, if you look at the GPU Power, you can see that there is an increased power-draw on the GPU at a few places (it’s drawing 2.6W at those spikes). Those are the places where the test is being run. You can also see that the GPU clock increases to meet the demand for processing power.
If there is no load on the GPU, why does it “only” deliver ~600 fps? Why is the load not at 100%? I think the reason is that the GPU load in GPU-Z does not show the stress on the dedicated Quick Sync circuitry that is running at full capacity. I can make the GPU graph increase, by moving a window onto the screen that is driven by the Intel HD Graphics 4000 “GPU”, so the GPU-Z tool is working as intended.
I should say that I was able to increase performance by running 2 concurrent decoding sessions, getting to ~800 fps, but from then on, more sessions just lowers the frame rate, and eventually, the CPU is saturated as well.
To enable Quick Sync on my workstation which has a dedicated NVidia GeForce 670 card on Windows 7, I have to enable a “virtual” screen and allow windows to extend the display to this screen (that I can’t see because I only have one 4K monitor). I also had to enable it in the BIOS, so it was not exactly plug and play.
I stand by my persuasion: yes, add GPU decoding to the mix, but the user should rely on edge-based detection combined with dedicated sensors (any integrator worth their salt will be able to install a PIR detector and hook it up in just a few minutes). This allows you to run your VMS on extremely low-end hardware and the scalability is much better than moving a bottleneck to a place where it’s harder to see.