Tag Archives: h.264

Magical “GPU” Based Video Decoder

I was recently alerted to an article that described a magical video decoding engine. The site has a history of making odd conclusions based on their observations, so naturally, I was a bit skeptical about the claims that were relayed to me by a colleague. Basically, the CPU load dropped dramatically, and the GPU load stayed the same. This sounded almost too good to be true, so I did some casual tests here (again).

EouEzI5bBR8uk.gif

Test setup

I am not thrilled about downloading a 2 GB installer that messes up my PC when I uninstall it, and running things in a VM would not be an honest test. Nor am I about to buy a new Intel PC to test this out (my next PC will be a Ryzen based system), so all tests are done with readily available tools: FFMpeg and GPU-Z. I believe that Intel wrote the QSV version of the h264 decoder, so I guess it’s as good as it gets.

Tests were done on an old 3770K, 32 GB RAM, Windows 7 with a GeForce 670 dedicated GPU. The 3770K comes with the Intel HD Graphics 4000 integrated graphics solution that supports Quick Sync.

Terminology

In the nerd-world, a GPU usually means a discrete GPU; a NVidia GeForce or AMD Radeon dedicated graphics card. Using the term “GPU support” is too vague, because different vendors have different support for different things. E.g. NVidia has CUDA and their NVEC codecs, and some things can be done with pixel shaders that work on all GPUs. (our decoding pipeline uses this approach and works on integrated as well as discrete GPU, so that’s why I use the term GPU accelerated decoding without embarrassment).

However, when you rely on (or are testing) something very specific, like Intel Quick Sync, then that’s the term you should use. If you say GPU support then the reader might be lead to believe that a faster NVidia card will get a performance boost (since the NVidia card is much, much faster than the integrated GPU that hosts Quick Sync). This would not be the case. A newer generation of Intel CPU would offer better performance, and it would not work at all on AMD chips with a dedicated GPU (or AMD’s APU solution). Same if you use CUDA in OpenCV, then say “CUDA support” to avoid confusion.

Results

Usually, when I benchmark stuff, I run the item under test at full capacity. E.g. if I want to test, say the CPU based H264 decoder in FFMpeg against the Intel Quick Sync based decoder, I will ask the system to decode the exact same clip as fast as possible.

So, let’s decode a 720p clip using the CPU only, and see what we get.

CPU

The clip only takes a few seconds to decode, but if you look at the task manager, you can see that the CPU went to 100%. That means that we are pushing the 3770K to it’s capacity.

CPU_FPS

Now, let’s test Quick Sync

QSV

Not as fast as the CPU only, but we could run CPU decoding at the same time, and in aggregate get more…. but we got ~580 fps

QSV_FPS

So we are getting ~200 fps less than the CPU-only method. Fortunately, the CPU is not being taxed to 100% anymore. We’re only at 10% CPU use when the QSV decoder is doing its thing:

CPU_QSV

Magic!!!

But surprisingly, neither is the GPU. In fact, the GPU load is at 0%

GPU_QSV

However, if you look at the GPU Power, you can see that there is an increased power-draw on the GPU at a few places (it’s drawing 2.6W at those spikes). Those are the places where the test is being run. You can also see that the GPU clock increases to meet the demand for processing power.

If there is no load on the GPU, why does it “only” deliver ~600 fps? Why is the load not at 100%? I think the reason is that the GPU load in GPU-Z does not show the stress on the dedicated Quick Sync circuitry that is running at full capacity. I can make the GPU graph increase, by moving a window onto the screen that is driven by the Intel HD Graphics 4000 “GPU”, so the GPU-Z tool is working as intended.

I should say that I was able to increase performance by running 2 concurrent decoding sessions, getting to ~800 fps, but from then on, more sessions just lowers the frame rate, and eventually, the CPU is saturated as well.

Grief

To enable Quick Sync on my workstation which has a dedicated NVidia GeForce 670 card on Windows 7, I have to enable a “virtual” screen and allow windows to extend the display to this screen (that I can’t see because I only have one 4K monitor). I also had to enable it in the BIOS, so it was not exactly plug and play.

Conclusion

I stand by my persuasion: yes, add GPU decoding to the mix, but the user should rely on edge-based detection combined with dedicated sensors (any integrator worth their salt will be able to install a PIR detector and hook it up in just a few minutes). This allows you to run your VMS on extremely low-end hardware and the scalability is much better than moving a bottleneck to a place where it’s harder to see.

Tagged , ,

Codecs and Client Applications

4K and H.265 is coming along nicely, but it is also part of the problems we as developers are facing these days. Users want low latency, fluid video, low bandwidth and high resolution, and they want these things on 3 types of platforms – traditional PC applications, apps for mobile devices and tablets, and a web interface. In this article, I’d like to provide some insights from a developer’s perspective.

Fluid or Low Latency

Fluid and low latency at the same time is highly problematic. To the HLS guys, 1 second is “low latency”, but to us, and the WebRTC hackers, we are looking for latencies in the sub 100 ms area. Video surveillance doesn’t always need super low latency – if a fixed camera has 2 seconds of latency, that is rarely a problem (in the real world). But as soon as any kind of interaction takes place (PTZ or 2-way audio) then you need low latency. Optical PTZ can “survive” low latency if you only support PTZ presets or simple movements, but tracking objects will be virtually impossible on a high-latency feed.

Why high latency?

A lot of the tech developed for video distribution is intended for recorded video, and not for low latency live video. The intent is to download a chunk of video, and while that plays, you download the next in the background, this happens over and over, but playing back does not commence until at least the entire first block has been read off the wire. The chunks are usually 5-10 seconds in duration, which is not a problem when you’re watching Bob’s Burgers on Netflix.

The lowest latency you can get is to simple draw the image when you receive it, but due to packetization and network latency, you’re not going to get the frames at a steady pace, which leads to stuttering which is NOT acceptable when Bob’s Burgers is being streamed.

How about WebRTC?

If you’ve ever used Google Hangouts,  then you’ve probably used WebRTC. It works great when you have a peer-to-peer connection with a good, low latency connection. The peer-to-peer part is important, because part of the design is that the recipient can tell the sender to adjust its quality on demand. This is rarely feasible on a traditional IP camera, but it could eventually be implemented. WebRTC is built into some web browsers, and it supports H.264 by default, but not H.265 (AFAIK) or any of the legacy formats.

Transcoding

Yes, and no. Transcoding comes at a rather steep price if you expect to use your system as if it ran w/o transcoding. The server has to decode every frame, and then re-encode it in the proper format. Some vendors transcodes to JPEG which makes it easier for the client to handle, but puts a tremendous amount of stress on the server. Not on the encoding side, but the decoding of all those streams is pretty costly.  To limit the impact on the transcoding server, you may have to alter the UI to reflect the limitation in server side resources.

Installed Applications

The trivial case is an installed application on a PC or a mobile device. Large install files are pretty annoying (and often unnecessary), but you can package all the application dependencies, and the developer can do pretty much anything they want. There’s usually a lot of disk-space available and fairly large amounts of RAM.

On a mobile device you struggle with OS fragmentation (in case of Android), but since you are writing an installed application, you are able to include all dependencies. The limitations are in computing power, storage, RAM and physical dimensions. The UI that works for a PC with a mouse is useless on a 5″ screen with a touch interface. The CPU/GPU’s are pretty powerful (for their size), but they are no-where near the processing power of a halfway decent PC. The UI has to take this into consideration as well.

“Pure” Web Clients

One issue that I have come across a few times, is that some people think the native app “uses a lot of resources”, while a web based client would somehow, magically, use fewer resources to do the same job. The native app uses 90.0% of the CPU resources to decode video, and it does so a LOT more efficient than a web client would ever be able to. So if you have a low end PC, the solution is not to demand a web client, but to lower the number of cameras on-screen.

Let me make it clear: any web client that relies on an ActiveX component to be downloaded and installed, might as well have been an installed application. ActiveX controls are compiled binaries that only run on the intended platform (IE, Windows, x86 or x64). They are usually implicitly (sometimes explicitly) left behind on the machine, and can be instantiated and used as an attack vector if you can trick a user to visit a malicious site (which is quite easy to accomplish).

The purpose of a web client is to allow someone to access the system from a random machine in the network, w/o having to install anything. An advantage is also that since there is no installer, there’s no need to constantly download and install upgrades every time you access your system. When you log in, you get the latest version of the “client”. Forget all about “better for low end” and “better performance”.

Technology

Java applets can be installed, but often setting up Java for a browser is a pain in the ass (security issues), and performance can be pretty bad.

Flash apps are problematic too, and suffer the same issues as Java applets. Flash has a decent H.264 decoder for .flv formatted streams, but no support for H.265 or legacy formats (unless you write them, from scratch.. and good luck with that 🙂 ) Furthermore, everyone with a web browser in their portfolio is trying to kill Flash due to it’s many problems.

NPAPI or other native plugin frameworks (NaCL, Pepper) did offer decent performance, but usually only worked on one or two browsers (Chrome or Firefox), and Chrome later removed support for NPAPI.

HTML5 offers the <video> tag, which can be used for “live” video. Just not low latency, and codec support is limited.

Javascript performance is now at a point (for the leading browsers) that you can write a full decoder for just about any format you can think of and get reasonable performance for 1 or 2 720p streams if you have a modern PC.

Conclusion

To get broad client side support (that truly works), you have to make compromises on the supported devices side. You cannot support every legacy codec and device and expect to get a decent client side experience on every platform.

As a general rule, I disregard arguments that “if it doesn’t work with X, then it is useless”. Too often, this argument gains traction, and to satisfy X, we sacrifice Y. I would rather support Y 100% if Y makes more sense. I’d rather serve 3 good dishes, than 10 bad ones. But in this industry, it seems that 6000 shitty dishes at an expensive “restaurant” is what people want. I don’t.

 

 

Tagged , , , ,