Codecs and Client Applications

4K and H.265 is coming along nicely, but it is also part of the problems we as developers are facing these days. Users want low latency, fluid video, low bandwidth and high resolution, and they want these things on 3 types of platforms – traditional PC applications, apps for mobile devices and tablets, and a web interface. In this article, I’d like to provide some insights from a developer’s perspective.

Fluid or Low Latency

Fluid and low latency at the same time is highly problematic. To the HLS guys, 1 second is “low latency”, but to us, and the WebRTC hackers, we are looking for latencies in the sub 100 ms area. Video surveillance doesn’t always need super low latency – if a fixed camera has 2 seconds of latency, that is rarely a problem (in the real world). But as soon as any kind of interaction takes place (PTZ or 2-way audio) then you need low latency. Optical PTZ can “survive” low latency if you only support PTZ presets or simple movements, but tracking objects will be virtually impossible on a high-latency feed.

Why high latency?

A lot of the tech developed for video distribution is intended for recorded video, and not for low latency live video. The intent is to download a chunk of video, and while that plays, you download the next in the background, this happens over and over, but playing back does not commence until at least the entire first block has been read off the wire. The chunks are usually 5-10 seconds in duration, which is not a problem when you’re watching Bob’s Burgers on Netflix.

The lowest latency you can get is to simple draw the image when you receive it, but due to packetization and network latency, you’re not going to get the frames at a steady pace, which leads to stuttering which is NOT acceptable when Bob’s Burgers is being streamed.

How about WebRTC?

If you’ve ever used Google Hangouts,  then you’ve probably used WebRTC. It works great when you have a peer-to-peer connection with a good, low latency connection. The peer-to-peer part is important, because part of the design is that the recipient can tell the sender to adjust its quality on demand. This is rarely feasible on a traditional IP camera, but it could eventually be implemented. WebRTC is built into some web browsers, and it supports H.264 by default, but not H.265 (AFAIK) or any of the legacy formats.

Transcoding

Yes, and no. Transcoding comes at a rather steep price if you expect to use your system as if it ran w/o transcoding. The server has to decode every frame, and then re-encode it in the proper format. Some vendors transcodes to JPEG which makes it easier for the client to handle, but puts a tremendous amount of stress on the server. Not on the encoding side, but the decoding of all those streams is pretty costly.  To limit the impact on the transcoding server, you may have to alter the UI to reflect the limitation in server side resources.

Installed Applications

The trivial case is an installed application on a PC or a mobile device. Large install files are pretty annoying (and often unnecessary), but you can package all the application dependencies, and the developer can do pretty much anything they want. There’s usually a lot of disk-space available and fairly large amounts of RAM.

On a mobile device you struggle with OS fragmentation (in case of Android), but since you are writing an installed application, you are able to include all dependencies. The limitations are in computing power, storage, RAM and physical dimensions. The UI that works for a PC with a mouse is useless on a 5″ screen with a touch interface. The CPU/GPU’s are pretty powerful (for their size), but they are no-where near the processing power of a halfway decent PC. The UI has to take this into consideration as well.

“Pure” Web Clients

One issue that I have come across a few times, is that some people think the native app “uses a lot of resources”, while a web based client would somehow, magically, use fewer resources to do the same job. The native app uses 90.0% of the CPU resources to decode video, and it does so a LOT more efficient than a web client would ever be able to. So if you have a low end PC, the solution is not to demand a web client, but to lower the number of cameras on-screen.

Let me make it clear: any web client that relies on an ActiveX component to be downloaded and installed, might as well have been an installed application. ActiveX controls are compiled binaries that only run on the intended platform (IE, Windows, x86 or x64). They are usually implicitly (sometimes explicitly) left behind on the machine, and can be instantiated and used as an attack vector if you can trick a user to visit a malicious site (which is quite easy to accomplish).

The purpose of a web client is to allow someone to access the system from a random machine in the network, w/o having to install anything. An advantage is also that since there is no installer, there’s no need to constantly download and install upgrades every time you access your system. When you log in, you get the latest version of the “client”. Forget all about “better for low end” and “better performance”.

Technology

Java applets can be installed, but often setting up Java for a browser is a pain in the ass (security issues), and performance can be pretty bad.

Flash apps are problematic too, and suffer the same issues as Java applets. Flash has a decent H.264 decoder for .flv formatted streams, but no support for H.265 or legacy formats (unless you write them, from scratch.. and good luck with that 🙂 ) Furthermore, everyone with a web browser in their portfolio is trying to kill Flash due to it’s many problems.

NPAPI or other native plugin frameworks (NaCL, Pepper) did offer decent performance, but usually only worked on one or two browsers (Chrome or Firefox), and Chrome later removed support for NPAPI.

HTML5 offers the <video> tag, which can be used for “live” video. Just not low latency, and codec support is limited.

Javascript performance is now at a point (for the leading browsers) that you can write a full decoder for just about any format you can think of and get reasonable performance for 1 or 2 720p streams if you have a modern PC.

Conclusion

To get broad client side support (that truly works), you have to make compromises on the supported devices side. You cannot support every legacy codec and device and expect to get a decent client side experience on every platform.

As a general rule, I disregard arguments that “if it doesn’t work with X, then it is useless”. Too often, this argument gains traction, and to satisfy X, we sacrifice Y. I would rather support Y 100% if Y makes more sense. I’d rather serve 3 good dishes, than 10 bad ones. But in this industry, it seems that 6000 shitty dishes at an expensive “restaurant” is what people want. I don’t.

 

 

CPU vs GPU

I think some of the incumbents are going in the wrong direction, while I am a little envious of some that I think got it 100% right.

In the old days, things were simple and cameras were really dumb, but today cameras are often quite clever, but now hordes of VMS salespeople are now trying to make them dumb again, thereby driving the whole industry backward to the detriment of the end-users. Eventually, though, I think people will wake up and realize what’s going on.

The truth is that you can run a VMS on a $100 hardware platform (excluding storage). Yet,  if you are keeping up on the latest news, it seems that you that you need a $3000 monster PC with a high-end GPU to drive it. In the grand scheme of things (cost of cameras, cabling and VMS licenses) the $2900 dollar difference is peanuts, but it bothers me nonetheless. It bothers me because it suggests a piss-poor use of the available resources.

pi
A $40 VMS capable PC

As I have stated quite a few times, the best way to detect motion is to use a PIR sensor, but if you insist on doing any sort of image analysis the best way to do it is on the camera. The camera has access to the uncompressed frame in it’s most optimal format, and it brings its own CPU resources to the party.  If you move the motion detection to the camera, then your $100 platform will never have to decode a single video frame, and can focus on what the VMS should be doing: reading, storing and relaying data.

In contrast, you can let the camera throw away a whole bunch of information as it compresses the frame. Then send the frame across the network (dropping a few packets for good measure) to a PC that is sweating bullets as it must now decompress each and every frame since MPEG formats are all or (almost) nothing formats, there is no “decode every 4th frame” option here. The decompressed frame now contains compression artifacts which contribute to making accurate analysis difficult. The transmission of the frames across the network can also lead to the frames not arriving at a steady pace, which causes other problems for video analytics engines.

missing_packets
Look at all that motion! Let’s sound the alarm.

VMS vendors now say they have a “solution” to the PC getting crushed under the insane workload required to do any sort of meaningful video analysis. Move everything to a GPU they say – and it’s kinda true. If you bring up the task manager in windows, your CPU utilization will now be lower, but crank up GPU-z and you (should) see the GPU buckling under the load. One might ask if it would not have been cheaper to get a $350 octa-core Ryzen CPU instead of a $500 GPU

gpu-z-3

Some will say that if the integrator has to spend 2 days setting up the cameras using edge detection, it might be cheaper if they just spring for the super PC and do everything on that. This assumes that the setup can actually be done quicker than when setting it up on a camera. I’d wager that a lot of motion detection systems are not really necessary, and in other cases, the VMS motion detection is simply not as good as the edge-based detection, which in some tragic instances completely invalidate the system and renders it worthless as people and objects magically teleport from one frame to the next.