When we design a surveillance system, we need to carefully consider how we allocate resources and distribute workloads. When you add a camera to an NVR, the most common use is to reduce the camera to a fairly dumb “video transmitter” and then let the server do the heavy lifting.
But even if the server is much, much more powerful than your humble IP camera, it is usually taxed with a lot of work. One of the tasks the server routinely carries out is to do what some folks call “motion detection”. The term is usually misleading as the NVR is not really detecting motion at all. It is detecting “changes in the frame”, which could be noise, light, and transition from color to B/W etc. not related to what we understand as “motion” at all. Analytics engines look at differences too, but they are truly looking for “motion” and not JUST changes.
Looking for changes is usually “good enough”, and does not need to be any more than that. And if looking for “change” is what you need, then you really should let your camera do the work and free up the NVR to do more important things.
The reason we initially decided to analyze the frames for changes was really motivated by storage problems. A common HDD in those days was 200-300 MB, the 640×480 frames were considered “high resolution” and the format was always MJPEG. Naturally, the Axis 200+ could not deliver these crisp HD feeds at anywhere near 30 FPS. 3-5 FPS was usually all you could get. But storing this massive amount of data became a problem, so we decided to discard frames that were almost identical.
Naturally, as time passed we got higher resolutions and higher framerates, we were suddenly able to do MPEG4 encoding on a consumer device – in real time!!! MPEG4 and H.264 actually looks at two successive frames in much the same way we do on the NVR. The codec simply “throws away” the redundant information just as we do. Except the codec is throwing away just the parts of the frame that is similar to the previous one, while preserving only the changes – a much, much better way of doing things.
For the codec to figure out what to throw away, it must look at two successive frames. If they are very similar, it can throw away a lot, if they are very different it needs to send almost all the pixels. On top of that H.264 does a lot of other things before the video is sent across the network. These involve among other things – discrete cosine transformation, quantization and Huffman encoding.
It does not seem like a far stretch that the codec implementation could provide a number that tells the camera how much 2 frames are alike. And in a primitive way it actually does – if the frame is large in terms of bytes, then we can deduce that the frames are very different, if the frames are small, then they are very similar. Naturally this is too crude and would not work on CBR feeds, and there is no windowing etc.
Nor does it seem totally unreasonable that the codec implementation could give the “difference parameter” for each macroblock (a small 8×8 pixel block). It is important to understand that the codec already is doing the computation, we are just asking to get to peek at the result. Furthermore, the codec is also working on the crisp uncompressed frames that have the highest level of fidelity, and no information has been thrown away.
In naive implementations like the one I describe here, there is not a lot to be gained from working on the raw frames in the camera, but ask any analytics vendor if they would prefer to work on the video BEFORE or AFTER compression and the answer will uniformly be the same : BEFORE compression. So while the benefit is not huge, it is not completely without merit.
To do the detection on the NVR, the NVR will have to completely reverse the process: Take the Huffman symbols, and expand them into imaginary coefficients, go from frequency to the spatial domain, and only then can you start to think about examining the frames. You can then make all sorts of tricks – perhaps you only look at every N pixel, perhaps you don’t look at every frame, perhaps you get a lot of noise from too heavy compression, perhaps you don’t. Every single trick lowers the “quality” of the detection. Perhaps the client doesn’t care, even with severe degradation of the quality, and that’s fine by me. I am focused on improving and providing better, more efficient solutions and offering them to the ones who appreciate such things.
The point is this – spending a lot of resources decoding a H.264 stream, to get information that could have been gathered almost for free in the camera, is not my idea of efficient allocation of the resources. It is like rejecting a free apple, only to ride 30 miles to the store to buy the same, exact apple, only now it is slightly bruised from transporting it to the store – AND it takes a lot of effort to unwrap the apple.
In time, an NVR will not need to do much, in fact, I expect an NVR to be very similar to a NAS. Cheap, easy to replace, and very scalable. This will require that the cameras become a little more advanced, but my experience tells me that progress doesn’t just stop. We were amazed by 640 x 480 at 4 fps when I started, and just as we laugh at that today, we will laugh at NVR side change detection 10 years from now.
I suspect that a lot of cameras do not have the fine grained control over the encoding process that is needed here. I assume that they are picking off-the-shelf H.264 encoders or reference designs offered by the chip manufacturers. For such cameras, there might not be a simple way to do on-board processing, and doing so may jeopardize the performance of the camera – for those, you will have to spring for the expensive PC’s.
Start preparing for the change 🙂