Lies, Damn Lies and Video Analytics

Today, doing object tracking using OpenCV can be done in just a few hours. The same applies to face detection and YOLO. Object tracking and recognition is no longer “magic” or require custom hardware. Most coders can whip something together in a day or two that will run on a laptop. Naturally, the research behind these algorithms is the work of some extremely clever guys who, commendably, are sharing their knowledge with the world (the YOLO license is legendary).

But there’s a catch.

During a test of YOLO, it would show me a couple of boxes. One around a face, and YOLO was about 51% certain that this was a person. Around my sock, there would be another where it was 54% sure it was also a person. But there was another face in the frame that was not identified as one.

It’s surprising and very cool that an algorithm can recognize a spoon on the table. But when the algorithm thinks a sock is a face and a face isn’t one, are you going to actually make tactical decisions in a security system based on it?

Charlatans will always make egregious claims about what the technology can do, and gullible consumers and government agencies are being sold on a dream that eventually turn out to be a nightmare.

Recently I saw a commercial where a “journalist” was interviewing a vendor about their analytics software (it wasn’t JH). Example footage was shown of a terrorist unpacking a gun, and opening fire down the street. This took place in your typical corner store in a middle eastern country. The video systems in these stores are almost always pretty awful, bad cameras, heavy compression.

bad_video

The claim being made in the advert is that their technology would be able to identify the terrorist and determine his path through the city in a few hours. A canned demo of the photographer walking through the offices of the vendor was offered as a demonstration of how easy and fast this could be done.

I call bullshit!

-village fool

First of all, most of the cameras on the path are going to be recording feed at similar quality to what you see above. This makes recognition a lot harder (useless/impossible?).

Second, if you’re not running object tracking while you are recording, you’ll need to process all the recorded video. Considering that there might be thousands of cameras, recorded on different equipment recording in different formats, the task of doing the tracking on the recorded video is going to take some time.

Tracking a single person walking down a well lit hallway, with properly calibrated and high quality cameras is one thing. Doing it on a camera with low resolution, heavily compressed video, and a bad sensor on the street with lots of movement, overlaps, etc. is a totally different ballgame.

You don’t know anything about marketing!

-arbitrary marketing person, yelling at Morten

Sure, I understand that this sort of hyperbole is just how things are done in this business. You come up with things that are fantastic and plausible for the uneducated user, and hope that it makes someone buy your stuff. And if your magical tool doesn’t work, then it’s probably too late, and who defines “works” anyways? If it can do it 20% of the time, then it “works”, doesn’t it. Like a car that can’t drive in the rain also “works”.

If you want to test this stuff, show up with real footage from your environment, and demand a demo on that content (if the vendor/integrator can’t do it, they need to educate themselves!). Keep an eye on the CPU and GPU load and ask if this will run on 300 cameras in your mall/airport without having to buy 100 new PC’s with 3 top of the line GPU’s in them.

I’m not saying that it doesn’t ever work. I’m saying that my definition of “works” is probably more dogmatic than a lot of people in this industry.

 

HDR and Low Light

In the early days, the only thing that mattered was pixel count. The more pixels you could cram onto a sensor, the better. Some people noticed that the higher pixels count would decrease the performance in low light scenes (for the same sensor size), and we got to a point where you’d have ridiculously high resolutions on extremely small sensors, but you’d need to be a few miles from the sun in order to get good, sharp footage.,

There are all sorts of software tricks that you can employ to improve the appearance of the image to the casual observer, but you can’t conjure up data that just aren’t there. One trick is to use very long exposure, but that causes moving objects to get blurry, or you can do noise reduction, but that also removes actual details.

So what happened that caused the cameras to suddenly improve quite dramatically over the last couple of years? The sensor guys came up with back-illuminated sensors. Basically, the sensors got a hell of a lot better, and now we’re reaping the benefits.

bi_sesnor

Xda-Developers has a great article on the Sony IMX378 explaining BI sensors and how HDR is achieved. And of course, there’s always Wikipedia.

Competition is a wonderful thing.

 

 

P2P

As with IP cameras, one of the IoT challenges is how to get your controlling device (typically a phone) to talk to the IoT device in a way that does not require opening up inbound ports on your firewall.

All communication is peer to peer, so the term, when used in the context of IoT devices, is perhaps a little misleading, after all, an exposed camera sending a video stream to a phone somewhere is also “peer to peer”. Instead, P2P might be translated to “send data from A to B, even if both A and B are behind firewalls, using a middleman C” (what the hell is up with all the A, B, C these days).

On a technical level, the P2P cameras use something called UDP hole punching, which sounds a bit onymous, but there’s really nothing sneaky about it. What happens is that A connects to C, so that C now knows the external IP address of A. Likewise, B also connects to C, and now C knows the external IP address of both A and B.

This middleman, now passes the IP address of A to B, and B to A. Next step is for A to fire a volley of UDP packets towards B, while B does the same towards A.

The firewall on A’s side sees a bunch of packets travel to B’s address, and when B’s packets arrive, the firewall thinks that the UDP packets are replies to the packets that were sent from A and let’s them through.

You could accomplish the same thing by having A go to “whatsmyip.com” and email it to B, B would then do the same. Then run scripts that send UDP packets over the network, but a STUN server automates this process.

But who controls this “middle man”? Ideally, you’d be in charge of it; you’d be able to specify your own STUN-type server in the camera interface, so that you have full control of all links in the chain. In time, perhaps the camera vendors will release a protocol description and open source modules so that you can host your own middle-man.

The problem might be that you bought a nice cheap camera in the grey market. The camera is intended for the Chinese market, but comes with a “modded” firmware that enables English menus and so on. This is obviously risky. Updating a modded firmware may be impossible and brick the camera, and the manufacturer may be less inclined to support devices that have been modded. You get what you pay for, so to speak (and this blog is free!)

The modder is selling the cameras in the western markets, but the STUN server is still pointing to a server in China. This makes sense if you are a Chinese user, but it may seem very strange that your camera “calls home” to a server in China. A non-modded camera might do the same, simply because running a STUN service is cheaper, and allows the government to eavesdrop on the traffic. If you are Chinese (I am not), you could argue that you don’t trust Amazon, Microsoft or Google because they might work with the NSA. Therefore, using your own server would be preferred.

Apart from the STUN functionality, the camera may follow direction that are sent from B to C to A. This puts a lot of responsibility in the hands of the guys maintaining this server. If it is breached, a lot of cameras will then be vulnerable.

Depending on the end user, P2P may not be appropriate at all. To some users, the cost of a breach is small, compared to the hassle of installing a fully secure system it might be worth it.

While yours truly has abandoned all attempts to appear professional over the years, the truth is that most big installations have their shit together. Unfortunately the volume of DIYers and amateurish installers who don’t really know what they are doing is much bigger (in terms of headcount, not commercial volume), and if there’s one thing we all want to do, it’s to blame someone else.

Caveat Emptor.

this-is-fine.0

 

RAM Buffers

About 13 years ago, we had a roundtable discussion about using RAM for the pre-buffering of surveillance video. I was against it. Coding wise, it would make things more complicated (we’d essentially have 2 databases), and the desire to support everything, everywhere, at any time made this a giant can of worms that I was not too happy to open. At the time physical RAM was limited, and chances were that the OS would then decide to push your RAM buffer to the swap file, causing severe degradation of performance. Worst of all, it would not be deterministic when things got swapped out, so all things considered, I said Nay.

As systems grew from the 25 cameras that was the maximum number supported on the flagship platform (called XXV), to 64 and above, we started seeing severe bottlenecks in disk IO. Basically, since pre-buffering was enabled per default, every single camera would pass through the disk IO subsystem only to be deleted 5 or 10 seconds later. A quick fix was to disable pre-buffering entirely, which would help enormously if the system only recorded on event, and the events were not correlated across many cameras.

However, recently, RAM buffering was added to the Milestone recorders, which makes sense now that you have 64 bit OS’s with massive amounts of RAM.

I always considered “record on event” as a bit of a compromise. It came about because people were annoyed with the way the system would trigger when someone passed through a door. Instead of having the door being closed at the start of the clip, usually the door would be 20% open by the time the motion-detection triggered, thus the beginning of the door opening would be missing.

A pre-buffer was a simple fix, but some caveats came up: systems that were setup to record on motion, would often record all through the night, due to noise in the images. If the system also triggered notifications, the user would often turn down the motion detection sensitivity until the false alarms disappeared. This had the unfortunate side effect of making the system too dull to properly detect motion in daylight, and thus you’d get missing video, people and cars “teleporting” all over the place and so on. Quite often the user would not realize the mistake until an incident actually did occur, and then it’s too late.

Another issue is that the video requires a lot more bandwidth when there is a lot of noise in the scene. This meant that at night, all the cameras would trigger motion at the same time, and the video would take up the max bandwidth allocated.

filesize
Video bandwidth over several days. The high bandwidth areas are night-time recordings.
filesize_zoom
24 hour zoom

Notice that the graph above reaches the bandwidth limit set in the configuration in the camera and then seemingly drops through the night. This is because the camera switches to black/white which requires less bandwidth. Then, in the morning, you see a spike as the camera switches back to color mode. Then it drops off dramatically during the day.

Sticking this in a RAM based prebuffer won’t help. You’ll be recording noise all through the night, from just about every camera in your system, completely bypassing the RAM buffer. So you’ll see a large number of channels trying to record a high bandwidth video – which is the worst case scenario.

Now you may have the best server side motion detection available in the industry, but what good does it do if the video is so grainy you can’t identify anyone in the video (it’s a human – sure – but which human?).

During the day (or in well-lit areas), the RAM buffer will help, most of the time, the video will be sent over the network, reside in RAM for 5-10-30 seconds, and then be deleted, never to be seen again – ever. This puts zero load on disk IO and is basically the way you should do this kind of thing.

But this begs the questions – do you really want to do that? You are putting a lot of faith in the systems ability to determine what might be interesting now, and possibly later, and in your own ability to configure the system correctly. It’s very easy to see when the system is creating false alarms, it is something entirely different to determine if it missed something. The first problem is annoying, the latter makes your system useless.

My preference is to record everything for 1-2-3 days, and rely on external sensors for detection, which then also determines what to keep in long term storage. This way, I have a nice window to go back and review the video if something did happen, and then manually mark the prebuffer for long term storage.

“Motion detection” does provide some meta-information, that can be used later when I manually review the video, but relying 100% on it to determine when/what to record makes me a little uneasy.

But all else being equal, it is an improvement…

 

TileMill and Ocularis

A long, long time ago, I discovered TileMill. It’s an app that lets you import GIS data, style the map and create a tile-pyramid, much like the tile pyramids used in Ocularis for maps.

tilemill

There are 2 ways to export the map:

  • Huge JPEG or PNG
  • MBTiles format

So far, the only supported mechanism of getting maps into Ocularis is via a huge image, which is then decomposed into a tile pyramid.

Ocularis reads the map tiles the same way Google Maps (and most other mapping apps) reads the tiles. It simply asks for the tile at x,y,z and the server then returns the tile at that location.

We’ve been able to import Google Map tiles since 2010, but we never released it for a few reasons:

  • Buildings with multiple levels
  • Maps that are not geospatially accurate (subway maps for example)
  • Most maps in Ocularis are floor plans, going through google maps is an unnecessary extra step
  • Reliance on an external server
  • Licensing
  • Feature creep

If the app is relying on Google’s servers to provide the tiles, and your internet connection is slow, or perhaps goes offline, then you lose your mapping capability. To avoid this, we cache a lot of the tiles. This is very close to bulk download which is not allowed. In fact, at one point I downloaded many thousands of tiles, which caused our IP to get blocked on Google Maps for 24 hours.

Using MBTiles

Over the weekend I brought back TileMill, and decided to take a look at the MBTile format. It’s basically a SQLite DB file, with each tile stored as a BLOB. Very simple stuff, but how do I serve the individual tiles over HTTP so that Ocularis can use them?

Turns out, Node.js is the perfect tool for this sort of thing.

Creating a HTTP server is trivial, and opening a SQLite database file is just a couple of lines. So with less than 50 lines of code, I had made myself a MBTile server that would work with Ocularis.

tileserver

A few caveats : Ocularis has the Y axis pointing down, while MBTiles have the Y axis pointing up. Flipping the Y axis is simple. Ocularis has the highest resolution layer at layer 0, MBTiles have that inverted, so the “world tile” is always layer 0.

So with a few minor changes, this is what I have.

 

I think it would be trivial to add support for ESRI tile servers, but I don’t really know if this belongs in a VMS client. The question is time was not better utilized by making it easy for the GIS guys to add video capabilities to their app, rather than having the VMS client attempt to be a GIS planning tool.

 

Razor Software

It dawned on me that the VMS software, in many cases, are equivalent to razors… or pizzas, or burgers. It’s reaching the point where it’s really hard to sell anything better than just “good enough”. Sure, there are 5-blade razors with gels to soothe the skin, and there are “gourmet” burgers and pizzas (I still prefer Wendy’s, and I’ve been to a lot of burger places). So unless you can sell some sort of “experience”, when peddling easily fungible goods, you’ll be competing on price.

I recently was on vacation in Greece and I noticed that the cash register software was (apparently) DOS based, and hadn’t been updated since 2004. The service was as fast as anywhere else, and the software had no issue looking up the price of an item as quickly as the operator could scan the barcode or manually key in the item. A local hardware store here in Copenhagen has a similar system, and the Luddite discount supermarket chain, Aldi, didn’t have barcode scanners until recently.

In the good old days (of scurvy and gout), the store would simply write down what was sold, and at what price in a little black book. Back then, I doubt there were price tags on items, as it was the Quakers who introduced fixed prices.

Cash register by the National Cash Register Co., Dayton, Ohio, United States, 1915.
Cash register by the National Cash Register Co., Dayton, Ohio, United States, 1915.

The first register came about when saloon owner, James Ritty got fed up with his employees stealing from him, so he invented the cash register. A contraption he called the “Ritty’s Incorruptible Cashier”. Soon after, the invention was sold to the company that later became NCR.

NCR went on to add bookkeeping features to the cash register, so not only would you prevent the employees from stealing (it’s intended purpose), but it would also make your life easier as a shop-keeper.

Electronic cash registers was the next step. Clearly, going from a mechanical contraption to an electronic version made sense. It was cheaper, more reliable, faster and had more features.

Then came the DOS based POS systems that offered yet more flexibility, were easier to use and could interface with more printers and cash drawers.

Today, the local hipster hangout coffee shop down the street is now using an iPad with a credit card scanner attached.

As I see it, the embedded electronic register is equivalent to a DVR with some NTSC cameras. The DOS based register corresponds to the windows based IP video recorders that are fairly common today. And finally, a plug’n play video surveillance solution in the DropCam vein is peered to the iPad POS software.

My thinking is this: A tool that solves a particular, well defined problem, won’t be upgraded, at least not until the new tool offers a substantially better ROI. Adding bookkeeping to the mechanical cash made sense, but I doubt people replaced their old working mechanical registers with the slightly improved variant. New customers will buy the feature-rich variant if the price points are similar, but it’s very hard to change a substantial premium for features that really aren’t needed. And even harder to convince people to replace a working system with something more expensive, but only offers marginally useful features. I suspect that the reason I am seeing a lot of stores with DOS (or DOS-like) registers, is because the functionality is simply good enough, and there’s no meaningful ROI on upgrading to a, say, Windows 10 or iOS version.

And that brings me to VMS systems. Clearly, the analog systems were a pain in the ass. Tapes that were worn out, rendering the system useless when something did happen. Terrible low light performance and so on. The DVR systems were substantially better, causing people to upgrade. And for a while, IP based cameras were very, very expensive, but as prices came down, they started to offer a substantially better ROI than DVRs (in many respects, but not all).

The key point here is “good enough”: I don’t think we are quite there yet, but eventually we will arrive at a point where all VMS/NVRs will be pretty much plug and play. It’ll be a piece of cake to replace a camera (why the hell is this still a challenge on some VMSs?), hopefully no-one is going to spend hours tweaking resolutions, messing around with VBR and CBR, or mess around with “motion sensitivity” settings.

If I had my way, I’d let the “other guys” come up with new condiments, and ways to serve their turd-sandwiches, and instead offer cupcakes. Some people don’t like cupcakes, some prefer donuts, but I bet you that most people would pick a cupcake over any turd-sandwich. There are always people that complain, monday-morning-quarterbacks (some made a career of it), who demand some weird condiment that comes with turd-sandwich #2, and in some cases, sinking your teeth into a turd-sandwich is simply the only way to achieve some narrow goal.

I bet that once someone offers a cupcake, it’s game over for the turd-sandwich stores.

NVR Appliances Will Change The Landscape

In the good old days, if you were a VMS vendor, you’d offer 2 things : a little database engine, and a client to view video (live and recorded). In return, the camera guys would fret over sensor sizes and optics, but not worry about how video was stored and retrieved.

The core competency of the VMS guys was writing code that ran on a PC, while the camera guys would get their code to run on a small embedded platform.

What we are seeing now is that the lines are getting blurry. Milestone offers code that the camera guys can embed in their systems, and the camera guys now offer small databases embedded directly in the cameras, and obviously there’s a client to view the video too.

From an architectural point of view, the advances in camera hardware capabilities open up for a whole host of interesting configurations. In a sense, the cameras are participating in a giant BYOB party, only, instead of beer, the cameras bring processing power and storage. Prior, the host would have to buy more “beer” (storage and processing), as more “guests” (cameras) arrived. Now, if host has enough physical room (switches and power).

Cameras now have a sufficiently high level of sophistication that they can act almost completely autonomously. While my private setup does require a regular server and storage – somewhere – the camera pretty much cares for itself. It does motion detection, and decides on its own when to store footage. When it does store something, it notifies me that it did, and I can check it out from anywhere in the world. The amount of processing needed on the PC is miniscule.

So how do NVR appliances fit in?

A few years ago, the difference between a regular PC and an “appliance” was that the appliance came with Windows XP Embedded. That was pretty much it. The Hardware and software was almost 100% identical to the traditional PC platform, but the cost was usually substantially higher (even though an XP Embedded license is actually cheaper).

Now, though, we are seeing storage manufacturers offer NAS boxes with little *nix kernels that can connect to IP cameras, and store video. There’s usually a pretty cumbersome and somewhat slow web interface to access the video, but no enterprise level management tool to determine who gets to see what, and when.

The advanced DIY user, can buy one of these boxes, hook up a bunch of cheap Dahua cameras, and sleep like a baby, not worrying about Windows Update wrecking havoc during the night. As a cheapskate, I can live without the plethora of features and configuration options that a traditional VMS offers. It’s OK that it is a little bit slow and cumbersome to extract video from the system, because I only very, very rarely need to do so. I am not going for the very cheap (and total crap) solutions that Home Depot and Best Buy offer. Basically, if it doesn’t work, it’s too expensive.

Further up the pyramid though, we have the medium businesses. Retail stores, factories, offices, places where you don’t DIY, but instead call a specialist. And this, I think, is where the NAS boxes might make a dent in the incumbent VMS vendors revenue stream.

The interests of a DIY’er and the specialist align, in the sense that both want a hassle free solution. The awesome specialist is perfectly capable of buying the parts and building and configuring a PC, but to be honest, tallying the cost and frustration if things don’t work as expected. I think I’d pick the slightly more expensive, but hassle-free solution. Even if it didn’t support as many different cameras as the hyper-flexible solution that I am used to. I do not see an advantage to being able to play Hearts on my video surveillance system.

The question is then. For how long, will the specialist also sell a traditional VMS to go along with the NAS? How long before the UI of the NAS becomes so good that you really don’t need/want a VMS to go along with it? Some of the folks I’ve spoken to, show the fancy VMS, but advises, and prods the customer in the direction of a less open solution. Any of their employees can install and maintain (replace) the appliance, but managing the VMS takes training, in some cases certification and generally takes longer to deploy.

We are not there yet, but will we get there? I think so.