Today, doing object tracking using OpenCV can be done in just a few hours. The same applies to face detection and YOLO. Object tracking and recognition is no longer “magic” or require custom hardware. Most coders can whip something together in a day or two that will run on a laptop. Naturally, the research behind these algorithms is the work of some extremely clever guys who, commendably, are sharing their knowledge with the world (the YOLO license is legendary).
But there’s a catch.
During a test of YOLO, it would show me a couple of boxes. One around a face, and YOLO was about 51% certain that this was a person. Around my sock, there would be another where it was 54% sure it was also a person. But there was another face in the frame that was not identified as one.
It’s surprising and very cool that an algorithm can recognize a spoon on the table. But when the algorithm thinks a sock is a face and a face isn’t one, are you going to actually make tactical decisions in a security system based on it?
Charlatans will always make egregious claims about what the technology can do, and gullible consumers and government agencies are being sold on a dream that eventually turn out to be a nightmare.
Recently I saw a commercial where a “journalist” was interviewing a vendor about their analytics software (it wasn’t JH). Example footage was shown of a terrorist unpacking a gun, and opening fire down the street. This took place in your typical corner store in a middle eastern country. The video systems in these stores are almost always pretty awful, bad cameras, heavy compression.
The claim being made in the advert is that their technology would be able to identify the terrorist and determine his path through the city in a few hours. A canned demo of the photographer walking through the offices of the vendor was offered as a demonstration of how easy and fast this could be done.
I call bullshit!
-village fool
First of all, most of the cameras on the path are going to be recording feed at similar quality to what you see above. This makes recognition a lot harder (useless/impossible?).
Second, if you’re not running object tracking while you are recording, you’ll need to process all the recorded video. Considering that there might be thousands of cameras, recorded on different equipment recording in different formats, the task of doing the tracking on the recorded video is going to take some time.
Tracking a single person walking down a well lit hallway, with properly calibrated and high quality cameras is one thing. Doing it on a camera with low resolution, heavily compressed video, and a bad sensor on the street with lots of movement, overlaps, etc. is a totally different ballgame.
You don’t know anything about marketing!
-arbitrary marketing person, yelling at Morten
Sure, I understand that this sort of hyperbole is just how things are done in this business. You come up with things that are fantastic and plausible for the uneducated user, and hope that it makes someone buy your stuff. And if your magical tool doesn’t work, then it’s probably too late, and who defines “works” anyways? If it can do it 20% of the time, then it “works”, doesn’t it. Like a car that can’t drive in the rain also “works”.
If you want to test this stuff, show up with real footage from your environment, and demand a demo on that content (if the vendor/integrator can’t do it, they need to educate themselves!). Keep an eye on the CPU and GPU load and ask if this will run on 300 cameras in your mall/airport without having to buy 100 new PC’s with 3 top of the line GPU’s in them.
I’m not saying that it doesn’t ever work. I’m saying that my definition of “works” is probably more dogmatic than a lot of people in this industry.