It stands for “Real Time Streaming Protocol” and it is pretty much the de-facto protocol for IP security cameras. It’s based on a “pull” principle; anyone who wants to get the feed must ask for it first.
Part of the RTSP protocol describes how the camera and the client exchange information about how the camera sends its data. You might assume that the video is sent back via the same socket as the one used for the RTSP negotiation. This is not the case.
So, in the bog standard usage, the client will have to set up ports that it will use to receive the data from the camera. At this point, you could say that the client actually becomes a server, as it is now listening on two different ports. If you were to capture the communication, you might see something like this:
C->S:
SETUP rtsp://example.com/media.mp4/streamid=0 RTSP/1.0
CSeq: 3
Transport: RTP/AVP;unicast;client_port=8000-8001
S->C:
RTSP/1.0 200 OK
CSeq: 3
Transport: RTP/AVP;unicast;client_port=8000-8001;server_port=9000-9001;ssrc=1234ABCD
Session: 12345678
If both devices are on the same LAN (and you tolerate that the client app opens two new ports in the firewall of the PC), the camera will start sending UDP packets to those two ports. The camera has no idea if the packets arrive or not (it’s UDP), so it’ll just keep spewing those packets until the client tears down the connection or the camera fails to receive a keep-alive message (usually just a dummy request via the RTSP channel).
But what if they’re not on the same network?
My VLC player on my home network may open up port 8000 and 8001 locally on my PC, but the firewall in my router has no idea this happened (there are rare exceptions to this). So the VLC player says “hey, camera out there on the internet, just send video to me on port 8000”, but that isn’t going to work because my external firewall will filter out those packets.
To solve this issue, we have RTSP over TCP.
RTSP over TCP lets the client initiate all the connections, and the camera then sends data back through those connections. Most firewalls have no issue accepting data via a connection as long as it was initiated from the inside (UDP hole punching takes advantage of this). RTSP over HTTP is an additional layer to handle hysterical system admins that only tolerate “safe” HTTP traffic through their firewalls.
So, is that the only reason to use TCP?
Well… Hopefully, you know that UDP is non-ack; the sender has no idea if the client received the packet or not. This is useful for broad- and multicasting where getting acks would be impossible. So the server is happily spamming the network, not a care in the world if the packets make it or not.
TCP has ack and retransmission; in other words, the sender knows if the receiver is getting the data and will re-transmit if packets were dropped.
Now, imagine that we have two people sitting in a quiet meeting room, one of them reads a page of text to the other. The guy reading is the camera and the guy listening is the recorder (or VLC player).
So the reader starts reading “It was the best __ times, it was ___ worst of times“, and since this is UDP, the listener is not allowed to say “can you repeat that“. Instead, the listener simply has to make do. Since there’s not a lot of entropy in the message, we can make sense of the text even if we drop a few words here and there.
But imagine we have 10 people reading at the same time. Will the listener be able to make sense of what is being said? What about 100 readers? While this is a simplified model, this is what happens when you run every camera on UDP.
Using TCP, you kinda have the listener going “got it, got it, got it” as the reader works his way down the page. If the room is quiet, the listener will rarely have to say “can you repeat that”. In other words, the transmission will be just as fast as UDP.
If you have 10 readers, and some of them are not getting the “got it” message, they may decide to scale back a bit and read a bit more slowly. In the end, though, the listener will have a verbatim copy of what was being read, even if there are 1000 readers.
Modern video codecs are extremely efficiently encoded, h.264 and h.265 throw away almost everything that is not needed (and then some). This means that if you drop packets the impact is much greater, without those missing packets all you get is a gray blur because that is all that the receiver heard when 100 idiots were yelling on top of each other.
So TCP solves the firewall issue, and in a “quiet” environment it is just as efficient as UDP. In a noisy environment, it will slow things down because of retransmissions, but is that not a lot better than to get a blurry mess? Would it not be better if the cameras were able to adjust their settings if the receiver can’t keep up? Isn’t it better than just spewing a huge amount of HD video packets into the void, never to be seen?
In my opinion, for IP surveillance installations, you should pick RTSP over TCP, and only switch to UDP if you don’t care about losing video.
As an experiment, I set up my phone to blast UDP packets to my server to determine the packet loss on LTE. Assuming it would be massive. Turns out that LTE actually has retransmission of packets on the radio layer (at least for data packets), I don’t know if it does the same for voice data.
The difference may be academic for a lot of installations as the network is actually pretty quiet, but for large/poorly architected solutions it may make a real difference.