On Multithreaded Code

Say you want to build a wall. You’ve poured a foundation, and now you need to lay some bricks. If it’s just you working, you can start from one end, and lay bricks one by one, from A to B, one layer at a time.

This is single threaded work.

Since it’s just you, there’s no risk of collision.

Now, you could also lay one brick on one end, then move to the other end, lay one there, move back again, lay one next to the previous. The time spent moving from side to side is context switching. You are doing work, just not work that puts more bricks on the wall. Alas, the total time to completion is lower than if you just did one thing. This is multitasking with one CPU. It can be either preemptive (someone tells you to stop, and do something else, much like a regular job), or cooperative (old windows used this) where you say “I’m done, what now” on regular intervals.

Again, since it is just you, there’s little risk of messing things up (I’ve laid a few bricks… plenty of ways to mess things up alone).

If someone came by, and saw the wall being built (assuming he couldn’t see you working on it), it would seem as if two people were working on the wall. The assumption would then be that the wall would be done 2x faster. That person would be misled.

Now, say you trick your brother in law into helping you out, so now we have 2 CPUs or cores. It doesn’t really matter what you call them, in the old days it was 2 physical chips side by side, today it’s – usually – all inside one chip. So you both start at different ends, work towards the middle, and when you meet, you go back to your end and start a new layer. You could also have your brother in law mix the mortar, hand you bricks etc.

But now you have to coordinate things. You have to keep track of what the hell that brother in law of yours is doing. To do this we have a bunch of tools at our disposal : mutex, semaphore, event. These are control mechanisms that need to be applied carefully.

For example, say tell your brother in law to sit and wait, until you’ve laid a brick, until he can lay his. So you lay one, while he is waiting, then you wait while he lays one, and then back to you.

This may look like multithreading, and in a sense it is, but it is no faster than if it was just you doing the work alone. To the outsider, like before, it’s really hard to tell the difference. Clearly two people are working on the wall, but is it any faster than just one? It depends on how they work. How the mutexes, semaphores and events are applied.

In windows, there’s a bunch of fakery going on behind the scenes. One common problem is that the windows UI can usually only be updated from one thread. If you create a worker thread, and that thread then needs to update a progress bar, then you need to jump through some hoops to get that bar to update.

The hoops usually involve putting your command into a single-threaded queue, and then waiting for the work to be done. So now all of the workers are waiting in line to fill out a sheet, instead of just distributing the sheets to all the workers and have them fill them out in parallel.

But this technique is being used behind the scenes too. If, for example, you are not careful about how you design your COM objects (ancient technology, but useful nonetheless), then your carefully crafted multithreaded code might actually be single-threaded, and just appear to be multithreaded because you called CreateThread somewhere.

Some people tend to think that code that keeps the CPU load at a low percentage is always good. But you have to look at the output too. I had this shit piece of code, that ran at 25% CPU utilization and gave me 200 fps. But I had 4 cores, and I was just not able to go to 100%. Obviously, at 100% I should be able to get at least 600 fps. Sure enough, there was a long line at the office, and every worker was standing in line. Applying the mutexed correctly, allowed me to use all of the CPU if I wanted to.

Same applied to Visual Studio, compilations took forever because we are importing some files. After re-organizing the build sequence, we were able to max out all cores when compiling, significantly reducing compile time.

Multithreaded can be fast, but you have to do it right, and check your assumptions (this is almost always true).

Advertisements

4 thoughts on “On Multithreaded Code

  1. cunhaw says:

    Nice article. What about having more threads than cores? Waste of time and cpu cycles?

  2. prescienta says:

    Not really a “waste”. Most desktop PC’s run many processes, each process has at least one thread, and in most cases spawn a lot more than just one, yet we don’t consider this a waste. In many cases, the threads sit idle – waiting of some event that kicks them into action.

    As a user of a PC, I’d prefer this architecture, because it means that I – the user – can kick off a long running, and costly task, and still be able to send emails, read articles etc. Even if the result of the background task will arrive later than if it had all the cores to itself.

    From a programmer’s perspective, you have to be careful too. Here’s an example: Say we have 4 physical cores, and at idle the background processes are using <1 % of the total CPU resources. Let's say that we then start decoding a 30 FPS H.264 stream. At 30 fps, we have a budget of about 33 ms to decode each frame.

    So, we do a single threaded version, and it turns out that this version can decode the video in – say 10 ms per frame, using 1 core at 100%, while the others are idle at 0%

    Now, a clever programmer spends 3 months splitting the decoding process into multiple threads, so that it can run in parallel on all 4 cores. So now we can do the decoding in 5 ms, using all 4 cores at 100%.

    Was that an improvement?

    The answer is no, not really. While we increased decoding speed for 1 feed, we actually increased the cost per feed as well, and since most of our users stream more than one feed at a time, we actually made the performance worse.

    • cunhaw says:

      Agreed. But being a lot more specific, I was talking about opening 1 thread per client socket in a VMS architecture. Considering that a camera needs to open multiple sockets (rtsp for video streaming, rtsp for audio streaming, I/O push notifications, ptz, etc), you can easily have 5 or more opened sockets per camera.

      In a 200 cameras installation, you cannot afford having 1000 threads: a thread usually costs 2 MB of RAM each plus the context switching overhead.

      That’s something the developer needs to think about too.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: