25% CPU load is a bug

The full title of this article should be the substantially less catchy “on a multi-processor system, if you observe a process using a single whole CPU1 for an extended period of time, then, from that one observation, you can deduce that the process is buggy.” On a relatively common four CPU system, this will show as 25% CPU load. On an eight CPU system it would show as 12.5% CPU load. On a twenty-four physical core system with hyper-threading it would show as 2% CPU load.

Summary

Correctly operating processes fall into one of two categories. Either, they’re trying to deliver a result the user is actively waiting for, in which case they should be using all the CPUs that are available so they can deliver the result four (or eight, or forty-eight) times quicker; or, they’re doing background processing and trying to avoid loading the system, in which case, what was magical about using precisely one CPU. Well designed processes that are running for extended periods should thus be using either more or less than one whole CPU. Consequently, if you see a process that’s fully loading a single CPU for an extended period of time, it’s a bug.

Major OS vendors could use this observation to find buggy background processes that cause system responsiveness or battery-life issues.

Explanation

Despite the time for which multi-processor systems have been available, much software hasn’t caught up. Much software is written in a simple single-threaded manner. For software that isn’t doing very much, that’s fine. For software than needs to run for an extended period of time, it’s not.

Foreground processing

The easy case to argue is the case of software that’s producing a result that the user is actively waiting for. For example, suppose the software takes one CPU-minute to produce its result and the user has a forty-eight CPU system. If the user looks at their CPU monitoring application (say, Task Manager or top) and sees the system at 2% CPU load, then they’re perfectly entitled to ask why 98% of their system is being left idle. They paid good money for those CPUs. They’re completely correct here. Rope in the other CPUs: if you can go perfectly parallel, then you’d complete the user’s task in about one second. The user will be happy. Even if the user has only a four CPU system, then the task would take just fifteen seconds and you can give the user forty five seconds of their life back.

I don’t think anyone will disagree with the basic idea that if the user is waiting for a result it is desirable if all available resources can be brought to bear on the problem.

There is a difficulty that not all algorithms parallelise perfectly, and that’s true. There are two reasons for this:

The first reason you might give is that the algorithm is genuinely parallel but it has hit another bottleneck, such as memory2, disk or network. However, that cannot be the explanation of the process using precisely one CPU’s worth of load. What are the chances that the next bottleneck supplies data at exactly the right rate that one CPU can process it? What are the chances that it’s one CPU’s worth on all systems regardless of underlying technology? So, you can be sure that when you see a process sitting at exactly one CPU’s worth of load, then it’s because no attempt has been made to parallelise the algorithm.

You don’t have to get perfect parallelisation, even getting the CPU up to 40% on a four CPU processor would be an improvement.

The second reason is that the best known algorithm is a pure sequential algorithm. I’m going to argue that this is almost certainly still a bug. Firstly, in the real world, you’re very rarely solving a problem where the entirety of what you’re doing is one of the known hard-to-parallelise algorithms. Secondly, even if it is, perhaps it’s an artificial constraint. Maybe you need to look at the problem differently.

Some programs manage this very well. Microsoft Excel will quite happily use all the CPUs you have available when recalculating a spreadsheet.

Sometimes you can’t parallelise your process, but you can parallelise a group of processes. For example, your compiler may be single-threaded, but you might be able to use parallel Make when rebuilding your large project.

Background processing

The argument for background processing is a little more complex. For background processing you’re often trying to avoid unduly loading the system. Having multiple CPUs is one of the primary reasons modern systems often seem much more responsive than older systems.3 Background processing often doesn’t have an impact on the latency of interactive tasks. In turn, this is because many background tasks are single-threaded so that if such a task uses a few seconds, minutes or even hours of processing, not only is the bulk of the processing (75% of the CPUs on a four CPU system) still available, but the system doesn’t need to context switch from the existing task. This is particularly important if the foreground task just needed to do a short burst of processing (say, an editor responding to a key press).

This advantage comes from having a CPU sitting idle ready to take on a task at low latency. This idle CPU isn’t wasted, it’s being spent4 to guarantee system latency.

The question here is whether the idle CPUs were deliberately spent with a thought to the overall system performance or whether it’s an accident of implementation. That is, did someone sit down and think “for my long running background task not to have an impact on a four CPU system, I should leave three CPUs idle, not two, not one; and on a forty-eight CPU system, I should leave forty-seven CPUs idle, not forty-six, not twenty-four, not three, not one.” Or did someone say, “it doesn’t matter if I peg a CPU because all modern systems are multi-CPU so I won’t be hurting system performance.” Or, more likely, did someone not think about it at all. I think it’s unlikely that, if anyone did proper analysis, then they’d come to the conclusion that one CPU is the perfect amount for all systems.

I think with analysis, we’d see some cases where one CPU is too few. Are we really saying we don’t care how long this background activity takes? What if it takes forever? If that’s acceptable, why are we running this processing at all? On the other hand, if it does need to complete in some finite time, will one CPU be enough on all systems? Is there no benefit from completing more quickly? What if we are completing the task fast enough for user interaction (say, we’re running overnight), but we’re preventing the system from going to sleep and saving power.

There are also cases where one CPU is too many. CPU cycles aren’t the be-all and end-all of system performance impact. Running processing can put pressure on RAM and caches (memory caches and file caches), can put a load on shared devices (if your HDD is seeking to retrieve a file for the background process, then it can’t be seeking for a foreground process) and can lock shared resources (such as holding locks in a single-threaded part of the kernel). If you’re making heavy use of techniques that impact on other processes, you may need to get out of the way when the system is busy.5 You’ll then need to make sure you don’t fall into the too-little-CPU problem described in the previous paragraph, so at some point you may need to stop getting out of the way.

Short-lived processing

I’m not saying every single process should be multi-threaded. If your processing is short-lived, it’s perfectly reasonable to run single-threaded. You may end up at 25% CPU load but only for a short period of time. It’s specifically extended processing that’s the problem.

However, one day you may find your supposedly short-lived process sitting at 25% CPU load for an extended period of time. Assuming your original calculations were correct, that is, that the process is supposed to be short lived, restricting yourself to a single thread has stopped you breaking the rest of the system. At this point you know you have a bug: something has gone wrong, your program wasn’t supposed to take this long to run. The single-threaded nature of your processing was a mitigation of the bug. In comparison, if your process had run away with memory rather than CPU it probably would have consumed all the memory on your system and damaged other processes.

Miscellanea

I’m trying to keep this article short so that I don’t distract from the main point. This means I haven’t covered the whole topic. I just want to draw attention to the symptom. Knowing the problem is there is the first step towards fixing it. I suspect 90% of the time I see 25% CPU load on my system it’s not the complex foreground or background cases, it’s rogue short-lived processing.

Despite a little digression into some of the bigger traps, I’ve avoided talking about fixing this problem.

There’s clearly a class of processing between foreground and background where even though the user is actively waiting for the result of your process you want to leave the rest of the system responsive.

I’ve deliberately avoided specifying what constitutes an “extended period”. If you’re a user looking at a CPU monitoring tool your probably won’t notice the problem until it lasts tens of seconds (for foreground processing) or minutes (for background processing). Battery life probably won’t be affected until the period is many minutes. In contrast, if you’re investigating your own program and you know how long processing should take, then even a few seconds may be long enough to indicate a problem.6

I’ve seen a system show 25% CPU load but split between two processes (say, one at 10% and one at 15%). In that case one of the two processes was a service. The application was getting a service to do some work and sleeping waiting for a result, the service would spin up, do the work, process the result and go back to sleep. The CPU would ping-pong between the two processes. It’s an example of this same problem.

The main point of this article is CPU load but, as I touched on, there are other limits. Some operating systems make it easy to limit the amount of memory your task can consume, triggering a program crash if it’s exceeded. Unattended processes (background tasks, CGI scripts and so on) should probably use this. Short-lived tasks (as opposed to long-lived tasks with short-lived bursts of processing) could use a total run-time limit too (setrlimit, ulimit or Windows Task Scheduler elapsed time limits). This doesn’t have to be tightly placed on your current usage, just something that detects rogue tasks.

OS Telemetry

Up till now this has all been an interesting opinion piece aimed at program authors, but there is one other group that can take action.

I’ve seen a number of computer systems get laggy or spin up their fans for hours when they’re supposed to be idle. Investigation has always shown this to be background services, that normally do widely-spaced short bursts of processing, going rogue. I won’t name names.

It might not always seem it, but consumer grade computer system stability has improved massively over the past few decades. I’m going to assert that one of the reasons for this is telemetry about crashing programs.

However, programs crashing is not the only cause of end-user dissatisfaction. Laggy systems make the user miserable. Running fans aren’t merely noisy but are the sound of your battery life draining away: I’ve had laptops with eight hour battery life die before lunch because a background process went mad.

This suggests a new form of telemetry: monitoring for rogue background processes. For a major OS provider it should be easy to work out from observation which services, including third party services, are normally at very low CPU load and then flag when they go mad. If a service normally takes less than 1% CPU or completes its work in a handful of milliseconds each time it wakes up, then seeing it peg a CPU for minutes probably warrants investigation.

I would expect that, as with error reporting, the Pareto principle applies or is even exceeded. Clearing out the pure CPU issues should be the big win. Finding two tasks handing off to each other, memory hogs or indirect loading via disks can follow later.

1 I’m going to be a little vague on the terminology between CPUs, cores and hyper-threads (logical cores). For the purpose of this discussion, if one single-threaded CPU-bound process can consume ¹⁄_N^th of what shows as 100% total system processing load, then the system has N CPUs.

2 A memory bottleneck might show up as CPU load, at least on some measurements. This isn’t a reason to give up. Perhaps you need to look at how you’re using the caches or whether you can split the algorithm into multiple threads each with better cache coherency.

3 Mainstream consumer grade PCs went through this about a decade ago, mainstream smart phones did the same thing a little later. You may well remember the difference.

4 Note the difference between this and the foreground processing case. With foreground processing the user was annoyed that the CPUs they’d spent money on were not being used. The CPUs were wasted. In the background case the user is happy that the background processing is leaving some CPUs idle. The user spent money on having CPUs available to improve latency. The CPUs are being used for their intended purpose.

5 Although do make sure you only get out of the way of interactive (or otherwise high priority) tasks lest you fall foul of the what if two programs did this problem (or the variant where you leave some CPUs unused).

6 On the Random ASCII blog, Bruce Dawson talks about tracking down a hang in Chrome when writing an email. One of the symptoms was WmiPrvSE.exe unexpectedly hogging a single CPU.

Up to the welcome page.

Comments should be addressed to webmaster@pertinentdetail.org