Ben Bowen's Blog [Xenoprimate] • Home / Blog • • About • • Subscribe •

Common Multithreading Mistakes in C# - II: Unnecessary Contention 

In this series, I'm going to be elucidating some common errors C# programmers make when working with multithreaded systems. I'll be exploring issues from a lower-level point-of-view, and not dabbling too much in higher-level abstractions and libraries (such as the TPL and similar); hopefully the points raised will be pertinent to anyone hoping to write concurrent or parallelized applications.

What is Contention? 

In multi-threaded code, contention occurs when two or more threads are attempting to access the same resource simultaneously (in other words, they are contending for access to the resource). In incorrectly-written code, this usually results in errors; like we explored in Part I with the unprotected access around a word-count dictionary.

However, even when the code is written in a thread-safe manner, contention is still something we want to avoid as much as possible. This is because contention almost always means a degradation in throughput: Whether it's done in program code or at the CPU level, synchronizing access to a shared resource adds overhead. One obvious example is the lock keyword, which forces all but one thread to wait for access to the critical (locked-over) scope.

Now, it might appear then that we must make a choice between a slow program or a bugged one when writing multi-threaded code. However, that's a false dichotomy. The best answer is to still write thread-safe code (as always), but instead to reduce the amount of contention! Let's explore how.

Working Locally 

By far the most important technique for reducing contention is to reduce the amount of time each thread spends reading or writing to shared memory. Let's revisit the word count calculator from Part I; except let's imagine that management has now requested that all words be turned to uppercase (so we don't count "hello" and "Hello" as two different words), have hyphens and apostrophes removed, and any trailing spaces trimmed. As well as typifying management's tendency to add features at the last minute, this will give us even more to parallelize!

Why did we need 'more' to parallelize? Well, on a sufficiently fast machine, any amount of work will be sufficiently fast single-threaded. In our case, that means that our word-count program as it stood may benefit from parallelization on a desktop or older machine, but the overhead of multithreading would become a burden on a high-powered workstation or server. One of the important lessons to learn when starting to multi-thread code is knowing when not to multi-thread.

That can include times where the actual work done is just a handful of CPU operations per item (such as our word-counter); or when the data has a strict sequence dependence (e.g. each value relies on the previous calculation).

By adding more string-manipulation work to be done for each word in our word list, we've ensured that there really is enough work to be done for each word in order to justify spreading that work across multiple cores.

So let's look back at where we left off in the end of Part I with our heavily over-contented ThreadDoWork method for our word count calculator, but with management's extra stipulations added in:

private static List<string> wordList;
private static List<string> curseWords;
private static int currentWordIndex;
private static Dictionary<string, int> wordCountDict;
private static readonly object wordCountCalculatorSyncObj = new object();

public static void CalculateWordCounts() {
	wordList = GetWordList();
	curseWords = GetCurseWords();
	currentWordIndex = 0;
	wordCountDict = new Dictionary<string, int>();

	Thread threadA = new Thread(ThreadDoWork);
	Thread threadB = new Thread(ThreadDoWork);
	Thread threadC = new Thread(ThreadDoWork);
	Thread threadD = new Thread(ThreadDoWork);
	threadA.Start();
	threadB.Start();
	threadC.Start();
	threadD.Start();
	threadA.Join();
	threadB.Join();
	threadC.Join();
	threadD.Join();
}

private static void ThreadDoWork() {
	bool atLeastOneWordRemaining;
	int thisWordIndex;
	lock (wordCountCalculatorSyncObj) {
		thisWordIndex = currentWordIndex;
		currentWordIndex = currentWordIndex + 1;
	}
	if (thisWordIndex >= wordList.Count) atLeastOneWordRemaining = false;
	else atLeastOneWordRemaining = true;

	while (atLeastOneWordRemaining) {
		string thisWord = wordList[thisWordIndex]
			.ToUpper()
			.Replace("-", String.Empty)
			.Replace("'", String.Empty)
			.Trim();

		if (curseWords.Contains(thisWord)) Console.WriteLine("Curse word detected!");

		lock (wordCountCalculatorSyncObj) {
			bool firstOccurrenceOfWord = !wordCountDict.ContainsKey(thisWord);
		
			if (firstOccurrenceOfWord) wordCountDict.Add(thisWord, 1);
			else wordCountDict[thisWord] = wordCountDict[thisWord] + 1;
			thisWordIndex = currentWordIndex;
			currentWordIndex = currentWordIndex + 1;
		}
		if (thisWordIndex >= wordList.Count) atLeastOneWordRemaining = false;
	}
}
View Code Fullscreen • "ThreadDoWork with Additional Work"
The biggest problem with this code is that the majority of it is spent inside the lock over wordCountCalculatorSyncObj. This means that the code performs worse than a single-threaded implementation (due to the added overhead that threading and lock contention adds to any application). However, this application is still ripe for parallelization, it's just that we're doing it wrong. The key to reducing contention in multithreaded code is to delegate work to be done locally on each thread. Rather than each thread vying for the lock (e.g. wordCountCalculatorSyncObj) every time they want to update one measly word's count, why not split the work beforehand into equal pieces beforehand and let each thread simply tally a 'local' count before finally updating the global one? Let's take a look at a rewritten implementation using these principles:

private struct ThreadWorkBlock {
	public readonly int StartingIndex;
	public readonly int Count;

	public ThreadWorkBlock(int startingIndex, int count) {
		StartingIndex = startingIndex;
		Count = count;
	}
}

private static void CalculateWordCounts() {
	wordList = GetWordList();
	curseWords = GetCurseWords();
	currentWordIndex = 0;
	wordCountDict = new Dictionary<string, int>();

	Thread threadA = new Thread(ThreadDoWork);
	Thread threadB = new Thread(ThreadDoWork);
	Thread threadC = new Thread(ThreadDoWork);
	Thread threadD = new Thread(ThreadDoWork);
	int numWords = wordList.Count;
	int quarterOfWords = numWords / 4;
	threadA.Start(new ThreadWorkBlock(quarterOfWords * 0, quarterOfWords));
	threadB.Start(new ThreadWorkBlock(quarterOfWords * 1, quarterOfWords));
	threadC.Start(new ThreadWorkBlock(quarterOfWords * 2, quarterOfWords));
	threadD.Start(new ThreadWorkBlock(quarterOfWords * 3, numWords - quarterOfWords * 3));
	threadA.Join();
	threadB.Join();
	threadC.Join();
	threadD.Join();
}

private static void ThreadDoWork(object threadStartParameter) {
	ThreadWorkBlock localWorkBlock = (ThreadWorkBlock) threadStartParameter;

	Dictionary<string, int> localWordCountDict = new Dictionary<string, int>();
	for (int i = localWorkBlock.StartingIndex; i < localWorkBlock.StartingIndex + localWorkBlock.Count; ++i) {
		string thisWord = wordList[i].ToUpper().Replace("-", String.Empty).Replace("'", String.Empty).Trim();
		bool firstLocalOccurrenceOfWord = !localWordCountDict.ContainsKey(thisWord);
		if (firstLocalOccurrenceOfWord) localWordCountDict.Add(thisWord, 1);
		else localWordCountDict[thisWord] = localWordCountDict[thisWord] + 1;
	}
	lock (wordCountCalculatorSyncObj) {
		foreach (var kvp in localWordCountDict) {
			bool firstGlobalOccurrenceOfWord = !wordCountDict.ContainsKey(kvp.Key);
			if (firstGlobalOccurrenceOfWord) wordCountDict.Add(kvp.Key, kvp.Value);
			else wordCountDict[kvp.Key] += kvp.Value;
		}
	}
}
View Code Fullscreen • "Local-Working Word Calculator"
In the code above, CalculateWordCounts() divides the work in to four equal parts (four threads for my four-core machine). This is done with a helper type called ThreadWorkBlock. One ThreadWorkBlock is passed as an argument to each thread's start function; each containing where in the wordList to start working from, and how many words to work on.

Each thread begins in ThreadDoWork() and does the usual word-counting stuff we've already seen- except using a local dictionary rather than the global/shared wordCountDict. This means that we don't need to use any sort of locking or synchronization for this part- we can just update localWordCountDict freely- but don't forget that there are four threads doing this simultaneously, each on a separate part of wordList! This is where the speedup happens- we're essentially tallying the list four times as fast!

Eventually though, each thread must update the global state; and that is where we still require a lock. However, by pre-calculating all of the string-manipulation and word counts in a local dictionary, we can reduce the amount of time spent inside the lock (therefore lowering the total contention of the algorithm) by turning it in to nothing much more than a simple copy of results from local to global.

Note about 'local'/'global': You may have heard the term 'local' before when referring to variables that are declared inside methods or even narrower scopes such as inside a for-loop (e.g. 'a local int' or similar). In C# particularly, local variables are those declared on a thread's stack rather than somewhere accessible more globally. In multi-threading parlence, the term local usually refers to the callstack pertinent/visible only to the thread currently executing some given code (the local thread). Of course, more than one thread may be executing a certain piece of code at the same time, but a local variable is one that is only visible to the thread that is using it- there just may be multiple 'instances' of that variable in different threads' stacks, one for each thread.

The usage of the term above in firstLocalOccurenceOfWord then should begin to make sense: The variable will be true when this is the first occurrence of a given word on the local thread, but it may not be the first occurrence of that word across all threads. It's a somewhat irrelevant detail at this point- but it is important to be explicit when working with threads, and to understand the significance of local variables: They can never be shared across threads and therefore will always be threadsafe.

Although the idea of callstacks are technically an implementation detail, they are ubiquitous in many programming languages including all implementations of C# and the CLR; and also are a fundamental building block of the threading implementation in most OSs. They are also where the terms 'Stack Overflow' and 'Stack Trace' come from. If you'd like to learn more about them in your spare time, I would recommend the Call Stack Wikipedia page and the following Youtube video: Call Stack.

There's still a huge block of contention at the end though where all threads will generally finish their workload at roughly the same time and 'queue up' around the wordCountCalculatorSyncObj. Interestingly, this can be helped by 'blocking up' (i.e. splitting in to chunks) the work even further on each thread. This helps because although each thread will most likely still reach the first wordCountCalculatorSyncObj lock simultaneously, after that their progress is likely to be staggered as they process their own workloads in chunks, returning to the lock at times when it's less likely to be heavily contended. Here's an example I wrote which exhibits a further 20% speedup over the more naive implementation:

private static void ThreadDoWorkSectioned(object threadStartParameter) {
	ThreadWorkBlock localWorkBlock = (ThreadWorkBlock) threadStartParameter;

	Dictionary<string, int> localWordCountDict = new Dictionary<string, int>();
	int countPerSection = localWorkBlock.Count / NUM_SECTIONS_PER_THREAD;
	for (int section = 0; section < NUM_SECTIONS_PER_THREAD; ++section) {
		int startingIndex = localWorkBlock.StartingIndex + countPerSection * section;
		int count = section < NUM_SECTIONS_PER_THREAD - 1 ? countPerSection : localWorkBlock.Count - countPerSection * (NUM_SECTIONS_PER_THREAD - 1);
		localWordCountDict.Clear();

		for (int i = startingIndex; i < startingIndex + count; ++i) {
			string thisWord = wordList[i].ToUpper().Replace("-", String.Empty).Replace("'", String.Empty).Trim();
			bool firstLocalOccurrenceOfWord = !localWordCountDict.ContainsKey(thisWord);
			if (firstLocalOccurrenceOfWord) localWordCountDict.Add(thisWord, 1);
			else localWordCountDict[thisWord] = localWordCountDict[thisWord] + 1;
		}
		lock (wordCountCalculatorSyncObj) {
			foreach (var kvp in localWordCountDict) {
				bool firstGlobalOccurrenceOfWord = !wordCountDict.ContainsKey(kvp.Key);
				if (firstGlobalOccurrenceOfWord) wordCountDict.Add(kvp.Key, kvp.Value);
				else wordCountDict[kvp.Key] += kvp.Value;
			}
		}
	}
}
View Code Fullscreen • "Chunked Word Counter"
By updating the global wordCountDict more frequently but for less time each instance, we morph the contention-pattern of access from being one big contended queue at the end of the algorithm to something more staggered and fluid. As is the theme for this post; less contention therefore means an increase of speed. In fact, this pattern of parallelization can be improved and generalized further in to a branch of algorithms called work stealing algorithms; where instead of threads being allocated work at the beginning, they 'steal' work off a global queue (often in chunks/blocks), process + update any global state, and then take the next. We will look further at work-stealing in Part VI however, as they integrate some more advanced topics that we're not ready to explore yet.

Buffering Data 

We can now look at another way of reducing contention: Buffering. In fact, what we just did was a kind of buffering, but now we can expand it to a more general approach. Let's move on to a completely new example of a producer/consumer paradigm.

In a producer/consumer relationship, one or more threads are "producing" data to be processed somehow, and the "consumer" threads are taking that data and doing said processing. Let's imagine a message queue where a single producer is taking data from an I/O stream and adding messages to the queue; and multiple consumer threads are taking those messages from the queue and dealing with them appropriately as fast as possible.

Let's take a look at a simple implementation of this concept:

private static void ProducerThreadEntry() {
	while (queueRunning) {
		byte[] data = ReceiveWholeMessageFromStream();
		IMessage message = DeserializeMessage(data);
		lock (producerConsumerLock) {
			messageQueue.Enqueue(message);	
			Monitor.Pulse(producerConsumerLock);
		}
	}
}
private static void ConsumerThreadEntry() {
	IMessage message;

	while (queueRunning) {
		lock (producerConsumerLock) {
			while (messageQueue.Count == 0) {
				if (!queueRunning) return;
				Monitor.Wait(producerConsumerLock);
			}
			message = messageQueue.Dequeue();
		}
		DispatchMessage(message);
	}
}
View Code Fullscreen • "Producer/Consumer Example"
In this simple implementation, the producer thread cycles around in a loop (in ProducerThreadEntry). The first thing it does on each iteration is wait until an entire message has been received over the I/O stream. Then, once the raw byte[] data has been returned, a new IMessage object is deserialized (by passing that data to a DeserializeMessage method). Then, the new message object is added to a messageQueue, and one consumer thread is "pulsed" (more on that in a second).

Conversely, there are multiple consumer threads, each starting in ConsumerThreadEntry. Each one "waits" until "pulsed" by the producer (indicating that a message has been added). At that point, a single consumer will take the enqueued message object from the queue, and dispatch it to the rest of our application.

If you've never seen Monitor.Wait and Monitor.Pulse don't fret: They are low-level methods that can be used for inter-thread signalling. One or more threads can use Wait to stop execution until another thread calls Pulse (which will "wake" one of the waiting threads) or PulseAll (which will wake all of them).

In real-life code, higher-level constructs such as AutoResetEvent and ManualResetEvent/ManualResetEventSlim should be used. However, Wait and Pulse are still used somewhat commonly, and also I will be touching upon them again in Part IV, so I wanted to introduce them here.

Furthermore, just as it's important to understand arrays in order to fully comprehend how something like a List is implemented; it can be important to understand the lower-level building blocks of higher-level threading constructs, in order to gain a fuller understanding.

However, this code may prove to be a poor implementation. The potential problem lies in the fact that we have multiple consumers all waiting for a single message before continuing on to the next iteration of the while-loop and re-entering the contended code. This could prove to be very wasteful if the producer and consumers are producing/consuming messages at roughly an equal rate- this is because rather than doing useful work, our multiple consumer threads must all jostle with each other, waiting one-at-a-time to take the next message out of the queue. Further contention is introduced by the fact that no consumer threads may advance while the producer is adding the next message to the queue.

We can ameliorate this by introducing local buffering. On the producer thread, we'll 'buffer up' a collection of messages locally, before writing them all to the message queue in one big group once we've acquired enough. For each consumer, we'll take a big chunk of messages at a time from the queue, and then dispatch them while outside the producerConsumerLock. Here's the code:

private static void ProducerThreadEntry() {
	IMessage[] localQueue = new IMessage[PRODUCER_BUFFER_SIZE];
	int localQueueIndex = 0;

	while (queueRunning) {
		while (localQueueIndex < PRODUCER_BUFFER_SIZE) {
			byte[] data = ReceiveWholeMessageFromStream();
			IMessage message = DeserializeMessage(data);
			localQueue[localQueueIndex++] = message;
		}
				
		lock (producerConsumerLock) {
			for (int i = 0; i < PRODUCER_BUFFER_SIZE; ++i) {
				messageQueue.Enqueue(localQueue[i]);
			}
			localQueueIndex = 0;
						
			Monitor.Pulse(producerConsumerLock);
		}
	}
}
private static void ConsumerThreadEntry() {
	IMessage[] localQueue = new IMessage[CONSUMER_BUFFER_SIZE];

	while (queueRunning) {
		lock (producerConsumerLock) {
			while (messageQueue.Count < CONSUMER_BUFFER_SIZE) {
				if (!queueRunning) return;
				Monitor.Wait(producerConsumerLock);
			}

			for (int i = 0; i < CONSUMER_BUFFER_SIZE; ++i) {
				localQueue[i] = messageQueue.Dequeue();
			}
		}

		for (int i = 0; i < CONSUMER_BUFFER_SIZE; ++i) {
			IMessage message = localQueue[i];
			DispatchMessage(message);
		}
	}
}
View Code Fullscreen • "Local Buffering Example"
In the right cases, this code can be much, much faster. This is because the amount of time that each thread spends inside the producerConsumerLock, proportionally, is greatly reduced. Most of the time for each method is spent either buffering up messages, or dispatching them- instead of entering and exiting the lock in a loop, adding/removing one message at a time.

It's worth noting that buffering in this way is not always better. In fact, if the producer and consumers are being filled at a much lower rate than they can be processed, you'll simply cause 'lumpy processing', where nothing happens until the buffers are filled, a short amount of time is spent processing them, and so on.

The example above is a great demonstration for learning the power of buffering to reduce contention in multithreaded code. However, since .NET version 4,there have been some built-in collections that make use of advanced techniques in order to provide a fast, thread-safe queue manipulation that also blocks the consumer threads while the producer fills the buffer.

When writing your own producer/consumer code, you should definitely consider these types (BlockingCollection, usually backed by a ConcurrentQueue) before writing your own bespoke implementation.

However, most real-life producer/consumer scenarious require more than just a standalone collection and some threads, and therefore the techniques discussed above can definitely still be put to good use in your multithreaded implementations- it will just be up to you to identify areas where contention may be excessive.

Alternatives To Locking? 

Some of you who have tackled threading problems before may be by this point thinking "well what about alternatives to locking?". Some of you may even have come across or heard of the Interlocked static class.

The Interlocked class allows us to write threadsafe operations on a single variable location without using locks. Here's an example showing two different approaches to safely incrementing a counter across multiple threads:

private static void AThreadStart() {
	while (true) {
		lock (lockObj) {
			if (counter < COUNTER_TARGET) counter++;
			else break;
		}
	}
}

private static void BThreadStart() {
	int prevValue = counter;
	while (prevValue < COUNTER_TARGET) {
		Interlocked.CompareExchange(ref counter, prevValue + 1, prevValue);
		prevValue = counter;
	}
}
View Code Fullscreen • "Interlocked Example"
Both functions do exactly the same thing- increment counter in a thread-safe manner up to COUNTER_TARGET, and then return. The first function does something you should be familiar with by now- locking over the read/write of counter so that only one thread at a time can access it. BThreadStart however uses a different paradigm. The Interlocked class's CompareExchange method lets us read-modify-write counter in a single operation, with no risk of any other threads interfering while said operation is executing. This requires CPU support (see the CMPXCHG op for example).

However, many people also believe that using lock-free constructs (such as the Interlocked class) automatically guarantees an improvement in performance over traditional locking methods. This is simply not true- in fact on my i5, with four threads, AThreadStart is a faster performing alternative to BThreadStart. This is because constructs like CompareExchange don't magically 'remove' contention, they simply push it lower down the chain-of-responsibility, on to the CPU.

I will be talking more about lock-free programming in Part VI; but for now it's sufficient to say that using something like Interlocked requires profiling and a deep understanding of what is happening on each level of the system (CPU, Kernel, Application). Also, as we'll explore in Part III, it's much easier to make threading mistakes when you take away locks and other higher-level constructs. So, in 99% of cases, I would take Eric Lippert's advice: "always lock every access to a field that can be shared across threads, even if you think that eliding the lock is 99.99% likely to be harmless. The nano-scale benefit is simply not worth the macro-scale costs of verifying that the elision is correct, and certainly not worth the risk that the code is incorrect now or might become incorrect in the future".

Summary 

In summary, when writing multithreaded code we must be careful to ensure that our threads are doing as much work as possible at all times. It may feel counter-intuitive, but well-parallelized code should be hitting 100% CPU usage (or close) on all active cores while processing. A vastly reduced value indicates excessive stopping and starting while threads co-ordinate with one another.

However, it's also worth remembering that in a lot of cases, the co-ordination mechanism chosen will be a tiny fraction of the overall CPU usage time. In fact, in my personal experience, I tend to find that the co-ordination mechanism used for a multithreaded implementation (be it locks, semaphores, spinlocks and membars, or whatever) either cause extreme slowdown or have almost completely no impact: It's very rare to see any middle-ground here (though not impossible). Applying the techniques discussed above then should be about taking care not to hit the 'extreme slowdown' path- and not about chasing an extra 5% or so performance benefit. As we'll see in the upcoming parts in fact, chasing that last 5% can be notoriously error-prone and even when correct, can often actually make things worse.

In the next installment we'll explore some assumptions people make about what is and is not thread-safe; and some examples of code that many incorrectly assume to be correct that can, in certain situations, give incorrect results.


Edit 14th Oct '16: Altered a sentence to remove ambiguity about the way locks work. Thanks to /u/philsredditaccount on Reddit.