Clearly Too Slow (Postmortems) • Ben Bowen's Blog

A Performance Problem

In my game engine, the following code was called once every frame for every material in the game. We had an average of about 500 materials in most scenes; so on a high-end machine this one function was being called about 100,000 times a second. C# can definitely handle it (ask any high-scale web developer), but as you can imagine the method had to be fast (especially as we wanted to support average-to-low-end desktops):

(Don't spend too long looking at it- it's just here for demonstration :))

private void RenderCache_IterateMaterial(int materialIndex) {
	// Set up context variables
	KeyValuePair<Material, ModelInstanceManager.MIDArray> currentKVP = currentInstanceData[materialIndex];
	Material currentMaterial = currentKVP.Key;
	ModelInstanceManager.MIDArray currentMID = currentKVP.Value;

	// Skip this material if it or its shader are disposed
	if (currentMaterial.IsDisposed || currentMaterial.Shader.IsDisposed) return;

	// Skip this material if we're not using it
	bool inUse = false;
	for (int i = 0; i < currentMID.Length; ++i) {
		if (currentMID.Data[i].InUse) {
			inUse = true;
			break;
		}
	}
	if (!inUse) return;

	// Prepare shader according to material params, and switch to it or update it
	if (lastSetFragmentShader != currentMaterial.Shader || lastFrameNum != frameNum) {
		lastSetFragmentShader = currentMaterial.Shader;
		lastFrameNum = frameNum;
		QueueShaderSwitch(lastSetFragmentShader);
	}
	var queuedSRP = currentMaterial.FragmentShaderResourcePackage;
	if (lastSetFragmentShader == geomFSWithShadowSupport) {
		if (modifiedSRP == null) modifiedSRP = new ShaderResourcePackage();
		modifiedSRP.CopyFrom(queuedSRP);
		modifiedSRP.SetValue((ResourceViewBinding) lastSetFragmentShader.GetBindingByIdentifier("ShadowMap"), previousShadowBufferSRV);
		queuedSRP = modifiedSRP;
	}
	QueueShaderResourceUpdate(lastSetFragmentShader, queuedSRP);

	// Filter & sort
	if (materialFilteringWorkspace == null || materialFilteringWorkspace.Length < currentCache.NumModels) {
		materialFilteringWorkspace = new List<Transform>[currentCache.NumModels];
		for (int i = 0; i < materialFilteringWorkspace.Length; ++i) materialFilteringWorkspace[i] = new List<Transform>();
	}
	for (int i = 0; i < materialFilteringWorkspace.Length; ++i) materialFilteringWorkspace[i].Clear();

	SortByProximityToCamera(currentMID);
	uint numInstances = 0U;
	for (uint i = 0U; i < currentMID.Length; ++i) {
		ModelInstanceData curMID = sortedModelData[i];
		if (!curMID.InUse) continue;
		SceneLayer layer = currentSceneLayers[curMID.SceneLayerIndex];
		if (layer == null || !layer.GetRenderingEnabled() || !addedSceneLayers.Contains(layer)) continue;

		if (curMID.ModelIndex == __VEGG_MH.ModelIndex && currentCache.ID == __VEGG_MH.GeoCacheID) {
			int instanceIndex = 0;
			for (int j = 0; j < currentMID.Length; ++j) {
				if (currentMID.Data[j].Transform == curMID.Transform) {
					instanceIndex = j;
					break;
				}
			}
			Quaternion rot = Quaternion.IDENTITY;
			foreach (var kvp in __VEGG_MIH_ARR) {
				if (kvp.Key.InstanceIndex == instanceIndex) {
					rot = kvp.Value;
					break;
				}
			}
			materialFilteringWorkspace[curMID.ModelIndex].Add(curMID.Transform.RotateBy(rot));
		}
		else materialFilteringWorkspace[curMID.ModelIndex].Add(curMID.Transform);
		++numInstances;
	}

	// Concatenate & queue render commands
	if (instanceConcatWorkspace == null || instanceConcatWorkspace.Length < numInstances) {
		instanceConcatWorkspace = new Matrix[numInstances << 1]; // x2 so we don't create loads of garbage if the count keeps increasing by 1
	}

	uint instanceStartOffset = RenderCache_IterateMaterial_ConcatReserve(numInstances);
	uint nextWorkspaceIndex = 0;
	uint outVBStartIndex, outIBStartIndex, outVBCount, outIBCount;

	for (uint mI = 0U; mI < materialFilteringWorkspace.Length; ++mI) {
		List<Transform> filteredTransformList = materialFilteringWorkspace[mI];
		int numFilteredTransforms = filteredTransformList.Count;
		if (numFilteredTransforms == 0) continue;

		currentCache.GetModelBufferValues(mI, out outVBStartIndex, out outIBStartIndex, out outVBCount, out outIBCount);

		QueueRenderCommand(RenderCommand.DrawIndexedInstanced(
			(int) outVBStartIndex,
			outIBStartIndex,
			outIBCount,
			nextWorkspaceIndex + instanceStartOffset,
			(uint) numFilteredTransforms
		));

		for (int iI = 0; iI < numFilteredTransforms; ++iI) {
			if (mI == __EGGHACK_MH.ModelIndex && currentCache.ID == __EGGHACK_MH.GeoCacheID) {
				instanceConcatWorkspace[nextWorkspaceIndex++] = filteredTransformList[iI].RotateBy(__EGGHACK_ROT).AsMatrixTransposed;
			}
			else instanceConcatWorkspace[nextWorkspaceIndex++] = filteredTransformList[iI].AsMatrixTransposed;
		}
	}

	RenderCache_IterateMaterial_Concat(instanceConcatWorkspace, instanceStartOffset, numInstances);
}

View Code Fullscreen • "Material Iteration"
Unfortunately, at one point, profiling indicated that almost half of our CPU budget was being spent in this one method. Although I expected the method to certainly show up on the radar, I really needed to reduce that down to at most ~10%. In fact, one simple change got me almost all the way there! Notice the following line:

for (int i = 0; i < materialFilteringWorkspace.Length; ++i) materialFilteringWorkspace[i].Clear();

View Code Fullscreen • "Clearing Material Filtering Workspaces"
It might not look like much, but this was the cause of most of the slowdown! materialFilteringWorkspace is an array of List<T>s. To find out why this was giving us so much grief, let's look at the actual implementation of List<T>.Clear():

// Clears the contents of List.
public void Clear() {
	if (_size > 0)
	{
		Array.Clear(_items, 0, _size); // Don't need to doc this but we clear the elements so that the gc can reclaim the references.
		_size = 0;
	}
	_version++;
}

View Code Fullscreen • ".NET List<T> Clear"
(c) Microsoft .NET; https://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs,ca7bce81a50b0aeb

The critical element here is that the inbuilt List<T> must clear the array that it wraps when we call .Clear(). As the in-code comment alludes to, the reason for this is to make sure that references to items we've placed in the list's internal backing array (usually via Add()) are no longer considered reachable to the garbage collector.

Unfortunately, an array zero/clear is not a zero-cost operation. For most everyday usages it is absolutely neglible, but our List<T> array 'materialFilteringWorkspace' was quite long usually (somewhere around 600 usually, sometimes over 1000) and although each list would usually only have a length of around <= 10 items; some would be >= 100.

This means that 100,000 times a second, we were zeroing the memory for at least 600 lists. It's at this point that List<T>.Clear() becomes a bit too slow. :)

The Solution

One real solution to this is to re-architect the way the code works. Obviously, clearing a list 60M times a second (and therefore zeroing array memory 60M times a second) isn't really a situation you want to be in too often anyway. However, sometimes part of being a solo engine/game programmer means knowing how to cut corners.

Luckily for me, the actual types that I'm storing in these lists are structs, and there's no danger of lingering references clogging up my memory. This meant there was a fast fix: I could just write my own List<T> stub replacement to drop-in:

public sealed class FastClearList<T> where T : struct {
	private T[] storage;
	private int counter;

	public int Count {
		get {
			return counter;
		}
	}

	public FastClearList() {
		storage = new T[4];
		counter = 0;
	}

	public void Add(T item) {
		if (counter >= storage.Length) {
			var newStorage = new T[storage.Length << 1];
			Array.Copy(storage, 0, newStorage, 0, storage.Length);
			storage = newStorage;
		}
		storage[counter++] = item;
	}

	[MethodImpl(MethodImplOptions.AggressiveInlining)]
	public void Clear() {
		counter = 0;
	}

	public T this[int index] {
		[MethodImpl(MethodImplOptions.AggressiveInlining)]
		get {
			return storage[index];
		}
		[MethodImpl(MethodImplOptions.AggressiveInlining)]
		set {
			storage[index] = value;
		}
	}
}

View Code Fullscreen • "Fast Clear List"
This is a totally minimal List<T> replacement (doesn't even have a Remove()!) with one key difference: Clear() doesn't clear the backing array. By replacing usages of List<T>s with FastClearList<T>s in the code above, I managed to reduce RenderCache_IterateMaterial's overhead almost entirely.

Finally, a point to note: Although I have the generic constraint where T : struct for my FastClearList<T>, struct types in C# can still themselves contain fields of managed types. So if you're wondering why FastClearList<T> isn't used more widely- that may be why (also because not many people will come across zeroing memory as a performance issue, I imagine!).

Table of Contents

Post Details

Recommended Posts

Postmortems - Clearly Too Slow

A Performance Problem

The Solution

Read Next: Postmortems - Tale of Two Casts