Defender's Quest: Valley of the Forgotten DX has always had lingering issues with performance, and I've finally fixed them. Our main impetus to massively improve performance was our PlayStation Vita port. The game had been out on PC already, and ran alright on Xbox One and PS4, if not perfectly. But unless we seriously upped our game, there was no way it would run on the Vita.
When a game is slow, internet commenters tend to blame the programming language or engine. It's true that languages like C# and Java incur more overhead than C and C++, and that tools like Unity have longstanding issues with e.g. garbage collection, but the real reason people reach for these explanations is they're the software's most visible properties. But the real performance killers are just as likely to be stupid little details that have nothing to do with architecture.
0. Profiling Tools
There's only one real way to make a game faster -- profile it. Figure out where the computer is spending too much time, and make it spend less time to do the same thing, or even better -- make it spend no time by doing nothing at all.
The simplest profiling tool is just the Windows performance monitor:
This is actually a pretty versatile tool, and it's dead simple to run. Just Ctrl+Alt+Delete, open task manager, and hit the "Performance" tab. Just make sure you're not running a bunch of other programs. You should easily be able to detect CPU spikes and even memory leaks, if you're watching carefully. It's low-tech, but this is the first step for finding slow spots, besides just playing the game and noticing when stuff is sluggish.
Defender's Quest is written in Haxe, a high-level language that compiles to other languages, my chief target being C++. This means that any tool that can profile C++ can profile my Haxe-generated C++ code. So when I want to dig into what's causing problems, I boot up Visual Studio's Performance explorer:
The various consoles also have their own performance profiling tools, which are pretty awesome, but due to NDA's I'm not allowed to tell you anything about them. But if you have access to them, definitely use them!
Rather than write a crappy tutorial on how to use profiling tools like Performance Explorer, I'll just link to the official docs and get to the main event -- surprising things that yielded huge speedups, and how I found them!
1. Identifying the Problem
Performance is as much about perception as it is actual speed. Defender's Quest is a tower defense game that renders at 60 FPS, but with a variable game speed anywhere from 1/4x to 16x. Regardless of game speed, the simulation uses a fixed timestep of 60 updates per 1 second of 1x simulation time. So if you run the game at 16x, you're actually running the update logic at 960 FPS. That's honestly a lot to ask of a game! But I'm the one who made this mode possible, and if it's slow, players aren't wrong to notice.
And then there's this level:
That's the final bonus battle, "Endless 2", AKA the bane of my existence. This screenshot is from the New Game+ mode, where enemies are not only a lot stronger, but also gain powers like regenerating health. A favorite player strategy here is to spec dragons with maximum roar (an AOE attack that stuns enemies), and follow it up with a row of knights specced for maximum knockback to push anything that gets past the dragons back into range. The cumulative effect is that you can keep an enormous group of monsters frozen in place indefinitely, well past the point you could survive if you had to actually kill them. Since you only have to reach waves to earn rewards and achievements, not kill them, this is a perfectly valid and frankly brilliant strategy, exactly the kind of thing I want to encourage.
Unfortunately, it's also a pathological case for performance, especially when players try to run it at 16x or 8x speed. Sure, only the hardest of the hardcore are going to try for the Endless 2 New Game+ Wave 100 achievement, but they also tend to be the ones who talk loudest about the game, so I'd like for them to be happy.
"I mean, it's just a 2D game with a bunch of sprites, what's so hard about that?"
Indeed. Let's find out.
2. Collision decision
Look at this screenshot:
See that donut shape around the ranger? That's her range -- note also the dead zone where she can't target things. Every class has its own range shape, and each defender can have a differently sized range, depending on boost level and individual stats. And every defender can in theory target any enemy on the board if it's in range, and vice versa, for certain enemy types. You can have up to 36 defenders on the map at any time (not including the Azra, the main character), and there's no upper limit to the number of enemies. Every defender and enemy maintains a list of eligible targets based on range check calls every update step (minus sensible culling for those who can't possibly attack right now, etc).
Nowadays GPU's are really fast -- unless you're pushing things to the screaming bleeding edge, they can handle just about as many polygons as you want to throw at them. But CPU's, even fast ones, are extremely easy to bottleneck on simple subroutines, particularly ones that grow exponentially. This is why you can have a 2D game that is slower than a much prettier 3D one -- not because the programmer sucks (well, maybe that too, at least in my case), but principally because logic can sometimes be more expensive than drawing! It's not really a factor of "how many things you've got on screen", so much as it is what those things are doing.
So let's dive in and speed up collision detection. For perspective, before optimization, collision detection took up ~50% of the main battle loop's CPU time. Afterwards it was less than 5%.
All About QuadTrees
The basic fix for slow collision detection is Space Partioning - and we'd already been using a decent QuadTree since day one. These basically divide up space efficiently so you can skip a bunch of unneccessary collision checks.
Every frame, you update the entire QuadTree to track where everyone is, and whenever an enemy or defender wants to target anything, it asks the QuadTree for a list of who's nearby. But the profiler said both of these operations were far slower than they had to be.
What was wrong?
A lot of things, as it turned out.
Stringly Typed
Since I store both enemies and defenders in the same quadtree, you need to specify what you're asking for, and I was doing it like this:
var things:Array<XY> = _qtree.queryRange(zone.bounds, "e"); //"e" is for "enemy"
This is called Stringly Typed code, and among other reasons, it's bad because string comparison is always slower than integer comparison.
I rigged up some quick integer constants and changed it to this instead:
var things:Array<XY> = _qtree.queryRange(zone.bounds, QuadTree.ENEMY);
(Yeah, I probably should have used an Enum Abstract for maximum type safety but I was in a hurry and it got the job done.)
This one change made a big difference because this function gets called all the time, recursively, every time anybody needs a new list of targets.
Array vs Vector
See this?
var things:Array<XY>
Haxe Arrays are quite similar to ActionScript and JS arrays in that they're a resizable collection of objects, but in Haxe they're strongly typed.
There is, however, another data structure that is more performant on static targets like cpp, which is haxe.ds.Vector. Haxe Vectors are basically the same as Arrays, except that you pick a fixed size when you create them.
Since my QuadTrees had a fixed capacity already, I swapped Arrays for Vectors for a noticeable performance increase.
Ask only for what you need
Previously my queryRange function was returning a list of objects, XY instances. These held the x/y location of the referenced game object and its unique integer identifier (a lookup index in a master array). The requesting game object would take those XY's, extract the integer id to get its target, and then forget the rest.
So why was I passing all these XY object references around for each QuadTree node, recursively, up to 960 times per frame? All I needed to return was a list of integer ids.
PROTIP: Integers are way faster to pass around than basically anything else!
Compared to the other fixes this one was rather unsophisticated, but the performance gains were still noticeable because it was in such a hot inner loop.
Tail-call optimization
There's this fancy thing called Tail-call optimization which is kind of hard to explain, so I'll just show you.
Before:
nw.queryRange(Range, -1, result); ne.queryRange(Range, -1, result); sw.queryRange(Range, -1, result); se.queryRange(Range, -1, result); return result;
After:
return se.queryRange(Range, filter, sw.queryRange(Range, filter, ne.queryRange(Range, filter, nw.queryRange(Range, filter, result))));
These return the same logical results, but according to the profiler, the second one is faster, at least on the cpp target. Both of these are doing the exact same logic - make some changes to the "result" data structure and pass that into the next function until we return. When you do this recursively, you're able to avoid the compiler generating temporary references as it can just return the result from the previous function immediately rather than having to hang on to it for an extra step. Or something like that. I'm not 100% clear on how it works, read the link above.
(To my knowledge, the current version of the Haxe compiler does not have a tail-call optimization feature, so this was probably the C++ compiler at work -- so you shouldn't be surprised if this trick doesn't work on non-cpp targets when using Haxe.)
Object Pooling
If I want accurate results, I have to tear down my QuadTree and build it back up again for every update call. Creating new QuadTree instances was fairly tame, but the vast amount of new AABB and XY objects those QuadTrees depended on was causing some real memory thrashing. Since these are such simple objects, it makes sense to allocate a bunch ahead of time and just keep reusing them over and over. This is called Object Pooling.
Previously I was doing stuff like this:
nw = new QuadTree( new AABB( cx - hs2x, cy - hs2y, hs2x, hs2y) ); ne = new QuadTree( new AABB( cx + hs2x, cy - hs2y, hs2x, hs2y) ); sw = new QuadTree( new AABB( cx - hs2x, cy + hs2y, hs2x, hs2y) ); se = new QuadTree( new AABB( cx + hs2x, cy + hs2y, hs2x, hs2y) );
I replaced it with this:
nw = new QuadTree( AABB.get( cx - hs2x, cy - hs2y, hs2x, hs2y) ); ne = new QuadTree( AABB.get( cx + hs2x, cy - hs2y, hs2x, hs2y) ); sw = new QuadTree( AABB.get( cx - hs2x, cy + hs2y, hs2x, hs2y) ); se = new QuadTree( AABB.get( cx + hs2x, cy + hs2y, hs2x, hs2y) );
We use the open source framework HaxeFlixel, so we implemented this using HaxeFlixel's FlxPool class. For narrow-case optimization like this I often found myself replacing some core Flixel elements like collision detection with my own implementation (as I did with QuadTrees), but FlxPool is better than anything I could have written myself and did exactly what I needed.