In my last post, I concluded that the best way to store most things is to use a large unsorted array. I sneakily avoided mentioning that there is one thing that large unsorted arrays are exceptionally bad at – finding things! But maybe emphasizing that it was a large, unsorted array gave you a hint?
In this post, I’ll try to remedy that.
Finding things
“Finding things” is pretty vague, but it might mean things like:
Find all the sounds that are playing from a certain sound source (so that we can move the sounds when the sound source moves).
Find all the sounds that are playing a certain sound resource (so that we can stop them before unloading the resource).
Find all effects that are being applied to a certain playing sound (so that we can cancel the effects when the sound stops playing).
Etc.
In traditional object-oriented design, we typically do this by keeping explicit lists. I.e., a sound source may have an std::vector of all the sounds playing from it. A playing sound may have an std::vector of all the effects being applied to it, etc, etc.
This can work, but it spreads the data out in lots and lots of small arrays. As I wrote in the previous post, I prefer to have a few large arrays that hold all the data in the system.
Let’s look at a concrete example that we’ll use throughout the rest of this article.
Suppose we want a system that implements the observer pattern for entities. The system lets an entity “subscribe” to events from one or more other entities. When an entity is modified, all the subscribers are notified. We call the subscribers “observers” and we call the entities they are watching “observeds”.
Again, the traditional object-oriented approach would be to let the observed entity have a list of all the entities that are observing it. When the entity is changed it goes through this list to notify them. In addition to creating lots of small arrays, this also has another problem – when an object dies, how does it get removed from the lists of all the objects it is observing? Maybe we can use weak references, so we don’t have to remove it, but that will still waste memory. Or maybe all the observing objects also need a list, a list of all the objects they are observing. It’s starting to get messy.
What would the data-oriented approach be? Well, we want everything in a big array, so why not simply make a big array of “observations”, where each observation connects an observed entity to its observer entity:
typedef struct { entity_t *observed; entity_t *observer; } observation_t; uint32_t num_observations; observation_t *observations;
I’ve shown the data as an array here, just because it is simplest, but we could use any of the methods described in the previous post for storing bulk data. I assume too that whatever method we use for storing entities gives us permanent pointers, so we can use an entity_t * to permanently refer to an entity. If that is not the case, we would have to use something else that would give us a permanent reference instead, such as an index or an ID.
An entity might appear several times in the observations array if it is watching multiple entities, or being watched by multiple entities. Note that this way of organizing the data is a lot more similar to a relational database than to a traditional object-oriented model.
Assuming this data layout, what are the kind of “searches” we will need to do? Two typical situations are:
Entity A has changed, notify all its observers:
SELECT * WHERE (observed == A)Entity B has been deleted, remove it from the system:
SELECT * WHERE (observed == B OR observer == B)
The most straightforward way of resolving queries like these would be with a brute force loop:
for (observation_t *o = observations; o != observations + num_observations; ++o) { if (o->observed == a) { // notify(...) } }
The cost here is O(n) in the number of observations. This might be alright for very small datasets, but we want to find something better.
To improve the performance, we can add additional data structures that let us find things and resolve queries faster. Since this already looks a lot like a database, we will borrow from that terminology and call these accelerating data structures indices.
Indices
A database index is usually a list of references to database records, sorted by one of the fields. This can be used to answer range queries (salary > 50 AND salary < 100) or find items using binary search (which is O(log n)).
In game development, we usually don’t need to perform range queries. Typically, we’re only looking for items with a specific value. For example, we may look for all observations with observed == A. This is nice because it means we can use hash tables instead of sorted arrays as our lookup structure and accelerate the performance from O(log n) → O(1).
(As an exception, range queries on positions are often useful, e.g., “find all enemies within 50 m of the player”. I’ll talk a little bit later about how those can be implemented.)
So basically, what we want from our acceleration index, is that given an entity A it should quickly give us a list of all the observations where observed == A. I.e., something like this:
A → (o1, o2, o3, …)
Before we get into this, let’s start with a simpler case. Let’s assume, just for the sake of simplicity, that, for the property we’re interested in, all the objects have unique values, so our query will at most return a single item:
A → o1
We will go back in a little bit and look at the more complex case, once we’ve sorted this out.
If we only need a single value, all we need is a lookup table from one object to another. We can do this with a regular hash table. It could look something like this:
chash32_t observation_by_observed; observation_t *find_by_observed(entity_t *e) { const uint64_t key = (uint64_t)e; const uint32_t idx = chash32_get(&observation_by_observed, key, UINT32_MAX); return idx == UINT32_MAX ? 0 : observations + idx; }
Here, observation_by_observed is our index. It stores a lookup from an entity(represented by a pointer) to the observation that has that entity in its observed field (represented by an index into the observations array).
It’s a little bit arbitrary that we’re referencing entities by pointers and observations by indices in this example, we could just as easily do it the other way around – assuming we are using the bulk data strategy laid out in the previous post, where both object pointers and indices are permanent.
Note that the way I’m using hash tables is a bit unusual. Typically, in C++, you would define a hash table as a template hash<K,V> and the hash table would call an overloaded hash() function on the key objects K to compute numerical keys.
In my approach, the hash table does not use templates, but instead just performs a lookup from uint64_t → uint32_t. If we need to hash the key, we do that before we perform the lookup. And instead of storing our values directly in the hash table, we just store indices into an array of values. Since we like to store or our bulk data in big arrays anyway, this works out perfectly. I’ve written a bit more about this approach before.
Note that in the entity case, we don’t need to hash th