I’ve talked a little bit before about The Truth — the system we use for representing data in The Machinery. But I haven’t said that much about the rationale behind it — why we chose this approach over something else. Jascha Wedowski wrote to us on Twitter and wanted to know more about that and I think it makes for a really interesting blog topic, so here we go!
What is a data model and why should you have one?
Let’s start at the beginning. Most software programs need some way of representing and storing data. To have a word for it, we call it the application’s data model. There are many possible kinds of data models:
An application can have multiple representations of the same data. For example a program may use JSON configuration files on disk, but when the program is booted those files are read into in-memory data structures. The JSON files are permanent, have a well-defined structure and can be easily shared between programs. The in-memory representation is faster to access, but temporary, lacks a high-level structure and can’t be easily shared.
In this post, when I talk about the data model I’m mostly talking about these more permanent, structured models.
Most programs do not handle that much data and computers are pretty fast so they can convert between a structured file representation and a high-performance memory representation each time the program is run, or each time you open a file. But games are different. They often deal with gigabytes of data and need to run really fast. Converting all that data on each boot of the game would lead to really long startup times. Therefore, they often convert from the structured format to the in-memory format in a separate data-compile step. They store the in-memory representation on disk (it needs to be stored somewhere), but in such a way that it can be quickly streamed into memory and used immediately without any costly parsing or conversion.
Why not just use the in-memory format all the time and skip the compile step? You could, but typically the more structured format (whatever format it takes) has some advantages. For example, the compiled data might be one big .pak file, which doesn’t work well with version control. It might not merge well and thus not be a good fit when more than one person works on the project. It might also throw away information such as debug strings or compress textures to reduce the size of the final game.
Having a structured data model is useful because it allows us to implement features that we want our data to have on the data model itself, rather than on the systems that use it. This means that we only have to implement the feature once, and all systems will get it, rather than having to do a separate implementation in each system.
For example, consider backward compatibility. Backward compatibility means that a future version of our program is able to open files from an older version, even if the data has changed in some way (for example, we may have added new properties to an object). It is a pretty essential feature, because without it, an application update would break all the users’ old files.
Without data model support, supporting backward compatibility might mean keeping all the code around for parsing every past version of the data. Your code might look something like:
if (version == VERSION_1_0) { ... } else if (version == VERSION_1_1) { ... } ...
In contrast, if your data model handles backward compatibility you don’t have to do anything. As an example of how that might work, consider JSON. As long as you are just adding and removing properties, and give any new properties reasonable default values, JSON will automatically be backward compatible. JSON can also do a decent job of forward compatibility — i.e., allowing old executables to open newer data. The old executables will just ignore any properties they don’t understand. Forward compatibility is hard to achieve without some sort of structured data model, since you can’t do an if (version == ???) test for unknown future versions of the data.
In addition to backward and forward compatibility there are a bunch of other things that a data model can potentially help with:
Dependency tracking. If the data model has a consistent way of representing references, you can use it to detect missing or orphaned objects (objects not used by anyone).
Copy/paste. If the data model supports cloning objects, copy/paste operations can be implemented on top of that. This means that you don’t have to write custom copy/paste code for all your different objects.
Undo/redo. If the data model keeps track of a history of changes, undo can be implemented simply by rewinding the history. This is a lot simpler than using something like the Command Pattern to implement undo.
Real-time collaboration. If the data model has a synchronization protocol, you get collaboration for free. Users can just make local changes to their data, and through the replication protocol, those changes will be propagated to other users in the same session.
Offline collaboration. By offline collaboration, I mean collaboration where you explicitly push and pull changes from collaborators (instead of all changes happening in real-time). In other words, the regular version control model. Since most version control tools are based around text-based merging, in order to support offline collaboration nicely, your file formats must be human readable and merge easily (unless you want to write your own merge tools).
In short, by putting a lot of responsibilities on the data model you can make the UI code for the editor a lot simpler. This is really important to us, because one of the problems we had in Bitsquid/Stingray was that we spent a lot of time on developing UI and tools. Sometimes we would spend 30 minutes to add a feature to the runtime and then a week to create a UI for it. In The Machinery we wanted to address that imbalance and make sure that we could write tool and UI code as efficiently as runtime code. (Of course, anything involving human interface design will always be somewhat time consuming.)
The Bitsquid/Stingray data model
Picking a data model means balancing a range of different concerns. How fast does the code need to run? How much data does it need to handle? Do we need the model to support Undo, Copy/Paste, Collaboration, etc?
It’s not an easy choice, and once you’ve made it you’re usually stuck with it. You can’t change the data model without either breaking all your users’ data or writing a data migration tool, which can be tricky and time-consuming.
To understand our choices for The Machinery, it is helps to compare it to the data model we used for our last big project — Bitsquid/Stingray. The choices we made for The Machinery are in part a reaction to the problems we saw with that model.
In the Bitsquid engine, data was represented as JSON files on disk (with some exceptions, we used binary data for things like textures). The data was read by a federation of independent, but co-operating executables, such as an Animation Editor, Sound Editor, Level Editor, etc. For the runtime, this JSON data was compiled into efficient .pak files that could be streamed from disk directly into memory.
The Bitsquid/Stingray data model.