A tale of two bugs

Sept. 20, 2018
protect

We recently released a closed pre-alpha of The Machinery. As a result, we of course discovered a few new bugs. In this blog post, I’ll take a look at the two hairiest ones.

In general, I find that it is hard to find good information on debugging. You’re just expected to “pick it up as you go”. But it is a complicated topic and it’s hard to be a good programmer if you’re not good at debugging. I hope this post helps.

 

The ghost in the network

The first bug was actually discovered just before we released the alpha. We noticed that collaborative editing didn’t work between mine and Tobias’ machine. The editing session would start up and synchronize the scene, but on certain actions after that — such as moving or cloning an entity — the editor would crash.

The weird thing about this is that collaboration worked locally. I.e., if either me or Tobias started up two editors on one of our machines, those two editors could collaborate without any issues at all. What’s weird about that is that to the software there should be no difference between a local and a remote session. A local session is just a session that connects on 127.0.0.1 and a remote session is a session that connects to another IP. The difference just exists in the socket layer, so what could possibly be going on?

We spent some time just verifying the facts. This is always a good idea when something seems really weird. But yes, remote sessions were crashing and local sessions worked, consistently.

I investigated the code around the crash on the client side. The collaboration code is based on synchronizing objects with properties. (In The Truth, which I’ve talked about before.) Whenever a property on an object is changed, the change is serialized and sent to the other collaborators in the session. The crash happened when the client got a message from the server that a set property of an object had changed. The problem was that according to the data from the server, the set had billions of elements. Since the client was unable to allocate the data for this, the application crashed. So it would seem like the data from the server got corrupted somehow.

Our next idea was that maybe it was a Debug/Release difference. Turns out Tobias was running the Debug version of the engine and I was running the Release version. Maybe the layout of some struct was different in Debug or Release causing serialization differences? Or maybe we had uninitialized memory somewhere? But no, when we switched to running the same version we still had the crash.

Since there seemed to be data corruption somewhere, our next step was to try to figure out where. Maybe there was a bug in our network layer when packages were split over multiple send and recv calls. Or maybe the network itself was corrupting packages, wifi had been spotty due to a bad router. Furthermore, if we could show that there was no network corruption we knew that the problem would have to be in the server serialization code, since the data was alread bad when it got to the client.

To check that, I started logging MD5 checksums of all packages as they entered and left the network layer of our application. Turns out the checksums all matched, so there were no bugs in the network layer code.

At this point I knew that the problem must be in serialization/deserialization, so I decided to step through the code as a bad packet was being serialized on the server and then step through the deserialization of the same packet on the client.

This gave me the first real clue to the bug. I noticed that an object property was being serialized as a reference (one of the data types in The Truth), but on the client side it was deserialized as a set. Thus, the object ID of the reference was interpreted as the element count of the set, which caused the bug.

On closer examination, this was because the client and the server did not agree on the object type. The index of the type (which was what was sent over the wire) was the same. But that index had different meanings on the server and the client. The type table looked something like this:

Type table on client and server

There were only a few types in the list that had switched places. So collaboration worked fine until one of these types was sent, then it would crash.

When I created the collaboration system I knew that I would eventually have to deal with situations where the type table was different between the server and the client. This can happen, for example, if the client and server uses different plugins and some of these plugins introduce new types.

However, I put this off as something to do “later”. After all, for now, all users will be running with the same set of plugins (the ones we include in the alpha). Also, I’m not 100 % sure what collaboration even means when plugins differ. One of the users can create objects that the other users have no idea what they are. Does this even make sense?

So I’d found the bug, but I still wanted to figure out why our type lists differed. After all, both me and Tobias were running with the same set of DLLs — the default DLLs that we distribute with the alpha. Was there something non-deterministic in the loading of the DLLs that caused them to be loaded in a different order on different machines?

As we dug into this, we noticed it was actually something else. At one point during the development of the pre-alpha we decided to rename our sample_plugin.dll to tesselated_plane_plugin.dll. Tobias had deleted his bin directory after the rename, while I had not. So I had both the sample_plugin.dll and the tesselated_plane_plugin.dll, while he had only the latter one. Since plugins are loaded in alphabetical order, this caused the plugin to load and register its types earlier on my system, causing them to appear earlier in the list.

Once I deleted the sample_plugin.dll from my bin directory, collaboration worked as expected.

In the end, turns out there was nothing to fix for this bug. Of course, we eventually need to fix the problem where users can’t collaborate if their plugins don’t match, but it’s too big of a fix for the alpha. And users will only run into this bug if they write their own plugins that define new types and they run collaboration sessions between different machines and they only put these plugins on some of the machines.

At the end of a debugging session like this, I like to reflect back to see what can be done to prevent similar bugs in the future, and/or to make them easier to debug.

As for the debugging session itself, I think it went pretty well. The important things when debugging are:

  • Verify your hypotheses. If you think something is true, find a way of proving or disproving it (by logging, single-stepping, etc).

  • Find ways of getting more information. How can you increase your knowledge about the bug?

When debugging, you really need to think like a scientist and keep track (in your head or on paper) of the things you know, the things you believe (but don’t know for sure) and then find “experiments” that can add to what you know, and confirm or rule out the things you believe.

As for how to prevent bugs like this in the future, it is clear that I made a pretty big mistake when writing the original collaboration code. I was very aware that the code would only work (for now) if the type lists matched between the client and the server. So this was a pre-requisite, but I didn’t put that pre-requisite into the code.

If I had just added an assert verifying this pre-requisite, the assert would have triggered immediately and we would have known right away what was going on and not have to spend two hours debugging it.

Adding an assert is easy, we can just hash the entire type table and send that checksum as part of the initial synchronization message. If the checksums don’t match between client and server, we print an error.

So the lesson we can learn from this is clear:

  • If your code is assuming something — especially if you know that there are situations where that assumption might not be true — put that assumption into the code in the form of an assert.

The curiously recursing child entities

The second bug was discovered by one of our users. Luckily it had a 100 % reproduction case, but the reproduction steps were kind of complicated. First, you needed to open one of our more complicated projects, then open a big entity with hundreds of sub-entities and make that entity a child of itself. Finally, create a new entity, and make the complex entity a child of that entity.

Having an entity be a child of itself is kind of a weird setup. It happens when you use an entity as a prototype and place an instance of that prototype in the entity itself. (Note: In our system, all entities can act as prototypes, there is no distinction between an “entity” and a “prototype”.) Normally, a situation like that would lead to infinite recursion. However, we have a rule that in cases of self-reference, we only expand the first level in the runtime. So if a room is a child of itself, you only see that first room child, you don’t see the child of that room. (You can force another level of expansion by overriding the instance locally, but that’s another story.)

An entity that includes a smaller copy of itself, expanded to level 4.

The reason we allow entities to be children of themselves is not that we think it is a super useful feature, it is just that it is nicer to allow it than to throw a big error box saying “ILLEGAL OPERATION” in the user’s face.

When I reproduced the bug I could see that the crash happened in a piece of code that enumerated the entities in the scene. That code went into infinite recursion which smashed the stack.

Thus, it seemed pretty likely that there was a problem somewhere with how we handled these recursive entities. Probably I’d missed something that made us still go into infinite recursion.

My next step was to try to create a simpler test case. With the hundreds of objects in the initial scene, it was kind of hard to tell what was going on. I figured that with a simpler scene, I could just step through the code and it would be easier to understand the bug.

However, when I tried to simplify the problem, I could no longer reproduce it. I created a fresh entity and added the same kind of recursion that was in the original repro case — no crash. I then started with the original repro case and tried to delete a bunch of sub-entities to make it simpler — no crash anymore. Weird.

Again, I figured I needed more information, so I started to add logging to the application. In the code that was smashing the stack I added logging of object names, so I could see what was going on. I noticed something weird.

According to the log, the room entity was a child of the cellar_door entity. And the room had a props entity, which was the parent of the cellar_door. This was causing the loop. However, this was weird for two reasons.

JikGuard.com, a high-tech security service provider focusing on game protection and anti-cheat, is committed to helping game companies solve the problem of cheats and hacks, and providing deeply integrated encryption protection solutions for games.

Read More>>