Recording Statistics - An Exercise in Minimalism

April 28, 2020
protect

I recently added a Statistics view to The Machinery. It can be used to display various real-time statistics from the engine:

Frame time plotted.

In addition to drawing graphs, data can also be presented in tabular form, for more precise inspection:

Frame time shown as table.

Graphs are useful because they expose a lot of information in a way that it can be quickly processed by the human visual cortex. Things like glitches, anomalies and bad patterns stand out in a way they don’t if you are just looking at numbers in a profiler:

  • If you see occasional spikes in the frame rate but have a hard time pinpointing where they are coming from, you can plot the time spent in some of the major engine systems and see if one of them has spikes that coincide with the frame rate spikes. If it does, you have found the main source of the frame hitches.

  • Plotting memory usage over time is a good way of finding leaks and other problematic patterns. Now, we already have memory leak detection in the engine, but it only detects memory that never gets freed. There are other kinds of “leaks” that can also cause problems. For example, you might have an array that just grows and grows, because elements keep getting added to it without being removed. This is not a true memory leak, because we are still keeping track of all allocations and the memory eventually gets freed when the array is destroyed — perhaps at the end of the level. But memory use still grows unbounded in this situation, causing the application to eventually run out. This type of “leak” can happen even in garbage-collected languages.

In the statistics view, this shows up as a typical pattern of memory use that keeps rising while a level is being played and then drops sharply when the level is restarted. Finding the root cause can still be tricky — these kind of “leaks” are harder to fix than the “real” ones — but looking at the statistics can put you on the right track and let you narrow it down to a specific system.

  • Another example — suppose you are noticing an unusually high number of draw calls in a certain part of a level. By keeping the draw call statistics on screen as you pan around with the camera, you can check if you notice any sudden jumps as certain objects enter and leave the view frustum. In this way, you can quickly pinpoint the problematic object.

In my case, I wanted to use the Statistics view to investigate a weird “jerkiness” in the camera that I was sometimes seeing. As I was panning the camera smoothly, it kept jumping back, almost as if time was going backward. Weirdly, I couldn’t capture the phenomenon with any screen capturing tools, only by filming the screen with my phone.

So I wanted to plot the frame rate and camera position over time to see if I could spot any patterns that could explain the behavior.

(Spoiler: I didn’t. Both the frame time and the camera position moved smoothly. The jerkiness seems to have something to do with running on a secondary monitor — it always runs smoothly on my main screen. This, together with the fact that I can’t screen capture the issue, seem to indicate that frames somehow get “reordered” as they are sent to the secondary monitor. Now, that sounds super weird to me and I’m not sure how that is even possible. Might be a topic for a future blog post if we can figure it out.)

But anyway, that was the impetus for creating this system. And even though it couldn’t resolve this particular issue, it’s still a very useful thing to have in the engine. In this post, I’ll talk a little about how it’s implemented.

Drawing the graphs is pretty straightforward, so let’s focus on the more interesting problem: how do we collect the data that we draw?

There are two main requirements on the data collection:

  • First, it must be extensible. The Machinery is plugin-based and each plugin should be able to add its own counters to the system. In addition, there are lots of game-specific data that are interesting to track too, such as the number of enemies, A* searches, etc.

  • Second, the system must be fast enough that you can add statistics everywhere, and don’t worry about leaving it in the code. Having the statistics always available lets you do exploratory investigations that you might not bother with if you have to explicitly insert and remove stats collection every time you want to check something. In particular, I find it very useful to have statistics counters for all profiling scopes, so we can examine them for hitches. This tends to be a lot of data, so the system needs to be efficient.

A straightforward API for data collection might look something like this:


add_to_frame_counter("renderer/primitive-count", n);

Here we use a string to identify the data counter. (We need some human-readable identification so that we can browse for the counter in the UI.) The function call simply adds n to the current frame’s value for the counter.

Note that I’ve chosen to accumulate all the data for each counter over the frame. This is not a 100 % self-evident choice. Some sources might get multiple data points per frame or less than a data point per frame. For example, if you are looking at the size of inbound network packages, you might have a lot of frames where you get no package at all and some frames where you get multiple packages. Instead of tracking the package size per frame, we could treat each incoming package as its own data point.

Here’s what the data might look like using the two different methods.

Creating a data point for each incoming package:

When (frame #)

3.5

6.1

6.3

9.2

Packet size

782

1003

450

510

Accumulating the data per frame:

Frame #

0

1

2

3

4

5

6

JikGuard.com, a high-tech security service provider focusing on game protection and anti-cheat, is committed to helping game companies solve the problem of cheats and hacks, and providing deeply integrated encryption protection solutions for games.

Read More>>