Optimization: Tradewarz part 1

Instead of the planned post on volume rendering, today's topic is optimization. I've wanted to write about it for a while, but I tend to either start with optimized primitives like sb-cga, or optimize incrementally as I add features, so didn't really have any good example code to work with. Conveniently axion on #lispgames has been working on a game to learn OpenGL, and due to a combination of implementing things from scratch to learn them better and the rest of us simplifying things to be able to explain them in steps, the code has lots of room for both low- and high-level optimization.

Tradewarz screenshot

In part 1, I will start with relatively low-level optimization since the goal is to explain the optimization techniques rather than just make it faster. Higher level optimizations would have much more effect here, so normally they would be a priority. They would make the result of low-level optimization less obvious though, so we will pretend they aren't available for now.

The code so far renders a hex-based grid with some debug annotations (the red numbers in the screenshot above), and a few externally created models. I'm starting a few commits back from HEAD at the time of writing, since the actual code already has a few similar optimizations and I want to show the process leading up to them. The code for this post is in the 3b-optimization-part-1 branch of my fork on github.

Some general comments about optimizing for SBCL

I mostly use SBCL, particularly for code that needs to be fast, so my optimizations tend to focus on things that work well on sbcl.

When optimizing for sbcl, there are a few things worth keeping in mind:

declarations are treated as assertions unless the safety optimization property is set to 0.

This is great for development, since catching errors early saves a lot of time, leaving more time to think of better high-level optimizations. When doing low-level optimizations, you need to keep it in mind though. A function that just does 1 addition will probably spend as much time in the type check as it gained from being able to use an optimized addition. Some people would use a (safety 0) declaration here to get rid of the type check, but I usually prefer to just inline the function. When the function is inlined SBCL is smart enough to skip the check if it knows the type from the surrounding code, and we still get errors when sbcl isn't sure and the code is wrong. For larger functions, the improvements from generating better code usually outweigh the type checks.

Other lisps tend to just trust them, as SBCL does at (safety 0).
SBCL is pretty good at type propagation/optimizations.

For some lisps, you want to sprinkle declarations and THE around the code to make sure the compiler knows what is going on. In sbcl I usually try to avoid THE, since that tends to be a sign that either something else isn't declared well enough or that you are lying to the compiler and sometimes it will break.

For example, when adding small integers, rather than saying (the (unsigned-byte 32) (+ a b)) to get it to "act like C", in SBCL you should specify that you only care about 32 bits of the result with (ldb (byte 32 0) (+ a b)). If it also know the inputs are 32 bit integers, it will compile that down to a single instruction. Alternately, if it knows the inputs to (+ a b) are less than 32 bits, for example (unsigned-byte 10) or (integer 10 20), it can figure out the output is 32bits or less and again add them directly with a single instruction.

With (optimize (speed 3)) SBCL will provide notes about things it could have compiled more efficiently if it knew more about the types.

Sometimes the types aren't known in advance, but we have some common inputs we want to be fast. For example we might want to optimize for vectors of single-float, or (unsigned-byte 8) image data while still handling 16bit images correctly. In that case, duplicating the processing code within a typecase can improve speed quite a bit at the expense of increasing code size, since each branch can have optimized code for a particular type while the T branch keeps the slow but generic version.

SBCL is also good at figuring out types from how previous functions react to a value. For example check-type will have mostly the same effect as a declaration on following code, or if you do (+ a 1) (unless a ...), it knows that if + returns, A must have been a number and can't be NIL.
"boxing"

Most lisps need to store type information with values, frequently by storing it in a few bits of a pointer to the value. Some values smaller than a pointer (like small integers) can be stored directly, and still leave room for the type information. Converting between the form with type information and the form without is called "boxing" and "unboxing".

Boxing usually isn't too noticeable, but when working with values right on the edge of being able to store directly (like (unsigned-byte 32) and single-float on 32 bit platforms, or (unsigned-byte 64) and double-float on 64bit), the extra overhead can make up a large fraction of the cost of a call to a small function. Similarly to the overhead of type checks, inlining is an easy way to avoid that problem. In some cases, it can also be useful to provide a pre-existing 'box' in the form of an appropriately typed array the function can write results to instead of returning them directly.

Step 1: Profiling

Before trying to optimize things, first we need to profile it. I usually use the sb-sprof statistical profiler in sbcl (it also has a deterministic profiler, so you might want to try both and see which gives more insight into a particular problem.)

We also want some overall measurement of performance. In this case we will use average time per frame. On my hardware, it gets about 90FPS = 11ms/frame with 100% of a CPU core, and about 12% GPU utilization according to nvidia-settings (and including everything else displaying on the system). Knowing that we are limited by CPU and not GPU is important, since optimizing CPU wouldn't help much if it was mostly waiting for the GPU to draw things.

Note that both measurements are fairly noisy. There are other things running on the system, and just recompiling a function can cause a few FPS gain or loss due to things like how a loop aligns to cache or whatever. In this case, we should be able to get a few orders of magnitude improvement once it gets to some of the higher level optimizations, so just relative amounts and general trends is enough. If we were working towards a few percent improvement it would be time for more precise measurements like running specific functions in a loop, averaging results over multiple runs, etc.

;; load the statistical profiler
(require 'sb-sprof)

;; quick utility to profile all threads for N seconds (default 10)
(defun run-profile (&optional (time 10))
  (sb-sprof:start-profiling :threads :all :mode :cpu)
  (sleep time)
  (sb-sprof:stop-profiling)
  (sb-sprof:report :type :graph))

;; start the code
(tradewarz:start-game)

;; profile it
(run-profile)

The output is pretty verbose, so I usually use the slime-sprof slime contrib which formats the output and makes it easier to navigate. Here, I'll just cut out chunks of the full output and point out the interesting parts.

With (sb-sprof:report :type :graph), the output is broken into 2 main parts: the "call graph" section and the "flat profile section". We'll start with the latter

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    157  23.7    521  78.7    157  23.7        -  TRADEWARZ::RENDER-NODE
   2     69  10.4     69  10.4    226  34.1        -  SB-KERNEL:%SINGLE-FLOAT
   3     32   4.8    649  98.0    258  39.0        -  TRADEWARZ::LOOP-SCENE
   4     30   4.5     51   7.7    288  43.5        -  CL-OPENGL:COLOR
   5     23   3.5     23   3.5    311  47.0        -  SB-KERNEL:TWO-ARG-*
   6     21   3.2     21   3.2    332  50.2        -  CL-OPENGL-BINDINGS:CHECK-ERROR
   7     20   3.0     20   3.0    352  53.2        -  (LAMBDA (SB-PCL::.ARG0.) :IN "...")
   8     14   2.1     27   4.1    366  55.3        -  TRADEWARZ::CONVERT-TO-OPENGL
   9     12   1.8     14   2.1    378  57.1        -  (FLET #:BODY-FUN-640 :IN SB-IMPL::GETHASH3)
  10     11   1.7     39   5.9    389  58.8        -  TRADEWARZ::VECTOR-MULTIPLY-TO

Here we see under the self column that tradewarz::render-node accounts for 23.7% of our time, so we will start there. We also note that sb-kernel:%single-float is number 2, which is a sign we are doing a lot of type conversions somewhere, and sb-kernel:two-arg-* at number 5, which is a routine used to multiply numbers of unknown type.

Next step is to find tradewarz::render-node in the call-graph section of the report (we want the block where tradewarz::render-node isn't indented as far):

                               Callers
                 Total.     Function
 Count     %  Count     %      Callees
------------------------------------------------------------------------
     1   0.2                   TRADEWARZ::RENDER-NODE [1]
   521  78.7                   TRADEWARZ::LOOP-SCENE [3]
   157  23.7    521  78.7   TRADEWARZ::RENDER-NODE [1]
     1   0.2                   TRADEWARZ::CURRENT-MAP [45]
     2   0.3                   KEYWORDP [25]
     1   0.2                   "foreign function closure_tramp" [38]
     1   0.2                   TRADEWARZ::CURRENT-SCENE [34]
     1   0.2                   SB-C:RETURN-MULTIPLE [46]
    10   1.5                   (LAMBDA (SB-PCL::.ARG0.) :IN "...") [7]
    14   2.1                   CL-OPENGL-BINDINGS:CHECK-ERROR [6]
     5   0.8                   (SB-PCL::FAST-METHOD CFFI:TRANSLATE-TO-FOREIGN (T CFFI::FOREIGN-ENUM)) [33]
     6   0.9                   TRADEWARZ::MAKE-VECTOR [19]
     3   0.5                   (FLET #:CLEANUP-FUN-91 :IN CL-OPENGL-BINDINGS:BEGIN) [72]
     7   1.1                   (SB-PCL::FAST-METHOD TRADEWARZ::GET-SIZE (TRADEWARZ::MODEL)) [43]
     3   0.5                   "foreign function glColor4f" [32]
     8   1.2                   TRADEWARZ::GET-MODEL [29]
     6   0.9                   SB-VM::GENERIC-+ [10]
     5   0.8                   (FLET #:CLEANUP-FUN-76 :IN CL-OPENGL-BINDINGS:BIND-TEXTURE) [68]
     2   0.3                   "foreign function funcallable_instance_tramp" [31]
    19   2.9                   CFFI::%FOREIGN-ENUM-VALUE [35]
     5   0.8                   TRADEWARZ::VECTOR-MODIFY [18]
     5   0.8                   (LAMBDA (SB-PCL::.ARG0.) :IN "...") [23]
     6   0.9                   (LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1.) :IN "...") [21]
     1   0.2                   TRADEWARZ::RENDER-NODE [1]
    39   5.9                   TRADEWARZ::VECTOR-MULTIPLY-TO [11]
     6   0.9                   SB-IMPL::GETHASH3 [28]
    46   6.9                   TRADEWARZ::CONVERT-TO-OPENGL-NEW [27]
    45   6.8                   SB-KERNEL:%SINGLE-FLOAT [2]
    33   5.0                   CL-OPENGL:MULT-MATRIX [17]
    50   7.6                   CL-OPENGL:COLOR [4]
------------------------------------------------------------------------

The right column can be a bit confusing, the way to read it is that the less indented line (tradewarz::render-node) is the function being called, lines above that describe relation to functions that called it, and below describe functions it calls. The numbers like [1] at the end of the line indicate which block in the graph describes that function, here render-node is the first entry in the call-graph report.

In this case, render-node is called from loop-scene, and 78.7% of the time within calls to loop-scene is spent in render-node. Similarly, we can see that render-node spends a lot of time in gl:color, tradewarz::convert-to-opengl-new, sb-kernel:%single-float, tradewarz::vector-multiply-to, and cl-opengl:mult-matrix.

Step 2: some low-hanging fruit

It is interesting that gl:color shows up in the profile, but similar functions like gl:tex-coord which are called the same amount and have nearly the same definition don't. Looking at the source for render-node, we see that gl:color is called through apply while the others are called directly. Replacing that with (gl:color (first c) (second c) (third c) (or (fourth c) 1.0) removes gl:color from the profile completely (since it is inlined by default) and leaves us with just the "foreign function glColor4f" still taking 0.05%, and saving us 0.7ms from average frame time. Since gl:color was inlined it could have just moved the processing time into render-node, which is one reason we need an absolute measure of time in addition to the profiler.

Next is convert-to-opengl-new, which converts a tradewars::ax-matrix struct into a vector to pass to OpenGL. Later commits in the original modified the struct to use a (simple-array single-float (16)) for storage, so we will do that here as well. Once that is done, all convert-to-opengl-new does is create a new matrix and transpose the existing one into it. Conveniently, GL 1.3 added gl:mult-transpose-matrix which lets us use the original order directly. Using (gl:mult-transpose-matrix (world-basis node)) brings us to about 102FPS, another 0.61ms improvement.

Step 3: some actual computation

From there we move to vector-multiply-to, which is a component-wise multiplication of 2 vectors into a third, in this case scaling a vertex. The proper high-level optimization would be to just include the scaling into the modelview matrix and let GL deal with it, probably much more efficiently. For now, let's go ahead and optimize the existing code a bit anyway.

If we look at the entry for vector-multiply-to in the flat profile, we see something like

------------------------------------------------------------------------
    48   7.4                   TRADEWARZ::RENDER-NODE [1]
    17   2.6     48   7.4   TRADEWARZ::VECTOR-MULTIPLY-TO [7]
     5   0.8                   SB-VM::GENERIC-+ [14]
    26   4.0                   SB-KERNEL:TWO-ARG-* [4]
------------------------------------------------------------------------

As mentioned above, sb-kernel:two-arg-* is a function that handles arbitrary types of numbers, including complex, ratio, and various combinations of 2 number types. If we knew we always had 2 single-floats, we should be able to multiply them with a single CPU instruction. Not sure where sb-vm::generic-+ comes from in that profile, but it is the same sort of thing.

We'll start by modifying ax-vector to use a typed vector as storage, similar to the previous modification of ax-matrix. Some of the code tries to create vectors with integers, so we will also add a more convenient version of make-vector that converts the arguments to single-float, and inline it so if the compiler knows the inputs are already single-float it can skip that step.

(defstruct (ax-vector
            (:type (vector single-float))
            (:constructor %make-vector (&optional x y z))
            (:conc-name v))
  (x 0.0)
  (y 0.0)
  (z 0.0))

(declaim (inline make-vector))
(defun make-vector (&optional (x 0.0) (y 0.0) (z 0.0))
  (%make-vector (float x 0.0) (float y 1.0) (float z 1.0)))

The data files also have integers in them, so let's modify vector-modify to convert to single-floats as well, and add some type declarations.

(defun vector-modify (src &optional x y z)
  "Assign new components to a vector"
  (declare (type (simple-array single-float (3)) src))
  (when x (setf (vx src) (float x 1.0)))
  (when y (setf (vy src) (float y 1.0)))
  (when z (setf (vz src) (float z 1.0)))
  src)

That doesn't affect performance much, but even without modifying vector-multiply-to it has dropped to 2-3% in our profile from almost 6%. In this case it is because SBCL is smart enough to know that vx, vy, and vz only accept ax-vectors, and ax-vectors only store single-floats, so if vx returns, the result must be a single-float and can be multiplied directly. Unfortunately the extra performance there was matched by vector-modify getting a bit slower due to converting types. Switching from apply to a direct call to vector-modify and noticing that copying n to the normal variable isn't needed brings us to about 108FPS, another 0.5ms.

Step 4: clean up the data

At this point, the top of the profile looks something like

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    230  23.5    721  73.6    230  23.5        -  TRADEWARZ::RENDER-NODE
   2    102  10.4    102  10.4    332  33.9        -  SB-KERNEL:%SINGLE-FLOAT
   3     44   4.5     71   7.3    376  38.4        -  TRADEWARZ::VECTOR-MODIFY
   4     43   4.4    954  97.4    419  42.8        -  TRADEWARZ::LOOP-SCENE

and the main problems in render-node are

   109  11.1                   CL-OPENGL:MULT-TRANSPOSE-MATRIX [6]
    70   7.2                   TRADEWARZ::VECTOR-MODIFY [3]
    59   6.0                   SB-KERNEL:%SINGLE-FLOAT [2]
    50   5.1                   CFFI::%FOREIGN-ENUM-VALUE [14]
    32   3.3                   CL-OPENGL-BINDINGS:CHECK-ERROR [7]
    21   2.1                   TRADEWARZ::VECTOR-MULTIPLY-TO [9]
    16   1.6                   TRADEWARZ::GET-MODEL [28]
    15   1.5                   (SB-PCL::FAST-METHOD TRADEWARZ::GET-SIZE (TRADEWARZ::MODEL)) [29]
    11   1.1                   SB-IMPL::GETHASH3 [24]

along with a few percent from bits of CLOS internals.

We can't do much about the mult-transpose-matrix call at this point. Possibly it could be faster for this specific case, but it is in a separate library, so we will leave it for now.

Next on the list are vector-modify and sb-kernel:%single-float. In this case, both are translating static data into a specific form. Instead of doing this every time we use it, we should just convert the data to the final form when we load it.

Some of the data is created by load-obj which uses make-vector internally, so there we just need to tell it not to convert them to lists at the end. Other objects are specified directly in the sexp-based data files, so we need to translate them by hand.

;; convert a list of lists of lists to a list of lists of ax-vector
(defun translate-geometry (geometry)
  (flet ((x (x)
           (etypecase x
             (list
              (make-vector (or (first x) 0) (or (second x) 0) (or (third x) 0)))
             ((simple-array single-float (3))
              x)
             (vector
              (make-vector (aref x 0) (aref x 1) (aref x 2))))))
    (loop for v in geometry
          ;; some entries don't have normals, but also don't disable
          ;; lighting so add a reasonable default normal instead of
          ;; just using last normal from previous thing drawn
          if (first v)
            collect (mapcar #'x v)
          else collect (cons (make-vector 0 -1 0)
                             (mapcar #'x v)))))

Once that is done we can modify render-node to use the vectors directly, bringing us to about 135FPS, 1.85ms less than previously and 3.7ms better than where we started. That is a good example of why we compare in ms instead of FPS: we have 2 sets of optimizations that both improved frame time by about 1.85ms, which we can trivially compare. Measured in FPS, one was 18FPS improvement, and the other was 27FPS. Even if we know FPS is non-linear and know those gains were relative to 90FPS and 108FPS respectively, it would be difficult to intuitively see that both are equivalent improvements.

The model method on get-size is creating a new ax-vector on every call, so lets just store the size in that format to start with as well. In addition, we are still spending a few % scaling vertices at runtime even though the scale doesn't change within an object (or even over time), so lets let GL handle it as suggested in step 3. Since it is a property of the object and not the node, we can't put it in the world-basis matrix directly without some way to propagate changes from objects to nodes (and if it never changes, we'd be better off just multiplying the vertex data on load.) Instead, we will just call gl:scale for each node (and enable :normalize and make sure we don't have 0 scale, since gl:scale affects normals as well).

With the more efficient scaling, we are up another 0.7ms or so.

Step 5: some implementation-specific tricks

Now our profile looks like

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1    589  30.0   1493  76.0    589  30.0        -  TRADEWARZ::RENDER-NODE
   2    104   5.3    104   5.3    693  35.3        -  CL-OPENGL-BINDINGS:CHECK-ERROR
   3     97   4.9   1939  98.7    790  40.2        -  TRADEWARZ::LOOP-SCENE
   4     95   4.8     95   4.8    885  45.1        -  (LAMBDA (SB-PCL::.ARG0.) :IN "...")
   5     80   4.1    303  15.4    965  49.1        -  CL-OPENGL:MULT-TRANSPOSE-MATRIX
   6     73   3.7     73   3.7   1038  52.9        -  (LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1. SB-PCL::.ARG2.) :IN "...")
   7     63   3.2     74   3.8   1101  56.1        -  (FLET #:BODY-FUN-640 :IN SB-IMPL::GETHASH3)
   8     46   2.3     46   2.3   1147  58.4        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF
   9     44   2.2     44   2.2   1191  60.6        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
  10     38   1.9     38   1.9   1229  62.6        -  (LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1.) :IN "...")
  11     36   1.8     36   1.8   1265  64.4        -  SB-KERNEL:%SINGLE-FLOAT

That is starting to look better, loop-scene is just iterating the scene, so that being at #3 suggests we aren't wasting too much time per item. cl-opengl-bindings:check-error is how cl-opengl implements its automatic error checking. We could turn that off for a 5% improvement for final release, but for development it is better to keep it on. Aside from that we also have a few bits of CLOS internals, mult-transpose-matrix, some array internals and still some calls to %single-float.

sb-kernel:hairy-data-vector-ref is a sign SBCL doesn't know anything about the type of an array it is working with (for example whether it is displaced, adjustable, etc). That is usually a sign we could either use a more efficient array type somewhere, or give it more declarations, or both. In this case, if we look at the implementation of mult-transpose-matrix it has to make a copy of the matrix in a form that can be passed to the OpenGL api. Looking at the call-graph part of the profile, the *-vector-ref and %single-float calls are all in mult-transpose-matrix. In this case, we know we have a (simple-array single-float (16)) so we can do better than the generic code in cl-opengl, particularly on platforms where we can pass the contents of typed arrays directly to C functions.

CFFI has a wrapper for abstracting that ability, cffi:with-pointer-to-vector-data. Unfortunately it does extra work on platforms where using CL vectors directly isn't supported, since it copies the array back in case the foreign call made any changes to it, which we don't need. If we limit it to a few specific implementations it is still easier than adding code for each of them specifically, and we can fall back to using cl-opengl otherwise.

        (let ((mat (world-basis node)))
          #+(or cmucl sbcl ccl)
          (if (typep mat '(simple-array single-float (16)))
              (cffi:with-pointer-to-vector-data (p mat)
                (%gl:mult-transpose-matrix-f p))
              (gl:mult-transpose-matrix mat))
          #-(or cmucl sbcl ccl)
          (gl:mult-transpose-matrix mat))

That gives us another 0.5ms, bringing the total to under 6ms from a start of over 11ms.

A note about cl-opengl packages

Cl-opengl has 2 APIs, exposed by the gl: and %gl: packages. The "high-level" gl: package tries to be "easy to use" and accepts CL data types, translating as needed. The "low-level" %gl: package is an automatically generated set of CFFI wrappers for the whole OpenGL (and OpenGL-ES) API, along with all extensions. Functions in %gl: usually take c-style pointers, which we usually create with CFFI or static-vectors, but which need managed by hand.

Both gl: and %gl: packages are officially supported APIs. High performance code will probably need to use both, since the extra translations done by gl: functions frequently add overhead that can be avoided as in the above example. Libraries implementing their own "high-level" APIs will probably also want to use %gl: directly.

Step 6: CFFI tricks (don't do this one in normal code)

There are a few things related to sb-impl::gethash3 near the top of the profile, which is part of SBCL's hash table implementation. Checking the call graph, we see they come from cffi::%foreign-enum-value, which is also responsible for some of the unidentifiable CLOS internals we have been ignoring until now. They are due to converting symbols like :texture-2d and :triangles in the CL code to the numbers used in the C API. In particular, we pass (primitive model) to gl:with-primitive, so there is no way cl-opengl and CFFI could translate it at compile time.

It usually isn't worth worrying about that sort of thing, since we should be trying to minimize GL calls anyway. If it does end up being a problem, we can do the translation in advance with cffi:foreign-enum-value. It requires us to specify which "foreign enum" to use for looking up the symbol, in this case %gl:enum. We can find that by looking at the definition of with-primitive, seeing it passes the value to gl:begin, then checking the definition of that to see that the type of the mode argument is enum. (%gl:enum is usually the right thing, since GL doesn't break the type up, but a few things in cl-opengl use more specific enums.)

CL-OpenGL could theoretically optimize the translation of the constant :texture-2d argument to gl:bind-texture, but currently doesn't. In that case there isn't a convenient place to store it in advance like the primitive type in model, but we can look it up at read time or define a constant.

That leaves the CLOS bits, which are due to CFFI not handling enums as well as it should. With a small patch to CFFI, we can get rid of those as well, bringing FPS to about 207, and saving another 1ms.