Optimization: Tradewarz part 2

In part 1, we looked at some lower-level optimizations. This time we will look at the higher-level optimizations we should have done to start with, mostly focusing on using the OpenGL API more efficiently. Code for this part is in 3b-optimization-part-2.

Some general notes about OpenGL optimization/history

When optimizing OpenGL rendering, we need to take into account that the rendering process is broken up into a bunch of steps handled by different bits of hardware. The first (and usually most important) division is between CPU and GPU. For rendering we usually want to try to keep the GPU as busy as possible, both because it is the part doing most of the work and because we tend to have other things we want to use the CPU for, like AI in a game.

The previous version of the code is using what is known as "Immediate mode", calling gl:color, gl:tex-coord, gl:normal, and gl:vertex for each vertex on each object. When OpenGL (and its ancestor IRIS GL) was originally developed, those operations mapped fairly closely to the rendering hardware. That no longer applies today, and while "immediate mode" is convenient, it is very inefficient. To start with, it requires lots of function calls which is fairly expensive even without the extra overhead of calling the GL API from another language. On top of that, it usually leaves the GPU idling while it waits for more work.

The other extreme is called "retained mode", where you pass the entire scene to some API and let that manage all the rendering. That isn't very flexible, so we usually do something in between (which is still frequently classified as "retained mode".) We group up data as much as possible, store the data in GPU memory, and tell the GPU to draw things in large batches. For example we might store all the geometry for a specific model together and tell GL to draw that with a few calls, instead of making thousands of calls to tell it how to draw each vertex.

Over time, OpenGL has added a few different ways of grouping data into chunks that can be drawn at once. Originally, there were "display lists", which basically just recorded a sequence of immediate mode commands for later playback. Later, there were additions to allow storing a block of vertex data in CPU memory and telling the API to draw some or all of the vertices in that buffer. That reduced the amount of overhead from calling the API. Once GPUs started having lots of very fast ram, that still left transferring the data to the GPU every frame as a problem. The most recent version is the "Vertex Buffer Object" or "VBO". A VBO is a block of memory managed by the API so the API knows when the data changes, unlike the previous versions that used memory allocated by the calling code. This allows the API to store the data in GPU ram once for the usual case where the data doesn't change.

There is also a "Vertex Array Object" or "VAO" API built on top of VBOs, which allows us to combine a particular VBO configuration into a single object, again reducing the number of calls needed to draw a particular object.

Step 1: convert data to VBOs

To use a VBO, we need to do a few things: create a VBO object, format the data properly, copy it into the VBO, tell GL how to interpret the data, and tell it which parts of the data to draw.

Creating a VBO object works like creating a texture, we just call gl:gen-buffers to allocate 1 or more VBO IDs, then gl:delete-buffers when we are done with them.

We have a few options for "proper" formats for the data. When we tell GL how to interpret it, we specify what type it is (single-float is a good choice for most things), how many elements of that type make up the data for a particular vertex (for example a normal would have 3 values per vertex), and how far apart the data for a vertex is in the buffer. The last one lets us skip some space between entries, which we could use to store the other parts for that vertex. That is generally described as "interleaved" vertex data, so for example the buffer might have 3 floats of position, then 3 floats of normal, then 2 floats of UV, and 4 bytes of color data, then continue with position,normal,etc for next vertex. The other option is to have all of the position data together, all the normal data together, etc. They can either be in separate parts of a single VBO or in separate VBOs for each type of data. We can also combine the various strategies, for example if we had multiple objects that were identical except color we might have color in 1 buffer and everything else interleaved in another, with the objects sharing the interleaved buffer.

It doesn't usually matter too much which way we format it, so we will use one interleaved buffer just to demonstrate how to set it up.

Since we store all the data on the GPU, we can't use the length of a CL array to tell us how many vertices to draw. We will add that to the model class, and we also need somewhere to store the VAO ID.

   (vao :accessor vao
        :initarg :vao
        :initform nil)
   (vertex-count :accessor vertex-count
                 :initarg :vertex-count
                 :initform nil)

For now we will keep the geometry in the model as well since we aren't sure if anything else in the game will need it. If not, we could get rid of it as soon as the VAO is created.

Next is building and configuring the actual VBO/VAO:

(defun create-vao (model)
  (let* ((vao (gl:gen-vertex-array))
         (vbo (car (gl:gen-buffers 1)))
         (number-of-vertices (length (geometry model)))
         ;; 3 floats for normal, 3 for position, 2 for UV, 3 for color
         (floats-per-vertex (+ 3 3 2 3))
         (float-size 4)
         (bytes-per-vertex (* floats-per-vertex float-size)))
           ;; fill a buffer with vertex data
           (cffi:with-foreign-object (buffer :float (* floats-per-vertex
             ;; add a helper function to fill the buffer, I starts from -1 so
             ;; we can directly use the value returned from INCF as the index
             (let ((i -1))
               (labels ((add (a n &key swap)
                          (assert (< i (* floats-per-vertex
                          (loop for x below n
                                do (setf (cffi:mem-aref buffer :float (incf i))
                                         (if swap
                                             (aref a (- n x 1))
                                             (aref a x))))))
                 (loop for (n v uv c) in (geometry model)
                       do (add v 3)
                          (add n 3)
                          (add uv 2 :swap t)
                          (add c 3))))
             ;; copy it into the VBO
             (gl:bind-vertex-array vao)
             (gl:bind-buffer :array-buffer vbo)
             (%gl:buffer-data :array-buffer (* bytes-per-vertex
                              buffer :static-draw)
             ;; set up the VAO
             (gl:enable-client-state :vertex-array)
             (%gl:vertex-pointer 3 :float bytes-per-vertex
             (gl:enable-client-state :normal-array)
             (%gl:normal-pointer :float bytes-per-vertex
                                 (* 3 float-size))
             (gl:enable-client-state :texture-coord-array)
             (%gl:tex-coord-pointer 2 :float bytes-per-vertex
                                    (* 6 float-size))
             (gl:enable-client-state :color-array)
             (%gl:color-pointer 3 :float bytes-per-vertex
                                (* 8 float-size))
             ;; done modifying the VAO, so turn it off again
             (gl:bind-vertex-array 0)
             ;; we don't need the VBO object anymore, since the VAO keeps
             ;; a reference to it
             (gl:delete-buffers (list vbo))
             (setf vbo nil)
             ;; store the VAO and number of vertices in model
             (setf (vertex-count model) number-of-vertices
                   (vao model) (shiftf vao nil))))
      ;; make sure VAO and VBO get deleted in case we had an error
      (when vbo (gl:delete-buffers (list vbo)))
      (when vao (gl:delete-vertex-arrays (list vao))))))

First we create the GL objects, with gl:gen-vertex-array and gl:gen-buffers. (there should probably be a version of gl:gen-buffers that only makes 1, but there isn't yet, so we get a list of 1).

Next we calculate some sizes, assuming all of our data is stored as 4-byte floats (which is probably the same as a single-float in most CL implementations).

Once we know how much data we have, we allocate a temporary "foreign" array with cffi:with-foreign-object. We pass it the name of a variable (buffer), the foreign type we want (:float), and the number of elements to allocate. We fill the buffer with cffi:mem-aref which works sort of like aref except we have to specify what sort of data is in buffer. The one complicated bit is that the original data has UV swapped, so we need to handle that now since we can't swap it as easily during rendering. (We could swap it with a texture matrix or shader, but just modifying the data is easier.)

After the buffer is filled, we can send it to GL. We bind our VAO and VBO with gl:bind-vertex-array and gl:bind-buffer respectively. We use the :array-buffer target for the VBO, since that is where it looks for vertex data. %gl:buffer-data is used to copy the data from the foreign pointer buffer into the currently bound VBO. We specify the size in bytes, and specify :static-draw for the 'usage' parameter, which tells GL we will be sending the data once ("static") and using it as the source data for drawing ("draw").

The last step is setting up the VAO. Here we are using the older "fixed-function" style of rendering instead of shaders, so we use functions like gl:enable-client-state and %gl:vertex-pointer to specify where our vertex components are relative to the currently bound VBO.

(gl:enable-client-state :vertex-array) tells GL it should look for position data in the array instead of from calls to gl:vertex, similarly for :normal-array and gl:normal, :texture-coord-array and gl:tex-coord, etc.

%gl:vertex-pointer specifies where it should look for the position data. The first 2 arguments tell it we are storing 3 :float values per vertex. The next argument tells it that there is bytes-per-vertex (in this case 44) bytes between the start of the data for each vertex. The extra data leaves space for the other components. If we had separate VBOs for each component we could leave that argument as 0 telling GL to calculate it from the type and count. The last argument tells GL that that the position data starts 0 bytes from the start of the buffer.

Similarly, we use gl:normal-pointer to specify the first normal starts 12 bytes into the start of the buffer, right after the first position (normals always have 3 components, so we don't need to specify that), and so on for %gl:tex-coord-pointer and %gl:color-pointer.

The various *-pointer functions work relative to the currently bound VBO when they are called, so if we were using multiple VBOs, we would just bind them in turn before setting up the corresponding pointer.

Once GL knows how our VAO is arranged, we unbind it so we don't accidentally modify it, and delete the VBO object since we don't need to access it other than through the VAO. Finally, we store the VAO and size in the model object and move on to actually rendering them.

Step 2: use the VBOs/VAOs

This part is easy, instead of gl:with-primitive and a loop specifying each vertex, we just bind the VAO and tell it to draw some number of vertices from it.

        (gl:bind-vertex-array (vao model))
        (%gl:draw-arrays (primitive model) 0 (vertex-count model))

The function %gl:draw-arrays tells GL to draw sequential vertices from the currently enabled arrays (which arrays are enabled is stored in the VAO, so we just need to bind that to enable them again). The first argument specifies what type of primitive to draw, as in gl:with-primitives. The second argument lets us specify an offset in case we want to skip some vertices, for example if we had multiple objects in 1 buffer. The last is the number of vertices to draw.

Then at some point we probably want to bind VAO 0 again in case there is other drawing happening that it might confuse, but we can save a bit of time by only doing that once after we finish drawing all the objects.

Skipping all of the individual gl:vertex, gl:normal, etc calls brings us up to about 300FPS, about 1.4 ms faster (leaving about 3.3ms per frame).

Step 3: Trade some memory for fewer calls

Right now, we are using about 30% of GPU, which is better than where we started, but still wasting a lot of power. We are currently drawing 1133 objects: 1 axis marker, 1 tank, 1 jet, 300 hex tiles, and 830 digits indicating the hex coordinates. Unlike the jets and tanks. the tiles and coordinates will probably never move. Assuming we can use a single texture (or texture-array) for the whole set, we can combine the whole map into 1 or 2 objects.

Converting the hexes is a bit messy since the hexes use :triangle-strip and we want :triangle for the combined object. Just appending the triangle strips won't work, since they might not be directly adjacent to each other, and we don't want random triangles connecting separate tiles.

(defparameter *map-data* nil)

(defun add-geometry (model hash size offset)
  (let ((tristrip (member (primitive model)
                            ,(cffi:foreign-enum-value '%gl:enum
        (old1 nil)
        (old2 nil)
        (odd nil))
    (flet ((de-strip (x)
             (if tristrip
                 (let ((v1 old1)
                       (v2 old2))
                   (when v1
                     (when odd (rotatef v1 v2))
                     (setf odd (not odd))
                     (push v1 (gethash (image model) hash))
                     (push v2 (gethash (image model) hash))
                     (push x (gethash (image model) hash)))
                   (shiftf old1 old2 x))
                 (push x (gethash (image model) hash)))))
      (loop for (n v uv c) in (geometry model)
            do (de-strip (list n
                               (vector-add (vector-multiply size v)
                                           (apply #'make-vector offset))

(defmethod draw-tile :around (shape x y)
  (if *map-data*
      ;; building a VBO, add the geometry of the node to an existing
      ;; array
      (let* ((tile-id (aref (tiles (current-map)) y x))
             (tile (get-model (find-tile tile-id)))
             (size (tile-size (current-map)))
             (offset (map 'list '* size (call-next-method shape x (- y)))))
        (add-geometry tile *map-data* size offset)
        (draw-tile-coords x y offset))
      ;; otherwise just add separate nodes
      (let* ((tile-id (aref (tiles (current-map)) y x))
             (node (make-node (find-tile tile-id)))
             (size (tile-size (current-map)))
             (offset (map 'list '* size (call-next-method shape x (- y)))))
        (add-node node)
        (apply #'vector-modify (dv node) offset)
        (draw-tile-coords x y offset))))

(defun generate-map ()
  (let ((shape (tile-shape (current-map))))
    (let ((*map-data* (make-hash-table :test 'equal)))
      (loop with (h w) = (array-dimensions (tiles (current-map)))
            for x below (or w 0)
            do (loop for y below (or h 0)
                     do (draw-tile shape x y)))
       (lambda (k v)
         (let* ((map-name (intern (format nil "~:@(map-~a~)" k)))
                (model (make-instance 'model :name map-name :image k
                                             :geometry (reverse v)
                                             :primitive :triangles
                                             :size '(1.0 1.0 1.0)
           (find-radial-extent model)
           (create-vao model)
           (setf (gethash map-name (models (current-scene))) model)
           (add-node (make-node map-name))))

Not too much interesting to talk about here. Instead of adding a node for each tile, we loop through the geometry and add it to a list of vertices. We make a separate list for each texture used so it will properly handle tiles with a different texture, though we will want to keep the number of textures down or we won't gain any performance. It is pretty common to have maps like this drawn with 1 big texture containing a bunch of tile-sized textures within it, so that shouldn't be a problem. Alternately, we could have a single texture that stretches across the whole map.

Once we have the buffers of geometry we just build a new model object using same VAO functions as before and add it to the scene.

Combining tiles into 1 object brings us to about 650FPS, saving another 1.8ms.

Combining the tile coordinates into 1 object is similar, aside from needing to convert :line-loop to :lines. With all of the coordinates in 1 buffer, we save another 1.45ms bringing us to around 11500FPS and 80% GPU utilization.

Step 4: give it more work to do and see how it scales

Now we are down to about 5 objects, so lets add some more models:

helicopter fractal

With a thousand or so helicopters (each made of 3 objects) in a big animated hierarchy, we are still running around 250FPS with most of the time spent updating the transform hierarchy rather than rendering.