[insert blog here]
Optimization: Tradewarz part 2Fri, 26 Dec 2014 19:49:52 CST

In part 1, we looked at some lower-level optimizations. This time we will look at the higher-level optimizations we should have done to start with, mostly focusing on using the OpenGL API more efficiently. Code for this part is in 3b-optimization-part-2.

Some general notes about OpenGL optimization/history

When optimizing OpenGL rendering, we need to take into account that the rendering process is broken up into a bunch of steps handled by different bits of hardware. The first (and usually most important) division is between CPU and GPU. For rendering we usually want to try to keep the GPU as busy as possible, both because it is the part doing most of the work and because we tend to have other things we want to use the CPU for, like AI in a game.

The previous version of the code is using what is known as "Immediate mode", calling gl:color, gl:tex-coord, gl:normal, and gl:vertex for each vertex on each object. When OpenGL (and its ancestor IRIS GL) was originally developed, those operations mapped fairly closely to the rendering hardware. That no longer applies today, and while "immediate mode" is convenient, it is very inefficient. To start with, it requires lots of function calls which is fairly expensive even without the extra overhead of calling the GL API from another language. On top of that, it usually leaves the GPU idling while it waits for more work.

The other extreme is called "retained mode", where you pass the entire scene to some API and let that manage all the rendering. That isn't very flexible, so we usually do something in between (which is still frequently classified as "retained mode".) We group up data as much as possible, store the data in GPU memory, and tell the GPU to draw things in large batches. For example we might store all the geometry for a specific model together and tell GL to draw that with a few calls, instead of making thousands of calls to tell it how to draw each vertex.

Over time, OpenGL has added a few different ways of grouping data into chunks that can be drawn at once. Originally, there were "display lists", which basically just recorded a sequence of immediate mode commands for later playback. Later, there were additions to allow storing a block of vertex data in CPU memory and telling the API to draw some or all of the vertices in that buffer. That reduced the amount of overhead from calling the API. Once GPUs started having lots of very fast ram, that still left transferring the data to the GPU every frame as a problem. The most recent version is the "Vertex Buffer Object" or "VBO". A VBO is a block of memory managed by the API so the API knows when the data changes, unlike the previous versions that used memory allocated by the calling code. This allows the API to store the data in GPU ram once for the usual case where the data doesn't change.

There is also a "Vertex Array Object" or "VAO" API built on top of VBOs, which allows us to combine a particular VBO configuration into a single object, again reducing the number of calls needed to draw a particular object.

Step 1: convert data to VBOs

To use a VBO, we need to do a few things: create a VBO object, format the data properly, copy it into the VBO, tell GL how to interpret the data, and tell it which parts of the data to draw.

Creating a VBO object works like creating a texture, we just call gl:gen-buffers to allocate 1 or more VBO IDs, then gl:delete-buffers when we are done with them.

We have a few options for "proper" formats for the data. When we tell GL how to interpret it, we specify what type it is (single-float is a good choice for most things), how many elements of that type make up the data for a particular vertex (for example a normal would have 3 values per vertex), and how far apart the data for a vertex is in the buffer. The last one lets us skip some space between entries, which we could use to store the other parts for that vertex. That is generally described as "interleaved" vertex data, so for example the buffer might have 3 floats of position, then 3 floats of normal, then 2 floats of UV, and 4 bytes of color data, then continue with position,normal,etc for next vertex. The other option is to have all of the position data together, all the normal data together, etc. They can either be in separate parts of a single VBO or in separate VBOs for each type of data. We can also combine the various strategies, for example if we had multiple objects that were identical except color we might have color in 1 buffer and everything else interleaved in another, with the objects sharing the interleaved buffer.

It doesn't usually matter too much which way we format it, so we will use one interleaved buffer just to demonstrate how to set it up.

Since we store all the data on the GPU, we can't use the length of a CL array to tell us how many vertices to draw. We will add that to the model class, and we also need somewhere to store the VAO ID.

   (vao :accessor vao
        :initarg :vao
        :initform nil)
   (vertex-count :accessor vertex-count
                 :initarg :vertex-count
                 :initform nil)

For now we will keep the geometry in the model as well since we aren't sure if anything else in the game will need it. If not, we could get rid of it as soon as the VAO is created.

Next is building and configuring the actual VBO/VAO:

(defun create-vao (model)
  (let* ((vao (gl:gen-vertex-array))
         (vbo (car (gl:gen-buffers 1)))
         (number-of-vertices (length (geometry model)))
         ;; 3 floats for normal, 3 for position, 2 for UV, 3 for color
         (floats-per-vertex (+ 3 3 2 3))
         (float-size 4)
         (bytes-per-vertex (* floats-per-vertex float-size)))
           ;; fill a buffer with vertex data
           (cffi:with-foreign-object (buffer :float (* floats-per-vertex
             ;; add a helper function to fill the buffer, I starts from -1 so
             ;; we can directly use the value returned from INCF as the index
             (let ((i -1))
               (labels ((add (a n &key swap)
                          (assert (< i (* floats-per-vertex
                          (loop for x below n
                                do (setf (cffi:mem-aref buffer :float (incf i))
                                         (if swap
                                             (aref a (- n x 1))
                                             (aref a x))))))
                 (loop for (n v uv c) in (geometry model)
                       do (add v 3)
                          (add n 3)
                          (add uv 2 :swap t)
                          (add c 3))))
             ;; copy it into the VBO
             (gl:bind-vertex-array vao)
             (gl:bind-buffer :array-buffer vbo)
             (%gl:buffer-data :array-buffer (* bytes-per-vertex
                              buffer :static-draw)
             ;; set up the VAO
             (gl:enable-client-state :vertex-array)
             (%gl:vertex-pointer 3 :float bytes-per-vertex
             (gl:enable-client-state :normal-array)
             (%gl:normal-pointer :float bytes-per-vertex
                                 (* 3 float-size))
             (gl:enable-client-state :texture-coord-array)
             (%gl:tex-coord-pointer 2 :float bytes-per-vertex
                                    (* 6 float-size))
             (gl:enable-client-state :color-array)
             (%gl:color-pointer 3 :float bytes-per-vertex
                                (* 8 float-size))
             ;; done modifying the VAO, so turn it off again
             (gl:bind-vertex-array 0)
             ;; we don't need the VBO object anymore, since the VAO keeps
             ;; a reference to it
             (gl:delete-buffers (list vbo))
             (setf vbo nil)
             ;; store the VAO and number of vertices in model
             (setf (vertex-count model) number-of-vertices
                   (vao model) (shiftf vao nil))))
      ;; make sure VAO and VBO get deleted in case we had an error
      (when vbo (gl:delete-buffers (list vbo)))
      (when vao (gl:delete-vertex-arrays (list vao))))))

First we create the GL objects, with gl:gen-vertex-array and gl:gen-buffers. (there should probably be a version of gl:gen-buffers that only makes 1, but there isn't yet, so we get a list of 1).

Next we calculate some sizes, assuming all of our data is stored as 4-byte floats (which is probably the same as a single-float in most CL implementations).

Once we know how much data we have, we allocate a temporary "foreign" array with cffi:with-foreign-object. We pass it the name of a variable (buffer), the foreign type we want (:float), and the number of elements to allocate. We fill the buffer with cffi:mem-aref which works sort of like aref except we have to specify what sort of data is in buffer. The one complicated bit is that the original data has UV swapped, so we need to handle that now since we can't swap it as easily during rendering. (We could swap it with a texture matrix or shader, but just modifying the data is easier.)

After the buffer is filled, we can send it to GL. We bind our VAO and VBO with gl:bind-vertex-array and gl:bind-buffer respectively. We use the :array-buffer target for the VBO, since that is where it looks for vertex data. %gl:buffer-data is used to copy the data from the foreign pointer buffer into the currently bound VBO. We specify the size in bytes, and specify :static-draw for the 'usage' parameter, which tells GL we will be sending the data once ("static") and using it as the source data for drawing ("draw").

The last step is setting up the VAO. Here we are using the older "fixed-function" style of rendering instead of shaders, so we use functions like gl:enable-client-state and %gl:vertex-pointer to specify where our vertex components are relative to the currently bound VBO.

(gl:enable-client-state :vertex-array) tells GL it should look for position data in the array instead of from calls to gl:vertex, similarly for :normal-array and gl:normal, :texture-coord-array and gl:tex-coord, etc.

%gl:vertex-pointer specifies where it should look for the position data. The first 2 arguments tell it we are storing 3 :float values per vertex. The next argument tells it that there is bytes-per-vertex (in this case 44) bytes between the start of the data for each vertex. The extra data leaves space for the other components. If we had separate VBOs for each component we could leave that argument as 0 telling GL to calculate it from the type and count. The last argument tells GL that that the position data starts 0 bytes from the start of the buffer.

Similarly, we use gl:normal-pointer to specify the first normal starts 12 bytes into the start of the buffer, right after the first position (normals always have 3 components, so we don't need to specify that), and so on for %gl:tex-coord-pointer and %gl:color-pointer.

The various *-pointer functions work relative to the currently bound VBO when they are called, so if we were using multiple VBOs, we would just bind them in turn before setting up the corresponding pointer.

Once GL knows how our VAO is arranged, we unbind it so we don't accidentally modify it, and delete the VBO object since we don't need to access it other than through the VAO. Finally, we store the VAO and size in the model object and move on to actually rendering them.

Step 2: use the VBOs/VAOs

This part is easy, instead of gl:with-primitive and a loop specifying each vertex, we just bind the VAO and tell it to draw some number of vertices from it.

        (gl:bind-vertex-array (vao model))
        (%gl:draw-arrays (primitive model) 0 (vertex-count model))

The function %gl:draw-arrays tells GL to draw sequential vertices from the currently enabled arrays (which arrays are enabled is stored in the VAO, so we just need to bind that to enable them again). The first argument specifies what type of primitive to draw, as in gl:with-primitives. The second argument lets us specify an offset in case we want to skip some vertices, for example if we had multiple objects in 1 buffer. The last is the number of vertices to draw.

Then at some point we probably want to bind VAO 0 again in case there is other drawing happening that it might confuse, but we can save a bit of time by only doing that once after we finish drawing all the objects.

Skipping all of the individual gl:vertex, gl:normal, etc calls brings us up to about 300FPS, about 1.4 ms faster (leaving about 3.3ms per frame).

Step 3: Trade some memory for fewer calls

Right now, we are using about 30% of GPU, which is better than where we started, but still wasting a lot of power. We are currently drawing 1133 objects: 1 axis marker, 1 tank, 1 jet, 300 hex tiles, and 830 digits indicating the hex coordinates. Unlike the jets and tanks. the tiles and coordinates will probably never move. Assuming we can use a single texture (or texture-array) for the whole set, we can combine the whole map into 1 or 2 objects.

Converting the hexes is a bit messy since the hexes use :triangle-strip and we want :triangle for the combined object. Just appending the triangle strips won't work, since they might not be directly adjacent to each other, and we don't want random triangles connecting separate tiles.

(defparameter *map-data* nil)

(defun add-geometry (model hash size offset)
  (let ((tristrip (member (primitive model)
                            ,(cffi:foreign-enum-value '%gl:enum
        (old1 nil)
        (old2 nil)
        (odd nil))
    (flet ((de-strip (x)
             (if tristrip
                 (let ((v1 old1)
                       (v2 old2))
                   (when v1
                     (when odd (rotatef v1 v2))
                     (setf odd (not odd))
                     (push v1 (gethash (image model) hash))
                     (push v2 (gethash (image model) hash))
                     (push x (gethash (image model) hash)))
                   (shiftf old1 old2 x))
                 (push x (gethash (image model) hash)))))
      (loop for (n v uv c) in (geometry model)
            do (de-strip (list n
                               (vector-add (vector-multiply size v)
                                           (apply #'make-vector offset))

(defmethod draw-tile :around (shape x y)
  (if *map-data*
      ;; building a VBO, add the geometry of the node to an existing
      ;; array
      (let* ((tile-id (aref (tiles (current-map)) y x))
             (tile (get-model (find-tile tile-id)))
             (size (tile-size (current-map)))
             (offset (map 'list '* size (call-next-method shape x (- y)))))
        (add-geometry tile *map-data* size offset)
        (draw-tile-coords x y offset))
      ;; otherwise just add separate nodes
      (let* ((tile-id (aref (tiles (current-map)) y x))
             (node (make-node (find-tile tile-id)))
             (size (tile-size (current-map)))
             (offset (map 'list '* size (call-next-method shape x (- y)))))
        (add-node node)
        (apply #'vector-modify (dv node) offset)
        (draw-tile-coords x y offset))))

(defun generate-map ()
  (let ((shape (tile-shape (current-map))))
    (let ((*map-data* (make-hash-table :test 'equal)))
      (loop with (h w) = (array-dimensions (tiles (current-map)))
            for x below (or w 0)
            do (loop for y below (or h 0)
                     do (draw-tile shape x y)))
       (lambda (k v)
         (let* ((map-name (intern (format nil "~:@(map-~a~)" k)))
                (model (make-instance 'model :name map-name :image k
                                             :geometry (reverse v)
                                             :primitive :triangles
                                             :size '(1.0 1.0 1.0)
           (find-radial-extent model)
           (create-vao model)
           (setf (gethash map-name (models (current-scene))) model)
           (add-node (make-node map-name))))

Not too much interesting to talk about here. Instead of adding a node for each tile, we loop through the geometry and add it to a list of vertices. We make a separate list for each texture used so it will properly handle tiles with a different texture, though we will want to keep the number of textures down or we won't gain any performance. It is pretty common to have maps like this drawn with 1 big texture containing a bunch of tile-sized textures within it, so that shouldn't be a problem. Alternately, we could have a single texture that stretches across the whole map.

Once we have the buffers of geometry we just build a new model object using same VAO functions as before and add it to the scene.

Combining tiles into 1 object brings us to about 650FPS, saving another 1.8ms.

Combining the tile coordinates into 1 object is similar, aside from needing to convert :line-loop to :lines. With all of the coordinates in 1 buffer, we save another 1.45ms bringing us to around 11500FPS and 80% GPU utilization.

Step 4: give it more work to do and see how it scales

Now we are down to about 5 objects, so lets add some more models:

helicopter fractal

With a thousand or so helicopters (each made of 3 objects) in a big animated hierarchy, we are still running around 250FPS with most of the time spent updating the transform hierarchy rather than rendering.

Optimization: Tradewarz part 1Wed, 24 Dec 2014 14:30:46 CST

Instead of the planned post on volume rendering, today's topic is optimization. I've wanted to write about it for a while, but I tend to either start with optimized primitives like sb-cga, or optimize incrementally as I add features, so didn't really have any good example code to work with. Conveniently axion on #lispgames has been working on a game to learn OpenGL, and due to a combination of implementing things from scratch to learn them better and the rest of us simplifying things to be able to explain them in steps, the code has lots of room for both low- and high-level optimization.

In part 1, I will start with relatively low-level optimization since the goal is to explain the optimization techniques rather than just make it faster. Higher level optimizations would have much more effect here, so normally they would be a priority. They would make the result of low-level optimization less obvious though, so we will pretend they aren't available for now.

The code so far renders a hex-based grid with some debug annotations (the red numbers in the screenshot above), and a few externally created models. I'm starting a few commits back from HEAD at the time of writing, since the actual code already has a few similar optimizations and I want to show the process leading up to them. The code for this post is in the 3b-optimization-part-1 branch of my fork on github.

Some general comments about optimizing for SBCL

I mostly use SBCL, particularly for code that needs to be fast, so my optimizations tend to focus on things that work well on sbcl.

When optimizing for sbcl, there are a few things worth keeping in mind:

  • declarations are treated as assertions unless the safety optimization property is set to 0.

    This is great for development, since catching errors early saves a lot of time, leaving more time to think of better high-level optimizations. When doing low-level optimizations, you need to keep it in mind though. A function that just does 1 addition will probably spend as much time in the type check as it gained from being able to use an optimized addition. Some people would use a (safety 0) declaration here to get rid of the type check, but I usually prefer to just inline the function. When the function is inlined SBCL is smart enough to skip the check if it knows the type from the surrounding code, and we still get errors when sbcl isn't sure and the code is wrong. For larger functions, the improvements from generating better code usually outweigh the type checks.

    Other lisps tend to just trust them, as SBCL does at (safety 0).

  • SBCL is pretty good at type propagation/optimizations.

    For some lisps, you want to sprinkle declarations and THE around the code to make sure the compiler knows what is going on. In sbcl I usually try to avoid THE, since that tends to be a sign that either something else isn't declared well enough or that you are lying to the compiler and sometimes it will break.

    For example, when adding small integers, rather than saying (the (unsigned-byte 32) (+ a b)) to get it to "act like C", in SBCL you should specify that you only care about 32 bits of the result with (ldb (byte 32 0) (+ a b)). If it also know the inputs are 32 bit integers, it will compile that down to a single instruction. Alternately, if it knows the inputs to (+ a b) are less than 32 bits, for example (unsigned-byte 10) or (integer 10 20), it can figure out the output is 32bits or less and again add them directly with a single instruction.

    With (optimize (speed 3)) SBCL will provide notes about things it could have compiled more efficiently if it knew more about the types.

    Sometimes the types aren't known in advance, but we have some common inputs we want to be fast. For example we might want to optimize for vectors of single-float, or (unsigned-byte 8) image data while still handling 16bit images correctly. In that case, duplicating the processing code within a typecase can improve speed quite a bit at the expense of increasing code size, since each branch can have optimized code for a particular type while the T branch keeps the slow but generic version.

    SBCL is also good at figuring out types from how previous functions react to a value. For example check-type will have mostly the same effect as a declaration on following code, or if you do (+ a 1) (unless a ...), it knows that if + returns, A must have been a number and can't be NIL.

  • "boxing"

    Most lisps need to store type information with values, frequently by storing it in a few bits of a pointer to the value. Some values smaller than a pointer (like small integers) can be stored directly, and still leave room for the type information. Converting between the form with type information and the form without is called "boxing" and "unboxing".

    Boxing usually isn't too noticeable, but when working with values right on the edge of being able to store directly (like (unsigned-byte 32) and single-float on 32 bit platforms, or (unsigned-byte 64) and double-float on 64bit), the extra overhead can make up a large fraction of the cost of a call to a small function. Similarly to the overhead of type checks, inlining is an easy way to avoid that problem. In some cases, it can also be useful to provide a pre-existing 'box' in the form of an appropriately typed array the function can write results to instead of returning them directly.

Step 1: Profiling

Before trying to optimize things, first we need to profile it. I usually use the sb-sprof statistical profiler in sbcl (it also has a deterministic profiler, so you might want to try both and see which gives more insight into a particular problem.)

We also want some overall measurement of performance. In this case we will use average time per frame. On my hardware, it gets about 90FPS = 11ms/frame with 100% of a CPU core, and about 12% GPU utilization according to nvidia-settings (and including everything else displaying on the system). Knowing that we are limited by CPU and not GPU is important, since optimizing CPU wouldn't help much if it was mostly waiting for the GPU to draw things.

Note that both measurements are fairly noisy. There are other things running on the system, and just recompiling a function can cause a few FPS gain or loss due to things like how a loop aligns to cache or whatever. In this case, we should be able to get a few orders of magnitude improvement once it gets to some of the higher level optimizations, so just relative amounts and general trends is enough. If we were working towards a few percent improvement it would be time for more precise measurements like running specific functions in a loop, averaging results over multiple runs, etc.

;; load the statistical profiler
(require 'sb-sprof)

;; quick utility to profile all threads for N seconds (default 10)
(defun run-profile (&optional (time 10))
  (sb-sprof:start-profiling :threads :all :mode :cpu)
  (sleep time)
  (sb-sprof:report :type :graph))

;; start the code

;; profile it

The output is pretty verbose, so I usually use the slime-sprof slime contrib which formats the output and makes it easier to navigate. Here, I'll just cut out chunks of the full output and point out the interesting parts.

With (sb-sprof:report :type :graph), the output is broken into 2 main parts: the "call graph" section and the "flat profile section". We'll start with the latter

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
   1    157  23.7    521  78.7    157  23.7        -  TRADEWARZ::RENDER-NODE
   2     69  10.4     69  10.4    226  34.1        -  SB-KERNEL:%SINGLE-FLOAT
   3     32   4.8    649  98.0    258  39.0        -  TRADEWARZ::LOOP-SCENE
   4     30   4.5     51   7.7    288  43.5        -  CL-OPENGL:COLOR
   5     23   3.5     23   3.5    311  47.0        -  SB-KERNEL:TWO-ARG-*
   6     21   3.2     21   3.2    332  50.2        -  CL-OPENGL-BINDINGS:CHECK-ERROR
   7     20   3.0     20   3.0    352  53.2        -  (LAMBDA (SB-PCL::.ARG0.) :IN "...")
   8     14   2.1     27   4.1    366  55.3        -  TRADEWARZ::CONVERT-TO-OPENGL
   9     12   1.8     14   2.1    378  57.1        -  (FLET #:BODY-FUN-640 :IN SB-IMPL::GETHASH3)
  10     11   1.7     39   5.9    389  58.8        -  TRADEWARZ::VECTOR-MULTIPLY-TO

Here we see under the self column that tradewarz::render-node accounts for 23.7% of our time, so we will start there. We also note that sb-kernel:%single-float is number 2, which is a sign we are doing a lot of type conversions somewhere, and sb-kernel:two-arg-* at number 5, which is a routine used to multiply numbers of unknown type.

Next step is to find tradewarz::render-node in the call-graph section of the report (we want the block where tradewarz::render-node isn't indented as far):

                 Total.     Function
 Count     %  Count     %      Callees
     1   0.2                   TRADEWARZ::RENDER-NODE [1]
   521  78.7                   TRADEWARZ::LOOP-SCENE [3]
   157  23.7    521  78.7   TRADEWARZ::RENDER-NODE [1]
     1   0.2                   TRADEWARZ::CURRENT-MAP [45]
     2   0.3                   KEYWORDP [25]
     1   0.2                   "foreign function closure_tramp" [38]
     1   0.2                   TRADEWARZ::CURRENT-SCENE [34]
     1   0.2                   SB-C:RETURN-MULTIPLE [46]
    10   1.5                   (LAMBDA (SB-PCL::.ARG0.) :IN "...") [7]
    14   2.1                   CL-OPENGL-BINDINGS:CHECK-ERROR [6]
     5   0.8                   (SB-PCL::FAST-METHOD CFFI:TRANSLATE-TO-FOREIGN (T CFFI::FOREIGN-ENUM)) [33]
     6   0.9                   TRADEWARZ::MAKE-VECTOR [19]
     3   0.5                   (FLET #:CLEANUP-FUN-91 :IN CL-OPENGL-BINDINGS:BEGIN) [72]
     7   1.1                   (SB-PCL::FAST-METHOD TRADEWARZ::GET-SIZE (TRADEWARZ::MODEL)) [43]
     3   0.5                   "foreign function glColor4f" [32]
     8   1.2                   TRADEWARZ::GET-MODEL [29]
     6   0.9                   SB-VM::GENERIC-+ [10]
     5   0.8                   (FLET #:CLEANUP-FUN-76 :IN CL-OPENGL-BINDINGS:BIND-TEXTURE) [68]
     2   0.3                   "foreign function funcallable_instance_tramp" [31]
    19   2.9                   CFFI::%FOREIGN-ENUM-VALUE [35]
     5   0.8                   TRADEWARZ::VECTOR-MODIFY [18]
     5   0.8                   (LAMBDA (SB-PCL::.ARG0.) :IN "...") [23]
     6   0.9                   (LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1.) :IN "...") [21]
     1   0.2                   TRADEWARZ::RENDER-NODE [1]
    39   5.9                   TRADEWARZ::VECTOR-MULTIPLY-TO [11]
     6   0.9                   SB-IMPL::GETHASH3 [28]
    46   6.9                   TRADEWARZ::CONVERT-TO-OPENGL-NEW [27]
    45   6.8                   SB-KERNEL:%SINGLE-FLOAT [2]
    33   5.0                   CL-OPENGL:MULT-MATRIX [17]
    50   7.6                   CL-OPENGL:COLOR [4]

The right column can be a bit confusing, the way to read it is that the less indented line (tradewarz::render-node) is the function being called, lines above that describe relation to functions that called it, and below describe functions it calls. The numbers like [1] at the end of the line indicate which block in the graph describes that function, here render-node is the first entry in the call-graph report.

In this case, render-node is called from loop-scene, and 78.7% of the time within calls to loop-scene is spent in render-node. Similarly, we can see that render-node spends a lot of time in gl:color, tradewarz::convert-to-opengl-new, sb-kernel:%single-float, tradewarz::vector-multiply-to, and cl-opengl:mult-matrix.

Step 2: some low-hanging fruit

It is interesting that gl:color shows up in the profile, but similar functions like gl:tex-coord which are called the same amount and have nearly the same definition don't. Looking at the source for render-node, we see that gl:color is called through apply while the others are called directly. Replacing that with (gl:color (first c) (second c) (third c) (or (fourth c) 1.0) removes gl:color from the profile completely (since it is inlined by default) and leaves us with just the "foreign function glColor4f" still taking 0.05%, and saving us 0.7ms from average frame time. Since gl:color was inlined it could have just moved the processing time into render-node, which is one reason we need an absolute measure of time in addition to the profiler.

Next is convert-to-opengl-new, which converts a tradewars::ax-matrix struct into a vector to pass to OpenGL. Later commits in the original modified the struct to use a (simple-array single-float (16)) for storage, so we will do that here as well. Once that is done, all convert-to-opengl-new does is create a new matrix and transpose the existing one into it. Conveniently, GL 1.3 added gl:mult-transpose-matrix which lets us use the original order directly. Using (gl:mult-transpose-matrix (world-basis node)) brings us to about 102FPS, another 0.61ms improvement.

Step 3: some actual computation

From there we move to vector-multiply-to, which is a component-wise multiplication of 2 vectors into a third, in this case scaling a vertex. The proper high-level optimization would be to just include the scaling into the modelview matrix and let GL deal with it, probably much more efficiently. For now, let's go ahead and optimize the existing code a bit anyway.

If we look at the entry for vector-multiply-to in the flat profile, we see something like

    48   7.4                   TRADEWARZ::RENDER-NODE [1]
    17   2.6     48   7.4   TRADEWARZ::VECTOR-MULTIPLY-TO [7]
     5   0.8                   SB-VM::GENERIC-+ [14]
    26   4.0                   SB-KERNEL:TWO-ARG-* [4]

As mentioned above, sb-kernel:two-arg-* is a function that handles arbitrary types of numbers, including complex, ratio, and various combinations of 2 number types. If we knew we always had 2 single-floats, we should be able to multiply them with a single CPU instruction. Not sure where sb-vm::generic-+ comes from in that profile, but it is the same sort of thing.

We'll start by modifying ax-vector to use a typed vector as storage, similar to the previous modification of ax-matrix. Some of the code tries to create vectors with integers, so we will also add a more convenient version of make-vector that converts the arguments to single-float, and inline it so if the compiler knows the inputs are already single-float it can skip that step.

(defstruct (ax-vector
            (:type (vector single-float))
            (:constructor %make-vector (&optional x y z))
            (:conc-name v))
  (x 0.0)
  (y 0.0)
  (z 0.0))

(declaim (inline make-vector))
(defun make-vector (&optional (x 0.0) (y 0.0) (z 0.0))
  (%make-vector (float x 0.0) (float y 1.0) (float z 1.0)))

The data files also have integers in them, so let's modify vector-modify to convert to single-floats as well, and add some type declarations.

(defun vector-modify (src &optional x y z)
  "Assign new components to a vector"
  (declare (type (simple-array single-float (3)) src))
  (when x (setf (vx src) (float x 1.0)))
  (when y (setf (vy src) (float y 1.0)))
  (when z (setf (vz src) (float z 1.0)))

That doesn't affect performance much, but even without modifying vector-multiply-to it has dropped to 2-3% in our profile from almost 6%. In this case it is because SBCL is smart enough to know that vx, vy, and vz only accept ax-vectors, and ax-vectors only store single-floats, so if vx returns, the result must be a single-float and can be multiplied directly. Unfortunately the extra performance there was matched by vector-modify getting a bit slower due to converting types. Switching from apply to a direct call to vector-modify and noticing that copying n to the normal variable isn't needed brings us to about 108FPS, another 0.5ms.

Step 4: clean up the data

At this point, the top of the profile looks something like

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
   1    230  23.5    721  73.6    230  23.5        -  TRADEWARZ::RENDER-NODE
   2    102  10.4    102  10.4    332  33.9        -  SB-KERNEL:%SINGLE-FLOAT
   3     44   4.5     71   7.3    376  38.4        -  TRADEWARZ::VECTOR-MODIFY
   4     43   4.4    954  97.4    419  42.8        -  TRADEWARZ::LOOP-SCENE

and the main problems in render-node are

   109  11.1                   CL-OPENGL:MULT-TRANSPOSE-MATRIX [6]
    70   7.2                   TRADEWARZ::VECTOR-MODIFY [3]
    59   6.0                   SB-KERNEL:%SINGLE-FLOAT [2]
    50   5.1                   CFFI::%FOREIGN-ENUM-VALUE [14]
    32   3.3                   CL-OPENGL-BINDINGS:CHECK-ERROR [7]
    21   2.1                   TRADEWARZ::VECTOR-MULTIPLY-TO [9]
    16   1.6                   TRADEWARZ::GET-MODEL [28]
    15   1.5                   (SB-PCL::FAST-METHOD TRADEWARZ::GET-SIZE (TRADEWARZ::MODEL)) [29]
    11   1.1                   SB-IMPL::GETHASH3 [24]

along with a few percent from bits of CLOS internals.

We can't do much about the mult-transpose-matrix call at this point. Possibly it could be faster for this specific case, but it is in a separate library, so we will leave it for now.

Next on the list are vector-modify and sb-kernel:%single-float. In this case, both are translating static data into a specific form. Instead of doing this every time we use it, we should just convert the data to the final form when we load it.

Some of the data is created by load-obj which uses make-vector internally, so there we just need to tell it not to convert them to lists at the end. Other objects are specified directly in the sexp-based data files, so we need to translate them by hand.

;; convert a list of lists of lists to a list of lists of ax-vector
(defun translate-geometry (geometry)
  (flet ((x (x)
           (etypecase x
              (make-vector (or (first x) 0) (or (second x) 0) (or (third x) 0)))
             ((simple-array single-float (3))
              (make-vector (aref x 0) (aref x 1) (aref x 2))))))
    (loop for v in geometry
          ;; some entries don't have normals, but also don't disable
          ;; lighting so add a reasonable default normal instead of
          ;; just using last normal from previous thing drawn
          if (first v)
            collect (mapcar #'x v)
          else collect (cons (make-vector 0 -1 0)
                             (mapcar #'x v)))))

Once that is done we can modify render-node to use the vectors directly, bringing us to about 135FPS, 1.85ms less than previously and 3.7ms better than where we started. That is a good example of why we compare in ms instead of FPS: we have 2 sets of optimizations that both improved frame time by about 1.85ms, which we can trivially compare. Measured in FPS, one was 18FPS improvement, and the other was 27FPS. Even if we know FPS is non-linear and know those gains were relative to 90FPS and 108FPS respectively, it would be difficult to intuitively see that both are equivalent improvements.

The model method on get-size is creating a new ax-vector on every call, so lets just store the size in that format to start with as well. In addition, we are still spending a few % scaling vertices at runtime even though the scale doesn't change within an object (or even over time), so lets let GL handle it as suggested in step 3. Since it is a property of the object and not the node, we can't put it in the world-basis matrix directly without some way to propagate changes from objects to nodes (and if it never changes, we'd be better off just multiplying the vertex data on load.) Instead, we will just call gl:scale for each node (and enable :normalize and make sure we don't have 0 scale, since gl:scale affects normals as well).

With the more efficient scaling, we are up another 0.7ms or so.

Step 5: some implementation-specific tricks

Now our profile looks like

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
   1    589  30.0   1493  76.0    589  30.0        -  TRADEWARZ::RENDER-NODE
   2    104   5.3    104   5.3    693  35.3        -  CL-OPENGL-BINDINGS:CHECK-ERROR
   3     97   4.9   1939  98.7    790  40.2        -  TRADEWARZ::LOOP-SCENE
   4     95   4.8     95   4.8    885  45.1        -  (LAMBDA (SB-PCL::.ARG0.) :IN "...")
   5     80   4.1    303  15.4    965  49.1        -  CL-OPENGL:MULT-TRANSPOSE-MATRIX
   6     73   3.7     73   3.7   1038  52.9        -  (LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1. SB-PCL::.ARG2.) :IN "...")
   7     63   3.2     74   3.8   1101  56.1        -  (FLET #:BODY-FUN-640 :IN SB-IMPL::GETHASH3)
   8     46   2.3     46   2.3   1147  58.4        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF
   9     44   2.2     44   2.2   1191  60.6        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
  10     38   1.9     38   1.9   1229  62.6        -  (LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1.) :IN "...")
  11     36   1.8     36   1.8   1265  64.4        -  SB-KERNEL:%SINGLE-FLOAT

That is starting to look better, loop-scene is just iterating the scene, so that being at #3 suggests we aren't wasting too much time per item. cl-opengl-bindings:check-error is how cl-opengl implements its automatic error checking. We could turn that off for a 5% improvement for final release, but for development it is better to keep it on. Aside from that we also have a few bits of CLOS internals, mult-transpose-matrix, some array internals and still some calls to %single-float.

sb-kernel:hairy-data-vector-ref is a sign SBCL doesn't know anything about the type of an array it is working with (for example whether it is displaced, adjustable, etc). That is usually a sign we could either use a more efficient array type somewhere, or give it more declarations, or both. In this case, if we look at the implementation of mult-transpose-matrix it has to make a copy of the matrix in a form that can be passed to the OpenGL api. Looking at the call-graph part of the profile, the *-vector-ref and %single-float calls are all in mult-transpose-matrix. In this case, we know we have a (simple-array single-float (16)) so we can do better than the generic code in cl-opengl, particularly on platforms where we can pass the contents of typed arrays directly to C functions.

CFFI has a wrapper for abstracting that ability, cffi:with-pointer-to-vector-data. Unfortunately it does extra work on platforms where using CL vectors directly isn't supported, since it copies the array back in case the foreign call made any changes to it, which we don't need. If we limit it to a few specific implementations it is still easier than adding code for each of them specifically, and we can fall back to using cl-opengl otherwise.

        (let ((mat (world-basis node)))
          #+(or cmucl sbcl ccl)
          (if (typep mat '(simple-array single-float (16)))
              (cffi:with-pointer-to-vector-data (p mat)
                (%gl:mult-transpose-matrix-f p))
              (gl:mult-transpose-matrix mat))
          #-(or cmucl sbcl ccl)
          (gl:mult-transpose-matrix mat))

That gives us another 0.5ms, bringing the total to under 6ms from a start of over 11ms.

A note about cl-opengl packages

Cl-opengl has 2 APIs, exposed by the gl: and %gl: packages. The "high-level" gl: package tries to be "easy to use" and accepts CL data types, translating as needed. The "low-level" %gl: package is an automatically generated set of CFFI wrappers for the whole OpenGL (and OpenGL-ES) API, along with all extensions. Functions in %gl: usually take c-style pointers, which we usually create with CFFI or static-vectors, but which need managed by hand.

Both gl: and %gl: packages are officially supported APIs. High performance code will probably need to use both, since the extra translations done by gl: functions frequently add overhead that can be avoided as in the above example. Libraries implementing their own "high-level" APIs will probably also want to use %gl: directly.

Step 6: CFFI tricks (don't do this one in normal code)

There are a few things related to sb-impl::gethash3 near the top of the profile, which is part of SBCL's hash table implementation. Checking the call graph, we see they come from cffi::%foreign-enum-value, which is also responsible for some of the unidentifiable CLOS internals we have been ignoring until now. They are due to converting symbols like :texture-2d and :triangles in the CL code to the numbers used in the C API. In particular, we pass (primitive model) to gl:with-primitive, so there is no way cl-opengl and CFFI could translate it at compile time.

It usually isn't worth worrying about that sort of thing, since we should be trying to minimize GL calls anyway. If it does end up being a problem, we can do the translation in advance with cffi:foreign-enum-value. It requires us to specify which "foreign enum" to use for looking up the symbol, in this case %gl:enum. We can find that by looking at the definition of with-primitive, seeing it passes the value to gl:begin, then checking the definition of that to see that the type of the mode argument is enum. (%gl:enum is usually the right thing, since GL doesn't break the type up, but a few things in cl-opengl use more specific enums.)

CL-OpenGL could theoretically optimize the translation of the constant :texture-2d argument to gl:bind-texture, but currently doesn't. In that case there isn't a convenient place to store it in advance like the primitive type in model, but we can look it up at read time or define a constant.

That leaves the CLOS bits, which are due to CFFI not handling enums as well as it should. With a small patch to CFFI, we can get rid of those as well, bringing FPS to about 207, and saving another 1ms.

3bgl-shader updates: compute shaders, etcSat, 13 Dec 2014 23:36:19 CST

Working on cleaning up/documenting the changes made to 3bgl-shaders due to recent experiments with compute shaders. (some of these were already committed but not documented)

New features/changes:

Don't :use :cl anymore

Instead of worrying about which symbols to shadow when :useing both :cl and :3bgl-glsl, (:use :3bgl-glsl/cl) handles it automatically.

:3bgl-glsl/cl exports everything exported from :3bgl-glsl and :cl, preferring the former where there are conflicts.

The name isn't final yet though, might switch to using the shorter name for the combined package.

DEFMACRO works from shader code

3bgl-glsl:defmacro (exported by :3bgl-glsl/cl) allows defining macros for shader code, and should work as expected. Macro functions are evaluated by host, so can use arbitrary CL code at compile time.

macrolet also works, also allowing arbitrary CL code for computing the expansion.

More control over uniforms

The uniform macro accepts new arguments :layout and :qualifiers, for things like restrict and specifying image formats (:rg32f, etc)

shared memory

The new shared macro allows defining glsl shared variables for use in compute shaders, for example (shared temp (:float 512)) declares an array of 256 floats to be shared within each workgroup of a compute shader invocation.

Local-size-* declarations for compute shaders

Compute shaders need the workgroup size specified in the kernel, which can be done with the layout declaration:

(defun foo()
  (declare (layout (:in nil :local-size-x 16 :local-size-y 16 :local-size-z 1)))
Partial support for arrays

Arrays are supported for shared variables, and partially for local variables.

Type inference doesn't detect arrays yet so type must be declared explicitly, and array variables can't be initialized yet.

Both currently use the syntax (<base-type> <count>) rather than CL vector or array type specifiers.

(let (a)
  (declare ((:float 32) a))
  (setf (aref a 12) 34)
  (aref a 12))
MOD works on float and integer types

Previously mod had conflicting definitions, so only worked on some types, now it expands to % or mod() depending on the derived argument type.

Compute shaders / SmoothLife 3dFri, 12 Dec 2014 21:21:33 CST

After getting 3bgl-shader to a state approximating usable, I wanted to do something to exercise it a bit, as well as try out the (relatively) new compute shaders added in OpenGL 4.3.

2d SmoothLife

I decided to try an implementation of 3d SmoothLife, which is a generalization of the Game of Life cellular automaton to floating point, with influence from an arbitrary sized "neighborhood". The floating point values make the Hashlife optimization from normal "Life" impractical, and the wider neighborhood makes direct evaluation of the rules too expensive, particularly in 3d. Instead, the neighborhood calculations can be interpreted as convolutions with a kernel representing the shape of the neighborhood. The convolutions can then be implemented as element-wise multiplications in frequency domain, which is more efficient than direct convolution for the range of kernel sizes used even with the cost of the FFT and IFFT.

Somehow that ended up with spending a bunch of time on an educational but time consuming exploration of how FFTs work, along with experimenting on optimizing my compute-shader FFT implementations.

The general idea of 3bgl-shader seems to be working out well in practice though, in particular macros are nice for repetitive code like this. For example, I generated a macro FFT16 that takes as arguments the name of functions/macros to load/save a specific element, then I can do things like

;;; load specific rows from an image, do an FFT on them, and same to
;;; shared memory
(macrolet ((in (i)
             `(.xy (image-load tex (ivec3 x (+ y ,(* 16 i)) z))))
           (out (i re im)
                (setf (scratch lx ly 16 ,i) ,re)
                (setf (scratch lx ly 16 ,i t) ,im))))
  (fft16 in out))

;;; load specific elements from shared memory, do an FFT on them, and save
;;; back to the image
(macrolet ((in (i)
             `(vec2 (scratch2 lx ly 1 ,(* 16 i))))
           (out (i re im)
             `(image-store out1
                           (ivec3 (+ y ,(* 16 i)) x z)
                           (vec4 ,re ,im 0 0))))
  (fft16 in out))

without duplicating the FFT code or modifying the generator every time I wanted to change the access patterns or where it loads/saves. scratch and scratch2 are macros abstracting the layout of the elements in the shared memory.


(defmacro with-image-vars ((x y z base-var count) &body body)
  `(let (,@(loop for i below count
                 for s =  (alexandria:format-symbol t
                                                    "~a~d" base-var i)
                 collect (list s `(image-load tex (ivec3 (+ ,y ,i)
                                                         ,x ,z)))))

lets me define a set of repetitive bindings like (R0 (IMAGE-LOAD TEX (IVEC3 (+ Y 0) X Z))) to (R15 (IMAGE-LOAD TEX (IVEC3 (+ Y 15) X Z))) without getting errors where I missed/typoed a number while cut&pasting or typing them manually (which happened in one of the earlier versions, and wasted quite a bit of debugging time).

On the bad side, there is something wrong with dependency tracking in 3bgl-shader, so I have to explicitly reference some uniforms in the main shader to get them included in the output, and most error messages are very uninformative. It also could use some way to hook into slime for arglist hinting and source locations... having to search manually is pretty annoying after being used to just hitting M-..

When I finally managed to convince myself to stop trying to optimize it, I was getting about 26ms or so for a 256^3 FFT, divided into 3 passes (one for each dimension) of around 8-9ms each on my GTX780. Judging by the cuda benchmark results I've seen online I think it should be able to do about 2-3x as fast, but I'm not sure how much of that is just my implementation and how much is CUDA being more flexible than GL compute shaders.

One update of 3d smoothlife needs 2 convolutions and a pass to implement the birth/death rules using the results of the convolutions. The convolutions can share the FFT, leaving the total work at 1 FFT, 2 complex multiplies per pixel, 2 IFFTs, the running the rules per pixel. The multiplies take a few ms each, so in the final result I ended up using 1 frame per axis of the FFT and IFFTs, 1 frame for the multiplies, and 1 frame for the rules for a total of 11 frames per update. That leaves enough time for a simple draw at 60FPS as long as it isn't drawing too much (256X overdraw is a lot even for a modern GPU, particularly if it is spending most of a frame on other things). Smarter rendering is next step, since volume rendering is the other part I wanted to experiment with for this project.

Results so far:

3bgl-shader alpha releaseSun, 21 Sep 2014 17:13:17 CDT

One of the things I have been working on off and on for the last few years is a DSL for translating something resembling Common Lisp to GLSL. Recently, I finally got around to combining the various pieces into something approximately usable, and the result is 3bgl-shader.

It is still a bit rough, in particular most of the error messages are pretty bad. The main features (type-inference, dependency tracking, and interactive usage) seem to be working reasonably well though.

The video also shows something else I've been working on recently, which is a hack to show an xterm within a 3d scene, using the X Composite extension. That part is even rougher, to the point it doesn't even start up reliably. When it does start, it works reasonably well, and I even managed to get a nested draw loop working on *debugger-hook* so it can be used to debug code running in an emacs displayed on the xterm from within the code being debugged.