2d Sprites, Part 3: benchmarking and optimization

So far, we've built 2 versions of a simple particle system, and made some vague claims that one of them is (or at least could be) faster than the other. With the current code, we can't really tell though, just click a bunch and try to judge how it looks.

In order to start making meaningful comparisons, the first thing we need is some sort of objective performance measurement.

For now, let's just count frames, and output average framerate/frame time once in a while. Eventually we would probably want to track more details like instantaneous framerate/time, longest frame time or a histogram of frame times, time spent in GC, etc.

First we need some variables to store statistics in:

(defparameter *frame-count* 0)
(defparameter *last-fps-message-time* 0)
(defparameter *last-fps-message-frame-count* 0)
(defparameter *fps-message-interval* 2000) ;; in milliseconds

Then we need to reset them during initialization, and while we are modifying init, let's change the particle texture a bit. We'll make it white so we can tint it per draw call, and fade the edges a bit.

(defun init ()
  (setf *frame-count* 0)
  (setf *last-fps-message-time* (sdl:sdl-get-ticks))
  (vecto-texture (:click-particle 64 64)
    (vecto:set-rgba-fill 0 0 0 0)
    (vecto:clear-canvas)
    (loop for i below 16
         do
         (vecto:set-rgba-stroke 1 1 1 (/ 1 16.0))
         (vecto:set-line-width i)
         (vecto:centered-circle-path 32 32 20)
         (vecto:stroke)))
  (reload-textures *texture-manager*))

Remember that we can use the next-frame macro to call init while the main loop is still running

(next-frame
  (init))

Now we can modify the update function to print the average FPS for the previous few seconds every once in a while.

(defun update-fps ()
  ;; update the frame count
  (incf *frame-count*)
  ;; handle tick count wrapping to 0
  (when (< (sdl:sdl-get-ticks) *last-fps-message-time*)
    (setf *last-fps-message-time* (sdl:sdl-get-ticks)))
  ;; see if it is time for next message
  (when (>= (sdl:sdl-get-ticks)
           (+ *last-fps-message-time* *fps-message-interval*))
    (let ((frames (- *frame-count* *last-fps-message-frame-count*))
          (seconds (/ (- (sdl:sdl-get-ticks) *last-fps-message-time*) 1000.0)))
      (format t "~s seconds: ~s fps, ~s ms per frame~%"
              seconds
              (if (zerop seconds) "<infinite>" (/ frames seconds))
              (if (zerop frames) "<infinite>" (* 1000 (/ seconds frames)))))
    (setf *last-fps-message-time* (sdl:sdl-get-ticks))
    (setf *last-fps-message-frame-count* *frame-count*)))

(defun update (delta-t)
  (update-fps)
  (update-object *particle-manager* delta-t))

then we get something like 2.013 seconds: 30.30303 fps = 33.0 ms per frame and realize lispbuilder-SDL defaults to a 30 FPS cap, so let's turn that off and see how fast it goes

(next-frame
  (setf (sdl:frame-rate) 0))

2.0 seconds: 42868.5 fps = 0.023327151 ms per frame, much better. Now we can see how much it slows down when we click a few times, possibly after adjusting *click-particles-count*, but it still isn't repeatable enough for useful comparison. To deal with that, let's add a little benchmark, and try to make it more interesting to look at in the process.

First a note about measuring FPS though: Since FPS is inversely proportional to time, comparing absolute FPS numbers will obscure the actual useful information, which is how long it takes to render a frame (or part of a frame). For example, above I was getting 42000 fps. If I add an effect and it drops to 21000 fps, that might seem like a lot, but it really isn't... that effect renders as fast as clearing the window, about 0.02 ms. You could run 500 of those per frame and still maintain 100fps. On the other hand, if we were at 150 fps, and an effect dropped it to 120fps, that is a lower percentage drop, and much lower absolute drop in FPS, but it is actually a much more expensive effect, at about 1.6ms. Running 10 of those would take up an entire frame at 60fps.

With that out of the way, let's get back to the benchmarking.

The next thing we need is a way to trigger a repeatable load and measure how many frames it renders during the benchmark.

We start with some more statistics variables:

(defparameter *benchmark-start-time* nil)
(defparameter *benchmark-start-frames* 0)
(defparameter *benchmark-duration-ms* *click-particles-lifetime-ms*)
(defparameter *benchmark-message* "")

Then add a function to run every frame that checks to see if a benchmark is running, updates the stats, and when it finishes prints out the results.

(defun update-benchmark ()
  (when (and *benchmark-start-time*
             (> (sdl:sdl-get-ticks)
                (+ *benchmark-start-time* *benchmark-duration-ms*)))
    (let ((frames (- *frame-count* *benchmark-start-frames*))
          (seconds (/ (- (sdl:sdl-get-ticks) *benchmark-start-time*) 1000.0)))
      (format t "~arendered ~s frames in ~s seconds, ~s fps, ~sms/frame~%"
              *benchmark-message*
              frames seconds
              (if (zerop seconds) "infinite" (/ frames seconds))
              (if (zerop frames) "infinite" (* 1000.0 (/ seconds frames))))
      (setf *benchmark-start-time* nil))))

We need to call that from the update function:

(defun update (delta-t)
  (update-fps)
  (update-benchmark)
  (update-object *particle-manager* delta-t))

Finally, we need to add a function to initialize the statistics and start the benchmark running, then call that from a key handler.

For the benchmark, we will just start a bunch of particle systems at once, using the same random seed so we get the same set of particles every time.


;; you may want to set these lower :)
(defparameter *benchmark-particles-per-system* 1000)
(defparameter *benchmark-system-count* 50)
;; copy the random state so we get repeatable results
(defparameter *benchmark-random-state* (make-random-state))

(defun start-benchmark (type)
  (setf *benchmark-start-time* (sdl:sdl-get-ticks))
  (setf *benchmark-start-frames* *frame-count*)
  (setf *benchmark-message* (format nil "~s:" type))
  (let ((*random-state* (make-random-state *benchmark-random-state*)))
    (loop repeat *benchmark-system-count*
       do (let ((*mouse-x* (random *screen-width*))
                (*mouse-y* (random *screen-height*)))
            (push (make-instance type :count *benchmark-particles-per-system*)
                  (systems *particle-manager*))))))

(defun key-up (key state mod-key scancode unicode)
  (declare (ignore state mod-key scancode unicode))
  (case key
    (:sdl-key-escape (sdl:push-quit-event))
    (:sdl-key-1 (start-benchmark 'click-particles))
    (:sdl-key-2 (start-benchmark 'vbo-click-particles))))

We turned up the particle counts a bit compared to what we were using for clicks, so let's change the pattern a bit to make it more interesting, and make it more colorful as well.


(defclass click-particles (particle-system)
  ((age :initform 0 :accessor age)
   (positions :accessor positions)
   (velocities :accessor velocities)
   ;; add a color slot per particle system instance
   (color :initform (list (random 1.0) (random 1.0) (random 1.0) 0.01) :accessor color))
  (:default-initargs :count *click-particles-count*
    :speed *click-particles-speed*))

;; set the per-system color before drawing the system
(defmethod draw-object :before ((o click-particles))
  (gl:blend-func :src-alpha :one)
  (apply 'gl:color (color o)))

(defmethod initialize-instance :after ((o click-particles) &key count speed &allow-other-keys)
  (let ((positions
         (make-array (list (* 2 count)) :element-type 'single-float :initial-element 0.0))
        (velocities
         (make-array (list (* 2 count)) :element-type 'single-float :initial-element 0.0)))
    (loop
       for i below count
       for j = (* i 2)
       ;; we vary the speed randomly per particle so it isn't just a solid ring
       for v = (* (+ 0.1 (random 1.5)) speed)
       do (setf (aref positions (+ j 0)) (float *mouse-x* 1.0))
       do (setf (aref positions (+ j 1)) (float *mouse-y* 1.0))
       do (setf (aref velocities (+ j 0))
                (float (* v (sin (* i (/ (* 2 pi) count)))) 1.0))
       do (setf (aref velocities (+ j 1))
                (float (* v (cos (* i (/ (* 2 pi) count)))) 1.0)))
    (setf (positions o) positions)
    (setf (velocities o) velocities)))

benchmark running Now we can get some repeatable numbers. With the original immediate mode code, I get about 30fps (33ms/frame) for 50 systems @ 1000 particles each. With the VBO code, I get about 34fps (28 ms). Measurably faster, but not as good as we can get. Time to start optimizing...

One of the original goals of switching to VBOs was to cut down on function calls, but we still do a function call to vbo-rectangle for every particle (and probably a bunch of function calls to math routines since we didn't declare types inside it).

Let's start out by converting that to an FLET so it can get inlined, and turn on speed optimizations. (I'm mostly assuming SBCL for this part, some compilers might need more help.)

(defmethod draw-object ((o click-particles))
  (declare (optimize speed))
  (flet ((rectangle (x y width height
                       &optional (u1 0.0) (v1 0.0) (u2 1.0) (v2 1.0))
           (let* ((w/2 (/ width 2.0))
                  (h/2 (/ height 2.0))
                  (x1 (- x w/2))
                  (x2 (+ x w/2))
                  (y1 (- y h/2))
                  (y2 (+ y h/2)))
             (gl:tex-coord u1 v2)
             (gl:vertex x1 y1 0.0)
             (gl:tex-coord u2 v2)
             (gl:vertex x2 y1 0.0)
             (gl:tex-coord u2 v1)
             (gl:vertex x2 y2 0.0)
             (gl:tex-coord u1 v1)
             (gl:vertex x1 y2 0.0))))
    (bind-texture :click-particle)
    (gl:with-primitive :quads
      (loop with positions = (positions o)
         for i below (length positions) by 2
         for x = (aref positions i)
         for y = (aref positions (1+ i))
         do (rectangle x y 32 32)))))

(defmethod draw-object ((o vbo-click-particles))
  (declare (optimize speed))
  (flet ((vbo-rectangle (pointer offset x y width height
                                 &optional (u1 0.0) (v1 0.0) (u2 1.0) (v2 1.0))
           (let* ((w/2 (/ width 2.0))
                  (h/2 (/ height 2.0))
                  (x1 (- x w/2))
                  (x2 (+ x w/2))
                  (y1 (- y h/2))
                  (y2 (+ y h/2)))
             (macrolet
                 ((store-values (&rest v)
                    `(progn ,@(loop for i in v
                                 collect `(setf (cffi:mem-aref pointer
                                                               :float (1- (incf offset)))
                                                (float ,i 0.0))))))
               (store-values u1 v2)
               (store-values x1 y1)
               (store-values u2 v2)
               (store-values x2 y1)
               (store-values u2 v1)
               (store-values x2 y2)
               (store-values u1 v1)
               (store-values x1 y2)))))
    (let ((components-per-vertex 4)
          (size-of-float 4)
          (vertices-per-sprite 4))
      ;; activate the VBO
      (gl:bind-buffer :array-buffer (vbo o))
      ;; allocate space for the data
      (%gl:buffer-data :array-buffer
                       (* (/ (length (positions o)) 2)
                          vertices-per-sprite
                          components-per-vertex
                          size-of-float)
                       (cffi:null-pointer) ;; just allocating space
                       :stream-draw)
      ;; get a pointer to the VBO memory
      (gl:with-mapped-buffer (p :array-buffer :write-only)
        ;; copy the vertex data to the buffer
        (loop with positions = (positions o)
           for i below (length positions) by 2
           for offset from 0 by (* components-per-vertex vertices-per-sprite)
           for x = (aref positions i)
           for y = (aref positions (1+ i))
           do (vbo-rectangle p offset x y 32 32)))
      ;; draw the object
      (%gl:tex-coord-pointer 2 :float
                             (* components-per-vertex size-of-float)
                             (cffi:null-pointer))
      (%gl:vertex-pointer 2 :float
                          (* components-per-vertex size-of-float)
                          (cffi:make-pointer (* 2 size-of-float)))
      (gl:enable-client-state :texture-coord-array)
      (gl:enable-client-state :vertex-array)
      (%gl:draw-arrays :quads 0 (* vertices-per-sprite
                                   (/ (length (positions o)) 2)))
      (gl:disable-client-state :vertex-array)
      (gl:disable-client-state :texture-coord-array)))
  ;; finally disable the buffer
  (gl:bind-buffer :array-buffer 0))

That sped it up a bit, about 37 FPS for immediate mode and 39 FPS for VBO now, but SBCL gave us lots of hints about places it couldn't compile efficiently, let's add some declarations to get rid of some of the compiler notes.

(defmethod draw-object ((o click-particles))
  (declare (optimize speed))
  (flet ((rectangle (x y width height
                       &optional (u1 0.0) (v1 0.0) (u2 1.0) (v2 1.0))
           (let* ((w/2 (/ width 2.0))
                  (h/2 (/ height 2.0))
                  (x1 (- x w/2))
                  (x2 (+ x w/2))
                  (y1 (- y h/2))
                  (y2 (+ y h/2)))
             (gl:tex-coord u1 v2)
             (gl:vertex x1 y1 0.0)
             (gl:tex-coord u2 v2)
             (gl:vertex x2 y1 0.0)
             (gl:tex-coord u2 v1)
             (gl:vertex x2 y2 0.0)
             (gl:tex-coord u1 v1)
             (gl:vertex x1 y2 0.0))))
    (bind-texture :click-particle)
    (gl:with-primitive :quads
      ;; SBCL is smart, so declaring the type of this array takes care
      ;; of this function
      (loop with positions of-type (simple-array single-float (*)) = (positions o)
         for i below (length positions) by 2
         for x = (aref positions i)
         for y = (aref positions (1+ i))
         do (rectangle x y 32 32)))))

(defmethod draw-object ((o vbo-click-particles))
  (declare (optimize speed))
  (flet ((vbo-rectangle (pointer offset x y width height
                                 &optional (u1 0.0) (v1 0.0) (u2 1.0) (v2 1.0))
           (let* ((w/2 (/ width 2.0))
                  (h/2 (/ height 2.0))
                  (x1 (- x w/2))
                  (x2 (+ x w/2))
                  (y1 (- y h/2))
                  (y2 (+ y h/2)))
             (macrolet
                 ;; we rewrite the macro here a bit to simplify type
                 ;; inference on the offset values
                 ((store-values (&rest v)
                    `(progn ,@(loop for i in v
                                 for j from 0
                                 collect `(setf (cffi:mem-aref pointer
                                                               :float (+ offset ,j))
                                                (float ,i 0.0))))))
               (store-values u1 v2  x1 y1
                             u2 v2  x2 y1
                             u2 v1  x2 y2
                             u1 v1  x1 y2)))))
    (let ((components-per-vertex 4)
          (size-of-float 4)
          (vertices-per-sprite 4))
      (gl:bind-buffer :array-buffer (vbo o))
      (%gl:buffer-data :array-buffer
                       (* (/ (length (positions o)) 2)
                          vertices-per-sprite
                          components-per-vertex
                          size-of-float)
                       (cffi:null-pointer)
                       :stream-draw)
      (gl:with-mapped-buffer (p :array-buffer :write-only)
        ;; we specify the type for the array here too
        (loop with positions of-type (simple-array single-float (*)) = (positions o)
           with length = (length positions)
           for i below length by 2
           ;; and a type for offset
           for offset fixnum from 0 by (* components-per-vertex vertices-per-sprite)
           for x = (aref positions i)
           for y = (aref positions (1+ i))
           do (vbo-rectangle p offset x y 32 32)))
      ;; draw the object
      (%gl:tex-coord-pointer 2 :float
                             (* components-per-vertex size-of-float)
                             (cffi:null-pointer))
      (%gl:vertex-pointer 2 :float
                          (* components-per-vertex size-of-float)
                          (cffi:make-pointer (* 2 size-of-float)))
      (gl:enable-client-state :texture-coord-array)
      (gl:enable-client-state :vertex-array)
      (%gl:draw-arrays :quads 0 (* vertices-per-sprite
                                   (floor (length (positions o)) 2)))
      (gl:disable-client-state :vertex-array)
      (gl:disable-client-state :texture-coord-array)))
  (gl:bind-buffer :array-buffer 0))

That still leaves a few warnings, but they don't seem to affect performance much, so they can stay. Now we have about 50fps/20ms for immediate mode and about 70fps/14ms for VBO, time to ask the profiler for advice.

Rather than restart the main loop and run it in sb-sprof:with-profiling, we will enable profiling for all threads, start the benchmark running, wait for a bit, then print out the profiling report.

(require 'sb-sprof)

(defun profile-benchmark (type &key (count 1))
  (sb-sprof:reset)
  (sb-sprof:start-profiling)
  (loop repeat count
     do (next-frame (start-benchmark type))
     do (sleep 3))
  (sb-sprof:stop-profiling)
  (format t "profile for ~s~%" type)
  (sb-sprof:report :type :graph))

(profile-benchmark 'click-particles :count 2)
(profile-benchmark 'vbo-click-particles :count 2)


With immediate mode:

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1     41  11.0    180  48.1     41  11.0        -  MAP-INTO
   2     31   8.3     40  10.7     72  19.3        -  ELT
   3     29   7.8     60  16.0    101  27.0        -  (SB-PCL::FAST-METHOD DRAW-OBJECT (CLICK-PARTICLES))
   4     27   7.2     27   7.2    128  34.2        -  CL-OPENGL3-BINDINGS:CHECK-ERROR
   5     21   5.6     21   5.6    149  39.8        -  SB-KERNEL:TWO-ARG-*
   6     18   4.8     18   4.8    167  44.7        -  SB-KERNEL:SEQUENCEP
   7     14   3.7     14   3.7    181  48.4        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
   8     13   3.5     13   3.5    194  51.9        -  SB-KERNEL:TWO-ARG-+
   9     12   3.2     13   3.5    206  55.1        -  SB-VM::GENERIC-+
  10     11   2.9     11   2.9    217  58.0        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS

With VBOs:

           Self        Total        Cumul
  Nr  Count     %  Count     %  Count     %    Calls  Function
------------------------------------------------------------------------
   1     55  14.4    249  65.4     55  14.4        -  MAP-INTO
   2     46  12.1     58  15.2    101  26.5        -  ELT
   3     37   9.7     43  11.3    138  36.2        -  (SB-PCL::FAST-METHOD DRAW-OBJECT (VBO-CLICK-PARTICLES))
   4     35   9.2     35   9.2    173  45.4        -  SB-KERNEL:TWO-ARG-*
   5     27   7.1     27   7.1    200  52.5        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
   6     19   5.0     19   5.0    219  57.5        -  SB-KERNEL:SEQUENCEP
   7     16   4.2     16   4.2    235  61.7        -  SB-VM::GENERIC-+
   8     15   3.9     15   3.9    250  65.6        -  SB-KERNEL:TWO-ARG-+
   9     13   3.4     16   4.2    263  69.0        -  SB-KERNEL:%SETELT
  10     13   3.4     13   3.4    276  72.4        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF

We called MAP-INTO, + and * from update-element, so let's optimize that a bit:

(defmethod update-object ((o click-particles) delta-t)
  (incf (age o) delta-t)
  (setf delta-t (float delta-t 1.0))
  (if (< (age o) *click-particles-lifetime-ms*)
      (let ((positions (positions o))
            (velocities (velocities o)))
        ;; optimize the inner loop for speed, and declare the types
        ;; for the arrays
        (declare (optimize speed)
                 ((simple-array single-float (*)) positions velocities))
        (map-into positions (lambda (p v)
                              (+ p (* v delta-t 0.001)))
                  positions velocities))
      nil))

With that, we are up to about 115fps for immediate mode, and 275fps for VBOs, and it is time to discuss some of the different issues you need to consider when optimizing GL code, since we've run into one here.

When running GL code on current hardware, there are usually a few pieces of software and hardware involved: the user code, the driver code, the main CPU, the bus, and various parts of the GPU.

So far, we have been optimizing for the CPU, mostly in user code, but also reducing the driver workload by trying to make fewer GL calls. On the GPU side, there are a few things to consider:

First is how fast the GPU can access the geometry data, and transform the geometry from world coordinates into screen coordinates, etc. We won't worry about that too much, since we have fairly simple geometry here.

Next is texture accesses and fill rate, which we need to take a closer look at. We mentioned before that using mipmaps can be faster than using nearest filtering, since it lets the GPU sample texels closer together, which makes better use of the special texture caches on most GPUs. Since we already use mipmaps, we move on to fill-rate, which basically refers to how fast the GPU can render pixels to the screen. Here, we are much worse than the usual 3d app, since everything we draw is visible, and needs blended with what is already there. In a 3d app, we could for example draw things roughly from front to back, and the GPU could do a quick depth test to decide it didn't need to worry about any pixels that are hidden by closer objects, so it only has to draw each pixel at most a few times. With our particle system, we are potentially drawing every pixel hundreds or thousands of times.

One easy way to test if fill rate is the limiting factor in an application, is to resize the window (while drawing the same thing, at the same size relative to the window size) and see how it affects performance. In this case, a smaller window makes the VBO case faster, but doesn't really affect the immediate mode code, and increasing the window to 640x400 drops both versions to about 100fps.

If we wanted to fix that, we'd probably need to either change the effect, for example by drawing smaller particles, or try to draw it more efficiently. In this case, we could probably use a texture with a bunch of particles on it, and use that on a smaller number of particles, and get about the same effect with much lower fill-rate cost. We could even extend that further, and just render it in advance and play the entire system as a single sequence of textures.

One other thing we can try, is enabling alpha testing:

(next-frame
  (gl:enable :alpha-test)
   (gl:alpha-func :notequal 0.0))

Here, we use gl:alpha-func to tell GL to only draw texels where the alpha value is :notequal to 0.0. It doesn't seem to help with my hardware, but might make a difference on some systems.

That's it for sprites for now, so here is the usual link back to the index, and a link to the final code