[insert blog here]
3bgl-shader updates: compute shaders, etcSat, 13 Dec 2014 23:36:19 CST

Working on cleaning up/documenting the changes made to 3bgl-shaders due to recent experiments with compute shaders. (some of these were already committed but not documented)

New features/changes:

Don't :use :cl anymore

Instead of worrying about which symbols to shadow when :useing both :cl and :3bgl-glsl, (:use :3bgl-glsl/cl) handles it automatically.

:3bgl-glsl/cl exports everything exported from :3bgl-glsl and :cl, preferring the former where there are conflicts.

The name isn't final yet though, might switch to using the shorter name for the combined package.

DEFMACRO works from shader code

3bgl-glsl:defmacro (exported by :3bgl-glsl/cl) allows defining macros for shader code, and should work as expected. Macro functions are evaluated by host, so can use arbitrary CL code at compile time.

macrolet also works, also allowing arbitrary CL code for computing the expansion.

More control over uniforms

The uniform macro accepts new arguments :layout and :qualifiers, for things like restrict and specifying image formats (:rg32f, etc)

shared memory

The new shared macro allows defining glsl shared variables for use in compute shaders, for example (shared temp (:float 512)) declares an array of 256 floats to be shared within each workgroup of a compute shader invocation.

Local-size-* declarations for compute shaders

Compute shaders need the workgroup size specified in the kernel, which can be done with the layout declaration:

(defun foo()
  (declare (layout (:in nil :local-size-x 16 :local-size-y 16 :local-size-z 1)))
  ...)
Partial support for arrays

Arrays are supported for shared variables, and partially for local variables.

Type inference doesn't detect arrays yet so type must be declared explicitly, and array variables can't be initialized yet.

Both currently use the syntax (<base-type> <count>) rather than CL vector or array type specifiers.

(let (a)
  (declare ((:float 32) a))
  (setf (aref a 12) 34)
  (aref a 12))
MOD works on float and integer types

Previously mod had conflicting definitions, so only worked on some types, now it expands to % or mod() depending on the derived argument type.

Compute shaders / SmoothLife 3dFri, 12 Dec 2014 21:21:33 CST

After getting 3bgl-shader to a state approximating usable, I wanted to do something to exercise it a bit, as well as try out the (relatively) new compute shaders added in OpenGL 4.3.

2d SmoothLife

I decided to try an implementation of 3d SmoothLife, which is a generalization of the Game of Life cellular automaton to floating point, with influence from an arbitrary sized "neighborhood". The floating point values make the Hashlife optimization from normal "Life" impractical, and the wider neighborhood makes direct evaluation of the rules too expensive, particularly in 3d. Instead, the neighborhood calculations can be interpreted as convolutions with a kernel representing the shape of the neighborhood. The convolutions can then be implemented as element-wise multiplications in frequency domain, which is more efficient than direct convolution for the range of kernel sizes used even with the cost of the FFT and IFFT.

Somehow that ended up with spending a bunch of time on an educational but time consuming exploration of how FFTs work, along with experimenting on optimizing my compute-shader FFT implementations.

The general idea of 3bgl-shader seems to be working out well in practice though, in particular macros are nice for repetitive code like this. For example, I generated a macro FFT16 that takes as arguments the name of functions/macros to load/save a specific element, then I can do things like


;;; load specific rows from an image, do an FFT on them, and same to
;;; shared memory
(macrolet ((in (i)
             `(.xy (image-load tex (ivec3 x (+ y ,(* 16 i)) z))))
           (out (i re im)
             `(progn
                (setf (scratch lx ly 16 ,i) ,re)
                (setf (scratch lx ly 16 ,i t) ,im))))
  (fft16 in out))

;;; load specific elements from shared memory, do an FFT on them, and save
;;; back to the image
(macrolet ((in (i)
             `(vec2 (scratch2 lx ly 1 ,(* 16 i))))
           (out (i re im)
             `(image-store out1
                           (ivec3 (+ y ,(* 16 i)) x z)
                           (vec4 ,re ,im 0 0))))
  (fft16 in out))

without duplicating the FFT code or modifying the generator every time I wanted to change the access patterns or where it loads/saves. scratch and scratch2 are macros abstracting the layout of the elements in the shared memory.

Similarly

(defmacro with-image-vars ((x y z base-var count) &body body)
  `(let (,@(loop for i below count
                 for s =  (alexandria:format-symbol t
                                                    "~a~d" base-var i)
                 collect (list s `(image-load tex (ivec3 (+ ,y ,i)
                                                         ,x ,z)))))
     ,@body))

lets me define a set of repetitive bindings like (R0 (IMAGE-LOAD TEX (IVEC3 (+ Y 0) X Z))) to (R15 (IMAGE-LOAD TEX (IVEC3 (+ Y 15) X Z))) without getting errors where I missed/typoed a number while cut&pasting or typing them manually (which happened in one of the earlier versions, and wasted quite a bit of debugging time).

On the bad side, there is something wrong with dependency tracking in 3bgl-shader, so I have to explicitly reference some uniforms in the main shader to get them included in the output, and most error messages are very uninformative. It also could use some way to hook into slime for arglist hinting and source locations... having to search manually is pretty annoying after being used to just hitting M-..

When I finally managed to convince myself to stop trying to optimize it, I was getting about 26ms or so for a 256^3 FFT, divided into 3 passes (one for each dimension) of around 8-9ms each on my GTX780. Judging by the cuda benchmark results I've seen online I think it should be able to do about 2-3x as fast, but I'm not sure how much of that is just my implementation and how much is CUDA being more flexible than GL compute shaders.

One update of 3d smoothlife needs 2 convolutions and a pass to implement the birth/death rules using the results of the convolutions. The convolutions can share the FFT, leaving the total work at 1 FFT, 2 complex multiplies per pixel, 2 IFFTs, the running the rules per pixel. The multiplies take a few ms each, so in the final result I ended up using 1 frame per axis of the FFT and IFFTs, 1 frame for the multiplies, and 1 frame for the rules for a total of 11 frames per update. That leaves enough time for a simple draw at 60FPS as long as it isn't drawing too much (256X overdraw is a lot even for a modern GPU, particularly if it is spending most of a frame on other things). Smarter rendering is next step, since volume rendering is the other part I wanted to experiment with for this project.

Results so far:

3bgl-shader alpha releaseSun, 21 Sep 2014 17:13:17 CDT

One of the things I have been working on off and on for the last few years is a DSL for translating something resembling Common Lisp to GLSL. Recently, I finally got around to combining the various pieces into something approximately usable, and the result is 3bgl-shader.

It is still a bit rough, in particular most of the error messages are pretty bad. The main features (type-inference, dependency tracking, and interactive usage) seem to be working reasonably well though.

The video also shows something else I've been working on recently, which is a hack to show an xterm within a 3d scene, using the X Composite extension. That part is even rougher, to the point it doesn't even start up reliably. When it does start, it works reasonably well, and I even managed to get a nested draw loop working on *debugger-hook* so it can be used to debug code running in an emacs displayed on the xterm from within the code being debugged.

Just a quick note so I can find it again:Sun, 21 Sep 2014 00:00:00 CDT

To get a full backtrace in sbcl debugger (including from slime sldb) with all of the internals involved in signalling the error:

(setf sb-debug:*stack-top-hint* 'sb-debug::invoke-interruption)

before triggering the error.

Useful when there is a problem with the internals, and the stack trace wouldn't otherwise show the interesting frames.

Blogging againSat, 28 Jun 2014 00:00:00 CDT

Trying to get started blogging again, so finally uploaded some old posts that had been waiting on me to tweak the colorization a bit, and some info about recent 3bmd changes.

3bmd updates

3bmd is a Markdown processor implemented in Common Lisp, with various extensions.

Thanks to Vsevolod Dyomkin, it now has an extension to support some of the PHP Markdown Extra table format, so

| Left  | Center Aligned  | Right |
| :------------ |:---------------:| -----:|
| *marked up*      | some wordy text | 123 |
| plain | **centered** | 4 |

renders as

Left Center Aligned Right
marked up some wordy text 123
plain centered 4

Unlike the original PHP Markdown Extra tables, the | are required on both sides of the table.

Thanks to Gábor Melis, 3bmd can also output markdown, for example after making modifications to the parsed document.

Unfortunately the internal parse tree is still pretty ugly and not an officially supported API yet, so expect it to change at some point.