Go: Speeding up image resizing with SIMD

Go 1.26.0 was released a few hours ago. Among other things, it introduces the SIMD experiment, with the simd/archsimd package.

I wanted to take it for a spin, so I tried applying it to a simple use case: image resizing.

Establishing a baseline

First, a baseline. Go’s standard library doesn’t have image scaling functions, but there are two well-known third-party libraries that do. For both I’ll use bilinear interpolation, the fastest method that still gives acceptable results.

import "github.com/nfnt/resize"

func ResizeNFNT(w, h uint, img image.Image) image.Image {
	return resize.Resize(w, h, img, resize.Bilinear)
}

or

import "golang.org/x/image/draw"

func ResizeDraw(w, h uint, img image.Image) image.Image {
	dst := image.NewRGBA(image.Rect(0, 0, int(w), int(h)))
	draw.BiLinear.Scale(dst, dst.Bounds(), img, img.Bounds(), draw.Over, nil)
	return dst
}

Let’s consider a typical resizing task: taking a large (25 MPix) image and generating a small (400x300) thumbnail.

BenchmarkResizeNFNT-16      78033728 ns/op
BenchmarkResizeDraw-16     371551918 ns/op

NFNT it is.

Can we go faster?

Progressive downscaling

Peering at the code of NFNT, it looks like it always computes the scaled image in a single interpolation pass over the source. I figured doing some progressive downscaling passes before that final interpolation could help.

The idea is simple. Take our 5824x4368 benchmark image. If you compute the 400x300 output in one interpolation pass, each output pixel represents a 14.56-pixel wide block of the input image. Oops, that’s not a whole number! Which is more or less why interpolation needs floating point arithmetic.

But halving an image is trivial: just average blocks of 2x2 pixels, pure integer math. So we can keep halving until the image is small enough, and only then do the expensive bilinear interpolation on what’s left.

// instead of:
5824x4368
 \--- bilinear interpolation --> 400x300

// do that:
5824x4368
 \--- halve --> 2912x2184
                 \--- halve --> 1456x1092
                                 \--- halve --> 728x546
                                                 \--- bilinear interpolation --> 400x300

Or, in terms of code:

func ResizeDownscaling(w, h uint, img image.Image) image.Image {
	for {
		size := img.Bounds().Size()
        if uint(size.X) <= 2*w || uint(size.Y) <= 2*h {
			return ResizeNFNT(w, h, img)
		}
		img = downscaleHalf(img)
	}
}

It looks like more steps, but each halving is so cheap that the total work ends up being less. The final interpolation pass now operates on a much smaller image, which is where the real savings come from. And as a bonus, averaging 2x2 blocks of bytes is the kind of operation that should be very amenable to vectorization… but let’s not get ahead of ourselves.

For the downscaling logic, I couldn’t be bothered to write it myself so I asked an LLM to do it. Since our source image comes from a JPEG, it’s a YCbCr image.


func downscaleHalf(img image.Image) image.Image {
	src := img.(*image.YCbCr)

	srcCH := len(src.Cb) / src.CStride
	dstYW := src.Bounds().Dx() / 2
	dstYH := src.Bounds().Dy() / 2
	dstCW := src.CStride / 2
	dstCH := srcCH / 2

	yLen := dstYW * dstYH
	cLen := dstCW * dstCH
	buf := make([]uint8, yLen+2*cLen)

	dst := &image.YCbCr{
		Y:              buf[:yLen],
		Cb:             buf[yLen : yLen+cLen],
		Cr:             buf[yLen+cLen:],
		YStride:        dstYW,
		CStride:        dstCW,
		SubsampleRatio: src.SubsampleRatio,
		Rect:           image.Rect(0, 0, dstYW, dstYH),
	}

	halfPlane(dst.Y, src.Y, dstYW, dstYH, src.YStride)
	halfPlane(dst.Cb, src.Cb, dstCW, dstCH, src.CStride)
	halfPlane(dst.Cr, src.Cr, dstCW, dstCH, src.CStride)

	return dst
}

func halfPlane(dst, src []uint8, dstW, dstH, srcStride int) {
	for y := range dstH {
		row0 := src[2*y*srcStride:]
		row1 := src[(2*y+1)*srcStride:]
		for x := range dstW {
			dst[y*dstW+x] = uint8((uint16(row0[2*x]) + uint16(row0[2*x+1]) +
				uint16(row1[2*x]) + uint16(row1[2*x+1]) + 2) / 4)
		}
	}
}

downscaleHalf is mostly bookkeeping: computing sizes, allocating a buffer for the three YCbCr planes, and building the new image struct. The interesting part is halfPlane, which simply iterates over pairs of rows to average 2x2 blocks.

Let’s benchmark:

BenchmarkResizeNFNT-16          78033728 ns/op
BenchmarkResizeDownscaling-16   38407643 ns/op

Nice, already more than a 2x speedup!

Can we go faster?

Adding vectors

That halving step, averaging 2x2 blocks of bytes, is exactly the kind of thing that SIMD should excel at. Let’s vectorize halfPlane using the new simd/archsimd package.

Recall that we operate on two rows of pixels at a time:

a0 b0 c0 d0 e0 f0 g0 h0 ...
a1 b1 c1 d1 e1 f1 g1 h1 ...

And we want to compute1:

(a0+b0+a1+b1)/4   (c0+d0+c1+d1)/4   (e0+f0+e1+f1)/4  ...

Here’s how we2 vectorized the process.

  1. Load the first row in a vector3:
    row0 = [ a0 b0 c0 d0 e0 f0 g0 h0 ]
    
  2. Use PermuteOrZero with a vector of the even (resp. odd) indices to get only the even (resp. odd) pixels in the lower half:
    row0even = row.PermuteOrZero([ 0 2 4 6 -1 -1 -1 -1 ]) -> [ a0 c0 e0 g0 0 0 0 0 ]
    row0odd  = row.PermuteOrZero([ 1 3 5 7 -1 -1 -1 -1 ]) -> [ b0 d0 f0 h0 0 0 0 0 ]
    
  3. Use Average to compute the averages of pairs of pixels:
    row0even.Average(row0odd) -> [ (a0+b0)/2 (c0+d0)/2 (e0+f0)/2 (g0+h0)/2 0 0 0 0]
    
  4. Do the same steps for the row1
  5. Average the two rows with the same Average function, and we get what we want!

Here’s the complete code:

import "simd/archsimd"

var evenShuf = [16]int8{0, 2, 4, 6, 8, 10, 12, 14, -1, -1, -1, -1, -1, -1, -1, -1}
var oddShuf = [16]int8{1, 3, 5, 7, 9, 11, 13, 15, -1, -1, -1, -1, -1, -1, -1, -1}

var (
	evenIdx = archsimd.LoadInt8x16((*[16]int8)(&evenShuf))
	oddIdx  = archsimd.LoadInt8x16((*[16]int8)(&oddShuf))
)

func halfPlane(dst, src []uint8, dstW, dstH, srcStride int) {
	simdW := dstW &^ 7
	for y := range dstH {
		row0 := src[2*y*srcStride:]
		row1 := src[(2*y+1)*srcStride:]
		dstRow := dst[y*dstW:]
		x := 0
		for ; x < simdW; x += 8 {
			r0 := archsimd.LoadUint8x16Slice(row0[2*x:])
			avg0 := r0.PermuteOrZero(evenIdx).Average(r0.PermuteOrZero(oddIdx))

			r1 := archsimd.LoadUint8x16Slice(row1[2*x:])
			avg1 := r1.PermuteOrZero(evenIdx).Average(r1.PermuteOrZero(oddIdx))

			avg0.Average(avg1).StoreSlicePart(dstRow[x:])
		}
		for ; x < dstW; x++ {
			dstRow[x] = uint8((uint16(row0[2*x]) + uint16(row0[2*x+1]) +
				uint16(row1[2*x]) + uint16(row1[2*x+1]) + 2) / 4)
		}
	}
}

And here’s the new benchmark:

BenchmarkResizeNFNT-16              78033728 ns/op
BenchmarkResizeDownscaling-16       38407643 ns/op
BenchmarkResizeDownscalingSIMD-16   12544658 ns/op

A 3x speedup from SIMD alone!

Can we squeeze out even more?

Pushing even further

During previous attemps at SIMD, I noticed that {Load,Store}SlicePart (which handles slices shorter than the vector width) was measurably slower than a plain {Load,Store}Slice. I haven’t dug into why; I guess the non-Part variant causes less branching or something (it still has bound checks and panics when needed though!) Either way, switching to the faster variant only needs some extra logic to avoid writing past the end of the buffer:

                row0 := src[2*y*srcStride:]
                row1 := src[(2*y+1)*srcStride:]
                dstRow := dst[y*dstW:]
+
+               // StoreSlice writes 16 bytes (8 useful + 8 junk). On non-last rows
+               // the overflow lands in the next row and gets overwritten. On the
+               // last row we stop early and let the scalar tail handle the rest.
+               safeSimdW := simdW
+               if y == dstH-1 && simdW > 0 {
+                       safeSimdW = simdW - 8
+               }
+
                x := 0
-               for ; x < simdW; x += 8 {
+               for ; x < safeSimdW; x += 8 {
                        r0 := archsimd.LoadUint8x16Slice(row0[2*x:])
                        avg0 := r0.PermuteOrZero(evenIdx).Average(r0.PermuteOrZero(oddIdx))
 
                        r1 := archsimd.LoadUint8x16Slice(row1[2*x:])
                        avg1 := r1.PermuteOrZero(evenIdx).Average(r1.PermuteOrZero(oddIdx))
 
-                       avg0.Average(avg1).StoreSlicePart(dstRow[x:])
+                       avg0.Average(avg1).StoreSlice(dstRow[x:])
                }
                for ; x < dstW; x++ {
                        dstRow[x] = uint8((uint16(row0[2*x]) + uint16(row0[2*x+1]) +

And the result:

BenchmarkResizeNFNT-16              78033728 ns/op
BenchmarkResizeDownscaling-16       38407643 ns/op
BenchmarkResizeDownscalingSIMD-16   12544658 ns/op
BenchmarkResizeDownscalingSIMD2-16   9588322 ns/op

Another ~25% shaved off, for a few lines of extra bookkeeping.

Conclusion

All in all, progressive downscaling + SIMD gave us an 8x speedup over the baseline bilinear resize. This was a quick and dirty experiment, but I think it shows that even straightforward uses of SIMD can have a very real impact.

SIMD is a very welcome addition to Go, and I can’t wait to see what happens when these techniques get applied properly in libraries!


  1. well actually we want to compute the rounded averages (a+b+c+d+2)/4; that’s more or less what the SIMD code ends up doing so all is fine, I just simplified the explanations ^

  2. we: Claude and I ^

  3. Again for conciseness, I show 8-wide vectors; in practice they’re 16-wide ^