Since I started playing around with Pixel Bender in Flash, I’ve been trying out some different approaches here and there and learned a thing or two on performance optimizations (and quirks). As many people use PB specifically for its performance, and not much has been written on the subject, I thought I’d share my experiences and back them up with some benchmarks. Some of the things here are pretty obvious, yet others can be surprising and even frustrating.
Remember that this concerns Shaders in Flash Player, not Photoshop or After Effects, and that results could change in future versions. All benchmarks were performed on my crummy pc (AMD Athlon 64 X2 Dual, 2.21Ghz, 2GB Ram, Win XP), using 500×500 data with 4 channels, each performing 10 consecutive kernel executions. The kernel itself is just a read, a multiplication, a division, and a sqrt. ShaderJobs are performed synchronously.
Let’s get the obvious out of the way first (I won’t go into common sense optimalizations too much).
- Use 4 channels only if necessary. No transparency? Ditch it.
- Precalculate recurring constant calculations in Flash and pass them as parameters (such as width*height). Sure, it makes the “interface” of your Kernel potentially harder to read, but since Flash doesn’t support dependents (I hope it will some day), this should be a no-brainer if performance is really important.
- If only a part of a BitmapData needs to be processed, isolate it into a new BitmapData using copyPixels. Even when using applyFilter, sourceRect is buggy.
Told you it was obvious :p Now, some better ones.
Use ShaderJob, not ApplyFilter
- ShaderJob (on BitmapData) benchmark: 92-99ms
- ApplyFilter benchmark: 104-109ms
- ShaderJob ~ 10% faster
BitmapData is faster than ByteArray is faster than Vector.<Number> !
I’ve seen (and been guilty of) a lot of copying BitmapData into a Vector to harness “the power of Vector”. But look at this:
- ShaderJob on BitmapData: 92-99ms
- ShaderJob on ByteArray: 147-172ms
- ShaderJob on Vector.<Number>: 167-192ms
- BitmapData is ~40% faster than ByteArray
- BitmapData is ~47% faster than Vector.<Number>!!
Use BitmapData unless you have no other choice, or if complete floating point precision is important.
Conditionals are expensive!
This one annoys me quite a bit. Imagine you’re doing some calculations that you don’t need to do when alpha == 0 (which, as it happens, is usually the case). It can be a good idea to do them anyway in favour of dropping the alpha == 0 check. For the benchmark, I used values that had alpha set to 0 for about half of the data! Compare results to the previous benchmark.
- BitmapData: 134-192ms – ~47% speed loss!!!
- ByteArray: 147-172ms – ~22% speed loss
- Vector: 192-213ms – ~27% speed loss
In practice, test a version with and one without conditional. The results vary heavily depending on how many times calculations are omitted, and how many calculations are otherwise performed. Still, with half the (although slightly trivial) calculations omitted in this case, it’s stupefying that there’s so much increase in execution speed.
Do not use the input as the output
When using a ShaderJob or ApplyFilter, don’t use the same BitmapData/ByteArray/Vector instance that functions as the source. If you need iteration, you’re better of swapping two buffers. What happens is that Flash Player will need to make a temporary copy of the source, which slows things down.
Edit: The results here were compared to the normal ShaderJob test, while they’re using the alpha test. Percentages have been updated
- BitmapData: 207-218ms – ~30% speed loss
- ByteArray: 256-271ms – ~65% speed loss
- Vector: 276-293ms – ~40% speed loss
Update: Asynchronous ShaderJob
I just tested it, and the results indicate that asynchronous calls (waitForCompletion=false) are slower than synchronous calls. I suppose that’s mainly because of the event handling flow. Another thing I tested was to run 2 asynchronous calls with data of half the size, but it seems only 1 asynchronous ShaderJob can be started at the same time.
That’s it, see for yourself!
In closing, I’ll mention something I usually do but doesn’t seem to have any effect (it’s actually a habit from ActionScript). When reading from the same coordinate multiple times, I often store outCoord() in a variable and use that in the sample function. Well, I tested it, and it doesn’t have any impact at all :)
That’s it, at least for now, I hope it’s helpful! Check the benchmark and its source (the source is in fact pretty ugly, but does the trick). I’d be happy to know what kind of results other hardware yields.