Since I started playing around with Pixel Bender in Flash, I’ve been trying out some different approaches here and there and learned a thing or two on performance optimizations (and quirks). As many people use PB specifically for its performance, and not much has been written on the subject, I thought I’d share my experiences and back them up with some benchmarks. Some of the things here are pretty obvious, yet others can be surprising and even frustrating.
Remember that this concerns Shaders in Flash Player, not Photoshop or After Effects, and that results could change in future versions. All benchmarks were performed on my crummy pc (AMD Athlon 64 X2 Dual, 2.21Ghz, 2GB Ram, Win XP), using 500×500 data with 4 channels, each performing 10 consecutive kernel executions. The kernel itself is just a read, a multiplication, a division, and a sqrt. ShaderJobs are performed synchronously.
Let’s get the obvious out of the way first (I won’t go into common sense optimalizations too much).
- Use 4 channels only if necessary. No transparency? Ditch it.
- Precalculate recurring constant calculations in Flash and pass them as parameters (such as width*height). Sure, it makes the “interface” of your Kernel potentially harder to read, but since Flash doesn’t support dependents (I hope it will some day), this should be a no-brainer if performance is really important.
- If only a part of a BitmapData needs to be processed, isolate it into a new BitmapData using copyPixels. Even when using applyFilter, sourceRect is buggy.
Told you it was obvious :p Now, some better ones.
Use ShaderJob, not ApplyFilter
- ShaderJob (on BitmapData) benchmark: 92-99ms
- ApplyFilter benchmark: 104-109ms
- ShaderJob ~ 10% faster
BitmapData is faster than ByteArray is faster than Vector.<Number> !
I’ve seen (and been guilty of) a lot of copying BitmapData into a Vector to harness “the power of Vector”. But look at this:
- ShaderJob on BitmapData: 92-99ms
- ShaderJob on ByteArray: 147-172ms
- ShaderJob on Vector.<Number>: 167-192ms
- BitmapData is ~40% faster than ByteArray
- BitmapData is ~47% faster than Vector.<Number>!!
Use BitmapData unless you have no other choice, or if complete floating point precision is important.
Conditionals are expensive!
This one annoys me quite a bit. Imagine you’re doing some calculations that you don’t need to do when alpha == 0 (which, as it happens, is usually the case). It can be a good idea to do them anyway in favour of dropping the alpha == 0 check. For the benchmark, I used values that had alpha set to 0 for about half of the data! Compare results to the previous benchmark.
- BitmapData: 134-192ms – ~47% speed loss!!!
- ByteArray: 147-172ms – ~22% speed loss
- Vector: 192-213ms – ~27% speed loss
In practice, test a version with and one without conditional. The results vary heavily depending on how many times calculations are omitted, and how many calculations are otherwise performed. Still, with half the (although slightly trivial) calculations omitted in this case, it’s stupefying that there’s so much increase in execution speed.
Do not use the input as the output
When using a ShaderJob or ApplyFilter, don’t use the same BitmapData/ByteArray/Vector instance that functions as the source. If you need iteration, you’re better of swapping two buffers. What happens is that Flash Player will need to make a temporary copy of the source, which slows things down.
Edit: The results here were compared to the normal ShaderJob test, while they’re using the alpha test. Percentages have been updated
- BitmapData: 207-218ms – ~30% speed loss
- ByteArray: 256-271ms – ~65% speed loss
- Vector: 276-293ms – ~40% speed loss
Update: Asynchronous ShaderJob
I just tested it, and the results indicate that asynchronous calls (waitForCompletion=false) are slower than synchronous calls. I suppose that’s mainly because of the event handling flow. Another thing I tested was to run 2 asynchronous calls with data of half the size, but it seems only 1 asynchronous ShaderJob can be started at the same time.
That’s it, see for yourself!
In closing, I’ll mention something I usually do but doesn’t seem to have any effect (it’s actually a habit from ActionScript). When reading from the same coordinate multiple times, I often store outCoord() in a variable and use that in the sample function. Well, I tested it, and it doesn’t have any impact at all :)
That’s it, at least for now, I hope it’s helpful! Check the benchmark and its source (the source is in fact pretty ugly, but does the trick). I’d be happy to know what kind of results other hardware yields.
17 thoughts on “Some Flash Pixel Bender performance tips + benchmarks”
Conditionals are expensive because in that case both branches are computed because multiple pixels are computed simultaneously, this is similar to how it works on GPUs. Your suggestion to try it both ways is spot on. This trade-off may change later though…
where the vectors all created with a fixed size first? I’ve found this gives a big performance boost
Kevin: Thanks for giving input on why that is the case. It makes sense why it happens in Pixel Bender. Tho in the specific case of the Flash Player implementation, that kind of behaviour can be confusing and as such worth to point out.
Nathan: Yes, all Vector instances are instantiated as fixed size. The performance may look unexpected since it’s the opposite of ActionScript, but it’s probably a result of marshalling. The data is structured differently per type in memory, while the kernel expects the input/output to be in a specific format.
Thanks for sharing David! I am in the middle of optimising a filter where I am using a lot of sin and cos calculations, do you think that caching them (i.e. as bitmapData pixel values) is a good idea?
My results on 2.0Ghz MacBook Pro:
ShaderJob on BitmapData: 49ms
ApplyFilter on BitmapData: 60ms
ShaderJob on Vector: 130ms
ShaderJob on ByteArray: 97ms
SJ on BitmapData with alpha test: 87ms
AF on BitmapData with alpha test: 96ms
SJ on Vector with alpha test: 181ms
SJ on ByteArray with alpha test: 130ms
SJ BitmapData to self: 121ms
SJ Vector to self: 211ms
SJ ByteArray to self: 156ms
Pingback: Some Flash Pixel Bender performance tips + benchmarks | Adobe Tutorials
damn! how can i beat you David? ;D , btw how about process 2 ShaderJob (or more) for 1 bitmap (slice to half size like SLI) in 1 loop? ah yes maybe i’ve to try it myself ;)
Og2T: Thanks for posting the results! Seems they’re in line with mine. About using a BitmapData lookUp for sin and cos, you’ll be left with some overhead in that case. Since values for those are in the [0, 1] interval, you need to a) sample the image, and b) map it back to [-1, 1] (although that is trivial). I haven’t tested it, but I’d say that cos and sin should fast enough compared to the sampling and have proper floating point precision (if that’s a concern).
Katopz: Ah right, I was planning to test multiple asynchronous ShaderJobs and see if the multithreading still has a positive effect on the overall time – but I forgot about it! I’ll try to do some testing later today. In case of 2 synchronous calls, it’ll probably be a bit slower because of calling/marshalling overhead. But… who knows :)
Pingback: Video | Enjolt.com | Innovate for Success
It is not a surprise that Vector. is slower. than both other alternatives. But what worries me is the fact it is that much slower. And especially that ByteArray is slower than BitmapData.
Conversion from BitmapData to float can be done using a simple mapping function with a lookup table. E.g. f(x) = t[x]; where t is initialized with t[i] = i / 255.0f for all i in [0,255].
For Vector. to float this is a different story. Something like ((float)x) might be done but that should not cause such a big difference.
Now what is really strange is that ByteArray is slower since no conversion is needed at all.
Thank you for the tests. For a new ImageProcessing framework I am using Vector.. But I will keep doing it since pre-multiplied alpha is a pain and I am shooting for high quality :o)
Yeah, for Farbe I’m also still sticking to ByteArrays for most of the paint simulations for the same reasons. Especially those that require fluid sims; lower precision causes strong dissipation.
I’ve been thinking of several scenarios that could cause the slowdowns for each type, but none have been really satisfactory or weren’t logical. The fact that I don’t have any insight on how _exactly_ data is passed to the kernel doesn’t help, either. One of those was the type of data the kernel would expect as straight memory chunks so it can easily receive screen buffer data for filters or blend modes (which are in fact relatively fast). But there has to be a conversion to float at some point, so for Vectors and ByteArrays it would mean doing 2 conversions. But such an implementation would not make sense in the context of Flash.
good job David, a lot of ideas now in my mind (and no need for canavis this time :p)
Pingback: around as3 » Еще несколько важных фактов о производительности Pixel Blender’а
Pingback: Cold Constructs » Blog Archive » Pixel Bender gap guide
Thanks for these tips! I ran into an optimization problem today doubling an image’s size and wrote it up here:
Any idea what the bottleneck is? I am guessing it is overhead of moving from as3 into pixelbender, but really have no idea. Would be super to understand why it is suffering and how to help.
Erik: I think the second answer in that list says it best (quote: “I think the problem is that you are really comparing Pixel Bender against native player code, not against “actionscript”. I doubt Pixel Bender will ever win on that scenario.”). And he’s right, BitmapData::draw is coded in C++ directly in the player (so running natively on the “system”), so it’ll always have the upper hand :)
Pingback: Bits and Pieces · Pixel Bender Differences
Speaking on performance, be sure to check out this recent article on Flash 10.3 and AIR 2.6 performance: