The Bhuddabrot

Authors Note:

This blog post got out of hand. There is a lot of detail. Feel free to skim and if a heading catches your eye, read deeper. There are pictures along the way.

Introduction

This is a follow-up post to my post about Playing With The Mandelbrot Set.

To set the stage, the Mandelbrot set is the set of all points on the complex plane where the function below continues to infinity without the result exceeding 2. It is the set of all complex numbers n applied to this function, where both the real and imaginary parts stay constrained.

f(n+1) = n^2 + c

The Bhuddabrot is the inverse. It is the set of all numbers where the function outlined above does escape the bounds of 2. The image is a representation of the complex plane, where each iteration of every point is plotted in sequence. In the Mandelbrot calculation each pixel represents the number of iterations for that position, where here each pixel is a hit-counter describing the number of times a trajectory calculation has hit that point regardless of its starting position. The brighter the pixel, the more times it was “hit” by a Mandelbrot trajectory.

The important distinction is that the Mandelbrot shows the contours of the shape where the Bhuddabrot shows all the data points generated by doing the Mandelbrot calculation. They are simply different ways of displaying the same data.

Computation Challenges

I was struggling with having some noise in the image. The problem is less apparent at lower resolutions, but at high resolutions the noise is a problem. It was clear I needed to iterate on the code to resolve these issues, which I initially suspected of being floating point errors.

The problem is that a full-resolution render of the image takes my laptop a smidge under 4-days, and that is before cranking the quality slider. Making a code change then waiting for literal days to see what (if any) impact your change made is not going to let this project conclude anytime soon. Sounds like a side-quest!

The good news is that each calculation point is independent of each other. In theory it should be possible to break the work up to one “work unit” per “input-pixel” and then map the output heat-map onto an image.

Swoole

My first stop on my optimization journey was Swoole. Swoole is a tool that allows you to write concurrent code. Using my previous render (let’s say 4 days) as a benchmark, and assuming I was able to divide the work evenly among CPU cores, a c8g.12xlarge AWS EC2 instance with 48 cores would require 2 hours. This napkin maths out to about $4 per run. If I forget to turn the machine off the $2/hr charge is going to get spendy quick.

Costs aside, maybe I could go slower on smaller instances, what would this look like?
Well we would need a “work dispenser” that would yield work units from a generator. We would also need each worker to send it’s results back via some sort of channel. The last step would be to create an image out of a reduction of all the results.

I had a quick chat with Greg in the PHP Freelancer’s Guild Discord and we came to the conclusion that because of the necessity of the shared data, Swoole may not be the best fit for the project. With Swoole we must scale vertically and the type of work doesn’t fit well.

If you have an idea for how to solve this problem with Swoole, leave a comment!

Queues And Workers

What if, instead of having a shared work dispenser, we broke the script up into a few distinct steps.

Pre-calculate the work to be done and insert each work unit into a queue.
Workers pick up the work from the queue and write the resulting trajectory coordinates into a shared database.
Finally, the database is queried for all coordinates in the image and some post-processing is done.

Let’s do some napkin math on what each simulation would cost.

6x c8g.2xlarge instances for two hours = $3.828 per run
1x db.t3.small MariaDB instance = $36.32 per month

Ok, so the numbers don’t look great. One way or another the compute is going to cost about $4/run, but here we also have the added overhead of the database.

The other drawback is there will necessarily be a network round-trip request for each work unit. Millions of work units is a lot of round-trip requests.

Trade Offs

With the Swoole method I’m limited by the number of cores on any given physical machine. With the Queue route I’m limited by the network overhead to distribute the work. Both methods will cost about the same (minus the database).

The advantage of the queue worker method is that I can scale elastically. In theory I could set up an auto-scaling-group to hold a baseline (or even zero) number of workers, and set it to scale up to a bunch of workers to consume the workload.

In theory, at least, if I went the queue route I could tune how quickly an image is generated by tuning the maximum number of instances it is allowed to spin up. Each image would still have the same overall price tag, but I would have a control for how quickly I can iterate.

The Code

OK, let’s talk about the code.

We are calculating a $trajectory for a $mandelbrot calculation which moves through a series of complex $points. If we exceed our maximum number of iterations let’s return false as this point is inside the Mandelbrot set. Otherwise, keep appending the $points within the bounds of the set to our $trajectory until we find a point outside the set.

It may make more sense to just look at the code:

public static function trajectory(Mandelbrot $mandelbrot): false|array
{
    $trajectory = [];

    foreach ($mandelbrot->sequence() as $nth => $point) {
        if ($nth >= $mandelbrot->maxIterations) {
            return false;
        }

        if ($point->getReal()->abs()->isGreaterThanOrEqualTo(2) || $point->getImaginary()->abs()->isGreaterThanOrEqualTo(2)) {
            break;
        }

        $trajectory[] = $point;
    }

    return $trajectory;
}

Noisy Image

The image generated so far has a bit of noise in it. It’s not as pretty as the images you can find on Wikipedia, for example. This simply will not do, as this is supposed to be print-quality and “noise” is not print quality.

I suspect the noise is coming from floating point errors. So I spent some time implementing my library I link to below, rewriting everything to use my implementation instead. The gist of the change is that instead of working with ~14 decimal digits, I’m working with 64 and rounding the remainder off. I’m sure nothing bad will happen that will be talked about later in this blog post.

The initial results were promising, here are side-by-side 250x250px renders of using the built-in math and the version using fancy math.

Bhuddabrot image with much of the noise removed

The Impossible Task

Here’s the line of thought. I need to do arbitrary complexity calculations to get the really precise outline of a shape with an infinite curve.
The problem: You cannot have both.
The reason is simple. Calculating precisely is necessary to get a clear image. This blows up your memory usage because suddenly you are n out of 5000+ iterations deep and you are out of RAM because you are working with numbers of >=n*2 number of decimal places. A technically solvable problem but not in my lifetime.
I wonder if this is how Physicists feel about trying to capture an electron, you can either get the speed or velocity but not both. Here you can have to tell a lie to get an image or accept the limitations of floating point number capabilities. That, or use all the compute power in the universe to calculate the shape of an infinite curve.

I’m not opposed to telling a small lie and putting a disclaimer on the finished product. Or I could be honest and go to print with the errors included. But if I had to tell a small lie, let’s make it as small as possible within reason. I’m not afraid of a small compute bill. It’s entertainment dollars.

The Actual Code

The first thing I had to do was figure out how to do arbitrarily precise calculations on complex numbers. Turns out brick/math doesn’t support complex numbers. That’s OK, I can bring my own.

The next stop was to remove all references to division. The division path will blow up your memory usage if not carefully contained. The answer is to use rationals. “Oh boy I’m smart” I thought.
The immediate next step is to realize that some numbers are going to be irrational by the nature of what we are trying to draw. Suddenly not feeling so smart.
Some massaging between rationals where appropriate and decimals where irrational got me over the hurdle. “It’s only a small lie” I tell myself.

Elevating To The Cloud

We are not just throwing compute muscle behind a problem, we are being smart about it.

The Work Queue

Database Queue Driver

Right out of the box, having needed a database to store the intermediate results anyways, let’s try with the database queue driver for Laravel.

The first thing I noticed is that the throughput of the job to queue the work is terrible. TERRIBLE. It’s worse than local development on a sqlite database.

$ time php artisan app:queue-bhuddabrot-jobs 1000 1000
real    255m40.108s

Memory Store Queue Driver

A relational database is a bit overkill for the problem we are trying to solve. We are creating a big heap of work, then pulling the work off the queue as fast as we can.
While this is doable with a relational database, we quickly hit a ceiling regarding how quickly we can pull items off the top of the pile.

Here’s the catch though, adding and subtracting work from the queue is not the slowest part of the operation. Running the actual calculations is the slowest part, so while there is room for improvement optimization efforts here are chipping away at a relatively small slice of the total time spend.

That said, the final render has 243 million jobs to process. If each job is 1.5kb that adds up to 364.5GB just to hold the tasks.

So I don’t blow the budget in cache storage space, I wrote some code to throttle the queue size. This means the process is tied to how fast the work can be done, but it’s not the slowest part of the work anyways.

Infrastructure Benchmarks

To get some numbers I launched two servers, a c8g.medium and a c8g.large – each was configured to run with 4 workers. That means that there are 8 workers total spread across 3 compute cores.

The “uppening” was upon us. If this worked then all I’d have to do is create an auto scaling group of a proper size, worker number per core ratio, and maybe bump the database specs.

It ended up taking 10 hours left to generate a 1,000px by 1,000px image seems alright. It’s a start.

The next day I ran a similar test, but I tweaked the RGB number of iterations. I set up an auto-scaling group to aim for 65% utilization on a maximum of 3 c8g.large spot-instance nodes, plus the c8g.medium “work dispensing” server. All (up to) 16 workers writing to a t4g.small database back end. This test took around 7 hours from start to finish.

Scaling this up won’t be a problem. The reality is I just need to tweak the auto-scaling group maximum node count to make it go faster. Maybe tune it to buy spot instances so I don’t pay full price.

And Then Shit Hit The Fan

I queued up the final render. Set the auto-scaling group to allow up to 20 spot instances. A week and $75 later the progress bar say’s I’m at 10%. The easiest 10%. An estimated $750+ compute bill for a pretty render is not in the budget at the moment.

According to the AWS billing dashboard, over the course of about 8 days I had sucked down 1368 computer hours. That’s 8 weeks of computer time.

So I called up my buddy Chris for help. Surely I’m using the wrong tool for the job, he’s smart, he’ll have an opinion.

We talked about GPU computing, but PHP doesn’t run on GPUs. That, and doing arbitrary precision math on a different computing architecture sounds like a lot of learning and this project is this close to being done, I don’t want to start over from scratch.

Back To The Drawing Board

I stood back and beheld my creation. It had a work dispenser, a caching layer, work servers, and a database layer. What if, just like how lawn mower race drivers strip parts to go faster, we stripped parts off this monstrosity and see how fast we can make it go.

Web-sockets Are The Answer

I slapped together some code to act as a “library” of sorts such that a worker could “check out” a unit of work.

Then I put together a basic websocket server that would distribute work from the library. Once all the work results are reduced down to the resulting image, it saves it.

Finally, a client script that connects to the server, checks out work, and returns the necessary calculations.

Further Optimizations

There were several optimizations along the way. The biggest optimization one can make is to simply “not do the work” in the first place.

I thought more about the work being done. Each pixel input coordinate must have its calculation run three times with different iteration count limits. This is how you get the red, green, and blue channels.
What if, instead of doing the same work 3 times with different iteration counts, we do the work once and count the outputs? Then color the red, green, and blue channels depending on the iteration count.

This optimization alone cut the work to render each image to a third. Being able to calculate all color channels from a single trajectory calculation was a big milestone.

Noisy Image Part 2

Remember how I joked about how “this rounding decision surely won’t haunt us later?” Well now is later and this is that section.

This one took some dialogue to work out. The new system was blazingly fast but the image noise came back. I hunted for (and found) all sorts of bugs in the math libraries but render after render gave the same noise.

I was running simulations at 500x500px. Occasionally I’d run a 1000x1000px calculation but the noise is present there too.
On happenstance, I wanted an intermediate quality image so I ran 768×768 and the noise was gone. There might be something here.

I consulted with ChatGPT and it helped guide me to the answer. It turns out that the overlaying grid is due to a small bias introduced with the combination of using a base-10 numbering system, and the necessity of rounding to the nearest base-10 decimal. This bias makes it subtly more likely that an escaping trajectory will land along a divisible-by-10 pixel space.

The solution is so simple. The render cannot be of a size divisible by 10. Illustrated below is a 500px and a 502px wide image. Note that not only do the lines disappear, but it is a much sharper image.

500x500px Bhuddabrot with overlaying grid.

502x502px Bhuddabrot with no overlaying grid visible.

The Final Render

Through the magic of “publishing when it’s done,” you didn’t have to wait for the week or so it took for the final render to render.

A server monitoring dashboard observing several client workers.

Anyways, enough stalling, here’s the final print, framed and ready to hang on the wall: