production on despatches

What's a Circuit Breaker?

Wed, 06 May 2026 10:53:27 +0100

I’d not been interviewed very often. For my first job, sure, I sent around my CV to a bunch of places and did a bunch of interviews.

I then started my own company and ran that for nearly two decades, everything largely self-taught. After that, because of the peculiar shape of my CV, and “falling between a lot of stools” as a friend pointed out, interviews were… tricky!

I remember one such interview a few years ago. He was good at diffusing any anxiety and it was a pretty relaxed interview. He wanted to know if I knew architectural design patterns - a term I hadn’t heard before. He gave me examples, circuit breaker, pub / sub and a third one I can’t remember anymore. I recognised pub / sub and the other one I think - but was not aware of circuit breaker. He explained it to me - something about reducing the risk of a downstream service failing.

It made sense - but I couldn’t remember any instance where I’d done that and I said as much.

No big deal! We moved on.

As you may have noticed from my recent flurry of posts, I’ve been going through a bit of an excavation of all things kraya, and discovering little nuggets that I’d forgotten all about.

Claude code went through all of my emails, all the code repositories and put together interesting things I had done. The purpose was mainly to pick out things that could be good blog posts, stories, or just reminders.

Claude had found two circuit breakers.

The first one was built by the seat of my pants in 2004. There was an active marketing campaign on megabus and the system was struggling. I’d plumbed the depths of the tech stack - the web servers, the database server to get every last ounce of performance out of the system.

What I really needed was to slow down the deluge of people coming into the site - just a little bit. I needed to prevent the snowball effect, and I tried a simple way to achieve it. I wrote the following around my 23rd birthday.

```php $timestamp = time() - 120; $sTimestamp = date("d-F-Y H:i"); $sql = "SELECT count(*) FROM tSearches WHERE Script_Start > '$sTimestamp' AND Script_End IS NULL;"; $numSearches = $dbh->getOne($sql); if ($numSearches > 10) sleep(5); $numSearches = $dbh->getOne($sql); if ($numSearches > 10) sleep(5); $numSearches = $dbh->getOne($sql); if ($numSearches > 10) sleep(5); ```

(The eagle eyed among you might notice a bug in the above code. It wasn’t resolved for years. Coding live on a production system tends to create bugs. See if you can find it before reading on.)

(Regular readers may also remember a part of this story from I Know People Like You)

The system already tracked searches - so I just needed to check it.

This bit of code checks to see how many of the searches that started in the last two minutes were incomplete. Except that’s not what it did - I forgot to actually use $timestamp so it only checked the current minute’s worth.

I considered failing and letting the user retry manually - but I didn’t want to force the user to take action unless absolutely necessary. This one waited up to 15 seconds and then let the search happen anyway.

If the user was still waiting after 15 seconds, might as well try and do a search and see what happens. In hindsight, it might have been better to fail at that point. If there were too many searches after 15 seconds, the database server was likely already snowballing.

This bit of code survived through to the end of the PHP codebase apart from minor tweaks. The Java version had layers of circuit breakers. Session limits, rate limiters and threadpool configuration for scaling up. None of it though, was called a circuit breaker - not by me or by the team.

So, have I built a circuit breaker? Thinking back, what threw me off was the wording. I had the feeling that the circuit breaker would “protect” another service.

In my mind, the database wasn’t another service - it was a part of the same service.

Holding the Fort

Wed, 15 Apr 2026 15:26:10 +0100

For many years, I loved my job. I was working on a production system that saw tens of thousands of orders across the world.

By the time it was 2015, I did not.

I had poured blood, sweat and tears into a ticketing system that kraya built — first for megabus, then for Polskibus. It had broken me, and we were now just limping along.

By 2015, we were spending 101–286 hours each month on a support contract that paid for 60. I raised this with the client and suggested a minimum of a 100% increase. They refused. Instead, they minimised their requests to just about 80 hours each month. Without the development work, I was now making a loss each month providing the resources for this support contract.

I could cancel the contract, but that would involve letting them have the source code, which was otherwise kraya’s property. There was a six month notice period — six months of providing all the support they need to run and to migrate the system to their team.

At some point, I went from taking on the challenge to holding the fort.

I could see that any effort to make things easier would backfire. I built them a feature for free — they complain about one bug in it — and they need it fixed urgently. I loosen the rules and deploy an additional server just before the weekend, letting them know about the risks. Something goes wrong and they are surprised.

The hardest thing for me to change was the belief that the client’s wins were my wins. If anything, the client’s wins were my loss. It meant more unpaid work in the support contract. It meant more complex systems to support and maintain — while every penny was being questioned.

I started to say no. No, we will not deploy new servers on a Thursday because the weekend is too close. No, we will not reduce the QA time on this bit of functionality you want. No, we will not restore reports which could display erroneous data.

It all came to a head in one specific instance where the CEO insisted on speaking to me because they needed something deployed urgently. They needed us to work through the evening or the weekend. I said no. They offered to pay double. I said no. They were not happy.

A few months later, they cancelled the contract.

I remember the meeting. It was amicable and friendly. They asked if I could keep the staff on till the end of the year - they wanted backup in case anything went wrong during the migration.

I wouldn’t want to let them go before Christmas anyway.

I had a brief conversation with my brother to decide what to do next. I already knew that there was nothing left to keep running. We kept it running for three more months until the end of December.

I remember going into the main office, gathering everyone around and delivering the news. Polskibus had cancelled. We are shutting kraya down. We’ll keep going until the end of December.

I remember the Christmas party. It had often ended up being drinks, and being out all night. It didn’t this year - we went home after the meal. The atmosphere was one of sadness and relief. We’d all been through the fires and made it out the other end. It was over.

It was the end of what I’d built over 15 years.

I felt nothing.

I Chose to Keep Going

Tue, 14 Apr 2026 20:52:10 +0100

In 2008, we all watched Pivotal crash and burn. They’d taken a year and nearly £900k to build a new ticketing system for the fringe. On launch, they realised that it could serve only one customer at a time. We built an interim ticketing system for them over the weekend.

It was time for kraya to take a leap. We should take the megabus.com ticketing system to the next level and build a distributed ticketing system that could potentially scale infinitely.

I’d picked JBoss because it was backed by Red Hat - it had all these features and capabilities - a lot of which we needed. The other option we considered was Glassfish, but it just didn’t have the features we needed. There were other options but that involved prohibitive licensing fees.

We costed it out at £650k and a year. It was unrealistic but I could not imagine Stagecoach paying more, or giving us more time. They wanted it for £500k and in six months. I should’ve pushed back, but we’d built a booking system over the weekend, only a few months back - this should be possible, right?

It wasn’t.

We got a year in the end — we didn’t ask for it, and we didn’t know until we were most of the way down the path. When I heard about the delay it was a mix of relief and regret. We’d burned through most of the budget on getting in contractors who were leaving imminently. We could have got fewer people on board, but as permanent staff.

We launched to Canada first, and that wasn’t too bad. Then it was the US, and the nightmare started. The UK launched last on my birthday in 2010.

In the intervening years, we had a budget shortfall of £150k and because we’d rushed to build the product, we’d cut corners, and all the contractors had left. We even had to let go of some of the permanent staff. I asked Stagecoach if they’d fund us the extra £150k as we’d asked at the start. They said no. They renegotiated the contract. They demanded more oversight - and increased our reporting obligations.

Load testing was already a part of our process and we tested the new system under load. We identified issues and fixed them. On paper, it looked good.

It wasn’t.

The problem wasn’t load per se. It was the stability of the system. It struggled to stay up for extended periods of time. The more nodes there were, the worse it was.

It was exactly the kind of problem that was hard to replicate and hard to test. The main option we had was to think through what all it could be and to take stabs in the dark. I put together a hit list and worked through it methodically. Convincing Stagecoach to spend the money was sometimes harder than solving the problem.

Ultimately, the one thing that pushed us over the line in terms of stability was staggered nightly automated restart of each node.

I thought I was done working with software that needed regular restarts to stay functional and at first resisted this. We are running enterprise level software, and that too on Linux. If I was comfortable with such shenanigans, I might have stayed with Windows. But we were running out of options. It was a hail Mary - it worked.

I had been working with Linux and related software for years by that point. I, in fact had servers that had not been restarted for literal years at that point, with services that were running just as long. A lot of these services didn’t even need a restart on config change - just a reload.

None of these services even had a paid tier. That should have been the clue.

I cannot imagine having to restart PostgreSQL nightly or even Apache. I still do not understand how something slated for the enterprise market can have leaks that would warrant a regular restart to keep it working.

Years later, when I looked up issues around JBoss, I realised that it was notorious for a whole slew of problems with JGroups and clustering. I remember scouring the internet for details on any kind of issues and coming up empty.

I’d fallen into a marketing trap. JBoss was no Apache — it was the commercial product of Red Hat. I thought all enterprise open source software would be of the same calibre. It wasn’t.

From what I understand, Stagecoach spent millions and maybe two years building the whole ticketing system inhouse. kraya limped along for a few more years before being shuttered.

The people involved, though, fortunately seem to have gotten through it largely unscathed. It gives me a great deal of joy to see so many of the juniors I’d hired now CTO’s, VP’s, Directors.

I believed that every victory was ours but when something went wrong, it belonged to me. After all, I made every choice. I chose to pursue the Java EE ticketing system. I chose JBoss. I chose to keep going.

I was in a narrowing path, with fewer and fewer options.

I knew I was the only one holding that line - I didn’t know that there was another option.

Bit by bit, I lost all sense of what I was paying to fix these mistakes.

Always On

Tue, 14 Apr 2026 19:51:09 +0100

I knew as soon as my phone rang what it was about.

It was the same every time. I would drag myself up to answer the phone - my body, my mind screamed at me, but I had gotten good at overriding every instinct through sheer willpower. I could hear the apologetic tone on the other side, and I could recognise some of the voices after a while. I mustered up all of my strength to be and sound as awake as possible. I needed to be professional even if I was still in my underwear.

I had a glass of water on my bedside table. I’d pick that up and head to my office in the spare room. The computer was always on and always ready to go — like me I guess. I’d log on to the servers, and check the logs. If I can identify which one fell out of the group, I can restart just that one. If I was too late or if the issue had escalated, I’d have to restart the whole cluster — shut them all down, give them a few seconds, then bring each one up, while keeping an eye on them. I could do it half asleep after a while.

Falling back asleep wasn’t a breeze either - I was tired - exhausted - but I was now also wired. Waking up in the morning was harder - the alarm would go off and my body would be limp. I still remember the sheer power of will to drag myself into the shower, then carry on with the rest of the day.

Of 266 incidents over about two years, I answered 156.

I remember one particular night, though I do not remember how many times I’d woken up beforehand. I was already tired.

megabus.com had gone offline. I got an alert. “But someone else is on call tonight,” I told them. “We already tried them twice,” came the reply. I had to deal with this. I had to deal with this.

I remember sitting at my desk at home working on fixing it. At some point, something was different, though I don’t remember what. While I was working on fixing it, I remember being overcome with an overwhelming impulse to get up from the chair and walk away — I almost imagined myself walking away. I resisted and shut down that impulse. I fixed megabus as I had always done. In fixing megabus though, something broke inside me, somewhere deep, in the very core of my being. I was never the same again.

I analysed the system top to bottom, inside and out. I even waded through JVM internals.

It got incrementally better, more stable. I think I rewrote every component that wasn’t the core ticketing system. In the end, what pushed it over the line were two unexpected changes. Automated nightly restarts of each node in the cluster and a rate limiter.

On the 10th December 2012, the system, now serving Polskibus, had a sale event. We had a bank of screens on a wall with all the key stats for the system. It looked cool, and we felt a bit like we were on a TV show. At peak, nearly 15,000 concurrent sessions — six or seven times the average. Over 30,000 bookings in a single day, three times more than the normal amounts across all systems.

We watched it closely, all day. Nothing broke. Nothing screamed. Everyone smiled, but there was no celebration.

Did They Have a Problem That Year?

Tue, 14 Apr 2026 10:24:26 +0100

2008 was a heck of a year for kraya, and for me. We were already operating megabus.com in the UK, USA, and Canada, along with Oxford Tube, the sales website for coach usa - all for Stagecoach.

We were also working on the fringe website. We integrated the website with the brand spanking new ticketing system - which cost nearly £900k.

We were also hosting websites for Boots, Kellogg’s Food Service and dozens of other clients.

All of this was held together by three or four developers, two systems administrators and me.

On the 13 June (incidentally, I got married on the same date years later), as I was just getting ready for a wild night on the town, a call comes through - which John answers.

I still remember them laughing and then doing a double take “oh, you’re serious? let me get Shri”

It was the fringe. We’d already known that they were having trouble with their ticketing system. I’d even pitched in, made suggestions - looked at their code to try and help, but none of that was enough. I expected an update.

They wanted to know if we could put together an interim booking system for them over the weekend. I wasn’t sure. I told them I’d speak to my team and get back to them.

I wasn’t involved with the work on the fringe up until this point. I knew very little about it. I was focused on megabus.com. The US version of the site had a big marketing campaign happening in a few days and that was what I was meant to be focused on.

By the time I put the phone down, Chris, who had been the lead on the fringe already had a answer. “We can do it!”

“But how - it’s got to take more than a weekend - right?”

The fringe website was already built well and had a clean layer interfacing with the new ticketing system. In fact, that was the bulk of the work that year.

So.. Chris told me - all we would have to do is to implement the functionality within that thin layer, fattening it up.

He believed we could do it. I believed him.

I hopped in a cab, headed over to the fringe to talk it through. I didn’t promise them we’d be able to get something ready by Monday, but I promised we’d do our best.

We were in the office on the weekend, writing code. I remember working on the basket, sending diffs over email and generally having a good time.

I even had some megabus US fun to keep me entertained in the form of issues with loading sheets - I was already in the office, so it was one step easier to fix.

One of the bits of functionality which took a surprising amount of time was the seat allocation. None of us had to worry about that before - it was just capacity management on megabus. For the fringe though, we had to allocate actual seats with seat numbers and everything.

With the fringe we had multiple tables which all joined together (thanks hibernate) to encode a tremendous amount of detail about the seating plans - including their physical location on a map.

It was too much detail for us, so we had to simplify it all down to get it working in the timeframe. We kept most of the rest of the structures intact to keep the data migration easier once the ticketing system was fixed.

By Monday, we had each managed at the most 6 hours of sleep each of the previous three nights. I still have vivid memories of a suit of armour that we put together using packing material while we were waiting for bits of data or details of logic.

I broke the MySQL replication at 01:24, fixed by 01:32

I remember the delirium setting in. Email sent to the client with “fun fun fun fun fun fun fun” as the subject

There were random emails to my brother “I’m still here!”

I also remember making makeshift beds with bubblewrap to get a wee nap here and there. We were all so exhausted - pumped up on coffee and nicotine.

Finally at 03:35 on the Tue email to client: “DONE DONE DONE DONE DONE NODE NODE NODE NODE”

Then at 05:33, requesting a PostgreSQL server rebuild for megabus US for their marketing campaign.

At 10am on Tuesday, the fringe is finally able to sell tickets. The website promptly fell over from the load, but we nurse it back and it sells 65k+ tickets in the first week.

It would be at least two more weeks before the ticketing system is fixed and brought back in.

For the work we did for them that year, and the previous one, we effectively only charged about 30% - because that’s all they could afford. This year, we asked if they could put our name on the website.

Over the next two weeks(while megabus US was on their marketing campaign), we fought many battles. There were 750 duplicate bookings. Numerous customer complaints (thanks to our name being on the website) - almost all of them blaming us for the failure of the ticketing system. People did not understand that we put in the interim one, not the one that failed.

Press releases went out from the fringe - only two credited us. Both misspelt the company name. Both called us a web design company — which, we were not, had never been, and had no interest in becoming.

In truth, I wanted to be a hero - I think we all did. What we really wanted was an acknowledgement of what we had done - which was nowhere to be found. We got paid though - at least for a part of our effort.

For many years after that, I would tell people with pride - “did you know - I saved the fringe, back in 2008,” which was inevitably met with something like “oh, did they have a problem that year?”

What we Carried

Mon, 13 Apr 2026 20:02:04 +0100

I started my company in 2000. I was 17. I built megabus.com in 2003. I was 21.

It started off small, and little by little, I carried more and more. I became we, and we carried more and more. Before we realised, we were carrying a great deal. Ultimately, though, if something went seriously wrong, it would be on my shoulders.

The chart does not capture the scaling of the organisation, or other departments like tech support or hosting, which had dozens of clients.

Data mined by Claude from my emails, issue trackers and code repos

The section at the top is the number of active committers for that quarter. You can see my on my todd at the start and the rise and the fall of the dev team.

The very peak of it was in 2008. We built a booking system over the weekend for the Edinburgh festival fringe because their brand new £800k+ system could only handle one person at a time.

There were a handful of us building it while I was simultaneously prepping and managing the megabus US systems for a sales campaign. At the same time we were operating megabus across the UK, USA, and Canada, Oxford Tube, Coach USA, the Fringe website itself and numerous other smaller hosting clients like Boots, Kelloggs Food Service, and so on.

We were a total of ~4 developers and two systems administrators holding all of these together.

I was recently reminded of a story a friend of mine told me. Before he was my friend, he worked with me, and one of the things we did together was one the big megabus deployments when we migrated to a Java EE ticketing system.

One part of the migration was the data. I had developed a tool to migrate the data and on the evening - everything was prepared, we were off peak, and we had taken the site offline. He ran the script, which went on for a wee while and it failed. It was not meant to do that.

The way he tells the story, he told me about the failure. I come over, look at the errors, say “hmmm, that’s interesting,” and head off to have a cigarette. A few minutes later I go over to my desk type away furiously, then asked him to run it again.

It worked, and completed.

I remember that night. I don’t remember what the problem was or how I fixed it, but I do remember that moment when I went over to see how it had failed. In the short walk from my desk to his, I reiterated in my mind, all the possible backup plans - with the worst case scenario being to call off the migration on that day. We would do it another day. It would cost money, but it would be do-able. I was ok with that.

I was curious as to where my limits were, so I kept pushing, until I would meet with a wall.

There were no rails and there were no railings - just a cliff edge, unmarked… I didn’t know that - I expected a brick wall.

The worst of it would be only a few years later, in 2011. We built a new Java EE ticketing system for a fraction of what it should have cost in about 30% of the time it needed.

I personally responded to over 250 out of hours emergency tickets over an 18 month period. That was hard!

I had run off a cliff edge, and like the roadrunner in the cartoons, it took a while before I realised there was no ground beneath me.

A few years after I ran off the cliff, the company shut down. A few years later, I would start my active recovery journey through therapy. A few years later still, when I felt ready for a leadership role, I was asked to lead a problematic team - a role they struggled to fill for a while.

The team worked hard and delivered but struggled with the perception of poor delivery. Trust was thin, stress was high and morale was low. The situation was so bad that the week before I was supposed to start, the scrum master who was supposed to be my guide through it all quit.

I was warned by multiple people that this job was loaded with problems. I took on the job anyway, without a real guide, straight into multiple serious issues.

I loved it and managed to turn the whole thing around in my first week. Delivered key items, laid the foundations of trust and improved morale. It took a bit longer to bed everything down. Within weeks, I was asked if I would take on leading the entire digital team.

It was here, many years later, I got a sense of how unusual it was for such a tiny team to do so much.

It was here, many years later, I got a sense of how it is to have guardrails, to have support, peers to lean on.

It was here, for the first time I realised that the job didn’t have to be a lonely one.

Whatcha Thinking?

Mon, 13 Apr 2026 12:05:48 +0100

I loved working on megabus. I was in love with it. My girlfriend at the time had a habit of asking what I was thinking about when I looked deep in thought. The answer - every single time, was inevitably megabus. She eventually stopped asking.

I was 22 years old.

When I built the original prototype for megabus.com, I built it using PHP + PostgreSQL. I put together a document detailing my reasoning for these choices. I quoted 33 days for it, built it over six weeks and charged £13,200.

The support contract was £350/month - for one day a month. On the first day, megabus.com sold 200 orders.

When megabus had its first expansion, I was up overnight bringing new servers online and scaling it live. I loved it - my code was finally being tested.

Over a week, I’d probably burned through many days of effort. I remember the project manager specifically asking me to invoice for the extra work I put into it. I even said that I would - except I didn’t.

I’ve had a long time to think about this - why did I not send that invoice? I even had approval.

The answer, as with most things of this nature is complicated. I loved the work and I didn’t want it to end. I didn’t want a potential conflict trying to figure out what a reasonable amount was to charge. I felt that I should have done a better job in the first place - I felt responsible that I had not told them that scaling of this nature would not have worked without prep work.

I had not scaled anything before.

I was 22 years old.

I was super grateful that someone believed in me. I naively assumed that they saw all the extra effort I was putting in and that they would reward me for it - that they would have my back.

I remember adding a bunch of different bits of functionality because I wanted it there. I didn’t want to go through the process of quoting for it, and it getting potentially rejected, not to mention the waiting for decisions. One key bit of functionality I remember is adding in a percentage load column for the loading sheets. I built it, showed it - they loved it! It went live. I did not charge for it.

At this point, the vast majority of my time was spent on megabus - very little of it actually paid for.

At a glance, based on the emails sent, I probably spent a minimum of 10 days each month supporting megabus when I was charging for one day.

In Jan 2004 - I proposed doubling the contract to two days for £4,800/year. It probably kicked in in Feb 2004. By March 2004, the site exceeded that revenue each day.

In the following months, I probably spent, on average a minimum of at least double the time I was paid for. I should have charged for it.

I grew the team, and the support contract based on the minimum I needed to maintain the product - not based on the amount of time I was spending.

For my 28th birthday, my girlfriend at the time organised a cake which was a image representing kraya - which was basically megabus. I felt bad that she thought that kraya was the most important thing in my life - she was right - but it still felt bad. kraya had other clients at the time, but my time wasn’t monopolised by other clients, or indeed by kraya - my heart still belonged to megabus.

And it would all have been fine too, except for a grave miscalculation I made.

In 2010, after trying to rebuild the ticketing system for £500k, and making some mistakes with people I trusted, kraya ended up in £150k in the hole. We needed some money urgently.

I was desperate and naively, I reached out to stagecoach for help. I thought they were my friend - that they would have my back.

They understandably lost a great deal of trust in my ability to manage and lead my company. I trusted the wrong person - but that was still my mistake. They were right.

I thought that I’d built up enough goodwill that they would help me through this. I’d felt I would have way more than that “in the bank” in terms of goodwill. I learned that professional relationships do not work that way that dark afternoon, standing outside my office on the phone, in the rain.

They didn’t make my life easier. Instead, I’d ended up rattling the cage - they were now panicked - realising their over-reliance on an organisation that could disappear at any point.

Instead of support, I had further actions, renegotiating the contract and what felt like punitive, and definitely invasive reporting obligations.

I was hurt and angry. I had poured my heart, my soul - hey, my very life into this product that I loved.

Suffice it to say - I got no help - no loan, no offer of investment - though they did suggest buying us outright - which I rejected.

I signed a contract under circumstances I would not wish on anyone.

The best I got from them was a challenge - if we were really spending more time than we were charging for - prove it. I did! We documented every minute we were spending - I wasted my time on spreadsheets, pointless meetings and work to try and rebuild the broken trust.

We went from £300k in the hole to £200k profit within a year. We charged for a whole year in support around 20% of what the system made in a day.

I was 28 years old.

Around the same time, I was also dealing with the operational aftermath of trying to build a java EE ticketing system over six months for £500k. I thought it would take a year and cost £1m. In hindsight, it needed two years and probably three million pounds.

Over 18 months, I personally answered over 150 out of hours emergency calls. We had a rota and others on call too - but I took the vast majority of these calls. I felt bad putting others through what I knew was gruelling.

All of this led me down a narrower and narrower path to a serious breakdown - though I didn’t know enough to name it until many years later. All I knew - all I felt was that something broke in me.

We managed to resolve all of the issues, but the deployment of that version kept getting pushed.

Stagecoach cancelled the contract in 2012. They had started building a ticketing system in-house two years prior - the cost of my grave mistake. I wasn’t able to make the meeting - I was in India, and at the same time as the meeting, I was meeting for the first time the one who is now my wife.

I was 28 years old. I spent the next 15 years putting myself back together.

How much did it cost them to build it inhouse? If I had charged for my time from the start, would we all have been better off?

I still feel something deep inside me every time I see a megabus - a sense of pride mixed in with a deep sense of sadness - not for what I lost - but for what could have been.

I am 44 years old, and I am starting again.

Priced In

Tue, 07 Apr 2026 09:55:24 +0100

Two ticketing systems. Same client. Same payment provider. We were moving fast — that was the explicit choice, theirs and mine. The kind of fast where you know something will go wrong eventually and you price it in rather than try to prevent it.

We’d done the sensible thing and shared the payment code between them — DRY, less surface area for error, obvious call.

Then the larger system needed PostAuth. We updated the code, added a scheduled task to catch anything the non-deterministic bits missed, moved on.

A few months later: why has no money come through on the smaller system?

We’d ported the PostAuth flow across when we updated the shared code. We hadn’t added the scheduled task. The payment provider, chosen for cheap and cheerful rather than reliability, failed silently rather than erroring. The accounting department, running at the same pace as everyone else, hadn’t caught the gap.

Four separate things had to go wrong simultaneously. Any one of them holding would have meant no loss at all.

The client lost money. Not a catastrophic amount, but real money. I braced for the call.

Try and let me know the next time you decide to run a sale.

He already knew the cost. He’d known before the mistake happened.

I Know People Like You

Tue, 31 Mar 2026 10:41:29 +0100

A few years ago, I was interviewed for a role. I was talking about a ticketing system I’d built - originally in Spring, then rewritten to use EJB 3.2. The interviewer didn’t look impressed.

The team had already written a lot of stuff in Spring - but I really did not like it. There was all this XML all over the place which was annoying, but what I really didn’t like was that the code and configuration for each component was spread out all over the place. It meant that to understand how something worked, I had to go hunting. Eventually, I got sick of it, and ported it to EJB myself.

Later in the same interview, he said: “I know people like you - you come in, shake things up and get things done - but that’s not what I’m looking for.”

He was right. I understand that. But I’ve been thinking about what he was actually describing.

When megabus.com was still a PHP site, search was the problem. It returned quickly when the database was healthy and crawled - and slowed down when the database was struggling. The load came in spikes. Even within a minute, there were peaks and troughs.

My fix was simple. Before the search query ran, I added a small SQL check: how many queries are currently active on the database server? If too many, wait a second and try again. A few retries, then send it anyway.

A rate limiter baked into a search algorithm, written live on a production server.

There were edge cases to consider, not to mention the load the rate limiter would add to the database server. I knew though that if it broke, I could fix it - I could just remove it - live, if needed. Not having the rate limiter was at the time, more expensive than having it.

It worked. It got us through more than one hump.

The database was still the ceiling. We were on PostgreSQL 7 - no replication support. Getting a more powerful server was possible but disproportionately expensive. So I built something.

Two database servers. All writes went to both. Reads were distributed randomly between them. Everything funnelled through one section of code.

I didn’t do this live. I tested it. I knew what failure looked like: if the servers diverged badly enough, I’d pick a primary and reset the other. That was the contingency. It wasn’t a safety net someone else would pull - it was mine.

The data integrity held better than I expected. Under very high load there were edge cases - ticket IDs for the same customer could be in a different order across the two servers on a return purchase - but because the IDs were consistent within each server, it never caused a real problem. It held the fort until I could replace it with something better.

I occasionally lie awake at night imagining the databases diverging and figuring out how I would fix it.

I picked PostgreSQL over MySQL when MySQL was the obvious choice. Under heavy load it stays up — it slows to a crawl, but it keeps going. And it had transactions. I was building an ecommerce site; I needed transaction support. MySQL was fast and popular. It also had a habit of giving up under sustained load. I still pick PostgreSQL - but nowadays, so do most other people.

The thing these decisions had in common was that I was the person who’d be fixing them at 3am if they went wrong. When you’re personally accountable for the consequences, the risk calculus changes. You think harder about what failure looks like. You build the contingency before you go live. You know which direction to pull if it goes sideways.

Caution that’s never personally tested isn’t rigour. It’s consequence-avoidance dressed up as responsibility.

“I know people like you - you come in, shake things up and get things done - but that’s not what I’m looking for.”

It Gets Everywhere

Tue, 24 Mar 2026 20:52:21 +0000

In 1999, I was building websites in ASP (before there was .NET) and MSSQL Server. We had a Windows NT server that I had to restart every week — not because of updates, because it would get slower and slower until a restart was the only thing that would fix it.

We had one ADSL connection coming into the office and three of us. I wanted to share the internet. Windows NT didn’t support it cleanly — it had a way, but it was clunky enough that no internet was arguably better. We’d paid hundreds of pounds for it.

I’d heard about Linux. Downloaded Red Hat, installed it, configured it for NAT. It worked — it was like magic. I’m pretty sure I had to recompile the kernel to get some bits working, but there were instructions and they were honest. It did what it said.

Here was software that was completely free — free enough that I could read the source code, make changes, run it however I wanted. It did more than the hundreds of pounds worth of garbage sitting on the desk. And once I set it up, I never had to restart it. Never. Compared to once a week on the NT box.

The difference, in my mind, was simple. Linux was built responsibly. NT was built as a money-making enterprise.

That held for a long time. I moved to Debian, then celebrated when Ubuntu arrived and made things more accessible. I’ve recently been able to abandon Windows altogether — gaming on Linux is finally viable. I came back full time and felt mostly at home.

But there were minor niggles. Things that felt slightly off but that I couldn’t quite name.

Then I started digging into systemd.

I remembered feeling odd about having to run specific commands to read logs. Odd about one tool doing many different things — which ran contrary to the Unix philosophy that had made Linux what it was. When I looked into the history of the opposition to systemd, it was revelatory.

systemd becoming process 1 is, in a word, irresponsible. It makes everything easier and more accessible, which is why it won. But unlike the Linux of old, the tradeoff isn’t visible upfront, and there’s no real choice. The responsible option isn’t the default anymore — it’s the thing you have to go looking for.

I thought I had already done the work. I thought I had found the alternative.

While I was celebrating Linux becoming mainstream, I hadn’t considered what it would cost.

The Linux ecosystem had started optimising for mainstream at the expense of responsibility. It works now, for far more people. But it’s a different thing than it was. When linux was really taking off, there was a joke going around (before memes were called memes) about Microsoft Linux. Turns out the joke was on us!

It is always a tradeoff between security and convenience — something convenient is rarely secure, and vice versa. I think something similar applies to responsibility. The more accessible you make something, the harder it becomes to hold the line on what it was built to do.

There was a time when software going wrong meant losing your work. Now it means losing your money, your reputation, or — in a car, in a hospital — your life.

The context has changed. The attitudes haven’t. And the places that once had better attitudes — the ones built on responsibility, on craft, on caring about the thing itself — are being pulled in the same direction. It gets everywhere.

Do you want your car running Windows? What about systemd?

Even Light Gets Heavier

Tue, 24 Mar 2026 10:56:05 +0000

A dedicated input type is better than reusing your domain model at the API boundary. Test layers matter. Writing log statements as you go saves the poor soul (probably you) debugging blind at 10pm. You know all of this.

This isn’t about any of that.

It’s about the fact that none of those decisions show up in the metrics that matter to the people making hiring and delivery calls. The cost is immediate and visible. The return is delayed, quiet, and arrives in the form of things that didn’t happen — the investigation that took two hours instead of two days, the API change that didn’t bleed into the domain model, the bug that the structure caught before it shipped.

Sprint velocity captures the extra day. It doesn’t capture what that day bought.

This is not a new problem. Most engineers who’ve been around long enough have felt it from both sides - made the careful call and got measured on the slowness, or inherited the codebase built entirely for speed and paid the tax. The measurement system was already broken. It has been rewarding the appearance of velocity over the thing velocity is supposed to serve.

This was true long before anyone was generating code with AI. The PR process in a lot of teams was already largely theatrical — review comments on naming conventions while the architectural decisions slipped through unquestioned, approvals given because the diff was too large to meaningfully read. The gate was already not doing much. We brushed it under the carpet and moved on.

AI tooling is changing the volume of code moving through that process by an order of magnitude. The pressure to remove the gate entirely — to trust the output, to ship faster - is only growing. The faster-is-better incentive that was already making review ineffective is about to be handed a much larger surface to work on.

Many years ago, I pitched full redevlopment of a ticketing system from a PHP based system to a Java EE system because it was struggling to scale.

It probably needed a couple of years to build. They wanted it in six months. I accepted the challenge.

We built and deployed the system in eight months. We spent the next year fixing it.

The client then rebuilt it in-house.

When AI runs this experiment at scale, who takes it back?

We Optimised Ourselves to Death

Wed, 11 Feb 2026 09:55:18 +0000

I once worked on a gaming website.

It collected structured metadata about games - tags for features, screenshots, videos, reviews. Users contributed information. We gamified participation and rewarded it with games and gifts.

It started making money through “similar games” lists.

All of our traffic came from Google.

Then we needed more revenue.

So we did what teams do.

We added features.
Integrated Steam, Xbox and PSN.
Pulled in achievements.
Expanded recommendation lists.
Tweaked advertising.
Worked on SEO.

Traffic crept upward.

Still not enough.

Eventually we decided the problem was perception.

The site looked too much like a community project. It needed to feel more premium. More authoritative. More modern.

So we renamed it.
Changed the domain.
Redesigned it from the ground up.

Months of work.

We launched.

Traffic collapsed.

We never recovered.

In hindsight, the failure wasn’t technical.

It wasn’t branding.

It wasn’t SEO.

It was that we never made a hard decision about what the product actually was.

Was it:

A participatory community?
A structured data engine?
A search destination?
A content property optimised for Google?
A recommendations platform?

It was all of them.

Weakly.

What Google valued wasn’t polish. It valued volatility.

Our homepage changed many times a day because users were contributing.
Those contributions created fresh internal links, fresh content, fresh signals.

Participation was the engine.

When we redesigned for the information consumer instead of the contributor, we stabilised the surface.

We accidentally killed the engine.

We optimised the visible layer and ignored the system feeding it.

I first heard the phrase “we’ll fix it in post” from my filmmaker brother.

Something wasn’t quite right during filming, but they moved on anyway. It could be corrected later.

In film, that’s sometimes true.

In product development, it’s usually self-deception.

Lean encourages delaying decisions to the last responsible moment.
That’s discipline.

What most teams practice is delaying decisions until they become painful.
That’s avoidance.

An MVP is not the smallest thing you can push out.
It is the smallest thing that is coherent and viable.

Viable means it has a clear shape.
It respects constraints.
It closes more questions than it opens.

If you ship something that only works on the happy path, with undefined edges and postponed trade-offs, you haven’t preserved optionality.

You’ve preserved ambiguity.

Ambiguity spreads.

In code, as defensive layers.
In design, as half-committed patterns.
In product, as multiple possible futures carried at once.

Teams don’t slow down because they’re weak.
They slow down because no one chose.

Every postponed constraint becomes cognitive load.
Every “temporary” rule becomes precedent.

Lean does not say “don’t decide.”

It says: decide at the point where delaying further increases cost.

Most teams drift past that point because deciding feels like loss.

Loss of flexibility.
Loss of imagined futures.
Loss of political safety.

But momentum comes from commitment.

Once something is decided, energy frees up.
The system becomes legible.
Subsequent decisions compound instead of conflict.

We didn’t fail because we built the wrong feature.

We failed because we never chose what we were.

Most startups don’t die from lack of effort.

They die from unmade decisions.

“We’ll fix it later” is not iteration.

It is hope disguised as strategy.

Microservices vs Monolith: Real World Tradeoffs

Wed, 17 Jul 2024 09:48:25 +0100

When starting a new backend system for a contract I was on, one of the early decisions I had to make was whether to lean into a monolith or adopt a microservices approach. While common wisdom offers strong opinions on both ends of the spectrum, in reality, the choice often hinges on organizational constraints as much as on technical purity.

Reactive vs Traditional Spring Web

I began by reviewing performance comparisons between Spring MVC and WebFlux. Reactive Web generally comes out ahead in benchmarks, but that doesn’t tell the whole story.

In our use case—web notifications—the benefit of reactive patterns depends heavily on how data is delivered. If we were polling, the advantage would be limited. However, with Server-Sent Events (SSE), Spring’s support aligns directly with Reactive Web, making WebFlux the more appropriate choice for this part of the system.

The Deployment Constraint

Ideally, I would have started with a monolith: a single deployable artifact combining both the Kafka Streams logic and the API. This option would have simplified initial development and allowed us to iterate quickly. But at the client, the platform does not allow deploying a Kafka Streams app and an API within the same Kubernetes deployment.

This effectively rules out a true monolith, even for a prototype.

Options Considered

Shared Library with Thin Deployments

A middle ground was to build the core logic in a shared library and have lightweight deployments wrap around it. This would allow the streams app and the API to share code without needing to make HTTP calls between them.

The downside: these services are no longer independently deployable. But given our team size and velocity goals, this compromise might be acceptable.

Full Microservices

Another option was to separate the services entirely:

Streams service (Kafka, plus domain-specific logic)
Web API (for delivering notifications)
Subscription API (managing notification subscriptions)

This adheres more closely to the single responsibility principle, especially as we move from PoC to MVP. However, it adds deployment and coordination overhead.

Application Profiles

A third hacky option was to control which parts of the app run using environment-based profiles. For example, we could disable Kafka in dev or use conditional beans to keep deployments clean. While not ideal long-term, it offers flexibility for early stages.

Conclusion

Constraints matter. While I lean toward monoliths for rapid delivery in small teams, platform limitations forced a hybrid approach. We intend to evolve into microservices over time, but only when the benefits clearly outweigh the cost.

Have you faced similar deployment constraints that shaped your architecture? I’d love to hear how you navigated them.

PostgreSQL performing huge updates

Sun, 06 Nov 2011 12:45:41 +0000

PostgreSQL is a pretty powerful database server and will work with almost any settings thrown at it. It is really good at making do with what it has and performing as it is asked.

We recently found this as we were trying to update every row in a table that had over eight million entries. We found in the first few tries that the update was taking over 24 hours to complete which was far too long for an update script.

Our investigation of this led us to the pgsql_tmp folder and the work_mem configuration parameter.

When the query was being executed, we checked the pgsql_tmp folder to see how was space being utilised in there. We already knew about the pgsql table from past experience. We had a server running out of disk space and rapidly. We had narrowed it down into this folder. In cancelling the query referenced by the tmp files in here, we were able to free up literally gigabytes of disk space...

We had found roughly half a gig of temporary files in here. This led us to investigate the configuration file.

The one parameter that stuck out was work_mem which was set to a default of 1mb which I guess might make sense under most circumstances but not in this one. According to the postgresql documentation

work_mem (integer)

Specifies the amount of memory to be used by internal sort operations and hash tables before switching to temporary disk files. The value is defaults to one megabyte (1MB). Note that for a complex query, several sort or hash operations might be running in parallel; each one will be allowed to use as much memory as this value specifies before it starts to put data into temporary files. Also, several running sessions could be doing such operations concurrently. So the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

This would tell us that the total memory usage with work_mem could be several times the value set here and setting it to half a gig would probably be a terrible idea for a heavily utilised production server. However, for the migration process when we need to update over 8,000,000 rows, it might be a good temporary fix.

After updating the work_mem to 512mb, we found that no more tmp files were created and the whole thing was done in memory.

When updating so many rows, there area a few other things to consider.

Firstly, autovacuum will likely kick in several times to vacuum the table. You'll probably want to disable this for the duration of the update statement and run a vacuum afterwards.

```sql --disable auto vacuum ALTER TABLE sometable SET ( autovacuum_enabled = false, toast.autovacuum_enabled = false ); ```

You can switch autovacuum back on after the update statement has completed

```sql --enable auto vacuum ALTER TABLE sometable SET ( autovacuum_enabled = true, toast.autovacuum_enabled = true ); ```

A few other things you want to take a look at are the

fsync parameter (I usually have this set to off anyway since the servers are pratically fully redundant)
checkpoint_segments: I changed this to roughly 5 times the original value (check the log to see if it says that its checkpointing too often)
checkpoint_completion_target: I changed this to 0.9

With all of these updates, we were able to bring the total time of the update down to a few hours.

Tracking progress of an update statement

Wed, 02 Nov 2011 19:59:02 +0000

Sometimes there is a need to execute a long running update statement. This update statement might be modifying millions of rows as was the case when we went hunting for a way to track the progress of the update. Hunting around took us to http://archives.postgresql.org/pgsql-admin/2002-07/msg00286.php In our particular case, we are using postgresql but this should work with any database server that provides sequences. Our original sql was of the form:

```sql update only table1 t1 set amount = t2.price from table2 t2 where t1.id = t2.id; ```

There is of course now way of figuring out how many rows had been updated already. The first step was to create a sequence

```sql CREATE TEMPORARY SEQUENCE seq_progress START 1; ```

We can then use this sequence in the update statement to ensure that each row updated also increments the sequence

```sql update only table1 t1 set amount = t2.price from table2 t2 where nextval('seq_progress') != 0 and t1.id = t2.id; ```

Once the query is running, you can open another connection to the database. To get an indication of how far it has got, you can just run the following

```sql select nextval('seq_progress'); ```

Bear in mind that this will also increment it by 1 but if you have millions of rows which is really the only case in which this would be useful, a few additional increments is hardly going to make a difference.

Good luck and have fun!

Java Object Size In Memory

Mon, 25 Apr 2011 15:58:00 +0000

Anyone who has worked with java in a high end application will be well aware of the double edged sword that is java garbage collection. When it works - it is awesome but when it doesn’t - it is an absolute nightmare. We work on a ticketing system where it is imperative that the system is as near real-time as possible. The biggest issue that we have found is the running of memory in the JVM which causes a stop the world garbage collection. This then results in cluster failures since an individual node is inaccessible for long enough that it is kicked out of the cluster.

There are various ways to combat this issue and the first instinct would be suggest that there is a memory leak. After eliminating this as a possibility, the next challenge was to identify where the memory was being taken up. This took some time and effort and the hibernate second level cache was identified. We were storing far too much in the second level cache.

This is another double edged sword. The hibernate second level cache is absolutely imperative to a high performance system. It does however, come with a price. The cache needs to be managed carefully to ensure that balance between performance and memory requirements.

To this end, it was important to be able to identify what was taking up all the memory in the cache. Each object might only take a couple of hundred bytes, but with our second level cache set to store hundreds of thousands of items, this quickly takes up hundreds of megabytes. With the metadata of the cache, this could easily hike it up near a gigabyte of memory usage. This gets substantially worse with cache evictions and the adding of new items into the cache.

The correct way to resolve this is to identify specific object types that “overload” the cache. i.e. items that have an large number of instances stored in the cache. Identifying classes that store a large number of items is easy enough - we just traverse the cache and count up the number of items. However, there might be a class that stores a smaller number of items but take a sizeable amount of memory. For this reason, it is important to understand the object sizes in memory as well.

If you have ever tried to find a way to identify object sizes, you will know that this is no easy task. You can calculate to some degree of accuracy the size of an object based on the data it stores but this is a manual process.

The only real way to get this information is to use a java agent and use that to calculate a more accurate memory usage. For this purpose, we used the classmexer agent which requires a simple installation step of adding the following parameter to java -javaagent:classmexer.jar. You can then figure out the memory utilisation of an object by calling

```java MemoryUtil.deepMemoryUsageOf(objectInstance) ```

You can also pass in a collection of objects:

```java MemoryUtil.deepMemoryUsageOfAll(objectInstanceCollection) ```

This was the simple part.

Traversing the node structure of jboss cache and collating a collection statistics with regards to the number of each type of object and its memory utilisation was a little more interesting.

I will cover this separately