<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>production on despatches</title><link>https://icle.es/tags/production/</link><description>Recent content in production on despatches</description><generator>Hugo</generator><language>en</language><atom:link href="https://icle.es/tags/production/index.xml" rel="self" type="application/rss+xml"/><item><title>What's a Circuit Breaker?</title><link>https://icle.es/2026/05/06/whats-a-circuit-breaker/</link><pubDate>Wed, 06 May 2026 10:53:27 +0100</pubDate><guid>https://icle.es/2026/05/06/whats-a-circuit-breaker/</guid><description>&lt;p>I&amp;rsquo;d not been interviewed very often. For my first job, sure, I sent around my CV
to a bunch of places and did a bunch of interviews.&lt;/p>
&lt;p>I then started my own company and ran that for nearly two decades, everything
largely self-taught. After that, because of the peculiar shape of my CV, and
&amp;ldquo;falling between a lot of stools&amp;rdquo; as a friend pointed out, interviews were&amp;hellip;
tricky!&lt;/p>
&lt;p>I remember one such interview a few years ago. He was good at diffusing any
anxiety and it was a pretty relaxed interview. He wanted to know if I knew
architectural design patterns - a term I hadn&amp;rsquo;t heard before. He gave me
examples, circuit breaker, pub / sub and a third one I can&amp;rsquo;t remember anymore. I
recognised pub / sub and the other one I think - but was not aware of circuit
breaker. He explained it to me - something about reducing the risk of a
downstream service failing.&lt;/p></description><content:encoded><![CDATA[<p>I&rsquo;d not been interviewed very often. For my first job, sure, I sent around my CV
to a bunch of places and did a bunch of interviews.</p>
<p>I then started my own company and ran that for nearly two decades, everything
largely self-taught. After that, because of the peculiar shape of my CV, and
&ldquo;falling between a lot of stools&rdquo; as a friend pointed out, interviews were&hellip;
tricky!</p>
<p>I remember one such interview a few years ago. He was good at diffusing any
anxiety and it was a pretty relaxed interview. He wanted to know if I knew
architectural design patterns - a term I hadn&rsquo;t heard before. He gave me
examples, circuit breaker, pub / sub and a third one I can&rsquo;t remember anymore. I
recognised pub / sub and the other one I think - but was not aware of circuit
breaker. He explained it to me - something about reducing the risk of a
downstream service failing.</p>
<p>It made sense - but I couldn&rsquo;t remember any instance where I&rsquo;d done that and I
said as much.</p>
<p>No big deal! We moved on.</p>
<p>As you may have noticed from my recent flurry of posts, I&rsquo;ve been going through
a bit of an excavation of all things kraya, and discovering little nuggets that
I&rsquo;d forgotten all about.</p>
<p>Claude code went through all of my emails, all the code repositories and put
together interesting things I had done. The purpose was mainly to pick out
things that could be good blog posts, stories, or just reminders.</p>
<p>Claude had found two circuit breakers.</p>
<p>The first one was built by the seat of my pants in 2004. There was an active
marketing campaign on megabus and the system was struggling. I&rsquo;d plumbed the
depths of the tech stack - the web servers, the database server to get every
last ounce of performance out of the system.</p>
<p>What I really needed was to slow down the deluge of people coming into the
site - just a little bit. I needed to prevent the snowball effect, and I tried a
simple way to achieve it. I wrote the following around my 23rd birthday.</p>
```php
$timestamp = time() - 120;
$sTimestamp = date("d-F-Y H:i");
$sql = "SELECT count(*) FROM tSearches WHERE Script_Start > '$sTimestamp' AND Script_End IS NULL;";

$numSearches = $dbh->getOne($sql);
if ($numSearches > 10)
    sleep(5);

$numSearches = $dbh->getOne($sql);
if ($numSearches > 10)
    sleep(5);

$numSearches = $dbh->getOne($sql);
if ($numSearches > 10)
    sleep(5);
```
<p>(The eagle eyed among you might notice a bug in the above code. It wasn&rsquo;t
resolved for years. Coding live on a production system tends to create bugs. See
if you can find it before reading on.)</p>
<p>(Regular readers may also remember a part of this story from
<a href="https://icle.es/simple-wins.md">I Know People Like You</a>)</p>
<p>The system already tracked searches - so I just needed to check it.</p>
<p>This bit of code checks to see how many of the searches that started in the last
two minutes were incomplete. Except that&rsquo;s not what it did - I forgot to
actually use <code>$timestamp</code> so it only checked the current minute&rsquo;s worth.</p>
<p>I considered failing and letting the user retry manually - but I didn&rsquo;t want to
force the user to take action unless absolutely necessary. This one waited up to
15 seconds and then let the search happen anyway.</p>
<p>If the user was still waiting after 15 seconds, might as well try and do a
search and see what happens. In hindsight, it might have been better to fail at
that point. If there were too many searches after 15 seconds, the database
server was likely already snowballing.</p>
<p>This bit of code survived through to the end of the PHP codebase apart from
minor tweaks. The Java version had layers of circuit breakers. Session limits,
rate limiters and threadpool configuration for scaling up. None of it though,
was called a circuit breaker - not by me or by the team.</p>
<p>So, have I built a circuit breaker? Thinking back, what threw me off was the
wording. I had the feeling that the circuit breaker would &ldquo;protect&rdquo; another
service.</p>
<p>In my mind, the database wasn&rsquo;t another service - it was a part of the same
service.</p>
]]></content:encoded></item><item><title>Holding the Fort</title><link>https://icle.es/2026/04/15/holding-the-fort/</link><pubDate>Wed, 15 Apr 2026 15:26:10 +0100</pubDate><guid>https://icle.es/2026/04/15/holding-the-fort/</guid><description>&lt;p>For many years, I loved my job. I was working on a production system that saw
tens of thousands of orders across the world.&lt;/p>
&lt;p>By the time it was 2015, I did not.&lt;/p>
&lt;p>I had poured blood, sweat and tears into a ticketing system that kraya built —
first for megabus, then for Polskibus. It had broken me, and we were now just
limping along.&lt;/p>
&lt;p>By 2015, we were spending 101–286 hours each month on a support contract that
paid for 60. I raised this with the client and suggested a minimum of a 100%
increase. They refused. Instead, they minimised their requests to just about 80
hours each month. Without the development work, I was now making a loss each
month providing the resources for this support contract.&lt;/p></description><content:encoded><![CDATA[<p>For many years, I loved my job. I was working on a production system that saw
tens of thousands of orders across the world.</p>
<p>By the time it was 2015, I did not.</p>
<p>I had poured blood, sweat and tears into a ticketing system that kraya built —
first for megabus, then for Polskibus. It had broken me, and we were now just
limping along.</p>
<p>By 2015, we were spending 101–286 hours each month on a support contract that
paid for 60. I raised this with the client and suggested a minimum of a 100%
increase. They refused. Instead, they minimised their requests to just about 80
hours each month. Without the development work, I was now making a loss each
month providing the resources for this support contract.</p>
<p>I could cancel the contract, but that would involve letting them have the source
code, which was otherwise kraya&rsquo;s property. There was a six month notice period
— six months of providing all the support they need to run and to migrate the
system to their team.</p>
<p>At some point, I went from taking on the challenge to holding the fort.</p>
<p>I could see that any effort to make things easier would backfire. I built them a
feature for free — they complain about one bug in it — and they need it fixed
urgently. I loosen the rules and deploy an additional server just before the
weekend, letting them know about the risks. Something goes wrong and they are
surprised.</p>
<p>The hardest thing for me to change was the belief that the client&rsquo;s wins were my
wins. If anything, the client&rsquo;s wins were my loss. It meant more unpaid work in
the support contract. It meant more complex systems to support and maintain —
while every penny was being questioned.</p>
<p>I started to say no. No, we will not deploy new servers on a Thursday because
the weekend is too close. No, we will not reduce the QA time on this bit of
functionality you want. No, we will not restore reports which could display
erroneous data.</p>
<p>It all came to a head in one specific instance where the CEO insisted on
speaking to me because they needed something deployed urgently. They needed us
to work through the evening or the weekend. I said no. They offered to pay
double. I said no. They were not happy.</p>
<p>A few months later, they cancelled the contract.</p>
<p>I remember the meeting. It was amicable and friendly. They asked if I could keep
the staff on till the end of the year - they wanted backup in case anything went
wrong during the migration.</p>
<p>I wouldn&rsquo;t want to let them go before Christmas anyway.</p>
<p>I had a brief conversation with my brother to decide what to do next. I already
knew that there was nothing left to keep running. We kept it running for three
more months until the end of December.</p>
<p>I remember going into the main office, gathering everyone around and delivering
the news. Polskibus had cancelled. We are shutting kraya down. We&rsquo;ll keep going
until the end of December.</p>
<p>I remember the Christmas party. It had often ended up being drinks, and being
out all night. It didn&rsquo;t this year - we went home after the meal. The atmosphere
was one of sadness and relief. We&rsquo;d all been through the fires and made it out
the other end. It was over.</p>
<p>It was the end of what I&rsquo;d built over 15 years.</p>
<p>I felt nothing.</p>
]]></content:encoded></item><item><title>I Chose to Keep Going</title><link>https://icle.es/2026/04/14/i-chose-to-keep-going/</link><pubDate>Tue, 14 Apr 2026 20:52:10 +0100</pubDate><guid>https://icle.es/2026/04/14/i-chose-to-keep-going/</guid><description>&lt;p>In 2008, we all watched Pivotal crash and burn. They&amp;rsquo;d taken a year and nearly
£900k to build a new ticketing system for the fringe. On launch, they realised
that it could serve only one customer at a time.
&lt;a href="https://icle.es/saving-the-fringe.md">We built an interim ticketing system for them over the weekend&lt;/a>.&lt;/p>
&lt;p>It was time for kraya to take a leap. We should take the megabus.com ticketing
system to the next level and build a distributed ticketing system that could
potentially scale infinitely.&lt;/p></description><content:encoded><![CDATA[<p>In 2008, we all watched Pivotal crash and burn. They&rsquo;d taken a year and nearly
£900k to build a new ticketing system for the fringe. On launch, they realised
that it could serve only one customer at a time.
<a href="https://icle.es/saving-the-fringe.md">We built an interim ticketing system for them over the weekend</a>.</p>
<p>It was time for kraya to take a leap. We should take the megabus.com ticketing
system to the next level and build a distributed ticketing system that could
potentially scale infinitely.</p>
<p>I&rsquo;d picked JBoss because it was backed by Red Hat - it had all these features
and capabilities - a lot of which we needed. The other option we considered was
Glassfish, but it just didn&rsquo;t have the features we needed. There were other
options but that involved prohibitive licensing fees.</p>
<p>We costed it out at £650k and a year. It was unrealistic but I could not imagine
Stagecoach paying more, or giving us more time. They wanted it for £500k and in
six months. I should&rsquo;ve pushed back, but we&rsquo;d built a booking system over the
weekend, only a few months back - this should be possible, right?</p>
<p>It wasn&rsquo;t.</p>
<p>We got a year in the end — we didn&rsquo;t ask for it, and we didn&rsquo;t know until we
were most of the way down the path. When I heard about the delay it was a mix of
relief and regret. We&rsquo;d burned through most of the budget on getting in
contractors who were leaving imminently. We could have got fewer people on
board, but as permanent staff.</p>
<p>We launched to Canada first, and that wasn&rsquo;t too bad. Then it was the US, and
the nightmare started. The UK launched last on my birthday in 2010.</p>
<p>In the intervening years, we had a budget shortfall of £150k and because we&rsquo;d
rushed to build the product, we&rsquo;d cut corners, and all the contractors had left.
We even had to let go of some of the permanent staff. I asked Stagecoach if
they&rsquo;d fund us the extra £150k as we&rsquo;d asked at the start. They said no. They
renegotiated the contract. They demanded more oversight - and increased our
reporting obligations.</p>
<p>Load testing was already a part of our process and we tested the new system
under load. We identified issues and fixed them. On paper, it looked good.</p>
<p>It wasn&rsquo;t.</p>
<p>The problem wasn&rsquo;t load per se. It was the stability of the system. It struggled
to stay up for extended periods of time. The more nodes there were, the worse it
was.</p>
<p>It was exactly the kind of problem that was hard to replicate and hard to test.
The main option we had was to think through what all it could be and to take
stabs in the dark. I put together a hit list and worked through it methodically.
Convincing Stagecoach to spend the money was sometimes harder than solving the
problem.</p>
<p>Ultimately, the one thing that pushed us over the line in terms of stability was
staggered nightly automated restart of each node.</p>
<p>I thought I was done working with
<a href="https://icle.es/it-gets-everywhere.md">software that needed regular restarts to stay functional</a>
and at first resisted this. We are running enterprise level software, and that
too on Linux. If I was comfortable with such shenanigans, I might have stayed
with Windows. But we were running out of options. It was a hail Mary - it
worked.</p>
<p>I had been working with Linux and related software for years by that point. I,
in fact had servers that had not been restarted for literal years at that point,
with services that were running just as long. A lot of these services didn&rsquo;t
even need a restart on config change - just a reload.</p>
<p>None of these services even had a paid tier. That should have been the clue.</p>
<p>I cannot imagine having to restart PostgreSQL nightly or even Apache. I still do
not understand how something slated for the enterprise market can have leaks
that would warrant a regular restart to keep it working.</p>
<p>Years later, when I looked up issues around JBoss, I realised that it was
notorious for a whole slew of problems with JGroups and clustering. I remember
scouring the internet for details on any kind of issues and coming up empty.</p>
<p>I&rsquo;d fallen into a marketing trap. JBoss was no Apache — it was the commercial
product of Red Hat. I thought all enterprise open source software would be of
the same calibre. It wasn&rsquo;t.</p>
<p>From what I understand, Stagecoach spent millions and maybe two years building
the whole ticketing system inhouse. kraya limped along for a few more years
before being shuttered.</p>
<p>The people involved, though, fortunately seem to have gotten through it largely
unscathed. It gives me a great deal of joy to see so many of the juniors I&rsquo;d
hired now CTO&rsquo;s, VP&rsquo;s, Directors.</p>
<p>I believed that every victory was ours but when something went wrong, it
belonged to me. After all, I made every choice. I chose to pursue the Java EE
ticketing system. I chose JBoss. I chose to keep going.</p>
<p>I was in a narrowing path, with fewer and fewer options.</p>
<p>I knew I was the only one holding that line - I didn&rsquo;t know that there was
another option.</p>
<p>Bit by bit, I lost all sense of what I was paying to fix these mistakes.</p>
]]></content:encoded></item><item><title>Always On</title><link>https://icle.es/2026/04/14/always-on/</link><pubDate>Tue, 14 Apr 2026 19:51:09 +0100</pubDate><guid>https://icle.es/2026/04/14/always-on/</guid><description>&lt;p>I knew as soon as my phone rang what it was about.&lt;/p>
&lt;p>It was the same every time. I would drag myself up to answer the phone - my
body, my mind screamed at me, but I had gotten good at overriding every instinct
through sheer willpower. I could hear the apologetic tone on the other side, and
I could recognise some of the voices after a while. I mustered up all of my
strength to be and sound as awake as possible. I needed to be professional even
if I was still in my underwear.&lt;/p></description><content:encoded><![CDATA[<p>I knew as soon as my phone rang what it was about.</p>
<p>It was the same every time. I would drag myself up to answer the phone - my
body, my mind screamed at me, but I had gotten good at overriding every instinct
through sheer willpower. I could hear the apologetic tone on the other side, and
I could recognise some of the voices after a while. I mustered up all of my
strength to be and sound as awake as possible. I needed to be professional even
if I was still in my underwear.</p>
<p>I had a glass of water on my bedside table. I&rsquo;d pick that up and head to my
office in the spare room. The computer was always on and always ready to go —
like me I guess. I&rsquo;d log on to the servers, and check the logs. If I can
identify which one fell out of the group, I can restart just that one. If I was
too late or if the issue had escalated, I&rsquo;d have to restart the whole cluster —
shut them all down, give them a few seconds, then bring each one up, while
keeping an eye on them. I could do it half asleep after a while.</p>
<p>Falling back asleep wasn&rsquo;t a breeze either - I was tired - exhausted - but I was
now also wired. Waking up in the morning was harder - the alarm would go off and
my body would be limp. I still remember the sheer power of will to drag myself
into the shower, then carry on with the rest of the day.</p>
<p>Of 266 incidents over about two years, I answered 156.</p>
<p>I remember one particular night, though I do not remember how many times I&rsquo;d
woken up beforehand. I was already tired.</p>
<p>megabus.com had gone offline. I got an alert. &ldquo;But someone else is on call
tonight,&rdquo; I told them. &ldquo;We already tried them twice,&rdquo; came the reply. I had to
deal with this. I had to deal with this.</p>
<p>I remember sitting at my desk at home working on fixing it. At some point,
something was different, though I don&rsquo;t remember what. While I was working on
fixing it, I remember being overcome with an overwhelming impulse to get up from
the chair and walk away — I almost imagined myself walking away. I resisted and
shut down that impulse. I fixed megabus as I had always done. In fixing megabus
though, something broke inside me, somewhere deep, in the very core of my being.
I was never the same again.</p>
<p>I analysed the system top to bottom, inside and out. I even waded through JVM
internals.</p>
<p>It got incrementally better, more stable. I think I rewrote every component that
wasn&rsquo;t the core ticketing system. In the end, what pushed it over the line were
two unexpected changes. Automated nightly restarts of each node in the cluster
and a rate limiter.</p>
<p>On the 10th December 2012, the system, now serving Polskibus, had a sale event.
We had a bank of screens on a wall with all the key stats for the system. It
looked cool, and we felt a bit like we were on a TV show. At peak, nearly 15,000
concurrent sessions — six or seven times the average. Over 30,000 bookings in a
single day, three times more than the normal amounts across all systems.</p>
<p>We watched it closely, all day. Nothing broke. Nothing screamed. Everyone
smiled, but there was no celebration.</p>
]]></content:encoded></item><item><title>Did They Have a Problem That Year?</title><link>https://icle.es/2026/04/14/did-they-have-a-problem-that-year/</link><pubDate>Tue, 14 Apr 2026 10:24:26 +0100</pubDate><guid>https://icle.es/2026/04/14/did-they-have-a-problem-that-year/</guid><description>&lt;p>2008 was a heck of a year for kraya, and for me. We were already operating
megabus.com in the UK, USA, and Canada, along with Oxford Tube, the sales
website for coach usa - all for Stagecoach.&lt;/p>
&lt;p>We were also working on the fringe website. We integrated the website with the
brand spanking new ticketing system - which cost nearly £900k.&lt;/p>
&lt;p>We were also hosting websites for Boots, Kellogg&amp;rsquo;s Food Service and dozens of
other clients.&lt;/p></description><content:encoded><![CDATA[<p>2008 was a heck of a year for kraya, and for me. We were already operating
megabus.com in the UK, USA, and Canada, along with Oxford Tube, the sales
website for coach usa - all for Stagecoach.</p>
<p>We were also working on the fringe website. We integrated the website with the
brand spanking new ticketing system - which cost nearly £900k.</p>
<p>We were also hosting websites for Boots, Kellogg&rsquo;s Food Service and dozens of
other clients.</p>
<p>All of this was held together by three or four developers, two systems
administrators and me.</p>
<p>On the 13 June (incidentally, I got married on the same date years later), as I
was just getting ready for a wild night on the town, a call comes through -
which John answers.</p>
<p>I still remember them laughing and then doing a double take &ldquo;oh, you&rsquo;re serious?
let me get Shri&rdquo;</p>
<p>It was the fringe. We&rsquo;d already known that they were having trouble with their
ticketing system. I&rsquo;d even pitched in, made suggestions - looked at their code
to try and help, but none of that was enough. I expected an update.</p>
<p>They wanted to know if we could put together an interim booking system for them
over the weekend. I wasn&rsquo;t sure. I told them I&rsquo;d speak to my team and get back
to them.</p>
<p>I wasn&rsquo;t involved with the work on the fringe up until this point. I knew very
little about it. I was focused on megabus.com. The US version of the site had a
big marketing campaign happening in a few days and that was what I was meant to
be focused on.</p>
<p>By the time I put the phone down, Chris, who had been the lead on the fringe
already had a answer. &ldquo;We can do it!&rdquo;</p>
<p>&ldquo;But how - it&rsquo;s got to take more than a weekend - right?&rdquo;</p>
<p>The fringe website was already built well and had a clean layer interfacing with
the new ticketing system. In fact, that was the bulk of the work that year.</p>
<p>So.. Chris told me - all we would have to do is to implement the functionality
within that thin layer, fattening it up.</p>
<p>He believed we could do it. I believed him.</p>
<p>I hopped in a cab, headed over to the fringe to talk it through. I didn&rsquo;t
promise them we&rsquo;d be able to get something ready by Monday, but I promised we&rsquo;d
do our best.</p>
<p>We were in the office on the weekend, writing code. I remember working on the
basket, sending diffs over email and generally having a good time.</p>
<p>I even had some megabus US fun to keep me entertained in the form of issues with
loading sheets - I was already in the office, so it was one step easier to fix.</p>
<p>One of the bits of functionality which took a surprising amount of time was the
seat allocation. None of us had to worry about that before - it was just
capacity management on megabus. For the fringe though, we had to allocate actual
seats with seat numbers and everything.</p>
<p>With the fringe we had multiple tables which all joined together (thanks
hibernate) to encode a tremendous amount of detail about the seating plans -
including their physical location on a map.</p>
<p>It was too much detail for us, so we had to simplify it all down to get it
working in the timeframe. We kept most of the rest of the structures intact to
keep the data migration easier once the ticketing system was fixed.</p>
<p>By Monday, we had each managed at the most 6 hours of sleep each of the previous
three nights. I still have vivid memories of a suit of armour that we put
together using packing material while we were waiting for bits of data or
details of logic.</p>
<p>I broke the MySQL replication at 01:24, fixed by 01:32</p>
<p>I remember the delirium setting in. Email sent to the client with &ldquo;fun fun fun
fun fun fun fun&rdquo; as the subject</p>
<p>There were random emails to my brother &ldquo;I&rsquo;m still here!&rdquo;</p>
<p>I also remember making makeshift beds with bubblewrap to get a wee nap here and
there. We were all so exhausted - pumped up on coffee and nicotine.</p>
<p>Finally at 03:35 on the Tue email to client: &ldquo;DONE DONE DONE DONE DONE NODE NODE
NODE NODE&rdquo;</p>
<p>Then at 05:33, requesting a PostgreSQL server rebuild for megabus US for their
marketing campaign.</p>
<p>At 10am on Tuesday, the fringe is finally able to sell tickets. The website
promptly fell over from the load, but we nurse it back and it sells 65k+ tickets
in the first week.</p>
<p>It would be at least two more weeks before the ticketing system is fixed and
brought back in.</p>
<p>For the work we did for them that year, and the previous one, we effectively
only charged about 30% - because that&rsquo;s all they could afford. This year, we
asked if they could put our name on the website.</p>
<p>Over the next two weeks(while megabus US was on their marketing campaign), we
fought many battles. There were 750 duplicate bookings. Numerous customer
complaints (thanks to our name being on the website) - almost all of them
blaming us for the failure of the ticketing system. People did not understand
that we put in the interim one, not the one that failed.</p>
<p>Press releases went out from the fringe - only two credited us. Both misspelt
the company name. Both called us a web design company — which, we were not, had
never been, and had no interest in becoming.</p>
<p>In truth, I wanted to be a hero - I think we all did. What we really wanted was
an acknowledgement of what we had done - which was nowhere to be found. We got
paid though - at least for a part of our effort.</p>
<p>For many years after that, I would tell people with pride - &ldquo;did you know - I
saved the fringe, back in 2008,&rdquo; which was inevitably met with something like
&ldquo;oh, did they have a problem that year?&rdquo;</p>
]]></content:encoded></item><item><title>What we Carried</title><link>https://icle.es/2026/04/13/what-we-carried/</link><pubDate>Mon, 13 Apr 2026 20:02:04 +0100</pubDate><guid>https://icle.es/2026/04/13/what-we-carried/</guid><description>&lt;p>I started my company in 2000. I was 17. I built megabus.com in 2003. I was 21.&lt;/p>
&lt;p>It started off small, and little by little, I carried more and more. I became
we, and we carried more and more. Before we realised, we were carrying a great
deal. Ultimately, though, if something went seriously wrong, it would be on my
shoulders.&lt;/p>
&lt;p>The chart does not capture the scaling of the organisation, or other departments
like tech support or hosting, which had dozens of clients.&lt;/p></description><content:encoded><![CDATA[<p>I started my company in 2000. I was 17. I built megabus.com in 2003. I was 21.</p>
<p>It started off small, and little by little, I carried more and more. I became
we, and we carried more and more. Before we realised, we were carrying a great
deal. Ultimately, though, if something went seriously wrong, it would be on my
shoulders.</p>
<p>The chart does not capture the scaling of the organisation, or other departments
like tech support or hosting, which had dozens of clients.</p>

<figure >
    
        <img src="./gantt_combined_1000.png" 
            alt="gantt chart of the main projects done by kraya through its life" />
    
    
    
        <figcaption>
            
            
            
                <p style="margin: -0.5rem 0 0 0;">
                    Data mined by Claude from my emails, issue trackers and code repos
                
                
                
                
                
                
            </p> 
            
        </figcaption>
    
</figure>


<p>The section at the top is the number of active committers for that quarter. You
can see my on my todd at the start and the rise and the fall of the dev team.</p>
<p>The very peak of it was in 2008. We
<a href="https://icle.es/saving-the-fringe.md">built a booking system over the weekend for the Edinburgh festival fringe because their brand new £800k+ system could only handle one person at a time.</a></p>
<p>There were a handful of us building it while I was simultaneously prepping and
managing the megabus US systems for a sales campaign. At the same time we were
operating megabus across the UK, USA, and Canada, Oxford Tube, Coach USA, the
Fringe website itself and numerous other smaller hosting clients like Boots,
Kelloggs Food Service, and so on.</p>
<p>We were a total of ~4 developers and two systems administrators holding all of
these together.</p>
<p>I was recently reminded of a story a friend of mine told me. Before he was my
friend, he worked with me, and one of the things we did together was one the big
megabus deployments when we migrated to a Java EE ticketing system.</p>
<p>One part of the migration was the data. I had developed a tool to migrate the
data and on the evening - everything was prepared, we were off peak, and we had
taken the site offline. He ran the script, which went on for a wee while and it
failed. It was not meant to do that.</p>
<p>The way he tells the story, he told me about the failure. I come over, look at
the errors, say &ldquo;hmmm, that&rsquo;s interesting,&rdquo; and head off to have a cigarette. A
few minutes later I go over to my desk type away furiously, then asked him to
run it again.</p>
<p>It worked, and completed.</p>
<p>I remember that night. I don&rsquo;t remember what the problem was or how I fixed it,
but I do remember that moment when I went over to see how it had failed. In the
short walk from my desk to his, I reiterated in my mind, all the possible backup
plans - with the worst case scenario being to call off the migration on that
day. We would do it another day. It would cost money, but it would be do-able. I
was ok with that.</p>
<p>I was curious as to where my limits were, so I kept pushing, until I would meet
with a wall.</p>
<p>There were no rails and there were no railings - just a cliff edge, unmarked… I
didn&rsquo;t know that - I expected a brick wall.</p>
<p>The worst of it would be only a few years later, in 2011. We built a new Java EE
ticketing system for a fraction of what it should have cost in about 30% of the
time it needed.</p>
<p>I personally responded to over 250 out of hours emergency tickets over an 18
month period. That was hard!</p>
<p>I had run off a cliff edge, and like the roadrunner in the cartoons, it took a
while before I realised there was no ground beneath me.</p>
<p>A few years after I ran off the cliff, the company shut down. A few years later,
I would start my active recovery journey through therapy. A few years later
still, when I felt ready for a leadership role, I was asked to lead a
problematic team - a role they struggled to fill for a while.</p>
<p>The team worked hard and delivered but struggled with the perception of poor
delivery. Trust was thin, stress was high and morale was low. The situation was
so bad that the week before I was supposed to start, the scrum master who was
supposed to be my guide through it all quit.</p>
<p>I was warned by multiple people that this job was loaded with problems. I took
on the job anyway, without a real guide, straight into multiple serious issues.</p>
<p>I loved it and managed to turn the whole thing around in my first week.
Delivered key items, laid the foundations of trust and improved morale. It took
a bit longer to bed everything down. Within weeks, I was asked if I would take
on leading the entire digital team.</p>
<p>It was here, many years later, I got a sense of how unusual it was for such a
tiny team to do so much.</p>
<p>It was here, many years later, I got a sense of how it is to have guardrails, to
have support, peers to lean on.</p>
<p>It was here, for the first time I realised that the job didn&rsquo;t have to be a
lonely one.</p>
]]></content:encoded></item><item><title>Whatcha Thinking?</title><link>https://icle.es/2026/04/13/whatcha-thinking/</link><pubDate>Mon, 13 Apr 2026 12:05:48 +0100</pubDate><guid>https://icle.es/2026/04/13/whatcha-thinking/</guid><description>&lt;p>I loved working on megabus. I was in love with it. My girlfriend at the time had
a habit of asking what I was thinking about when I looked deep in thought. The
answer - every single time, was inevitably megabus. She eventually stopped
asking.&lt;/p>
&lt;p>I was 22 years old.&lt;/p>
&lt;p>When I built the original prototype for megabus.com, I built it using PHP +
PostgreSQL. I put together a document detailing my reasoning for these choices.
I quoted 33 days for it, built it over six weeks and charged £13,200.&lt;/p></description><content:encoded><![CDATA[<p>I loved working on megabus. I was in love with it. My girlfriend at the time had
a habit of asking what I was thinking about when I looked deep in thought. The
answer - every single time, was inevitably megabus. She eventually stopped
asking.</p>
<p>I was 22 years old.</p>
<p>When I built the original prototype for megabus.com, I built it using PHP +
PostgreSQL. I put together a document detailing my reasoning for these choices.
I quoted 33 days for it, built it over six weeks and charged £13,200.</p>
<p>The support contract was £350/month - for one day a month. On the first day,
megabus.com sold 200 orders.</p>
<p>When megabus had its first expansion, I was up overnight bringing new servers
online and scaling it live. I loved it - my code was finally being tested.</p>
<p>Over a week, I&rsquo;d probably burned through many days of effort. I remember the
project manager specifically asking me to invoice for the extra work I put into
it. I even said that I would - except I didn&rsquo;t.</p>
<p>I&rsquo;ve had a long time to think about this - why did I not send that invoice? I
even had approval.</p>
<p>The answer, as with most things of this nature is complicated. I loved the work
and I didn&rsquo;t want it to end. I didn&rsquo;t want a potential conflict trying to figure
out what a reasonable amount was to charge. I felt that I should have done a
better job in the first place - I felt responsible that I had not told them that
scaling of this nature would not have worked without prep work.</p>
<p>I had not scaled anything before.</p>
<p>I was 22 years old.</p>
<p>I was super grateful that someone believed in me. I naively assumed that they
saw all the extra effort I was putting in and that they would reward me for it -
that they would have my back.</p>
<p>I remember adding a bunch of different bits of functionality because I wanted it
there. I didn&rsquo;t want to go through the process of quoting for it, and it getting
potentially rejected, not to mention the waiting for decisions. One key bit of
functionality I remember is adding in a percentage load column for the loading
sheets. I built it, showed it - they loved it! It went live. I did not charge
for it.</p>
<p>At this point, the vast majority of my time was spent on megabus - very little
of it actually paid for.</p>
<p>At a glance, based on the emails sent, I probably spent a minimum of 10 days
each month supporting megabus when I was charging for one day.</p>
<p>In Jan 2004 - I proposed <em>doubling</em> the contract to two days for £4,800/year. It
probably kicked in in Feb 2004. By March 2004, the site exceeded that revenue
each day.</p>
<p>In the following months, I probably spent, on average a minimum of at least
double the time I was paid for. I should have charged for it.</p>
<p>I grew the team, and the support contract based on the minimum I needed to
maintain the product - not based on the amount of time I was spending.</p>
<p>For my 28th birthday, my girlfriend at the time organised a cake which was a
image representing kraya - which was basically megabus. I felt bad that she
thought that kraya was the most important thing in my life - she was right - but
it still felt bad. kraya had other clients at the time, but my time wasn&rsquo;t
monopolised by other clients, or indeed by kraya - my heart still belonged to
megabus.</p>
<p>And it would all have been fine too, except for a grave miscalculation I made.</p>
<p>In 2010, after trying to rebuild the ticketing system for £500k, and making some
mistakes with people I trusted, kraya ended up in £150k in the hole. We needed
some money urgently.</p>
<p>I was desperate and naively, I reached out to stagecoach for help. I thought
they were my friend - that they would have my back.</p>
<p>They understandably lost a great deal of trust in my ability to manage and lead
my company. I trusted the wrong person - but that was still my mistake. They
were right.</p>
<p>I thought that I&rsquo;d built up enough goodwill that they would help me through
this. I&rsquo;d felt I would have way more than that &ldquo;in the bank&rdquo; in terms of
goodwill. I learned that professional relationships do not work that way that
dark afternoon, standing outside my office on the phone, in the rain.</p>
<p>They didn&rsquo;t make my life easier. Instead, I&rsquo;d ended up rattling the cage - they
were now panicked - realising their over-reliance on an organisation that could
disappear at any point.</p>
<p>Instead of support, I had further actions, renegotiating the contract and what
felt like punitive, and definitely invasive reporting obligations.</p>
<p>I was hurt and angry. I had poured my heart, my soul - hey, my very life into
this product that I loved.</p>
<p>Suffice it to say - I got no help - no loan, no offer of investment - though
they did suggest buying us outright - which I rejected.</p>
<p>I signed a contract under circumstances I would not wish on anyone.</p>
<p>The best I got from them was a challenge - if we were really spending more time
than we were charging for - prove it. I did! We documented every minute we were
spending - I wasted my time on spreadsheets, pointless meetings and work to try
and rebuild the broken trust.</p>
<p>We went from £300k in the hole to £200k profit within a year. We charged for a
whole year in support around 20% of what the system made in a day.</p>
<p>I was 28 years old.</p>
<p>Around the same time, I was also dealing with the operational aftermath of
trying to build a java EE ticketing system over six months for £500k. I thought
it would take a year and cost £1m. In hindsight, it needed two years and
probably three million pounds.</p>
<p>Over 18 months, I personally answered over 150 out of hours emergency calls. We
had a rota and others on call too - but I took the vast majority of these calls.
I felt bad putting others through what I knew was gruelling.</p>
<p>All of this led me down a narrower and narrower path to a serious breakdown -
though I didn&rsquo;t know enough to name it until many years later. All I knew - all
I felt was that something broke in me.</p>
<p>We managed to resolve all of the issues, but the deployment of that version kept
getting pushed.</p>
<p>Stagecoach cancelled the contract in 2012. They had started building a ticketing
system in-house two years prior - the cost of my grave mistake. I wasn&rsquo;t able to
make the meeting - I was in India, and at the same time as the meeting, I was
meeting for the first time the one who is now my wife.</p>
<p>I was 28 years old. I spent the next 15 years putting myself back together.</p>
<p>How much did it cost them to build it inhouse? If I had charged for my time from
the start, would we all have been better off?</p>
<p>I still feel something deep inside me every time I see a megabus - a sense of
pride mixed in with a deep sense of sadness - not for what I lost - but for what
could have been.</p>
<p>I am 44 years old, and I am starting again.</p>
]]></content:encoded></item><item><title>Priced In</title><link>https://icle.es/2026/04/07/priced-in/</link><pubDate>Tue, 07 Apr 2026 09:55:24 +0100</pubDate><guid>https://icle.es/2026/04/07/priced-in/</guid><description>&lt;p>Two ticketing systems. Same client. Same payment provider. We were moving fast —
that was the explicit choice, theirs and mine. The kind of fast where you know
something will go wrong eventually and you price it in rather than try to
prevent it.&lt;/p>
&lt;p>We&amp;rsquo;d done the sensible thing and shared the payment code between them — DRY,
less surface area for error, obvious call.&lt;/p>
&lt;p>Then the larger system needed PostAuth. We updated the code, added a scheduled
task to catch anything the non-deterministic bits missed, moved on.&lt;/p></description><content:encoded><![CDATA[<p>Two ticketing systems. Same client. Same payment provider. We were moving fast —
that was the explicit choice, theirs and mine. The kind of fast where you know
something will go wrong eventually and you price it in rather than try to
prevent it.</p>
<p>We&rsquo;d done the sensible thing and shared the payment code between them — DRY,
less surface area for error, obvious call.</p>
<p>Then the larger system needed PostAuth. We updated the code, added a scheduled
task to catch anything the non-deterministic bits missed, moved on.</p>
<p>A few months later: why has no money come through on the smaller system?</p>
<p>We&rsquo;d ported the PostAuth flow across when we updated the shared code. We hadn&rsquo;t
added the scheduled task. The payment provider, chosen for cheap and cheerful
rather than reliability, failed silently rather than erroring. The accounting
department, running at the same pace as everyone else, hadn&rsquo;t caught the gap.</p>
<p>Four separate things had to go wrong simultaneously. Any one of them holding
would have meant no loss at all.</p>
<p>The client lost money. Not a catastrophic amount, but real money. I braced for
the call.</p>
<blockquote>
<p>Try and let me know the next time you decide to run a sale.</p></blockquote>
<p>He already knew the cost. He&rsquo;d known before the mistake happened.</p>
]]></content:encoded></item><item><title>I Know People Like You</title><link>https://icle.es/2026/03/31/i-know-people-like-you/</link><pubDate>Tue, 31 Mar 2026 10:41:29 +0100</pubDate><guid>https://icle.es/2026/03/31/i-know-people-like-you/</guid><description>&lt;p>A few years ago, I was interviewed for a role. I was talking about a ticketing
system I&amp;rsquo;d built - originally in Spring, then rewritten to use EJB 3.2. The
interviewer didn&amp;rsquo;t look impressed.&lt;/p>
&lt;p>The team had already written a lot of stuff in Spring - but I really did not
like it. There was all this XML all over the place which was annoying, but what
I really didn&amp;rsquo;t like was that the code and configuration for each component was
spread out all over the place. It meant that to understand how something worked,
I had to go hunting. Eventually, I got sick of it, and ported it to EJB myself.&lt;/p></description><content:encoded><![CDATA[<p>A few years ago, I was interviewed for a role. I was talking about a ticketing
system I&rsquo;d built - originally in Spring, then rewritten to use EJB 3.2. The
interviewer didn&rsquo;t look impressed.</p>
<p>The team had already written a lot of stuff in Spring - but I really did not
like it. There was all this XML all over the place which was annoying, but what
I really didn&rsquo;t like was that the code and configuration for each component was
spread out all over the place. It meant that to understand how something worked,
I had to go hunting. Eventually, I got sick of it, and ported it to EJB myself.</p>
<p>Later in the same interview, he said: &ldquo;I know people like you - you come in,
shake things up and get things done - but that&rsquo;s not what I&rsquo;m looking for.&rdquo;</p>
<p>He was right. I understand that. But I&rsquo;ve been thinking about what he was
actually describing.</p>
<p>When megabus.com was still a PHP site, search was the problem. It returned
quickly when the database was healthy and crawled - and slowed down when the
database was struggling. The load came in spikes. Even within a minute, there
were peaks and troughs.</p>
<p>My fix was simple. Before the search query ran, I added a small SQL check: how
many queries are currently active on the database server? If too many, wait a
second and try again. A few retries, then send it anyway.</p>
<p>A rate limiter baked into a search algorithm, written live on a production
server.</p>
<p>There were edge cases to consider, not to mention the load the rate limiter
would add to the database server. I knew though that if it broke, I could fix
it - I could just remove it - live, if needed. Not having the rate limiter was
at the time, more expensive than having it.</p>
<p>It worked. It got us through more than one hump.</p>
<p>The database was still the ceiling. We were on PostgreSQL 7 - no replication
support. Getting a more powerful server was possible but disproportionately
expensive. So I built something.</p>
<p>Two database servers. All writes went to both. Reads were distributed randomly
between them. Everything funnelled through one section of code.</p>
<p>I didn&rsquo;t do this live. I tested it. I knew what failure looked like: if the
servers diverged badly enough, I&rsquo;d pick a primary and reset the other. That was
the contingency. It wasn&rsquo;t a safety net someone else would pull - it was mine.</p>
<p>The data integrity held better than I expected. Under very high load there were
edge cases - ticket IDs for the same customer could be in a different order
across the two servers on a return purchase - but because the IDs were
consistent within each server, it never caused a real problem. It held the fort
until I could replace it with something better.</p>
<p>I occasionally lie awake at night imagining the databases diverging and figuring
out how I would fix it.</p>
<p>I picked PostgreSQL over MySQL when MySQL was the obvious choice. Under heavy
load it stays up — it slows to a crawl, but it keeps going. And it had
transactions. I was building an ecommerce site; I needed transaction support.
MySQL was fast and popular. It also had a habit of giving up under sustained
load. I still pick PostgreSQL - but nowadays, so do most other people.</p>
<p>The thing these decisions had in common was that I was the person who&rsquo;d be
fixing them at 3am if they went wrong. When you&rsquo;re personally accountable for
the consequences, the risk calculus changes. You think harder about what failure
looks like. You build the contingency before you go live. You know which
direction to pull if it goes sideways.</p>
<p>Caution that&rsquo;s never personally tested isn&rsquo;t rigour. It&rsquo;s consequence-avoidance
dressed up as responsibility.</p>
<p>&ldquo;I know people like you - you come in, shake things up and get things done - but
that&rsquo;s not what I&rsquo;m looking for.&rdquo;</p>
]]></content:encoded></item><item><title>It Gets Everywhere</title><link>https://icle.es/2026/03/24/it-gets-everywhere/</link><pubDate>Tue, 24 Mar 2026 20:52:21 +0000</pubDate><guid>https://icle.es/2026/03/24/it-gets-everywhere/</guid><description>&lt;p>In 1999, I was building websites in ASP (before there was .NET) and MSSQL
Server. We had a Windows NT server that I had to restart every week — not
because of updates, because it would get slower and slower until a restart was
the only thing that would fix it.&lt;/p>
&lt;p>We had one ADSL connection coming into the office and three of us. I wanted to
share the internet. Windows NT didn&amp;rsquo;t support it cleanly — it had a way, but it
was clunky enough that no internet was arguably better. We&amp;rsquo;d paid hundreds of
pounds for it.&lt;/p></description><content:encoded><![CDATA[<p>In 1999, I was building websites in ASP (before there was .NET) and MSSQL
Server. We had a Windows NT server that I had to restart every week — not
because of updates, because it would get slower and slower until a restart was
the only thing that would fix it.</p>
<p>We had one ADSL connection coming into the office and three of us. I wanted to
share the internet. Windows NT didn&rsquo;t support it cleanly — it had a way, but it
was clunky enough that no internet was arguably better. We&rsquo;d paid hundreds of
pounds for it.</p>
<p>I&rsquo;d heard about Linux. Downloaded Red Hat, installed it, configured it for NAT.
It worked — it was like magic. I&rsquo;m pretty sure I had to recompile the kernel to
get some bits working, but there were instructions and they were honest. It did
what it said.</p>
<p>Here was software that was completely free — free enough that I could read the
source code, make changes, run it however I wanted. It did more than the
hundreds of pounds worth of garbage sitting on the desk. And once I set it up, I
never had to restart it. Never. Compared to once a week on the NT box.</p>
<p>The difference, in my mind, was simple. Linux was built responsibly. NT was
built as a money-making enterprise.</p>
<p>That held for a long time. I moved to Debian, then celebrated when Ubuntu
arrived and made things more accessible. I&rsquo;ve recently been able to abandon
Windows altogether — gaming on Linux is finally viable. I came back full time
and felt mostly at home.</p>
<p>But there were minor niggles. Things that felt slightly off but that I couldn&rsquo;t
quite name.</p>
<p>Then I started digging into systemd.</p>
<p>I remembered feeling odd about having to run specific commands to read logs. Odd
about one tool doing many different things — which ran contrary to the Unix
philosophy that had made Linux what it was. When I looked into the history of
the opposition to systemd, it was revelatory.</p>
<p>systemd becoming process 1 is, in a word, irresponsible. It makes everything
easier and more accessible, which is why it won. But unlike the Linux of old,
the tradeoff isn&rsquo;t visible upfront, and there&rsquo;s no real choice. The responsible
option isn&rsquo;t the default anymore — it&rsquo;s the thing you have to go looking for.</p>
<p>I thought I had already done the work. I thought I had found the alternative.</p>
<p>While I was celebrating Linux becoming mainstream, I hadn&rsquo;t considered what it
would cost.</p>
<p>The Linux ecosystem had started optimising for mainstream at the expense of
responsibility. It works now, for far more people. But it&rsquo;s a different thing
than it was. When linux was really taking off, there was a joke going around
(before memes were called memes) about Microsoft Linux. Turns out the joke was
on us!</p>
<p>It is always a tradeoff between security and convenience — something convenient
is rarely secure, and vice versa. I think something similar applies to
responsibility. The more accessible you make something, the harder it becomes to
hold the line on what it was built to do.</p>
<p>There was a time when software going wrong meant losing your work. Now it means
losing your money, your reputation, or — in a car, in a hospital — your life.</p>
<p>The context has changed. The attitudes haven&rsquo;t. And the places that once had
better attitudes — the ones built on responsibility, on craft, on caring about
the thing itself — are being pulled in the same direction. <em>It gets everywhere.</em></p>
<p>Do you want your car running Windows? What about systemd?</p>
]]></content:encoded></item><item><title>Even Light Gets Heavier</title><link>https://icle.es/2026/03/24/even-light-gets-heavier/</link><pubDate>Tue, 24 Mar 2026 10:56:05 +0000</pubDate><guid>https://icle.es/2026/03/24/even-light-gets-heavier/</guid><description>&lt;p>A dedicated input type is better than reusing your domain model at the API
boundary. Test layers matter. Writing log statements as you go saves the poor
soul (probably you) debugging blind at 10pm. You know all of this.&lt;/p>
&lt;p>This isn&amp;rsquo;t about any of that.&lt;/p>
&lt;p>It&amp;rsquo;s about the fact that none of those decisions show up in the metrics that
matter to the people making hiring and delivery calls. The cost is immediate and
visible. The return is delayed, quiet, and arrives in the form of things that
didn&amp;rsquo;t happen — the investigation that took two hours instead of two days, the
API change that didn&amp;rsquo;t bleed into the domain model, the bug that the structure
caught before it shipped.&lt;/p></description><content:encoded><![CDATA[<p>A dedicated input type is better than reusing your domain model at the API
boundary. Test layers matter. Writing log statements as you go saves the poor
soul (probably you) debugging blind at 10pm. You know all of this.</p>
<p>This isn&rsquo;t about any of that.</p>
<p>It&rsquo;s about the fact that none of those decisions show up in the metrics that
matter to the people making hiring and delivery calls. The cost is immediate and
visible. The return is delayed, quiet, and arrives in the form of things that
didn&rsquo;t happen — the investigation that took two hours instead of two days, the
API change that didn&rsquo;t bleed into the domain model, the bug that the structure
caught before it shipped.</p>
<p>Sprint velocity captures the extra day. It doesn&rsquo;t capture what that day bought.</p>
<p>This is not a new problem. Most engineers who&rsquo;ve been around long enough have
felt it from both sides - made the careful call and got measured on the
slowness, or inherited the codebase built entirely for speed and paid the tax.
The measurement system was already broken. It has been rewarding the appearance
of velocity over the thing velocity is supposed to serve.</p>
<p>This was true long before anyone was generating code with AI. The PR process in
a lot of teams was already largely theatrical — review comments on naming
conventions while the architectural decisions slipped through unquestioned,
approvals given because the diff was too large to meaningfully read. The gate
was already not doing much. We brushed it under the carpet and moved on.</p>
<p>AI tooling is changing the volume of code moving through that process by an
order of magnitude. The pressure to remove the gate entirely — to trust the
output, to ship faster - is only growing. The faster-is-better incentive that
was already making review ineffective is about to be handed a much larger
surface to work on.</p>
<p>Many years ago, I pitched full redevlopment of a ticketing system from a PHP
based system to a Java EE system because it was struggling to scale.</p>
<p>It probably needed a couple of years to build. They wanted it in six months. I
accepted the challenge.</p>
<p>We built and deployed the system in eight months. We spent the next year fixing
it.</p>
<p>The client then rebuilt it in-house.</p>
<p>When AI runs this experiment at scale, who takes it back?</p>
]]></content:encoded></item><item><title>We Optimised Ourselves to Death</title><link>https://icle.es/2026/02/11/we-optimised-ourselves-to-death/</link><pubDate>Wed, 11 Feb 2026 09:55:18 +0000</pubDate><guid>https://icle.es/2026/02/11/we-optimised-ourselves-to-death/</guid><description>&lt;p>I once worked on a gaming website.&lt;/p>
&lt;p>It collected structured metadata about games - tags for features, screenshots,
videos, reviews. Users contributed information. We gamified participation and
rewarded it with games and gifts.&lt;/p>
&lt;p>It started making money through “similar games” lists.&lt;/p>
&lt;p>All of our traffic came from Google.&lt;/p>
&lt;p>Then we needed more revenue.&lt;/p>
&lt;p>So we did what teams do.&lt;/p>
&lt;p>We added features.&lt;br>
Integrated Steam, Xbox and PSN.&lt;br>
Pulled in achievements.&lt;br>
Expanded recommendation lists.&lt;br>
Tweaked advertising.&lt;br>
Worked on SEO.&lt;/p></description><content:encoded><![CDATA[<p>I once worked on a gaming website.</p>
<p>It collected structured metadata about games - tags for features, screenshots,
videos, reviews. Users contributed information. We gamified participation and
rewarded it with games and gifts.</p>
<p>It started making money through “similar games” lists.</p>
<p>All of our traffic came from Google.</p>
<p>Then we needed more revenue.</p>
<p>So we did what teams do.</p>
<p>We added features.<br>
Integrated Steam, Xbox and PSN.<br>
Pulled in achievements.<br>
Expanded recommendation lists.<br>
Tweaked advertising.<br>
Worked on SEO.</p>
<p>Traffic crept upward.</p>
<p>Still not enough.</p>
<p>Eventually we decided the problem was perception.</p>
<p>The site looked too much like a community project. It needed to feel more
premium. More authoritative. More modern.</p>
<p>So we renamed it.<br>
Changed the domain.<br>
Redesigned it from the ground up.</p>
<p>Months of work.</p>
<p>We launched.</p>
<p>Traffic collapsed.</p>
<p>We never recovered.</p>
<p>In hindsight, the failure wasn’t technical.</p>
<p>It wasn’t branding.</p>
<p>It wasn’t SEO.</p>
<p>It was that we never made a hard decision about what the product actually was.</p>
<p>Was it:</p>
<ul>
<li>A participatory community?</li>
<li>A structured data engine?</li>
<li>A search destination?</li>
<li>A content property optimised for Google?</li>
<li>A recommendations platform?</li>
</ul>
<p>It was all of them.</p>
<p>Weakly.</p>
<p>What Google valued wasn’t polish. It valued volatility.</p>
<p>Our homepage changed many times a day because users were contributing.<br>
Those contributions created fresh internal links, fresh content, fresh signals.</p>
<p>Participation was the engine.</p>
<p>When we redesigned for the information consumer instead of the contributor, we
stabilised the surface.</p>
<p>We accidentally killed the engine.</p>
<p>We optimised the visible layer and ignored the system feeding it.</p>
<p>I first heard the phrase “we’ll fix it in post” from my filmmaker brother.</p>
<p>Something wasn’t quite right during filming, but they moved on anyway. It could
be corrected later.</p>
<p>In film, that’s sometimes true.</p>
<p>In product development, it’s usually self-deception.</p>
<p>Lean encourages delaying decisions to the last responsible moment.<br>
That’s discipline.</p>
<p>What most teams practice is delaying decisions until they become painful.<br>
That’s avoidance.</p>
<p>An MVP is not the smallest thing you can push out.<br>
It is the smallest thing that is coherent and viable.</p>
<p>Viable means it has a clear shape.<br>
It respects constraints.<br>
It closes more questions than it opens.</p>
<p>If you ship something that only works on the happy path, with undefined edges
and postponed trade-offs, you haven’t preserved optionality.</p>
<p>You’ve preserved ambiguity.</p>
<p>Ambiguity spreads.</p>
<p>In code, as defensive layers.<br>
In design, as half-committed patterns.<br>
In product, as multiple possible futures carried at once.</p>
<p>Teams don’t slow down because they’re weak.<br>
They slow down because no one chose.</p>
<p>Every postponed constraint becomes cognitive load.<br>
Every “temporary” rule becomes precedent.</p>
<p>Lean does not say “don’t decide.”</p>
<p>It says: decide at the point where delaying further increases cost.</p>
<p>Most teams drift past that point because deciding feels like loss.</p>
<p>Loss of flexibility.<br>
Loss of imagined futures.<br>
Loss of political safety.</p>
<p>But momentum comes from commitment.</p>
<p>Once something is decided, energy frees up.<br>
The system becomes legible.<br>
Subsequent decisions compound instead of conflict.</p>
<p>We didn’t fail because we built the wrong feature.</p>
<p>We failed because we never chose what we were.</p>
<p>Most startups don’t die from lack of effort.</p>
<p>They die from unmade decisions.</p>
<p>“We’ll fix it later” is not iteration.</p>
<p>It is hope disguised as strategy.</p>
]]></content:encoded></item><item><title>Microservices vs Monolith: Real World Tradeoffs</title><link>https://icle.es/2024/07/17/microservices-vs-monolith-real-world-tradeoffs/</link><pubDate>Wed, 17 Jul 2024 09:48:25 +0100</pubDate><guid>https://icle.es/2024/07/17/microservices-vs-monolith-real-world-tradeoffs/</guid><description>&lt;p>When starting a new backend system for a contract I was on, one of the early
decisions I had to make was whether to lean into a monolith or adopt a
microservices approach. While common wisdom offers strong opinions on both ends
of the spectrum, in reality, the choice often hinges on organizational
constraints as much as on technical purity.&lt;/p>
&lt;h3 id="reactive-vs-traditional-spring-web">Reactive vs Traditional Spring Web&lt;/h3>
&lt;p>I began by reviewing
&lt;a href="https://filia-aleks.medium.com/microservice-performance-battle-spring-mvc-vs-webflux-80d39fd81bf0">performance comparisons&lt;/a>
between Spring MVC and WebFlux. Reactive Web generally comes out ahead in
benchmarks, but that doesn’t tell the whole story.&lt;/p></description><content:encoded><![CDATA[<p>When starting a new backend system for a contract I was on, one of the early
decisions I had to make was whether to lean into a monolith or adopt a
microservices approach. While common wisdom offers strong opinions on both ends
of the spectrum, in reality, the choice often hinges on organizational
constraints as much as on technical purity.</p>
<h3 id="reactive-vs-traditional-spring-web">Reactive vs Traditional Spring Web</h3>
<p>I began by reviewing
<a href="https://filia-aleks.medium.com/microservice-performance-battle-spring-mvc-vs-webflux-80d39fd81bf0">performance comparisons</a>
between Spring MVC and WebFlux. Reactive Web generally comes out ahead in
benchmarks, but that doesn’t tell the whole story.</p>
<p>In our use case—web notifications—the benefit of reactive patterns depends
heavily on how data is delivered. If we were polling, the advantage would be
limited. However, with Server-Sent Events (SSE), Spring’s support aligns
directly with Reactive Web, making WebFlux the more appropriate choice for this
part of the system.</p>
<h3 id="the-deployment-constraint">The Deployment Constraint</h3>
<p>Ideally, I would have started with a monolith: a single deployable artifact
combining both the Kafka Streams logic and the API. This option would have
simplified initial development and allowed us to iterate quickly. But at the
client, the platform does not allow deploying a Kafka Streams app and an API
within the same Kubernetes deployment.</p>
<p>This effectively rules out a true monolith, even for a prototype.</p>
<h3 id="options-considered">Options Considered</h3>
<h4 id="shared-library-with-thin-deployments">Shared Library with Thin Deployments</h4>
<p>A middle ground was to build the core logic in a shared library and have
lightweight deployments wrap around it. This would allow the streams app and the
API to share code without needing to make HTTP calls between them.</p>
<p>The downside: these services are no longer independently deployable. But given
our team size and velocity goals, this compromise might be acceptable.</p>
<h4 id="full-microservices">Full Microservices</h4>
<p>Another option was to separate the services entirely:</p>
<ul>
<li><strong>Streams service</strong> (Kafka, plus domain-specific logic)</li>
<li><strong>Web API</strong> (for delivering notifications)</li>
<li><strong>Subscription API</strong> (managing notification subscriptions)</li>
</ul>
<p>This adheres more closely to the single responsibility principle, especially as
we move from PoC to MVP. However, it adds deployment and coordination overhead.</p>
<h4 id="application-profiles">Application Profiles</h4>
<p>A third hacky option was to control which parts of the app run using
environment-based profiles. For example, we could disable Kafka in dev or use
conditional beans to keep deployments clean. While not ideal long-term, it
offers flexibility for early stages.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Constraints matter. While I lean toward monoliths for rapid delivery in small
teams, platform limitations forced a hybrid approach. We intend to evolve into
microservices over time, but only when the benefits clearly outweigh the cost.</p>
<p>Have you faced similar deployment constraints that shaped your architecture? I&rsquo;d
love to hear how you navigated them.</p>
]]></content:encoded></item><item><title>PostgreSQL performing huge updates</title><link>https://icle.es/2011/11/06/postgresql-performing-huge-updates-1106/</link><pubDate>Sun, 06 Nov 2011 12:45:41 +0000</pubDate><guid>https://icle.es/2011/11/06/postgresql-performing-huge-updates-1106/</guid><description>&lt;p>PostgreSQL is a pretty powerful database server and will work with almost any
settings thrown at it. It is really good at making do with what it has and
performing as it is asked.&lt;/p>
&lt;p>We recently found this as we were trying to update every row in a table that had
over eight million entries. We found in the first few tries that the update was
taking over 24 hours to complete which was far too long for an update script.&lt;/p>
&lt;p>Our investigation of this led us to the pgsql_tmp folder and the work_mem
configuration parameter.&lt;/p>
&lt;p>When the query was being executed, we checked the pgsql_tmp folder to see how
was space being utilised in there. We already knew about the pgsql table from
past experience. We had a server running out of disk space and rapidly. We had
narrowed it down into this folder. In cancelling the query referenced by the tmp
files in here, we were able to free up literally gigabytes of disk space...&lt;/p></description><content:encoded><![CDATA[<p>PostgreSQL is a pretty powerful database server and will work with almost any
settings thrown at it. It is really good at making do with what it has and
performing as it is asked.</p>
<p>We recently found this as we were trying to update every row in a table that had
over eight million entries. We found in the first few tries that the update was
taking over 24 hours to complete which was far too long for an update script.</p>
<p>Our investigation of this led us to the pgsql_tmp folder and the work_mem
configuration parameter.</p>
<p>When the query was being executed, we checked the pgsql_tmp folder to see how
was space being utilised in there. We already knew about the pgsql table from
past experience. We had a server running out of disk space and rapidly. We had
narrowed it down into this folder. In cancelling the query referenced by the tmp
files in here, we were able to free up literally gigabytes of disk space...</p>
<p>We had found roughly half a gig of temporary files in here. This led us to
investigate the configuration file.</p>
<p>The one parameter that stuck out was work_mem which was set to a default of 1mb
which I guess might make sense under most circumstances but not in this one.
According to the postgresql documentation</p>
<blockquote>
<p><code>work_mem</code> (<code>integer</code>)</p>
<p>Specifies the amount of memory to be used by internal sort operations and hash
tables before switching to temporary disk files. The value is defaults to one
megabyte (<code>1MB</code>). Note that for a complex query, several sort or hash
operations might be running in parallel; each one will be allowed to use as
much memory as this value specifies before it starts to put data into
temporary files. Also, several running sessions could be doing such operations
concurrently. So the total memory used could be many times the value
of <code>work_mem</code>; it is necessary to keep this fact in mind when choosing the
value. Sort operations are used for <code>ORDER BY</code>, <code>DISTINCT</code>, and merge joins.
Hash tables are used in hash joins, hash-based aggregation, and hash-based
processing of <code>IN</code> subqueries.</p></blockquote>
<p>This would tell us that the total memory usage with work_mem could be several
times the value set here and setting it to half a gig would probably be a
terrible idea for a heavily utilised production server. However, for the
migration process when we need to update over 8,000,000 rows, it might be a good
temporary fix.</p>
<p>After updating the work_mem to 512mb, we found that no more tmp files were
created and the whole thing was done in memory.</p>
<p>When updating so many rows, there area a few other things to consider.</p>
<p>Firstly, autovacuum will likely kick in several times to vacuum the table.
You'll probably want to disable this for the duration of the update statement
and run a vacuum afterwards.</p>
```sql
    --disable auto vacuum
    ALTER TABLE sometable SET (
      autovacuum_enabled = false, toast.autovacuum_enabled = false
    );
```
<p>You can switch autovacuum back on after the update statement has completed</p>
```sql
    --enable auto vacuum
    ALTER TABLE sometable SET (
      autovacuum_enabled = true, toast.autovacuum_enabled = true
    );
```
<p>A few other things you want to take a look at are the</p>
<ul>
<li>fsync parameter (I usually have this set to off anyway since the servers are
pratically fully redundant)</li>
<li>checkpoint_segments: I changed this to roughly 5 times the original value
(check the log to see if it says that its checkpointing too often)</li>
<li>checkpoint_completion_target: I changed this to 0.9</li>
</ul>
<p>With all of these updates, we were able to bring the total time of the update
down to a few hours.</p>]]></content:encoded></item><item><title>Tracking progress of an update statement</title><link>https://icle.es/2011/11/02/tracking-progress-of-an-update-statement-1101/</link><pubDate>Wed, 02 Nov 2011 19:59:02 +0000</pubDate><guid>https://icle.es/2011/11/02/tracking-progress-of-an-update-statement-1101/</guid><description>&lt;p>Sometimes there is a need to execute a long running update statement. This
update statement might be modifying millions of rows as was the case when we
went hunting for a way to track the progress of the update. Hunting around took
us to &lt;a href="http://archives.postgresql.org/pgsql-admin/2002-07/msg00286.php">http://archives.postgresql.org/pgsql-admin/2002-07/msg00286.php&lt;/a> In our
particular case, we are using postgresql but this should work with any database
server that provides sequences. Our original sql was of the form:&lt;/p>
```sql
update only table1 t1
set amount = t2.price
from table2 t2
where t1.id = t2.id;
```
&lt;p>There is of course now way of figuring out how many rows had been updated
already. The first step was to create a sequence&lt;/p>
```sql
CREATE TEMPORARY SEQUENCE seq_progress START 1;
```</description><content:encoded><![CDATA[<p>Sometimes there is a need to execute a long running update statement. This
update statement might be modifying millions of rows as was the case when we
went hunting for a way to track the progress of the update. Hunting around took
us to <a href="http://archives.postgresql.org/pgsql-admin/2002-07/msg00286.php">http://archives.postgresql.org/pgsql-admin/2002-07/msg00286.php</a> In our
particular case, we are using postgresql but this should work with any database
server that provides sequences. Our original sql was of the form:</p>
```sql
update only table1 t1
set amount = t2.price
from table2 t2
where t1.id = t2.id;
```
<p>There is of course now way of figuring out how many rows had been updated
already. The first step was to create a sequence</p>
```sql
CREATE TEMPORARY SEQUENCE seq_progress START 1;
```
<p>We can then use this sequence in the update statement to ensure that each row
updated also increments the sequence</p>
```sql
update only table1 t1
set amount = t2.price
from table2 t2
where nextval('seq_progress') != 0
and t1.id = t2.id;
```
<p>Once the query is running, you can open another connection to the database. To
get an indication of how far it has got, you can just run the following</p>
```sql
select nextval('seq_progress');
```
<p>Bear in mind that this will also increment it by 1 but if you have millions of
rows which is really the only case in which this would be useful, a few
additional increments is hardly going to make a difference.</p>
<p>Good luck and have fun!</p>]]></content:encoded></item><item><title>Java Object Size In Memory</title><link>https://icle.es/2011/04/25/java-object-size-in-memory/</link><pubDate>Mon, 25 Apr 2011 15:58:00 +0000</pubDate><guid>https://icle.es/2011/04/25/java-object-size-in-memory/</guid><description>&lt;p>Anyone who has worked with java in a high end application will be well aware of
the double edged sword that is java garbage collection. When it works - it is
awesome but when it doesn&amp;rsquo;t - it is an absolute nightmare. We work on a
ticketing system where it is imperative that the system is as near real-time as
possible. The biggest issue that we have found is the running of memory in the
JVM which causes a stop the world garbage collection. This then results in
cluster failures since an individual node is inaccessible for long enough that
it is kicked out of the cluster.&lt;/p>
&lt;p>There are various ways to combat this issue and the first instinct would be
suggest that there is a memory leak. After eliminating this as a possibility,
the next challenge was to identify where the memory was being taken up. This
took some time and effort and the hibernate second level cache was identified.
We were storing far too much in the second level cache.&lt;/p>
&lt;p>This is another double edged sword. The hibernate second level cache is
absolutely imperative to a high performance system. It does however, come with a
price. The cache needs to be managed carefully to ensure that balance between
performance and memory requirements.&lt;/p></description><content:encoded><![CDATA[<p>Anyone who has worked with java in a high end application will be well aware of
the double edged sword that is java garbage collection. When it works - it is
awesome but when it doesn&rsquo;t - it is an absolute nightmare. We work on a
ticketing system where it is imperative that the system is as near real-time as
possible. The biggest issue that we have found is the running of memory in the
JVM which causes a stop the world garbage collection. This then results in
cluster failures since an individual node is inaccessible for long enough that
it is kicked out of the cluster.</p>
<p>There are various ways to combat this issue and the first instinct would be
suggest that there is a memory leak. After eliminating this as a possibility,
the next challenge was to identify where the memory was being taken up. This
took some time and effort and the hibernate second level cache was identified.
We were storing far too much in the second level cache.</p>
<p>This is another double edged sword. The hibernate second level cache is
absolutely imperative to a high performance system. It does however, come with a
price. The cache needs to be managed carefully to ensure that balance between
performance and memory requirements.</p>
<p>To this end, it was important to be able to identify what was taking up all the
memory in the cache. Each object might only take a couple of hundred bytes, but
with our second level cache set to store hundreds of thousands of items, this
quickly takes up hundreds of megabytes. With the metadata of the cache, this
could easily hike it up near a gigabyte of memory usage. This gets substantially
worse with cache evictions and the adding of new items into the cache.</p>
<p>The correct way to resolve this is to identify specific object types that
&ldquo;overload&rdquo; the cache. i.e. items that have an large number of instances stored
in the cache. Identifying classes that store a large number of items is easy
enough - we just traverse the cache and count up the number of items. However,
there might be a class that stores a smaller number of items but take a sizeable
amount of memory. For this reason, it is important to understand the object
sizes in memory as well.</p>
<p>If you have ever tried to find a way to identify object sizes, you will know
that this is no easy task. You can calculate to some degree of accuracy the size
of an object based on the data it stores but this is a manual process.</p>
<p>The only real way to get this information is to use a java agent and use that to
calculate a more accurate memory usage. For this purpose, we used the
<a href="http://www.javamex.com/classmexer/" title="ClassMexer Java Profiling Agent">classmexer agent</a>
which requires a simple installation step of adding the following parameter to
java <code>-javaagent:classmexer.jar</code>. You can then figure out the memory utilisation
of an object by calling</p>
```java
MemoryUtil.deepMemoryUsageOf(objectInstance)
```
<p>You can also pass in a collection of objects:</p>
```java
MemoryUtil.deepMemoryUsageOfAll(objectInstanceCollection)
```
<p>This was the simple part.</p>
<p>Traversing the node structure of jboss cache and collating a collection
statistics with regards to the number of each type of object and its memory
utilisation was a little more interesting.</p>
<p>I will cover this separately</p>]]></content:encoded></item></channel></rss>