AI for the rest of us

Welcome to the AI community for everyone.

Hello friends

Charles’ youngest is a fanatical gamer and has been developing his own mods, and recently also his own games in Unity. So on Sunday the two of them went to GamersLounge at the Boiler Room in Guildford. This club was set up to preserve how retro games used to be played, on the original consoles they were developed for. Using modern flashcarts, they have managed to archive lots of games so that everyone coming to the club, young or old, can understand and enjoy gaming as it was 20-30 years ago. It was enormous fun, and made me think I should acquire a small machine and set up the Commodore 64 games of my childhood for him.

Charles also wants to tell any readers who, like him, are fans of Baroque music (surely there must be someone) that this recording of the Four Seasons with La Serenissima and Adrian Chandler (violin/dir), one of squillions that came out last year to celebrate the piece’s 300th birthday, is completely astonishing.

Hannah arrived home from Japan on Tuesday, repacked her case and set off on another adventure. She’s writing this newsletter from Rome! There was just enough time to recover from her jetlag and get down to London for the 16th AI for the rest of us meetup on Thursday evening and it was another packed event. Thank you to Robert and Sal for their inspiring talks! I’ve covered more on both topics below!

Agent Craft is just around the corner and we’re thrilled to welcome Always Further as a sponsor for the event. Always Further are the team behind nono.sh - a secure isolated way to run AI agents. If your interested in building Agents I’d love to see you in London on June 12th! With only 100 tickets available don’t miss out!

If you're interested in sponsoring Agent Craft or this newsletter please reach out to hannah@aifortherestofus.live

Have a wonderful week,

Hannah & Charles

Try Leapter's Visual Language

You don't need to read code to start building with Leapter. Try Leapter today and let them know what you think at showcase.leapter.com

What’s Hannah reading this week?

During my talk at QCon earlier this year I proposed three anchors that help me navigate AI transformation in software teams. One of them is “People Matter”, hopefully you agree. Whatever the role of a developer becomes it has to remain a profession worth pursuing as a human, a thing that enough people want to spend 40 hours a week doing. I also proposed that reviewing thousands of lines of AI generated code is miserable work and is a practice that is likely to die out. We are problem solvers and this is just another problem we need to solve.

I was interested to see that Emma Burrows, Founder of Rezonant and formerly CTO of Stripe agrees with me.

“But we’ve found that trying to keep up with code reviews was slowing us down and with agents writing 99% of our code, we were ending up reviewing 2 rounds of human review (and a round of agent review) which was duplicating work. I expected fewer delays from waiting on approvals, but what I didn’t fully appreciate was how much time gets eaten up by context switching, both for the person writing the code and the person reviewing it. It’s cool to see things speed up significantly more than we thought they would.”

In her recent post on LinkedIn Emma shares that her team have moved peer review earlier in the process, doing a thorough spec review instead of code review. This makes sense to me and I’ve seen a similar shift in some open source projects, whereby maintainers are looking for well crafted requirements instead of code contribution.

At this week’s AI for the rest of us meetup Robert Werner, CTO of Leapter shared a similar perspective. In his talk he showed us how visualisation tools can help us reason about and validate code faster and that combining this with literate programming techniques, whereby a coding agent can document intent alongside executable code, can make code review less miserable and more human. You check it out at showcase.leapter.com

The topic of code review in the open source space keeps rumbling around. Last month the Kubernetes project, one of the largest and most active open source projects in the world updated their Pull Request Process with more direct guidance on the acceptable use of AI. Kat Cosgrove who sits of the Kubernetes Steering Committee explains:

“Do not leave the first review of AI generated changes to the reviewers. Verify the changes (code review, testing, etc.) before submitting your PR. Reviewers may ask questions about your AI-assisted code, and if you cannot explain why a change was made, the PR will be closed.”

Sadly I can’t make it to Edinburgh for the next edition of State of Open Con, but the lineup looks incredible and I can highly recommend it if you’re free on 5th June. The theme of the event is “Opening the AI Stack” - bringing together digital and AI leaders to deep dive on the key topics in AI and openness.

The benefits of multi-agent debate are something that I’ve covered in this newsletter before. This was based on my work at Futuria last year where I was building multi-agent teams for their clients. The foundation models have come a long way in the 12 months since then and one of the constraints that I was working around last summer was tool calling. To reliably call a tool from my multi-agent team I had to create a dedicated agent that was an “expert” in that tool and was a single interface between the team and the tool. Today the most advanced models can handle 10X more instructions than they could 12 months ago, that makes a huge impact on agentic development. Laurie Voss shared his findings:

“So that's our headline finding: a year ago, frontier models started losing track of instructions at somewhere around 200-300 simultaneous constraints. Depending on what model you pick, that boundary is now closer to 2,000 instructions.”

I was absolutely thrilled to see Sal Kimmich demo a live multi-agent debate on Thursday night at the AI for the rest of us meetup. Sal called upon the fundamentals of cognitive science, how we humans make good, correct decisions with incomplete data all day every day, and put that structure into their agent team. They also created personas for each agent and equipped those personas with the best academic knowledge in that domain. The result was a highly effective decision making team …with evidence that the multi-agent debate was consistently creating a higher quality answer than the model alone! You can check out the cyberneutics project on github.

Phil Winder has also shared his initial learnings from the recent experiments at HelixML where they are building out an agentic workforce. His post “Four Lessons from Building an Agentic Workforce” covers challenges like overly chatty agents and agents bringing human-like politics into their agent to agent collaboration. Apparently telling the AI agent that it’s not human is an important step not to be missed.

Phil closes the post by saying “I've been thinking more in terms of jobs to be done, which are easier to specify and more concrete. This will be my next focus.” I agree! This is the way! I’ve had a lot more success decomposing roles (the human org) into jobs and tasks (the agent org) but I do wonder how much decomposition is needed given the stats above suggesting agents can handle 10x more instructions. I really want to start playing with this again but there are just not enough hours in the day… especially when I keep booking more travel!

Phil will be sharing his Agentic Org at Agent Craft on June 12th alongside the trailblazers from Futuria. We are going to have some incredible people in the room and you can be there too!

What's Charles reading this week?

I've been particularly busy doing some research work for a client and haven't had much reading time as such, but I did manage to listen to this episode of the Asynchronous and Unreliable podcast on my way back from the Twofish studio on Saturday morning. Host, technologist and sci-fi author Anne Currie, talks to veteran technologist Martin Davidson of Tollens.AI about what it actually looks like to use LLMs to write high-quality, production-ready code. Martin (along with Sam Aaron of Sonic Pi fame) is one of the people really pushing what LLMs can do in this space, and the conversation is fascinating. He covers everything from "oracle-driven development" — defining what good looks like before you write a line — to using AI to build things like SIP stacks and chip emulators in Rust, that would previously have taken teams of engineers years. The closing analogy involving The Terminator is one I'll be thinking about for a while. Well worth an hour of your time.

I was also very struck by this report in The New York Times that the Trump administration has swung back towards AI safety. The review process “could be similar to one being developed in Britain, which has assigned several government bodies to ensure that A.I. models meet certain safety standards, people in the tech industry and the administration said,” according to the report.

The trigger for this apparent change of heart appears to be Anthropic's new "Mythos" model, which we’ve covered extensively in the newsletter and which alarmed cybersecurity researchers with its ability to find and exploit software vulnerabilities. The White House has been scrambling to understand the hacking capabilities Mythos possesses, and that national security anxiety seems to have overridden the administration's previous instinct to let the industry self-regulate.

If you are thinking to yourself, “Isn't this basically what Biden wanted to do?” the answer is largely “Yes”. As so often with Trump, he ripped something up — in this case revoking the Biden administration's AI executive order on Day 1 — and then is slowly but surely putting much of it back as it dawns on him (or perhaps the people around him) that it was a good idea. See also the Obama nuclear deal with Iran and the current war.

In fairness, I think at the heart of this sits a common issue for Western governments, all of which are desperate for economic growth and see AI as a way to achieve that: whilst also being pulled in different directions because of very real concerns over risks and, for many people not called Trump or Musk, also ethics and the environment. Economic and tech policy voices are worried about any policy changes that could complicate deployments, while the national security community is worried about the possibility of a major AI-enabled cyberattack. Meanwhile, independent analysts point out a deeper problem: model reviewing bodies that don’t have teeth does not equal meaningful oversight, and a working group co-designed with the companies being reviewed risks being too cosy to be genuinely independent — much as we don't let pharmaceutical companies run their own clinical trials without external oversight.

While this was going on OpenAI announced Daybreak, their direct answer to Anthropic's Project Glasswing and Mythos. Both rest on the same premise that frontier AI models can meaningfully shift the advantage toward defenders in cybersecurity. The key difference is philosophy: Anthropic chose to restrict Mythos entirely, citing the model's capabilities as a reason not to release it broadly, while OpenAI took the opposite approach: Daybreak is accessible to any organisation that submits a contact form, with higher-capability tiers gated by verification rather than invitation. Those tiers include a permissive model explicitly designed for penetration testing and red teaming (a proactive, adversarial simulation where security experts act as authorised attackers to test an organisation’s people, processes, and technology). Major names like Cloudflare, Cisco, and CrowdStrike are already integrating the capabilities.

Whether OpenAI's bet on broad access with controls proves wiser than Anthropic's caution is essentially a live experiment, and the US Government's interest in pre-deployment review, discussed above, is partly a response to exactly this question.

I was speaking about AI sustainability at MLCon this week. The event had about 5 different conference brands running concurrently including API Conference, DevOpsCon and MLCon. With so many things being impacted by AI, I thought the multiple events at one conference format worked rather well.

I had a lively conversation with Neil Douek, who I sat next to at the speaker dinner, and was pleased to catch his talk the next day. He gave an ambitious sweep through computing history, framing it as three overlapping eras: the trailblazers of the mainframe and space age; the rock stars of the internet and cloud; and now the AI-native engineers of today. His central point was that software development is approaching what he calls an "architecture of uncertainty" — moving from deterministic systems to probabilistic ones, from writing code to expressing intent. His prediction is that within 25 years, humans will be stewards of systems rather than builders of them. I’m looking forward to reading his book.

I also thoroughly enjoyed a panel/podcast session featuring Dr. Pieter Buteneers (from a start-up called Emma Legal), Michael Dowden, and Martin Stypinski. Their consensus was that prompt engineering can now get you surprisingly far, but for specialised domains such as legal documents, computer vision, and low-latency edge cases, custom models still matter. One of the panellists was candid that they're currently selling below cost, betting that models get cheaper before their runway runs out. My view on this is that nobody knows; the AI model providers are also selling at a considerable loss, and it isn’t clear if they’ll be able to reduce their operating costs significantly or whether their prices will go up substantially.

Elsewhere Steve Pereira of Visible Flow argued that as organisations rush to adopt AI tools, they risk falling into old traps: measuring what's easy rather than what matters, tracking outputs instead of outcomes, and collecting data nobody acts on. His advice was simple but hard to follow in practice: work backwards from what you're actually trying to achieve; pick a handful of genuinely useful metrics (lead time, throughput, work item age); and make sure someone owns each one. The watermelon problem of metrics (green on the outside, red within) is as common as ever.

Mehreen Tahir from monitoring/observability vendor New Relic offered a practical warning for anyone running infrastructure: your platforms were designed for human developers, and agents use them very differently. Agents make mistakes at machine speed, don't read error messages the way people do, and create audit and observability challenges nobody has fully solved yet. Her advice was to think in three horizons — instrument everything now, build evaluation into your workflows, and eventually let agents help manage the platform itself. The organisations that treat this as purely a tooling problem, she suggested, will struggle.

If I was going to try to sum up the overall theme of the sessions I attended, it was that AI isn't removing the hard thinking, but it is making it easier to build the wrong thing faster.