HumanWORKS

The Cost Of Counting – Losing The Apple Watch And Trying To Not Lose My Mind In AI Cost Calculations

Ten days ago I did something that, for a person who describes himself as a metrics driven autistic with a chronic pain condition and a relationship with sleep that could generously be described as “complicated,” might have seemed genuinely unfathomable.

I took my Apple Watch off.

Not just to jump in the shower. Not because the battery died. Not because the strap was chafing, although if you’ve ever spent three days staring at a sleep score whilst simultaneously developing a mild repetitive strain injury from anxiously rotating your wrist to check your sleep score, you’ll understand that chafing is the least of your problems.

I took it off and put it on my bedside table. Then I thought about whether to put it back on. Then I didn’t.

I haven’t worn it since.

The reason was a conversation with a good friend – one of the people who knows me well enough to know about the fibromyalgia, about the sleep apnoea diagnosis that arrived in the same window as a global pandemic as though the universe decided that if it was going to do bad timing it might as well commit to the bit – and we were talking about my sleep score and how I could improve it.

My sleep score was a conversation. I had plumbed new depths.

If you’ve used any consumer health wearable in the last five years, you’ll know exactly the flavour of psychological torture I’m describing.

Every morning, before you’ve had coffee, before your nervous system has fully negotiated the terms of another day, you’re presented with a number that tells you how well your body performed its most fundamental biological process whilst you were unconscious and therefore unable to do anything about it. This number is presented as the truth in the same way as a judge might deliver a sentence – and the sentence was broadly “do better, Matthew”.

For me specifically – and I’ll acknowledge upfront that my relationship with numbers is somewhat more intimate than most people’s, which is both a superpower and a source of entirely self-generated misery that I’ve had to spend considerable therapeutic energy untangling – the sleep score had become its own recursive problem.

I was tracking my sleep quality in order to manage my sleep quality, which was being negatively affected by the anxiety generated by tracking my sleep quality. Getting told you’ve had a bad night sets you up to believe you have had a bad night, even if you didn’t feel it to start with. The problem there is that I’m relying on a machine more than my own thought process – something I’ll come to later in this article.

The insight that finally shifted things was less about some massive realisation. Instead, it was embarrassingly simple. I was explaining to my friend why I was miserable about my sleep score, and I heard myself saying, out loud, that the score was making me more anxious about sleep, which was making the sleep worse, which was making the score lower, which was making me more anxious. I’d built myself a perfect closed loop of quantified suffering. A feedback spiral with excellent data integrity.

She was giving me advice on how to improve my score, whereas I felt like I wanted to just… well see what happens without a number to chide me further.

I took the watch off straight after that chat.


the inverted machine, or: whilst i was trying to stop being a machine, a machine was trying to become me

Now to talk about what is related, but also inverted – the difference between a machine tracking me, and me tracking a machine as the business has invested heavily in Codex and the broader OpenAI partnership.

In the same fortnight I divested from my personal biometric surveillance apparatus, I found myself spending a rather significant amount of time – intellectually, financially, and in terms of the conversations I was having with my colleagues – on the question of machines. Specifically, on what it costs to operate them. Even more specifically, on why that cost is considerably harder to calculate than the people selling you the machines would like you to know.

The contrast, when I finally noticed it after our weekly retro at work on Friday, was almost too neat for a person who enjoys a good structural irony.

I had been the human attempting to quantify myself like a machine. Importing the logic of the dashboard, the metric, the score, onto biological systems that were not designed to be measured that way and were communicating their objection through the medium of increasingly poor sleep scores and the particular brand of 3am anxiety that feels like your nervous system has hired a crisis communications team.

Simultaneously, I had been watching – and paying for – a machine that is very keen to have you believe it’s something closer to human than it actually is.

Not because it’s malicious, and not because the people building it are deliberately deceptive (though we can revisit that particular conversation later), but because the cultural framing around it has collectively decided to conflate output with intent in a way that we really should know better than by now.

We have all, at some point, been the child saying sorry for the thing we weren’t actually sorry for. We understood instinctively that producing the appropriate social output does not necessarily constitute the internal state the output implies.

The AI is not doing that consciously. Which is, in some ways, worse.

We are anthropomorphising a stochastic process because the outputs are fluent and occasionally uncanny, which is exactly as robust as deciding your washing machine has a personality because it sometimes makes an unusual sound during the spin cycle. The groan isn’t the washing machine expressing ennui – it’s a noise.

Anyway. Machines. Costs. On to, the Friday conversation and why you can’t calculate them as simply as you’d like.


tokens: not one thing, and definitely not as cheap as advertised

Last Friday I was sitting in a catch-up with Steve, who runs the business I now co-lead on the technical advisory side, and he asked a question that is increasingly being asked by senior leaders across every organisation currently in the process of discovering that deploying AI at scale costs rather more than the pilot suggested.

“How do you pre-calculate token usage to avoid spending excessively?”.

OK, technically it was “does anyone know how to present the relationship between price, OpenAI credits, and tokens” but the underlying question was the same.

So how do you know what you’re going to spend? Is it possible to assess ahead of time?

It’s a fair question. It’s also a question that, once you start answering it properly, reveals that whilst you might be able to align the logic of “one credit equals 150 tokens” the truth is that tokens aren’t able to be pre-calculated when you use a thinking model.

One of my colleagues who was also on the cal, made a reasonable attempt at an explanation. It wasn’t quite right. This is not a criticism of them – it’s genuinely not an obvious thing, and the way the industry presents it doesn’t help, because the industry’s commercial interests are served by keeping certain aspects of the cost structure somewhat opaque. We’ll come back to that. On that side, that’s a point my colleague was bang on about – it’s less about value assessment, and more about “just one more token” which is the architecture of illegal product distribution mechanisms.

Or, to be put it bluntly, it has the same psychological profile as that of a drug dealer – don’t worry about the cost, just enjoy it.

Anyway, getting back to the mundanity, let me explain what tokens actually are, because the word is being used in about four different ways simultaneously and this is, I would argue, not entirely a coincidence.

A token is, in the simplest possible terms, a unit of text. It’s roughly – and I want to stress roughly, because precision here is itself part of the problem – about three-quarters of a word in English, though this varies considerably depending on what you’re asking the model to process. Code is tokenised differently to prose. Non-Latin scripts tokenise differently again. If you’ve ever wondered why querying AI in certain languages costs disproportionately more, this is part of the answer.

Now. Input tokens and output tokens. These are the ones most people have a vague handle on. You send the model some text (input tokens, charged at one rate). The model sends back some text (output tokens, charged at a different – typically higher – rate). Simple enough. Calculable in advance if you know your prompt length and can estimate your expected output. Not trivial to predict with precision, but manageable.

So you can work it all out in advance? Hold your horses a little.

This is where I tell you about thinking tokens.

Ultimately, this is where the clean ledger gets complicated, and where the consultancy parallel of my own career is relevant.


internal monologue is expensive: the thinking model problem

When you ask a thinking model to work through a complex problem, it doesn’t simply process your input and generate an output. It does something considerably more interesting, and considerably more expensive.

It talks to itself first.

The technical description is “chain-of-thought reasoning” or “internal scratchpad” depending on which company’s documentation you’re reading, but the phenomenological description is closer to what I’d call a person’s internal monologue. The model generates reasoning steps that it doesn’t necessarily surface to you in the final output, works through competing hypotheses, revises its own approach mid-thought, and arrives at an answer that was produced by a process you can’t fully observe and which – critically – consumes tokens you are paying for whether or not they appear in what you receive.

I explain this to people using a consulting analogy, because I work for a consulting firm and it seems appropriate to use shared vocabulary.

When you retain a consultant, you’re paying for their time. That time includes the hours they spend in meetings with you, the deliverables they produce, and the thinking they do that never appears on a slide.

The good ones – the ones who arrive at answers that seem almost intuitive in retrospect – are often the ones doing the most invisible work. They’re running problems in the background whilst apparently doing something else, noticing patterns that don’t fit, interrogating their own assumptions before they become your advice.

That internal processing is billable, even when it’s unconscious. You don’t see it. You see the output. But the expertise that shaped the output was built in the invisible thinking, and you’re paying for that expertise even when it’s not legible.

This is why my day rate is what it is – you’re not paying for a day of my time, you’re paying for the experience I have that means you only have to spend a day doing something rather than months struggling through treacle making the mistakes I have already made years ago.

So when it comes to using a thinking model, you get to talk to something that can talk to itself about experiences – although this time, the experience is their data set as opposed to the lived experiences of consultants. Only time will tell if these two things are suitable delineated to show the value of consulting over time.

Thinking models work the same way as consultants, except you’re not paying for the expertise directly – instead you’re paying for the actual compute consumed by each step of the internal reasoning process. The challenge is the number of steps isn’t fixed, isn’t easily predictable, and isn’t consistently exposed by the platforms you’re deploying through.

When someone speaks to me, they don’t get to see my internal reasoning – which makes me very much like an AI model, except I do at least have the decency to tell you the day rate first.

Most people want to know that “you can ask 100 questions a week” because that fits well on a financial ledger. The reality is it isn’t that clear. The deeper problem is that not all usage is created equal – just as we’ve seen people 10x their productivity, and we’ve seen others make images about Cthulu.

Anyway, getting back to the calculations… even if we broke down the input tokens – which you can estimate with reasonable accuracy – the thinking tokens and output tokens are functions of what the model decides to do with your prompt, which is itself a function of prompt complexity, temperature settings, and model behaviour that most cloud-based deployments don’t give you clean visibility into.

Temperature, for those unfamiliar, is roughly the dial between “deterministic and predictable” and “creative and variable” – and higher temperature means higher variance in output length, which means higher variance in cost.

I don’t generally have time to sit down and talk about the finer points of AI model parameters in chats because most people aren’t actually interested in the mechanics. However, I’m getting into it now, because this is a blog post rather than a two-minute summary and I have considerably more latitude.

There are ways to spend less than the meter implies — getting the model to hand you a deterministic script that runs without calling it again, rather than paying for a fresh invocation every single time.

The commercial reasons the platforms would rather you didn’t dwell on that option are a piece in their own right, and a later one. For now it’s enough to sit with the smaller, stranger fact: the cost structure resists clean calculation, and that resistance is not an accident.

Or, in simple terms, think before you prompt.


the unified problem: when does measurement serve you, and when does it consume you

Here is the thing I’ve been working towards.

More data creates more pressure to measure what we are doing. I’m already seeing it with clients. Everyone wants clear pricing and clear cost control the same way they wanted to do this when the business moved to the cloud and realised 100 people doing the same thing 100 times over was not cheap. However, AWS at least had the decency to tell you what an hour of compute cost.

So this is not a new observation, but it’s worth sitting with in the context of both of the things I’ve been describing, because the vector of the problem is different even as the structure is identical.

With the Apple Watch, I was the thing being measured. The data proliferation was about me – my sleep cycles, my heart rate variability, my blood oxygen levels, the number of times I apparently shifted position between 2am and 4am. All of that data was real. None of it was particularly actionable beyond “sleep better,” which I was already motivated to do and which the data was, empirically, making harder rather than easier. The measurement was consuming the thing it was supposed to serve.

With AI spend, the organisation is the thing being measured – or rather, the organisation would like to do the measuring but can’t, cleanly, because the cost structure actively resists clean measurement.

Steve’s question was the right question. The honest answer is that you can get close with careful prompt engineering, with moving deterministic workloads into scripts, with understanding which use cases benefit from thinking models, and which are overpaying for probabilistic variance they don’t need.

However you cannot fully pre-calculate AI spend with the transparency a finance function would like, because the architecture doesn’t expose the parameters cleanly, and the commercial interests of the people building the architecture are not aligned with exposing them.

These are different manifestations of the same dysfunction. In one case, the data is too present and too actionable-feeling, creating anxiety loops that degrade the system being measured. In the other, the data is deliberately obscured, creating cost overruns in systems that were sold on efficiency.

There’s a third face of it, and it’s the one that closes the loop back to where I started.

A thinking model doesn’t always know when to stop thinking. Past a certain point, more recursive self-questioning buys you more hedged, more verbose, occasionally more confused output rather than a better answer – and the boundary between “this benefits from extended reasoning” and “this is disappearing up its own chain-of-thought” is not marked on any dashboard, while the meter runs either way. I say this as someone who has been reliably informed by multiple therapists that I have precisely the same problem. The model over-measuring its own reasoning is me over-measuring my own sleep: neither of us knows when enough is enough, and the not-knowing is the expensive part.

(Cynics may well say “this sounds a lot like you when you’re talking philosophy in a tech context” and I’d probably say cynics are right. It’s one of my many idiosyncrasies.)

Anyway, sanity, in every case, is the same discipline: knowing which data is load-bearing, which is noise, and what the cost of measuring versus not measuring actually is.

I know what my sleep quality is like without a number. I know when I’ve slept badly because I wake up in pain and my cognition operates at the level of a drowsy golden retriever. The number added precision without adding insight, and traded the precision for my equanimity at 7am.

I know what good AI deployment looks like without a per-token ledger. A thinking model writing compliance documentation is justified spend. A thinking model answering “what’s the capital of France” is a contribution to OpenAI’s quarterly results that adds nothing to mine. In that case, there’s nothing to “think” about – some things just are, whereas other things need thinking and their associated inference that these models bring us for a subscription price.

The watch taught me this and cost me nothing except the embarrassment of having taken this long to notice – and the £800 or so for the Apple Watch Ultra.

The AI cost structures, if they teach the same lesson, will cost rather more to the companies that adopt them before the penny drops.

Some things, it turns out, are better measured with experience than with instruments. The trick is knowing which is which before you’ve spent three months of compute budget and a year of 3am anxiety finding out the hard way.

And here, finally, is the symmetry I couldn’t see while I was living inside it.

I was a human trying to run myself like a machine – importing the dashboard, mistaking the score for the sleep.

In the same fortnight, I was paying for a machine running the trick in reverse: a stochastic process very keen to have you believe there’s a someone in there, fluent enough that we conflate the output with an intent it doesn’t possess.

Two lies from opposite ends of the same mirror – mine, that a number could tell me how rested I was; the machine’s, that a confident answer is the same as a considered one, and that its cost, like its interior, is simpler than it looks.

I stopped letting one machine measure me. The other one is always measuring me back – fluent, confident, and believing I won’t notice. When it comes to using AI effectively, noticing is the whole of the work.


Next week: what we’re actually doing to push past the current limits of what thinking models can do, why those limits are more interesting than the hype suggests, and why “just use a bigger model” is the AI equivalent of “just try harder” – technically true, immediately useless, and beloved of people who haven’t looked at the cost structure recently.

Comments

Leave a comment