LLMs: Good For and Bad For

What LLMs Do

#MeWriting Large language models repeatedly perform one easy to understand operation. Given a sequence of tokens (words), they predict a next best enough token. They use a large database of “weights” to calculate “next best enough”. They accept a random best enough token to cut down on comparison time versus picking the very best. This willingness to accept “best enough” is what makes the completion algorithm, as it is called, practical. It’s also what makes it interesting. For a long enough answer which isn’t very long, you will get a different word by word answer every time. You will get different classes of similar themed answers as well.

The completion algorithm is a lot like when your phone has three words choices for autocomplete, you pick one of them that will work. The feature was first rolled out by Google in late 2004. People have made a game of this feature since day one. When you play that game, you sometimes end up with a plausible sentence, though rarely a sensible one. With LLMs — billions of weights (aka “parameters”) and a long context window to evaluate — most sentences and paragraphs, even multi-paragraph answers, seem sensible too. That is the magic of LLMs.

The magic ends there. There is no mechanism in LLMs that guarantees that completions, as they are called at ground level, are correct in any factual sense. An obvious question is why AI researchers didn’t design one in. Turns out, they don’t know how to model truth in a manner compatible with the accidental efficiency of the completion algorithm.

The most important reason we have settled on this algorithm is that it is, accidentally, computable efficiently by vector algorithms, packaged as graphics processing units (GPUs). It’s the third computational task they’ve been really good at, following graphics circa 1990 and cryptography circa 2010. We used these same GPUs for gaming and Bitcoin. They did the underlying computational tasks much faster than CPUs could.

The point here is that when we talk about generative AI for text, we are stuck in and with this model of computation. We are stuck with what it is good at. We are stuck with its limitations. People who pretend otherwise are flat out lying to you or, more charitably, creating convincing surface level illusions that crack under close inspection.

“But ChatGPT says that it’s thinking, so it must be thinking!”

I’ve heard this reasoning from otherwise very intelligent people. No, it is not thinking. What it is actually doing in that step is stuffing the context window with possibilities so that completion down the road might pick one and expand on it. That’s the illusion. It is not at all how smart people think.

Do LLMs Solve the Task at Hand?

With this article, I want to give you a sense of what LLMs are good at and not good at. We now understand what they do, exactly. For any task, we can ask:

“If it’s generating one random best enough token at a time, yielding a linguistically plausible completion, does (or can) that solve the task at hand?”

I am a fan of Elon Musk. I would like him to succeed at everything he has decided is important. He is a savant at picking worthy big goals. That said, his claim of a “truth seeking AI”, based on the completion algorithm, is total bullshit. As noted above, there is no mechanism in the completion algorithm to ensure truth. There is no checking that another LLM could perform to evaluate truthiness. Perhaps there is a way to measure — likely by hand — the truthiness of a large sample of outputs, then tune training to optimize for that measurement. In practice, it’s a measurement much closer to 50% (heads or tails) than 100%. That tuned training is both computationally and humanly very expensive.

To put that into context, we can train simple neural networks to drive a car (Tesla “Full Self Driving”) to an error rate around one human intervention per 100ish miles, and an accident / death per mile-driven rate about 1/10 that of the human fleet. But truthiness of language models peaks around (call it) 60% and is easily steered or derailed by strategic human token injection mid completion. While LLMs are based on neural networks, at operational scale for purpose, driving and writing words are very different tasks with very different levels of achievable mastery.

Let’s call Tesla FSD 96% solved, and recognize that chipping away at the remaining 4% will get more and more expensive, perhaps exponentially (in the true mathematical sense) so. We have to this point achieved great utility from that 96%. Such is not the case with truthiness in LLMs, and we are at a point of rapidly diminishing returns at 60%. The problems and the algorithms at our disposal are just different.

Good For and Bad For List

We now have a comparative sense of LLMs’ limitations on their problem domain versus a different, quite successful “AI” application. From that, we can start to characterize applications that might work well with LLMs and applications that most definitely will not.

Story writing works well. It will work best with a few open ended suggestions, rather than numerous and detailed restrictions. The restrictions become the “truth” for the story. While the context window is a powerful force in shaping token production, it can also be contradictory, or have too many instructions for any single instruction to consistently have real force. “Make up a story about my dog Mona, Eeyore from fiction, and Paul Bunyan from fiction saving the forest using the 7zip application” results in delightful, if absurd, stories! These generated stories obviously aren’t true. To work as outputs, they just need to integrate the three characters and the tool. Everything else is gravy. Unexpected twists are welcome in great stories!
Drafting emails or LinkedIn posts work well if instructions aren’t too detailed and specific. In a weird coincidence, those end up being the kinds of emails and posts that get the most engagement. A sender telling a recipient exactly what and how to do something isn’t a friendly communication.
Summarizing works well if the source is coherent and the reader of the summary is familiar with the source. No trust in a good summary is required. The reader knows the material — as I’ve specified two sentences ago. So the reader can evaluate the summary for consistency with the source. If the reader is not familiar with the source material being summarized, the reader has no basis to evaluate the quality of the summary.
Translation works well with multi-lingual models trained on enough translation material. This is because sentence by sentence translation usually mostly covers full document translation. My own work with bilingual evaluators of translated stories had them consistently rating translated stories as “A” work — very good but not quite perfect with nuance — even with small, private models in the 4B size range.
LLMs suck at automation. Tool calling is great when it’s sequenced correctly. The demos are amazing when they work. The problem is that the sequencing by an LLM is still random, not deterministic. The funny thing is that we know how to automate deterministically. Code the exact sequence, the computer follows the exact sequence. The best we can hope for having the LLM handle sequencing is that it might get the sequence for something we don’t already know how to sequence. This suggests that LLMs might be good for exploring or prototyping sequences we don’t already know about. The limiting factor is downside risk of sequencing incorrectly. In processes that are worth automating, these downside risks tend to be quite high. In plain English, mistakes are very costly.
How about so-called vibe coding? That seems like applied automation and story telling. For prototyping, where we don’t totally know what the system should do, vibe coding to present possibilities would be a great tool, provided we’re willing to consider it a prototype and not try to massage it into a working system.

An unexpected turn that vibe coding has taken is into specification based automated code generation. This is a mistake, because we’re not good at writing detailed specifications, and LLMs are not good at faithfully following all the directions in them in correct proportion.

It is tempting to suggest putting a human in the loop in all of these activities, since none of them work perfectly every iteration. There are two problems with this. The first is it may just be less expensive to have a human do all the work to a higher degree of quality than to have the human evaluate every answer and repair the bad ones. This is probably the case with vibe coding. The other problem is that competent humans may not enjoy such a role that sets their creativity aside.

“Suck it up, buttercup, this is what we’re paying you to do now.”

The problem with this approach to defining work is that it will atrophy the skills that make great reviewers and fixers great at their work, and alienate the reviewers who have options.

Rule of Thumb

Here is a simple rule of thumb. LLMs are great at right-brained tasks where any linguistically plausible answer is a good task answer. They suck at left-brained tasks where a small subset of possible random answers are acceptable. When you work against this rule of thumb, you just make your project expensive or not winnable in the first place. Comfort with this rule of thumb is the “chill” you need to have if you’re going to make good decisions about what this technology should do.

“It shouldn’t be limited this way, and besides, version next will be better.”

Well, you lack the chill to make good decisions, and that lack of chill will absolutely get your ass kicked in this game. Downside risk.

Conclusion: Who Knows?

Let’s conclude with a personal observation from making this argument about tracing the actual task back to the completion algorithm for three years:

16/20 professionals in the AI space have no sense whatsoever of what tasks LLMs are good for and what tasks they are bad for. They haven’t considered that there is a dichotomy, let alone a continuum. They are the people most associated with AI hype and phrases like “it’s just a toddler” and “the next version will be even better”.
3/20 feel that there are some good applications and some bad applications, but have no idea what they are and no inclination to investigate.
1/20 have considered asking what LLMs actually do to try to figure out what they are good at.
A much smaller segment have put together a working theory.

These are my rough estimates based on my very active engagement with these people. They’re usually not bad people. They just aren’t thinking this through.

You have now seen a working theory. You are aware that at least a handful of people are paying attention to this. We’re paying attention because there is an opportunity to kick a lot of ass of a lot of people who just have their heads in the sand. As a thank you for reading this, I hope that you will pull your head out. Of the sand. You know, so you’re not an easy target for a total ass kicking.

My friend Pete A. Turner shared something with me on a private phone call the other day:

“People are making a lot of decisions based on what a robot thinks the next word should be.”

Pete is not a tech guy. He is very right brained. He has a better holistic sense of what is going on here than any tech guy I know.

I would appreciate your reactions and comments on my LinkedIn repost.