Modern artificial intelligence tools are largely rooted in text-based interactions. Text has many advantages. But the history of user experience, information and even humanity shows us that AI will have to go beyond text if it’s going to become relevant to wider society.
Wordcels and shape rotators
A few years ago there was a meme within techbro circles about “wordcels” and “shape rotators”. This is the idea that people broadly fall into one of two camps: those who think in words, and who think in pictures. Of course, reality is much more complicated than that, but the idea resonates.
Much of my work in information architecture involves working with semantic concepts that can, by their nature, transcend simplistic visualisations. On the other hand, there are those for whom a picture tells a thousand words, and for whom a concept cannot be imagined unless it is visualised.
My wife Alex and I complement each other in this way. I work with words, and she is awesome at visuospatial problems.
A few years ago we did the Crystal Maze Live Experience together. At one point my heart sank when Alex was sent into a room to do a word puzzle — one that literally involved cells of words. While the rest of the team barked a cacophony of competing instructions at her, she and I both knew it wouldn’t go well.
A few games later, she was doing a 3D puzzle with Tetris-style blocks — literal shape rotation. Jackpot. I told our team mates to be quiet and leave her to it. Sure enough, she completed it all by herself in an unbelievably fast time.
Text-based and visual advances underpin the history of information
The tension between the needs of verbal thinkers and visuospatial thinkers can be seen throughout the history of user experience, information and even humanity. It is almost like a pendulum has swung between text-based information and visual information over time, while constantly building on previous advances.
Chris Haughton’s book the History of Information shows how humans have developed increasingly sophisticated ways of working with information, culminating in the artificial intelligence of the present day. Even just scanning the contents page, you can sense how verbal solutions and visual solutions have built on each other to reach this point.
It all started with the development of spoken language (verbal). Speech may have first developed 1.75 million years ago. Then came mark making and drawing (visual). That was followed by writing.
When printing was invented, it became possible to reproduce both text and images using woodblocks.
But in 1440, when Johannes Gutenberg invented the movable type printing press, it suddenly became much more efficient to reproduce text at scale.
Because Latin script only has 26 characters, the printing press made it easier to reproduce European texts than east Asian writing, which has a high number of visual logographic characters. For example, Chinese required more than 8,000 characters (today, the Unicode standard includes over 100,000 Chinese characters). For this reason, most east Asian books were still produced by woodblock until the 19th century.
Arguably, the invention of the printing press put verbal thinkers on the front foot for hundreds of years. Methods that suited this style of thinking boomed, and written ideas could be disseminated widely. The publishing boom in Europe led to the rise of science, data collection, and attempts to classify knowledge through encyclopedias and dictionaries. Next came newspapers.
It was not until the emergence of photography in the 19th century that visual media could undergo a similar sort of revolution.
Then came advances in text communication that verbal thinkers would have thrived with: telegraph, telephone and radio. From there, it was a short leap to television.
The pattern I observe is that text is relatively easy to reproduce and disseminate in comparison to images. Advances in text or verbal communication technologies always seem to come before similar advances in visual communication technologies — sometimes significantly before.
Text and visuals in computing
As computing rose to prominence in the 20th century, you could see the same interplay between textual ideas and visual ideas.
At their core, computers are neither textual nor visual — they are mathematical. So the earliest interfaces sometimes involved people literally working in binary. The first user interface was a punch card.
In the 1960s came the command-line interface. This was a text-based system that enabled people to interact with a computer using a specialised language. For many computer users, this type of interface is the most efficient way to get things done, and most operating systems still contain a text-based terminal like this to this day.
But computers remained inaccessible to a wider audience until the advent of the graphical user interface, which became dominant by the 1980s. This ushered in the personal computing era.
Shortly afterwards, Tim Berners-Lee invented the world wide web. In its earliest days, the web was a purely hypertext medium. The earliest versions of the HTML specification did not include images or any form of multimedia. The web was primarily envisaged as a way of sharing and linking text. It is almost as if it was a reaction to the rise of the graphical user interface.
An independent browser company, Mosaic, unilaterally decided to display images within webpages, in conflict with the official HTML specification. But Mosaic became a hugely popular web browser, opening up the text-based medium to visual thinkers, and paving the way for the multimedia experiences that arguably made the web so appealing to most.
After the web became established as a multimedia platform, the advent of smartphones took us further into visual space. But the skeuomorphism of early touchscreen interfaces proved divisive.
Voice interfaces through smart assistants like Google Assistant and Alexa followed suit, almost as if in reaction to the highly visual skeuomorphism.
Attempts to render user interfaces as immersive environments like augmented reality and the metaverse felt almost like a counter-reaction to voice interfaces.
Social media is empowering visual thinkers
We have seen a similar layering of text-based and visual communication styles in social media. Blogging was an early favourite among the verbal thinkers of web 2.0. Popular blogging platforms even invited comparison to Gutenberg’s revolutionary invention with names like Movable Type and WordPress. (Interestingly, WordPress’s modern, more visual editing interface is called Gutenberg.)
Microblogging platforms like Twitter soon offered ways for people to share text in a viral fashion.
But shortly after blogging rose to prominence, Flickr arrived to provide the equivalent for photographers. Over time, social media has become highly visual with YouTube, Instagram and TikTok dominant.
Now, billions of people across the world routinely share and reproduce not just images, but videos — at massive scale.
Arguably, high-speed internet combined with modern video codecs is the visual thinkers’ equivalent of Gutenberg’s printing press. Only recently have visual thinkers been able to reproduce rich visual communications at scale. This is humanity’s pivot to video.
It makes me wonder if this may usher in an era that puts visual thinkers on the front foot for a long time to come, just as the printing press benefited verbal thinkers for centuries.
Visual thinking is not universal
While some visual thinkers have a tendency to think that everyone thinks visually, this is not at all true. For a start, it sidelines visually impaired people and those with aphantasia.
For all the dominance of visual media, it is clearly not universally favoured. People enjoy reading, radio and podcasts. The most popular art form is music.
There are also many ideas that simply cannot be easily represented visually. For example, Chris Haughton’s History of Information (page 21), notes a constraint of visualisations that necessitated abstractions such as alphabets:
…it is possible to [visually] represent “man” or “woman” but it is very difficult to represent “brother”.
The decline of text is changing society
Some like Kevin Munger argue that the modern popularity of visual media is causing a decline in literacy, which in turn is causing a decline of “the institutions that governed society”. I might agree. But I would say that, as a verbal thinker.
I agree that the world has felt a lot messier and more dangerous since visual social media came to prominence. But the idea that all the hallmarks of modern society that developed in the wake of the printing press are inherently better than an unknown alternative may be a value judgement or an ideological stance.
It is worth remembering that when writing began, great thinkers like Socrates worried that people’s memories and the art of questioning would be lost.
Presumably it would have been hard for people grappling with the printing press in the 15th century to imagine the rise of science, data and knowledge classification. Who knows what innovations may be unleashed by the ability of visual thinkers to innovate in the medium they thrive in?
It has even been argued by Thomas Pettitt that printing press has merely represented an interruption of dominant oral culture, an idea he calls the Gutenberg parenthesis.
Text-based interfaces have many advantages
Still, there are clear utilitarian reasons to favour non-visual approaches. Even leaving aside a preference difference that might emerge among more verbal thinkers, there are a few key reasons to favour a text-based interface.
For users of some assistive technologies like screen readers, the benefits are clear. If the interface is text-based in the first place, there is no need to translate visual concepts back into text for the benefit of screen reader users.
Meanwhile, for power-users, text-based interfaces offer a way to make multiple precise changes at speed.
An incredible demonstration of this is in videos of people using Strudel, a browser-based, text-based music making platform. One hypnotic example shows Switch Angel making a trance track by typing code in real time. You can feel the sense of mastery and control the interface provides.
We don’t often associate music as being something we can create or manipulate through a text-based code. But Strudel shows how, if you can reach a certain comfort level, it can be incredibly efficient.
Then there are tools like iA Presenter, which allows you to create presentation decks just by editing text. This is refreshing if you have become sick of tediously twiddling about in PowerPoint just to get things to line up properly.
Text-based interfaces are also easier to develop, and less resource-intensive than their graphical counterparts. This is why an advance normally comes first in the text-based space. A visual equivalent follows later — sometimes a long time later. So if you are comfortable using a text interface, you can experience a technology advance first.
Artificial intelligence is still largely based in text
Now we are in the artificial intelligence era. In keeping with technological advances throughout history, the earliest generation of modern AI interfaces have focused on text-based usage. Large language models literally centre on modelling language.
The main interface paradigm in AI tools so far has been chats — conversational interfaces. But while conversational interfaces hinge on the idea that people would find it easy to chat with their computer in natural language, it turns out that to get a half-decent response from an AI tool, you need to know exactly the right way to ask the question. The learning curve for this is steep, giving rise to prompt engineering.
The idea of prompt engineering has always smelled bad to me as a user experience practitioner. Blaming the user is a bad strategy. If it requires so much expertise that it requires “engineering” to use it properly, then your tool is not usable enough (unless it is aimed at expert users).
Even as a verbal thinker who works primarily in text, I find hard to imagine that the chat-based interfaces are the ultimate way to interact with powerful machines. Code may be. But having to “chat” by “engineering” my “prompts” is inefficient and cumbersome.
To make matters worse, the outputs of an AI chat tool are normally excessively verbose and cringey. I would never like to chat with any human that chats at me like an LLM does.
How artificial intelligence may go beyond text
History shows us that artificial intelligence tools will have to go beyond text to become truly usable. Just as command-line interfaces represented a leap forward for expert computer users, it took the advent of graphical user interfaces for computing to become accessible to everyday users.
So what would future interfaces for AI look like? There are lots of people actively investigating this. It feels like we are about to see some radically new ideas in user interfaces.
Google’s Design blog recently acknowledged the “conceptual gaps around AI”. But their response to filling this conceptual gap feels disappointingly superficial.
They compare their work to Susan Kare’s pioneering graphical user interface for Macintosh. This is as brave as it gets. Susan Kare’s concept took a methodical object-oriented approach — adopting metaphors of real-world things to help people understand the functions of a computer. The ideas pioneered in those early Macintosh interfaces have underpinned every subsequent desktop operating system of note.
Sadly, Google’s ideas don’t live up to their own hype. Their blog post asks: “What is Gemini’s equivalent of Kare’s smiling computer face?” The answer: They “landed on gradients”, because they “might be much more about energy… spirit and directionality.”
This has all the hallmarks of visual designers retrofitting a justification for something they think looks nice. A visual design that “might be about energy” may pass muster for a yoga brand, but it doesn’t feel like a solid footing for the future of human–computer interaction.
Adding structure makes chat more usable
One group that has interesting ideas is the Gov.UK AI Studio. Their work is of interest to me as someone working in information architecture in government. Their ideas appear to be evolving rapidly.
In December they wrote about design patterns for AI-guided service journeys. This involved “using a conversational goal-directed agent, in a chat-like interface” to understand “a user’s intention and constraints”.
The following month in January, they were “exploring AI as a design material”. There, the AI model took notes as the conversation went, keeping track of the user’s goals and information, then using that to decide what rules and structures to apply in its response.
In February, they have started talking about deterministic UI. Here, the chat interface is augmented with from elements like radio buttons and checkboxes to collect specific information and manipulate data in a controlled way.
It moves the conversation from informal to formal, and allows users to provide accurate information in exchange for accurate responses and enables the agent to accurately perform functions on the user’s behalf.
Today, they have published a new blog post describing “actions, skills and modes”. These are pitched at high level as “design surfaces”, but the detail describes how they are “combinations of context, instructions, tool calls or principles”.
These advances suggest that the open-ended nature of chat interfaces is exactly the problem. Open-ended chat sounds powerful and liberating at first. But it gives users nothing to latch onto.
The real world has constraints, and these need to be represented in the system. When these constraints are not explicitly acknowledged, the user can be led into typing a vague or invalid prompt. Instead of recognising the gap and offering an opportunity to course-correct, the LLM tries to fill the gap — and that’s where hallucinations come in.
Open-ended chat gives users no anchor points, and few clues as to what an effective input would look like. As such, users start to lose their bearings. If the AI tool doesn’t even know the user has lost their way, the situation can become a vicious cycle. This may be one reason why many users report AI responses becoming worse the longer the conversation goes on.
Artificial intelligence needs affordances
Jorge Arango recently pointed out that while many chat interfaces are open-ended, effective user interfaces offer clear distinctions and affordances — the qualities that tell us how to use something. Or, as Erika Hall says in Conversational Design (page 77): “context makes the conversation”.
This ought not to be news. Donald Norman wrote almost everything we need to know about affordances in the 1980s, in his book the Psychology of Everyday Things. But in the rush to slap AI slop onto everything, people have taken the lazy route in the hope that a chatbot will solve all the problems.
One approach to making AI tools less open-ended is retrieval augmented generation (rag). Here, an AI tool retrieves and indexes your content and relationships — often in the form of a knowledge graph — rather like a search engine does. This gives the AI tool a better grounding in reality, reducing the risk of hallucinations.
But this doesn’t yet get us beyond text-based interfaces. As AI has evolved, it has become more agentic. This is where it makes use of tools to take certain actions on your behalf in a variety of ways, rather than just executing a chat. This opens up a world of new possibilities. Here is where things get really interesting for the future of user interfaces.
The skeuomorphism of artificial intelligence
There is a good chance that the ultimate AI interface will look rather like a website. In Conversational Design (pages 40–44), Erika Hall argues that traditional Google search is the ultimate conversational interface.
Luke Wroblewski wrote about a series of recent experiments with different interface patterns for agentic AI. Kanban, dashboard, inbox, tasklist, calendar. These are all familiar concepts from the past.
This familiarity can be helpful, as Luke Worblewski notes. It works like the metaphors of the graphical user interfaces of the 1980s, grounding new functionality in familiar concepts. Susan Kare’s trash can intuitively told users how to get rid of their files — because the concept of the desktop interface was based on real life offices.
These interfaces for agentic AI perform the same function as skeumorphism, by mimicking familiar web interface patterns.
But this rehashing of old patterns is unlikely to be the end state of AI interfaces. Luke Wroblewski noted Scott Jenson’s 2011 quote in relation to the shift to mobile devices:
…copy, extend, and finally, discovery of a new form. It takes a while to shed old paradigms.
This sort of digital skeuomorphism may help users learn how to use new technologies. But history shows us that skeuomorphism can wear thin quickly. It begins to make users feel infantilised, as well as limiting their understanding of the true capabilities of the technology.
I have a feeling that in a few years’ time we will all be interacting with AI tools in radically visual ways we have not yet imagined.
Then, sometime later, the wordcels will get their revenge once again.
