Generative AI tools like Midjourney, Stable Diffusion and DALL-E 2 amazed us with their ability to produce stunning images in seconds.
Despite their findings, however, a perplexing disparity remains between what AI imagers can produce and what we can. For example, these tools often fail to provide satisfactory results for seemingly simple tasks such as counting objects and producing accurate text.
If generative AI has reached such unprecedented heights in creative expression, why does it struggle with tasks that even an elementary school student could complete?
Exploring the underlying reasons helps shed light on the complex numerical nature of AI and the nuances of its capabilities.
Limitations of AI with writing
Humans can easily recognize text symbols (such as letters, numbers, and characters) written in various fonts and handwriting. We can also produce text in different contexts and understand how context can change meaning.
Current AI image generators lack this inherent understanding. They have no real understanding of what text symbols mean. These generators are built on artificial neural networks trained on huge amounts of image data, from which they learn associations and make predictions.
The shape combinations in the training images are associated with various entities. For example, two inward-pointing lines meeting could represent the point of a pencil or the roof of a house.
But when it comes to text and quantity, the associations have to be incredibly accurate, as even small imperfections are noticeable. Our brains can overlook slight deviations in a pencil tip or a roof, but not so much when it comes to how a word is spelled or the number of fingers on a hand.
Read more: Both humans and AI hallucinate but not in the same way
As far as text-in-image templates are concerned, text symbols are just combinations of lines and shapes. Because text comes in so many different styles, and because letters and numbers are used in seemingly endless arrangements, the model often won’t learn to render text effectively.
The main reason for this is insufficient training data. AI image generators require much more training data to accurately represent text and quantities than other tasks.
The tragedy of the AI hands
Problems also arise when dealing with smaller objects that require intricate detailing, such as hands.
In training images, the hands are often small, holding objects, or partially obscured by other elements. It becomes difficult for the AI to associate the term hand with an exact representation of a human hand with five fingers.
As a result, hands generated by artificial intelligence they often look misshapenhave more or less fingers, or have their hands partially covered by objects such as sleeves or purses.
We see a similar problem when it comes to quantity. AI models lack a clear understanding of quantities, such as the abstract concept of four.
Thus, an image generator can answer a prompt for four apples by drawing learning from a myriad of images with many quantities of apples and return output with the wrong quantity.
In other words, the huge diversity of associations within the training data affects the accuracy of the quantities in the outputs.
Will artificial intelligence ever be able to write and count?
It is important to remember that text to image and text to video conversion is a relatively new concept in AI. Current generative rigs are low resolution versions of what we can expect in the future.
With advances in training processes and AI technology, future AI image generators will likely be much better able to produce accurate visualizations.
It’s also worth noting that most publicly accessible AI platforms don’t offer the highest level of capabilities. Generating accurate text and quantities requires highly optimized and tailored networks, so paid subscriptions to more advanced platforms will likely deliver better results.
#image #generators #smart #struggle #write #count
Image Source : theconversation.com