Recently apps like ‘Be My Eyes’ and ‘Seeing AI’ have added the ability to generate image descriptions using multi-modal Large Language Models (which can understand and generate both text and images) like GPT4Vision. I’m sure everyone has tried them, but for anyone who hasn’t, here’s the description I got by sharing a Toot to Be My Eyes this morning:
“The picture you've shared is a beautiful watercolour painting of flowers. The flowers have delicate, ruffled petals in shades of pink, purple, and white. The center flower is particularly striking with deep magenta at the base of its petals, which fades into a lighter pink. The leaves are painted in soft greens with hints of yellow and blue. The background is white, which makes the colors of the flowers and leaves pop. In the bottom right corner, there is a signature that reads "Wendy Craig" with the word "Photographer" underneath. The painting has a very soft and elegant feel to it.”
I have no way of knowing if the original image looks anything like this. But it doesn’t really matter, it fits in with what I was expecting from the Toot. I’ve really enjoyed ‘looking’ at all the Caturday photographs without Alt Text today – I can’t lie, I love cats!
Now, we all know the ‘scene description’ in Seeing AI was complete and utter rubbish, I don’t think that is me being harsh. The descriptions you get out of these new apps/models are at the least, very entertaining. At the best they are life changing!
For the first time, I, we, can ‘almost’ be a part of the visual internet. I’m not quite using Instagram daily, but you can Share post direct to Be My Eyes there to and it’s as good as you’ve seen.
I’m very excited, but I’m also aware that there are risks. These include inaccurate or misleading descriptions, so nobody should solely rely on this technology for critical tasks. But it is only going to get better. There is such an incentive for the world to develop computer vision and we will benefit. And it is only ever going to be better than it is right now, which is pretty fantastic.
So, here’s my question – are you as excited as I am? Are you keen to jump into the visual internet? Does the idea of GPT5Vision working with video make you want it to be here now? It’s certainly made using apps like Instagram, Ivory and Threads more inclusive.
Finally, on a personal note. I could see up to the age of sixteen. I’m using the text to image features of these models, together with the image to text, to create my own art. The descriptions you get from Google’s Bard are quite different than those from Be My Eyes and seeing AI – the differences can be quite interesting.