Updated: I wrote a Shortcut to make getting descriptions of Images and Screenshots easy.

By aaron ramirez, 5 January, 2024

Forum

iOS and iPadOS

Update 03/26/2024

I've just released what I'm tentatively calling Version 1 of both my Describe Screenshot and Describe Photo shortcuts.

They can both be found on their new dedicated site!

Some of the new changes include:

conversations: You can now reply to the descriptions you are provided. to do so, press the okay button on the description's alert. Have nothing to say? No worries! Hit cancel, and the Shortcut will leave you in peace!
Slash commands: When typing a reply, you can use /save with either Shortcut, and the last photo or screenshot taken will be saved to the photo album of your choosing. Additionally, Describe Photo also has /add, which will allow you to take another picture to accompany your replies.
Describe Photo now supports the Apple Vision Pro! If you run the shortcut on Vision Pro, it will grab the latest photo from your camera roll rather than having you take one. This is because the Shortcuts app on Vision Pro doesn't support taking photos in shortcuts. If you intend to use this shortcut with other smart glasses or prefer to take your photos in the Camera app, you can make grabbing the latest photo the default behavior in the set up screen.

That's everything. Share and Enjoy! :)

Update

There are now two Shortcuts.

Describe Screenshots: Can be found here: Describe Screenshots This one, after being assigned to a VoiceOver gesture, will take a screenshot when run and have GPT4 generate a description for you. It also gives you the opportunity to ask a question before sending your image.
Describe Photo, which can be found here: Describe Photo This one can also be assigned to a VoiceOver gesture, and when run, it will pull up the iOS or mac OS camera interface for you to take a photo which will then be described for you. Additionally, you can share pictures to this Shortcut, either from the iOS and mac OS share sheets, or Mac OS's Quick Actions menu.

Setting both Shortcuts up is identical to before, though now, you will be able to configure the system prompt and other parameters from the set up screen if you so choose. I did this because I hate editing shortcuts directly and the set up screen can be brought back up whenever you want, even long after you've originally installed the shortcut.

On iOS, The set up screen can be reached by editing the shortcut, tapping Shortcut Info on the bottom right, then tapping set up on the top right (immediately beneath the done button.)
From here, you can tap the Customize Shortcut button and you'll be asked all the set up questions again.

Note: The API key field will be blank when setting up your shortcut again, but as long as you've entered it once before, you don't have to fill this field out again. The rest of the set up process and usage is identical, so I'll leave the original post as well.

Original Post

Hi all! The other day, it occurred to me that getting screenshots described is a pain with Be My eyes and / or the ChatGPT app. You have to take the screenshot, hit the screenshot button before it disappears, hit share, then hit describe with Be My AI which is far too many steps for me.

I've written a shortcut using the built-in Apple Shortcuts app that takes a screenshot and describes it using the same technology Be My AI uses. Best part is since it's a Shortcut, you can assign it to a Voiceover gesture. This works on both iOS and Mac OS. I just put it in the iOS forum because I figure more people are likely to see it here. Anyway! the Shortcut can be found right here! Unfortunately, I'm not rich and can't afford to pay for everyone's usage, so this does cost (two to three cents per image) and there is a bit of setup involved.

So how do I set this thing up?

I'm glad you asked! Before you install this Shortcut, you need to do a few things: 1. Create an OpenAI account. This can be done at platform.openai.com. If you have a ChatGPT account, you may skip this step. Otherwise, just head to that site, press the sign up button, and follow the instructions.

Sign into your OpenAI Account (if you're not already) and head to their billing page. Here, you'll follow their instructions to set up a billing plan with them. It's not as complicated as it sounds. You basically just load your account with money ahead of time, and every image you have described pulls a couple cents from that balance until it reaches 0, at which point you can refill it again or never use the account again. This is not the same thing as a ChatGPT Plus subscription. if you have a ChatGPT plus subscription, you still have to do this.
Acquire an API Key You can do this on their API Key page. Just hit the create button, type a name for it, and hit create. then a text box will appear with your key. Copy this key and save it somewhere safe. OpenAI will not show you this key again, so if you lose it, you'll have to create another.

Also, don't share this key with anyone. Anyone who has access to your key can use their services pretending to be you, which will cost you money. If somebody does get their hands on your key, you can delete it on this page.
Install the shortcut! Once again, the Shortcut can be found here. When you install the shortcut, it will ask for your API Key. Paste it into the box and tap Install. At this point, it should be ready to use.

Assigning it to a Voiceover gesture.

This part's pretty easy. Just go to settings, accessibility, VoiceOver, Commands, All Commands, Shortcuts, then select the Shortcut's name (describe screenshot.) Then you will be given the option to add a gesture or keyboard shortcut. Once you add either or both, any time you use that gesture or keyboard shortcut, the shortcut will run.

I've installed the shortcut and set up a VoiceOver gesture but how do I use it?

Pretty simple: Whenever you want your screen described, make sure screen curtain is off, then use your VO gesture to activate the shortcut. Your phone will take the screenshot, then open the Shortcuts app so you can include a question with the image. Type in your question (if you have one,) then tap done. Then you can return to what you were doing. The description will take somewhere between 10-30 seconds to come back, but you don't have to wait in the Shortcuts app. Just go back to your Youtube video or whatever. Once the description appears (the shortcut should play the Tri-tone notification sound to let you know the description's there,) after which you can feel around the top center of your screen until VO focus has landed on the description field. Once it's there, you can swipe through the description and hit the done button when you're done reading it. At the end of the description, you will be told exactly how much that description cost you, so if you're conscious about money, be sure to read through the end. If enough people want me to move the total cost to the top of the description, I can definitely do that.

I don't like how it talks! Can I change it?

yes, you absolutely can. If you go into your shortcuts app, find the Describe Screenshot shortcut, hit the edit action (using the rotor,) and the first 4 or 5 text fields of the shortcut are all parameters which you can modify to your heart's content. If you specifically want to modify the way it talks just edit the text in the system prompt field. There's a comment box immediately before it that will tell you which one it is.

Dude, you talk a LOT!

I know! I know! I hope this Shortcut is as useful to all of you as it is to me. Please let me know what you think, and if you like it, share it with your friends who might benefit! :)

Options

Comments

Very cool! Thanks for making…

Very cool! Thanks for making this!

Really cool but you may want to change the default system prompt

Thanks for this useful shortcut. I had to go in and change the "snarky" tone of the system prompt. You may want to do the same, at least as a default option!
Once again though, very cool and thank you!

default system prompt

I've updated the original post with a link to the updated shortcut with the formal / professional system prompt.
Good on you for going in and editing it yourself though, that's what I want to see from people using the Shortcut, and why I provided those comments before each text field.
Knowing how to edit and write system prompts is a really powerful tool because you can better tailor the AI to your use-cases.
Also, the more each of us learns, the more we can experiment to find the best prompts for each use case or app.
The best part is writing prompts isn't actually that difficult. Anybody who's decently okay at writing in their native language can do it. :)

Awesome!

This is great stuff indeed! I also changed your original prompt, but I haven't edited out the snarky and humourous tone of it yet. But I live in Norway, so I’ve just copied your prompt into a prompt optimizer tread I have on Chat GpT, or in the OpenCat app to be more specific, and told it to translate it to norwegian and to add that I want all answers back in norwegian if nothing else is specified. And then just pasted it right back into the shortcut.
I also translated your question frase on the question about what I want to know about the picture. Works somehow fine, but I've seen a few issues, e.g. the screenshot being taken of the Siri interface and describing it instead of the actual screen content that I wanted to get described. Don't know why that happened a few times, but it has to have something to do with the timing of the actual screenshot, and probably relates to the placement this action is set in the shortcuts order of steps.
I'LL probably edit out the more snarky tone of it soon, to keep the result more efficient and hopefully shorter. And that's another question. I saw there was one or maybe more values for max tokens to use when running the shortcut. I have had the answer from this shortcut going on and on for ages, which probably gets unnecessary expensive in the long run. I think I saw the token value set to 1500. Would it perhaps help to decrease this number a great deal, to keep the answers shorter?
And also, is there any way that the return from Chat GpT can be sent directly to Siri and being read out by the standard Siri voice, instead of having to take the route via the shortcuts app to get the answer?
But this is great work, and the effort you've put in to making this is well worth the time! This can also be a very good starting point for all us other geeks to tweak and hammer on to our own liking! 👍🏻
Good job, mister! :)

updated with a new Shortcut and easier configuration

Hi all.
I just updated the original post to include my new Shortcut which works basically the same way, except rather than taking a screenshot, it will take a picture with your camera, or receive one from your share sheet / Mac OS quick actions.
Additionally, all the questions (API Key, system prompt, Max_tokens and temperature) will now be asked on the set up screen so you can edit them whenever you want without having to edit the actual Shortcut.

Let me know what you all think!
I have to run to work, but I'll respond to you all during one of my breaks or after work.

very neat

I tried this on macos, but I wonder if it could be set to not require opening the window chooser?
I'm going to have a look and see if modifications can be made to allow the description to appear as an alert with "ask more", "copy" and "okay" buttons afterward. THis would also apply to iOS.
Additionally, for whatever reason the "describe photo" shortcut is not listed in the sharesheet.
Has anyone else had this problem?
Again, Thank you Aaron for sharing this.
I'm looking forward to experimenting more and using what you've made as a starting point.

replies

Cliff said:

told it to translate it to norwegian and to add that I want all answers back in norwegian if nothing else is specified...

How are you finding its descriptions in Norwegian? How do they compare to English descriptions? I've always wondered how well GPT4 would do with image descriptions in other languages. I've tried Spanish and it seems to do okay.

Cliff said:

I also translated your question frase on the question about what I want to know about the picture. Works somehow fine, but I've seen a few issues, e.g. the screenshot being taken of the Siri interface and describing it instead of the actual screen content that I wanted to get described.

this shouldn't be happening. Siri is never used during the execution of this Shortcut, unless you made adjustments to it.

Are you activating the shortcut from Siri? If so, the siri screen will cover the app's screen, yes. You have to activate it from a voiceover gesture, a back tap on your phone, or anything else that won't place anything on the screen before the shortcut runs.

Cliff said:

I saw there was one or maybe more values for max tokens to use when running the shortcut. I have had the answer from this shortcut going on and on for ages, which probably gets unnecessary expensive in the long run. I think I saw the token value set to 1500. Would it perhaps help to decrease this number a great deal, to keep the answers shorter?

Absolutely. 1500 tokens of english text is around 1100 words. You can measure how many words can be expressed by a given number of tokens by just taking 75% of that number.

for example, 1000 tokens will usually be around 750 words. That being said, languages other than english are tokenized less efficiently since the language models have less exposure to them during their training, so it could be that 1000 tokens are required to write 500 Norwegian words, for example. I don't know if anyone has published a list of languages along with their average token count per word anywhere, but I can look around for you.

Cliff said:

And also, is there any way that the return from Chat GpT can be sent directly to Siri and being read out by the standard Siri voice, instead of having to take the route via the shortcuts app to get the answer?

As I said earlier in this message, this won't work with Siri the way it's set up. When you ask Siri to run the shortcut, the siri interface covers up the app's interface and so the screenshot is taken of siri and not the app you wanted to take a screenshot from.

That being said, the shortcuts app does have a speak text action, so you can replace the quick look action at the bottom of the shortcut with speak text. I don't know how configurable that action is though, in terms of what voices / speach rates you can use.

Quinton Williams said:

I tried this on macos, but I wonder if it could be set to not require opening the window chooser?

It depends on if the Show Alert action is accessible on mac OS or not. the reason I use Quick Look is because Show Alert doesn't work on iOS super well. I also like Quick Look because it allows you to easily copy the text to the clipboard or share it anywhere as a textfile, on both iOS and mac OS.

Anyway, it wouldn't be difficult to use Show alert on mac OS and Quick Look on iOS. give me a couple minutes and I can do that and see if it's better that way.

The other problem with Show alert on iOS is that all the text is written as one giant block so you can't swipe through it, which I personally like to do though I know some people would probably prefer to have the whole thing read in one giant chunk.

Quinton Williams said:

I'm going to have a look and see if modifications can be made to allow the description to appear as an alert with "ask more", "copy" and "okay" buttons afterward. THis would also apply to iOS.

Unfortunately, you're kinda stuck with the blocks Apple provides for you. Quick Look has a share button with which you can then hit copy. show alert might too, I can't remember. the problem is you can't add a "ask more" button to the show alert or quick look screen, as they're not modifyable.

One thing you can do is after dismissing the alert or quick look window you can use an input action to ask users if they want to ask a follow up question, but looping with Shortcuts is frustratingly limited.

There is a repeat action, but it can only be set to an integer, e.g. you can do repeat three times. You can also repeat for each item in a list, but you can't do repeat until x = false.

You can have users choose how many times they want the loop to execute at runtime, but it's hard to know how many follow up messages you'll need to send.

I considered setting it to looping 100 times or something ridiculous, but then if you don't want to ask a follow up question you shouldn't have to dismiss the text box each time.

I'm sure there's an ideal solution to this problem, but I decided not to mess with it. I hate the interface Apple provides for these actions so the less I have to deal with them the happier I am. Plus, conversations can get really expensive. I find it cheaper to just send the same screenshot again with a question that will better direct the AI to provide the information I want. Definitely keep us posted if you find something that works well though!

Quinton Williams said:

Additionally, for whatever reason the "describe photo" shortcut is not listed in the sharesheet. Has anyone else had this problem?

Hmm. It's actually not in my share sheet on mac OS even though share sheet is checked. It's definitely coming up on iOS though.

On Mac OS, I just vo shift m on a file, go to the quick actions menu, and select Describe Image from there.

updated both shortcuts

I've updated both shortcuts so now, on mac OS, it will use show Alert by default.
If we can get Apple to fix the accessibility issues with shortcuts on iOS we can use Show alert on iOS too.

The only other thing I want to figure out at this point is resizing the images.
Part of what's making the shortcut as expensive as it is is that Macs and iPhones either have high resolutions or take high res photos... either way, we're sending huge images to OpenAI.
Shortcuts does have an action to resize a photo, but I'm scared to mess with that because I don't want to accidentally ruin the photos. lol

This is an excellent idea. Thank you.

This is a truly excellent idea. Thank you very much. I’m using it and finding it useful. Moreover, I’m very grateful for the effort and work you’re putting in. I hope we can have more of what you’re offering and more innovative ways to use our devices.

appreciate your effort

Using Be My Ai and Seeing AI for long I barely see the need for such a shortcut but I do appreciate you creating this one and sharing it with the rest of us here. For a long time I wanted to automate an action but I tried really hard including asking Chat GPT and Barred to guide me into creating an automation.

Automations are for whats app, don't think it would be possible as this is not a native IOS app.
Automation 1: When receiving a message in whats app. from a specific sender, with keywords or phrase. Reply to sender with a custom predefined response message.

Automation 2: Get Siri to send my live location to a group chat in whats app. Ask before running. Triggered by alarm being stopped.

These automations are essential as of today as no one anymore uses the Text Messaging app. on IOS to send out messages. Their only use is now to receive OTP OR Alerts and marketing crap. Thank you for reading my comment, have a good one!

Can it be used with Gemini?

As far as I know, Gemini pro can also process images and print text and can be used over the api key. But unlike gpt4, it also offers a free service. Is it possible to update this shortcut to use Gemini api?

delayed responses!

Hi all!

Firstly to the Whatsapp question, Whatsapp does support the Shortcuts app, so if you head to the automation tab in the shortcuts app, this should be doable.

As for Gemini: Gemini 1 pro is really, really bad at image descriptions. Additionally, though it is free, it's only free because they're using data we send to train future models, so you have no guarantee of privacy whatsoever. I don't feel comfortable building on that platform. Their more capable models 1.0 Ultra and 1.5 Pro are not available to everyone on their API yet, so if I built shortcuts for them, most people wouldn't be able to use them.

I also tried Anthropic's new models, but though they're all amazing, even better than GPT-4 at text-based tasks, they all made a ton of things up for just about every image I passed to them.

Official v1.0 release is out!

I've just released what I'm tentatively calling Version 1 of both my Describe Screenshot and Describe Photo shortcuts.

They can both be found on their new dedicated site!

Some of the new changes include:

conversations: You can now reply to the descriptions you are provided. to do so, press the okay button on the description's alert. Have nothing to say? No worries! Hit cancel, and the Shortcut will leave you in peace!
Slash commands: When typing a reply, you can use /save with either Shortcut, and the last photo or screenshot taken will be saved to the photo album of your choosing. Additionally, Describe Photo also has /add, which will allow you to take another picture to accompany your replies.
Describe Photo now supports the Apple Vision Pro! If you run the shortcut on Vision Pro, it will grab the latest photo from your camera roll rather than having you take one. This is because the Shortcuts app on Vision Pro doesn't support taking photos in shortcuts. If you intend to use this shortcut with other smart glasses or prefer to take your photos in the Camera app, you can make grabbing the latest photo the default behavior in the set up screen.

That's everything. Share and Enjoy! :)

Shortcuts and Signal app?

Hi,
Your shortcuts sound excellent. Just the other day, I wanted to take a photo using Signal app's camera; and I'd obviously like to have it described as well. Can this be done?

I'm a newcomer to this. How do you know which camera (front or back) is active when you take a photo?
Thanks.

Question about Comparing images

Sometimes, I have two or more images, and I would like to compare those images. Is it possible to do this on the application? I mean, like, with the shortcut? If yes, do I need to select one image first, then add the next images to the conversation? I know that it's not possible to do this on the Be My Eyes app. Also, would it be possible for you to add a safe button? I think this would be easier than having to type a command to get that action done, if you're able to.

Replies

Cordelia Said:

I wanted to take a photo using Signal app's camera; and I'd obviously like to have it described as well. Can this be done?

It depends. From my brief google search on this, it looks like Signal's camera doesn't save photos to your phone's camera roll by default. You have to "save" each photo before it'll show up there.

Once you've saved it to your camera roll, you can run the Take Photo shortcut and have it grab the most recently taken photo. Alternatively, you can use the share sheet to share the photo with the shortcut.

Grabbing the most recent photo is probably quicker, since you can hit save on the photo in signal, then run the shortcut with your VO gesture of choice and get your description immediately.

Cordelia Said:

How do you know which camera (front or back) is active when you take a photo?

I'm not sure how the Signal camera is laid out, but with the regular iOS one, you'll have a button labeled something like "Switch to back camera." If it says that, it means the front camera is the one being used.\ Similarly, if it says "switch to front camera" that means the back camera is the one being used.

Winter Roses said:

Sometimes, I have two or more images, and I would like to compare those images. Is it possible to do this on the application?

Not at the moment, though I'm not opposed to adding this. It just might take me a week or two to get it done. :)

Winter Roses said:

Also, would it be possible for you to add a safe button? I think this would be easier than having to type a command to get that action done, if you're able to.

This is unfortunately one thing I can't do. The shortcuts app doesn't let you add buttons to things. I could add an alert that asks if you want to save or add another photo every time you send a message, but that means it would show up every single time you sent a message, regardless of whether or not you were interested in saving or adding another photo. This would slow things down... a lot!

Have you tried dictating /save? It works for me.

Thank you, and a question

Regarding the save option, no worries. I don't mind typing the command. I was only wondering if that would be a possibility.
This might be a bit off-topic, but the reason I was asking about comparing pictures is because, even though I'm totally blind, sometimes, I like to use the image creation application, you know, Copilot? Yeah, sometimes, even though I use the be my eyes app to tell me what the pictures are, after they have been generated, I would like to be able to compare the pictures that I like and have saved to my device. I would like to know which one is more detailed, or which one best illustrates the point I'm trying to get across. Anyway, I was wondering if it would be possible for you to create a shortcut for Copilot, when I generate a picture, I get four samples, and that's great, but sometimes I get different aspects of the prompt I would like to keep, but I don't know how to specify that. For example, the first picture might have the background that I wanted, but the second picture has the people that I requested. I don't know how to tell the App how to combine those two elements into one picture going forward. It would be nice to be able to use a shortcut to create one single picture, and then be able to give the app specific instructions to build on that particular picture, instead of working with four different images. I'm not able to specify what aspects to keep or removed Without losing everything in the process. Even if you can't get this done, this is the context of where I was coming from yesterday. That's why I was asking about comparing images. The problem with Copilot is that you're not able to specify which pictures to keep. It's hard to specify what aspects you want from the created images, instead of always starting from scratch.

Can't be done, unfortunately

In the world of AI, there's problems that a regular person like you and I could spend thousands of hours and dollars trying to solve but such an attempt would be completely pointless. Why? Because the field of AI is developing so rapidly that within about six months the landscape will look completely different and the problems we were trying to solve are problems no longer.

The issue you're having is one such problem. It's a difficult enough problem that neither of us have the resources to try to solve, but somebody else does and eventually will.

Let me walk you through what's going on, so you have an idea of why it happens at least.

When you're talking to Copilot, you're talking to GPT-4, which is a large language model. It's what does all the cool writing stuff that is sometimes hard to distinguish from human writing.

GPt-4 is pretty smart, in that it can understand what you want from it and it knows how and when to use tools.

DALL-E is one such tool. It is an AI system that can turn written text into art, based on the millions if not billions of text-image pairs it saw in its training. It, for example, knows what a house looks like, because it has seen many images that are associated with the word house.

The problem is, unlike GPT-4, DALL-E isn't very smart. It doesn't have the same grasp of the English language that GPT-4 does, and converting from words to pictures is really, really hard. It also can't reflect on its work, so it might be asked to make a house with windows made of cotton candy, and it will happily churn out the image you requested, but it can't look at it after the fact and say "Hey, the candy doesn't look as cloudy as I wanted it to. Let's fix that!"

This is why you'll have some images that follow your prompt exactly and others that don't. This is also why you'll usually have more than one image generated, because the developers know that you're unlikely to get what you were looking for on your first go.

It's possible that when we have DALL-E 4 or 5, this won't be a problem anymore.

The other thing to keep in mind, and the reason I mentioned GPT-4, is because you can write all the prompts you want, but that's not a guarantee that those are the prompts DALL-E will receive.

When you ask Copilot to generate art for you, GPT-4 comes up with the prompt for you based on your specifications, sends the request to DALL-E, DALL-E creates those four images then sends them back to GPT-4, and finally, they're given back to you. So you're basically playing telephone and all you can do is hope that your original message is preserved as it's passed between these two machines.

In the early days of DALL-E 3, there were ways to force it to generate the exact same image again and again, so you could ask it to make adjustments to that one image, but as far as I know, these methods don't work anymore.

There would also be no way to build a shortcut that merged these images together, because each image is different and you would need to apply different techniques to merge them based on how the images look.

You could also theoretically see better results by trying different prompts. If this is something you want to keep exploring, feel free to send me an email here on Applevis and we can see if we can figure something out!