New AI app for describing images and video: PiccyBot

Re: GPT-5

Thanks a lot! Yes, GPT-5 is pretty much unusable right now with the processing time taking so long, GPT-5 Mini seems to work well though.
I found the copy image option in Facebook at last. It is really weird, I knew it used to be there but earlier today it didn't show up at all for some reason. Anyway, the copy/paste feature works well. I agree with Winter Roses though, is it possible to only have to give permission once?

GPT-5 Chat

There is a model in the API called GPT-5 Chat, which is the same version of GPT-5 powering ChatGPT. When I test it it responds MUCH faster than standard GPT-5, however with rather short descriptions, but might be something to look into and add instead of the standard one. Here is the page about it:
https://platform.openai.com/docs/models/gpt-5-chat-latest

A Question About Model Storage of Sent Pictures

Hi Martijn,

First, thank you for all of the work you have done and continue to do for our community with PiccyBot. I never thought that such a service would exist, especially having access to multiple models.

I do have one question about how the AI models use the data we share through PiccyBot. I know that you/PiccyBot do not store or save the pictures uploaded by users; I am wondering, however, if you have any information about what the various AI companies do with the pictures that PiccyBot sends? I enjoy comparing descriptions from the various models, but I am uncomfortable having described pictures of family/friends/anyone else if these services are storing/utilizing the pictures users send. Regardless of what the AI companies do, please understand that this is not a reflection on PiccyBot or the work that you have done for our community.

Thanks for any insight!

Re: Grok 4 will stay

I live in a region where cellular connection is more reliable than Wi-Fi but don't think this has anything to do with connection stability. I can't even get a description when I upload a photo. I just get a Retry button but the result doesn't change no matter how many times I retry. Other apps like Be My Eyes can provide image descriptions without any problems though.

Storage of images and connectivity

Enes, if you feel PiccyBot is 'stuck', while network is fine, please either restart your phone or even reinstall the app. It should work again. It's an elusive issue that I will try to fix the coming time.

Michael, as said, PiccyBot doesn't store any media or prompts. And there is an additional layer of privacy since all requests to the providers come from the PiccyBot address, not yours. However, the AI providers can use your data in some cases. OpenAI says they won't, but you never know. Anthropic (Claude) has quite a good reputation and Mistral being European is very privacy conscious. Safest is Llama 4, since that is running on a local server and all data is removed immediately after use. The worst is likely Google. But hard to avoid them, especially with the Gemini 3 model around the corner of which I have high expectations.

Re: pasting

Firstly, thank you so much for this new feature - I have been wanting an easy way to get facebook images described for ages.

I think I am being thick though as I can't find the option to paste.

In Facebook, I go to an image, double tap to view it, then double tap and hold for a bit to get the menu, and then I select Copy image. I then switch to PiccyBot... but where is paste? I presume I am repeating the same action as per Facebook - double tap and holding. But I can't find the option to paste. What should have the focus when I do this? I've tried the text box, heading and some of the buttons.

Sorry I know I'm always the last one to figure these things out. I think I am on the latest version - there was an update pending so I installed it before trying.

Purchased the premium features but cannot pick AI model

I can't interact with the AI model selection dropdown in the settings. Double-tapping does nothing. Also, there's this button labeled as "gear.badge.questionmark" that should be labeled more properly, likely "Help".

AI model selection

Enes, you somehow cannot access the Firebase database with the PiccyBot settings. Can you use a VPN or other network and try again?
Actually I have included a built in offline list, but that backup feature wasn't included in the latest release. I'll provide an update by Monday.

Mr Grieves, you can paste in the main view in PiccyBot with a long press. Press in the middle of the screen. It will then prompt 'PiccyBot would like to paste from Facebook', 'Do you want to allow this?', and then you can select 'Allow paste'.

Data storage

Checking the API terms of the different companies they say basically the same thing. All of them store your data for a limited time to check if it complies with their usage policies. How long this time is varies (and some are vague about it). AFAIK the Piccybot server is in the EU which means the companies have to follow GDPR, but you of course really never know. I'm not sure that any of the big companies are "better" or "worse" than others in that regard.

Using a VPN worked but...

I could access the model list and select Grok 4 to find out how it would describe images and videos but then I forgot that it wouldn't be able to describe videos and captured a video. I did get a description afterwards, but the video was probably described by GPT4 or whatever the default/free model is. And now when I open the app, I have the free version interface with the Subscription button and an ad on the screen. I may try restarting the device or reinstalling the app but just wanted to inform you in case you work on fixing such issues. Also, I'd love to know whether I will have to keep the VPN on even after selecting the AI model, to access the server and retrieve descriptions at all times, or only once. Another thing is, why don't we have DeepSeek, Qwen or other models among the available ones? What models do provide video descriptions if not Grok 4? Can I not select any other model apart from GPT4 if I want to be able to get video descriptions as well as image descriptions? And can I not customize the default/initial system prompt? This would be quite handy. It's actually somewhat strange that this feature is missing when we can even customize the personality of the voice and possibly the description as well. Or does the personality customization thing apply to the style and intonation of the voice only rather than the content of the description? Finally, adding Piper voices as an option might create a free option and help reduce costs for you. They're also neural voices even though they lack style customization. They're also open-source and can be deployed on any server. Wait, why not just use the system voice then?
* Update: I did uninstall and reinstall the app while writing this, but now the Restore Purchases button doesn't bring up the App Store screen to let me restore the purchase. Let me also add that the Turkish localization is incomplete.
* Update 2: Just disabled the VPN and finally got the premium screen back after double-tapping on the Restore Purchase button several times.

Another Question

It appears that my model configuration is stored on the server, not the device itself. I completely uninstalled and reinstalled the app as I mentioned above, and the other settings were reset to the defaults, but it was still Grok 4 that was selected as the AI model in use.
And comes the question: What is Piccybot Mix and how is it supposed to work?
And here's a suggestion: Can we not set the description to match the language of the content if it is in any of the languages we specify in the settings? This could be useful for bilingual/multilingual people and those learning foreign languages etc..
Update (more questions): What is "Blind native Style"? Is it a model? How exactly does the length parameter work? Does the number let you set the number of words per response? If so, should the description length depend on the content itself to a certain degree? What if we prompt PiccyBot to describe a long video? Will it still stick to the same description length and truncate the response?

Answers..

Thanks for the feedback!

The available models vary from time to time. PiccyBot had DeepSeek, but I replaced it with Llama4 as that was a similar open source model and I want to keep the list manageable. I also removed GPT4o mini recently, as we now have the GPT5 models.

Regarding the video descriptions, only the Gemini, Amazon and Reka models do that. The other models are image only. PiccyBot will default to Gemini Flash 2.5 for a video description when a diffrent main model has been selected.

The personality affects the tone of the voice and will have some adjustments in the style of the content. Turn it off for a clean description. I will likely add a few more voice options the coming week. For the system voice, you can can set the voice to 'None'.

I hope the network and VPN issues will improve, I will add more local settings and backup options to ensure the settings remain accessible even if the network cannot connect to the Firebase server or the PiccyBot server.

PiccyBot Mix uses a combination of descriptions given by OpenAI, Google and Mistral models, and uses only the elements that are common to all. This should in principle all but avoid any hallucination in the description. So use this model for the most accurate description. Image only.

Blind Native style uses an inbuilt prompt to ensure the description is relevant for people born blind, with more focus on touch and no reference to colors etc.

The length parameter basically determines the number of tokens used with the model. Set to 100, it will result in more lengthy descriptions, while 10 will give a concise description. The response speed will be slower with a higher length setting. For a long video, set length to 100 for the maximum detail in the description.

The video quality setting determines the amount of compression of the video when sending it to the server. Low is high compression (for free users) while high is no compression. Setting it to high will give more exact results at a cost of slower processing.

Hope this helps!

Re: pasting

I'm not sure how that works with VoiceOver. I don't really have a "main form" that I can give focus to as far as I know. I can select all the elements in it, but not the form itself.

I have managed to get it to work a couple of times but I think it was pure luck.

Has anyone managed to do this with VoiceOver?

Re: pasting

Mr Grieves, It should be double tap and hold. But you are not the only one having trouble getting it to work. I will try to make it automatic in the next update. So, if PiccyBot finds you have an image on your clipboard, it will prompt you with a question whether you wish to paste it.
However, Apple is tricky with this, as they want only user initiated actions, not automatic ones, so they may not approve. Let's see..

Copy paste solution, maybe?

I use VoiceOver, and yes, I have gotten the copy-paste feature on Facebook to work with the app. I will say though, it can be tricky, because from what I understand, you have to be positioned right at the start of the line in the text box for it to work, and it has to be done pretty precisely. Isn’t there a way this could be part of the rotor?

You know how, when using the phone, there’s usually a box or menu with edit options on the rotor that include “Select,” “Select All,” “Copy,” “Paste,” “Share,” and other relevant commands? Is there a way you could enable something similar so that, for example, if I have an image on the clipboard, I could go into the text box manually, switch to the rotor, go to “Edit,” and then double-tap the “Paste” option? This would essentially put the picture into the box, just like how it works on the iPhone directly.

Right now, if I copy a picture directly to my clipboard from my iPhone camera, I can paste that image into the Notes app without a problem, but it doesn’t seem to work anywhere else. I don’t know if this could be implemented here, but it’s an option worth looking into.
As it stands, the edit menu is there, but none of the options show up. It would also be good to have some kind of text representation to show that there’s content in the box after pasting the image from the clipboard. Maybe it could display something like “Image” or even a short code such as JPG, GIF, PNG, or a series of numbers and letters. Basically, anything that would give an indication that there’s media content processing in the app. Having a completely blank box with no indication feels a bit strange, because there’s no way to know that there’s actually media there if you can’t see it.

Re: pasting

Thanks for the reply. I think the problem with VoiceOver is that it needs a specific child element to interact with and if it needs to be done on the background container then it becomes a bit tricky.

I wonder if popping up when an image is detected could get annoying. If I am using my phone I don't typically do much copy/pasting unless I am also on my Mac. So if I have an image in clipboard, it's likely to stay there for a long time. So if PiccyBot prompts me every time, then I would need to try to find some text to copy just to stop it happening?

Is there enough space on the screen to add a paste button amongst the other buttons, but only display it if there is an image to paste? Or maybe do something with the rotor actions?

Anyway, thanks again for this - once it becomes a bit easier this is going to be another really big advantage of PiccyBot compared to everything else. I usually ignore Facebook as I just feel excluded and I'm too lazy to save files all round the place just to have them described. I've been wanting something like this for ages.

Copying Descriptions to the Clipboard

Hi Martijn,

Could you please implement an easy way to copy just the image description to the clipboard? Right now, this is accomplished by pressing the Share button and copying the described text to the clipboard. Once I get where I want to paste the description (usually into a message to someone), I have to do some editing to remove the link to PiccyBot. Would it be possible to add a "Copy" function to the VoiceOver Rotor when focus is on the text containing the description, and to please remove the PiccyBot App link from what is coppied?

Re: Answers..

Thanks...
Question 1: Does the "None" option not disable the voice entirely? How does that let you use the system voice unless you have VoiceOver or Speak Screen enabled and use the appropriate gestures to hear the description? What I mean by "system voice", however, is the ability to have the system voice read out the description even if no such feature is enabled.
Question 2: What LLM does PiccyBot Mix use to perform text generation and generate the response? What do you mean by "elements"? Does MiccyBot Mix compare the text responses provided by the different models you mentioned or does it compare the raw image processing results and then finds the elements commonly found in all of them and then generate the text response itself?

I don't want to be that person, but...

So the video description is done by Gemini? Why did I think it was a variant of ChatGPT? Wasn’t this a thing once, or did it change? I personally prefer the descriptions from ChatGPT, so I’m wondering if that could be done for video descriptions too.

Also, would it be possible—without making it overly complicated—to have a PiccyBot mix for videos as well? Basically, the idea would be to run the video through different models, then compile the details that at least three or more of them agree on. I know that’s probably super complex, and you’d have to figure out how to merge everything, but I think it could be really useful.

Enes, I'll look into system…

Enes, I'll look into system voices as part of the addition of new voices, soon. PiccyBot Mix currently uses Gemini 2.5 Flash, GPT 4.1 and Pixtral as models, and Gemini 2.0 Flash takes input from all elements that are present in each of these descriptions to put together a combined description. These exact models were chosen for their speed to be able to generate an accurate description but not take too long either. I may change them now with the introduction of GPT5 and soon Gemini 3.

Winter, a mix for videos can be done as well, let me experiment with that. But OpenAI doesn't do video descriptions unfortunately. The only thing it can do is take screenshots at intervals and describe these separately, but that takes long and the result is not too great anyhow.

An iOS update is under review by Apple currently. That implements the automatic paste in PiccyBot, which should help make it a far easier feature to use. The app should also be more stable.

Thanks all!

Enes, I'll look into system…

Enes, I'll look into system voices as part of the addition of new voices, soon. PiccyBot Mix currently uses Gemini 2.5 Flash, GPT 4.1 and Pixtral as models, and Gemini 2.0 Flash takes input from all elements that are present in each of these descriptions to put together a combined description. These exact models were chosen for their speed to be able to generate an accurate description but not take too long either. I may change them now with the introduction of GPT5 and soon Gemini 3.

Winter, a mix for videos can be done as well, let me experiment with that. But OpenAI doesn't do video descriptions unfortunately. The only thing it can do is take screenshots at intervals and describe these separately, but that takes long and the result is not too great anyhow.

An iOS update is under review by Apple currently. That implements the automatic paste in PiccyBot, which should help make it a far easier feature to use. The app should also be more stable.

Thanks all!

Suggestion: Adding our own a-P-I key

Just a suggestion request / throwing it out there.
I really want longer descriptions on videos. Even though I have it set to 100 it's still fairly short, at least textually. I was wondering for those who want longer more detailed descriptions, so you don't have as much server cost, if we could maybe add our own A-P-I keys? I think that would be in some ways more cost effective for you, and give those who want really long descriptions such as myself, another option to get those.
This would of course be an option, I'm not suggesting pivoting away from the way things are done at all, just an optional extra.

Longer video descriptions, and other thoughts

So I don’t know if this could work, but I had given you a list of suggestions at one point in the Facebook group, and you were like, “Yeah, these could totally work, but you have to keep it affordable.” I’m going to copy and paste some of that discussion again myself because I like having all my points in one place. Would any of these suggestions work here? I know you said they were feasible, but it really depends on the cost, and that’s the one aspect that might hold us back. I’m going to paste the questions again just to refresh my mind and compare everything side by side. Could any of these work? For example, could you have Gemini analyze the video but then have ChatGPT give the output, or would that be too complicated? Let me explain. The thing is, the description—even though I set it to 100—still ends up being very short when it comes to video descriptions. That’s why I thought ChatGPT was the one giving the descriptions at one point, because I know ChatGPT tends to give out longer text in comparison to Gemini. If I remember correctly, it’s probably the model that gives the longest descriptions overall. Sometimes I’ll get a nice, long description, and then other times I won’t. Even when I specifically prompt for a longer video description, Gemini tends to summarize the content instead, which I think it’s kind of known for. So that’s why I thought ChatGPT was the one providing the description, not Gemini.

Basically, I don’t know if it’s even possible to combine the models—have Gemini analyze the video, then have ChatGPT provide the actual text. Maybe that doesn’t make sense technically, but it would be nice to somehow get longer, more detailed descriptions for videos. Also, when I ask follow-up questions about the video, even when I’m being very specific, I never seem to get the kind of response I actually want or need. I’m not sure if I’m doing something wrong here, or if it’s a limitation of the model. Like, I can analyze a video and get a fairly long description on the very first attempt. But if I analyze that same video again, I never seem to be able to get that same long, satisfactory description. I don’t know why, but artificial intelligence is kind of famous for this—you know, where the very first result you get is often the best one, and everything after that usually pales in comparison, sadly. I know you said ChatGPT doesn’t do video descriptions—which is strange—but what about Claude? Could you do a mix between Claude and Gemini? You’re still working on the audio-description model, right? I haven’t seen much discussion about it on here lately, so that’s why I was wondering. Anyway, I’m excited to see where this goes—you’re doing great.

Right now, the app has quite a few models, which is great, but I was wondering if maybe the number of models could be cut down a bit. For example, ChatGPT might be great with backgrounds and emotional details, Claude might excel at emotions, and Gemini might be better with faces. So what if you could combine the best parts of each model into one, or maybe two or three models? What you could do is either combine the best parts of each model across all the companies into one unified model that brings together their strengths, or if that’s not feasible, you could combine the best parts of each model within each individual company and then offer one or two optimized models per company in the app. In other words, there are two main ways to approach this: One, create a unit of cross-company models that merge the strengths and compensate for the weaknesses of all available models from different providers. This would mean blending the best features from ChatGPT, Claude, Gemini, and so forth into one or two highly capable “super models.”
Two, if that’s technically too complex or not possible, then at least streamline by merging the best features of models within each company. For example, take the best elements from the various ChatGPT models and combine those into one or two main ChatGPT options; do the same for Claude, Gemini, etc. Then the app would offer a couple of optimized models per company, instead of overwhelming users with a long list of individual models. Also, you could improve the naming scheme. Instead of showing confusing model identifiers like “ChatGPT-4 mini” or “Gemini Flash 2.5 NIL,” you could show the company name plus a simple model name and a short description. For example: we would see clean, descriptive names such as:
• ChatGPT plus – Best for technical images and detailed descriptions
• Gemini Flash – Best for dynamic images with emotions and faces
• Claude Sonnet – Best for videos or images with text and graphic overlays

This would help users easily understand the strengths of each model without having to guess which version or identifier means what. Keep in mind that this is a general overview, but we'd still be able to use the models for whatever we wanted to have described. Think about it—the models you have in the app are pretty similar, and most people usually stick to their one or two favorite models. I don’t think a lot of users are switching between models as much as possible. What I’m saying is, you could keep access to all the baseline models behind the scenes but not give users direct access to each one. All the processing could happen in the background, so users wouldn’t have to pick or switch models directly. You could even pull from other AI sources that don’t have chat interfaces or other AI developers who have strong description models but no conversational UI. We would get access to those capabilities through the underlying description process. Any missing details or aspects that don’t come through in the initial description could be fixed with follow-up questions. I don’t think this would sacrifice the quality of the descriptions. You could also consider having the app pull from the two or three latest models from each company, but that might complicate things and risk losing some details early on. Each model keeps improving anyway, and honestly, many of them are pretty similar with only minor differences. I don’t think most users can tell the subtle distinctions between, say, one version of Gemini versus another, or each version of Claude or ChatGPT. Usually, models fit into a couple of broad styles, and the letters and numbers in their names don’t mean much to the average user. Kind of like what you did with the PiccyBot mix or the native blind style, but on a larger scale.

PiccyBot Mix

Don't know how fast it is, but Llama provides surprisingly accurate descriptions despite being open-source, even though it lacks the detail provided by GPT5. It's still more detailed than Grok4, and doesn't hallucinate like Claude4 Sonnet. Reka, Pixtral and Amazon Nova fail to provide sufficiently detailed and accurate descriptions. So I'd say it's Llama instead of Pixtral that should be the third model beside a certain version of Gemini and GPT, provided it's fast enough. And I will list some other suggestions that I noted down as I continued to use the app. I'll have posted twice in a row but I don't want to make things complicated so the suggestions I will share in a separate post.

Suggestions

Here are some observations:

The volume settings for the mixer are not saved permanently; they just revert to the defaults when I close and reopen the app.
The audio description sounds okay when I hear it within PiccyBot but the beginning is trimmed/truncated when I export the video mixed with the description. Also, the first sentence is heard at the very end so the description starts over from the beginning if it ends before the video does. Some finer adjustments seem necessary to better synchronize the descriptions with the scenes and fit the description in the video duration so that it is not truncated or repeated from the beginning to fill in the gap at the end and align the end of the description with that of the video.

And below are some other suggestions:

A more compact interface that opens in an applet when bringing up description windows if an image or video is exported to PiccyBot from another app: We should be able to select something like a Dismiss button to quit it without having to close the app from the App Switcher or having all the elements and clutter of the full interface that could make the layout/navigation more complicated for some users.
Separate models, length and prompt customization for images and videos, and different contexts/types of content (i.e., OCR for images with text as the main focus, more concise descriptions for fast-paced action videos, more elaborate descriptions for artistic stuff etc.)
Concise description mode for videos (brief sentences/phrases describing the current scene/event/moment in between dialog) as in audio-described movies, instead of a whole long description that ignores what happens at a certain moment
Description history: The ability to store and access a certain number of descriptions along with the images or videos, customize that number in the settings and disable it entirely
Similarly, the ability to save a particular descrription along with the image or video and keep it indefinitely unless the user deletes it
The ability to customize the style/intonation and voice independently without having to stick to a certain style in order to be able to use a particular voice, or vice versa

Grok 4 - what a blessing + some tips to Enes

Martin, thanks much for adding Grok 4 to the model list, and for deciding that it will stay there! It does have a very low censorship level indeed! I have an image set of quite peculiar adult content. It forms quite a good "test set", as I was part of the experiences pictured in them, so I can form quite a good opinion on the descriptions given about them. Grok 4 didn't reject any of them, did a very good initial description, and listened accurately to all my follow-up questions, and clarified all details I wanted to know! Gemini 2.5 Pro, my earlier "go-to" model errored out on all of them without saying a word about reasons of course (which were the adult content)! And what's more all this in my native tongue, Hungarian! The wording, grammar and expressiveness of Grok 4 in Hungarian is totally on par. There were mild description inaccuracies sometimes (not really hallucinations, because the described element was really there, albeit a bit differently). But 1, or at most 2 follow-up questions always cleared those up. Earlier (in autumn of 2024, when I subscribed to lifetime Piccybot) Pixtral was the model to choose for such images. But it didn't understand and generate in Hungarian, so I needed to change to English forsuch content. Furthermore the descriptions of Pixtral weren't as good as Grok 4, though by no means bad. I don't know by the way whether Pixtral has retained that low censorship level in its current iteration. So Grok 4 is truly a blessing as I have the perfect right to "see" adult content, because I am way into my adulthood. And now I can do that with a really high quality and in my native tongue. So thanks Martin again!
To Enes: customizing the "what's in this image / video" prompt (which Piccybot allows lately) and follow-up questions may be the real solution to some of your proposals. Piccybot has recently got a prompt history feature, so you can recall the prompts you use / need often, like "Just read the text in this image accurately!" for OCR purposes (or its Turkish equivalent) etc. As we all know, we humans differ fortunately, so our needs, tastes etc. too, and visual content has an endless variety. And there are multilingual needs / setups for some people, like me. So I don't think that a heap of super-taylored modes / options are the way to go. There always will be a ton of cases that just don't fit. I prefer user freedom of creativity and experimentation more, although it requires some more work on the user's end, but it may well worth it. By the way I am completely satisfied with how the model list looks like right now: I find it neithertoo long, nor in any way uncomfortable or unpalatable.

See post: "Another Question"

I already suggested a solution I found useful to get descriptions in multiple languages. I didn't know Pixtral also provided uncensored descriptions by the way.

prompt storage increase?

Really enjoying all the new changes going into the APP- the copy and paste option is awesome- and if we ever get a desktop/website version it'd be really nice for sure.
BTW, not to take anything away from Piccy, but if you guys have been looking for a desktop uncensored describer option while waiting for Piccy to go that rout, go to the miniapps website and search image describer and you'll find lots of uncensored describers you can use. You have to pay a bit to use them, but for now it's a nice option.
Now to my main request :) Can we have more prompts stored? I want to have a set of them for videos and a set for images, but the allowed number ATM basically is only enough for my image based prompts. Also, is there a way to make stored prompts accessible in the "ask more" section?
Thanks for the great work!

I downloaded the update. I have some questions

I downloaded the update a while ago. It’s on the App Store. The copy-paste feature is working pretty well here—whenever I copy a picture to the clipboard, once I enter the app, it prompts me to allow pasting. Hopefully, you can make it so it only asks once to paste instead of every time.
I still think having the copy-paste option on the rotor, if possible, would give us more control if I don’t want it to happen automatically. Because as it stands, like I said, when I paste a picture in the box without any text indicator that the picture is there, if I wanted to ask a specific question about the picture before I start analyzing, how would this work? Once I paste the photo, it automatically sends the query. I don't want to be picky, but maybe this could be optional? Like in the settings, do you want to paste manually, or automatically? Plus, when the picture is pasted in the box, can I ask a follow up question, when this issue is resolved, without overriding the picture? It seems silly, but I thought I would ask, just in case.

Update available - added pasting of video links

The latest update will also paste video links (from Facebook and YouTube mainly), which should make social media video description more flexible.
The volume settings for the mixer are now saved permanently.

Good luck, let me know how it works for you!

Thank you, and a question

Hey, the copy paste feature, with the links, is working very well.
I thought I would ask this question just in case — is there a way to enable a feature so that when I paste a link to an Instagram post, the AI could grab all of the photos from that post and describe them back-to-back?

What I’m thinking is: if I share the link to a post, it could automatically describe each image in sequence. But this could also be optional — so a person could choose to have it describe each picture automatically, or have it wait until they type something like “next picture” before moving on.

For posts with multiple images, maybe it could handle up to five at a time. If there are ten pictures in the post, it would describe the first five back-to-back, then wait for confirmation before describing the second five.

It could also provide a quick summary of each image in the post first, so the user could choose which ones they want more details about, instead of getting full descriptions for all of them by default. You could even make this customizable in the settings — for example, I might choose to always describe the first four pictures automatically, then wait for me to say “next” before going further.

Not sure how helpful this would be for others, but for me, when I’m on Instagram, sometimes a post has multiple pictures and I have to go through them one at a time to get descriptions. This kind of feature would make it much smoother. Just thought I’d throw it out there.

re: batch photos

"is there a way to enable a feature so that when I paste a link to an Instagram post, the AI could grab all of the photos from that post and describe them back-to-back?"
Totally agree- a batch download/describe would be very helpful for me too since I often deal with pages with multiple photos of the product I'm trying to buy and doing them one at a time is very tedius.

Something's wonky with copying texts from the APP

Normally I have the APP generate a description and copy the description it gave me and use that as the image's title. For some reason, it's not copying now. I copy it, go to the file, rename file, paste, nothing. Just to test it out, I copy the description, try to paste it into a msg to myself, nothing. To be sure it's the APP and not other things, I randomly copied texts from safari, google, dropbox, home screen...and paste them into a msg to myself. They all work, but when I try to copy the description from the APP, it doesn't paste.

Copy and paste

Sorry bit slow to post this, but just wanted to say thanks for the change to copy and paste. This works really well. I was worried it would keep hassling me if I switched back to PiccyBot but it seems to know that once it has described what is on the clipboard then it doesn't need to do it again. If I quit the app and restart then it does, but that's not a big deal.

This is a fantastic new feature and really opens up Facebook. Thanks for continuing to release such great updates.

Problems with descriptions on the app

I think something got broken in the latest update with the description feature on the app. Not sure exactly when it started happening, but here’s what’s going on: when I try to copy the text description—whether it’s through the option on the screen, the share sheet, or using the direct method on iPhone by tapping four times with four fingers—it doesn’t actually copy the description.

The only way I can currently get the text description is by sharing it to another app like WhatsApp, Messenger, or my email, and then sending it to myself. So yes, I just wanted to comment and confirm that this is definitely a bug happening on the app right now.

Winter Roses

Tapping the screen four times with 4 fingers? Is that something you set up? Only asking because, by default, it should be tapping the screen four times with—3—fingers.
Just a heads up. :-)

Grok 4 issue

In the last 3, maybe 4 days, I've not been able to get Grok4 to work hardly at all. 9 out of 10 times it times out or just sits there "processing" forever. Tis a shame cause Grok4 is by far my favorite AI model ATM.

Copy paste gesture

I don't remember changing the setting, but I must have at some point, so, yeah. Maybe that's why.

Fair enough

Just wanted to clarify incase you were performing the wrong gesture by mistake. 🙂

Having the same issue with Grok 4

All I get is some timeout message in response to subsequent follow-up questions even if and after I upload a photo to Grok 4 and do get an initial description.

Grok and copy

Guys, thanks for pointing out the copy issue. This was a side effect of handling the automatic paste. It will be fixed in an update within a day or two. The Grok 4 model should be working again fine now.

how exactly does share description work?

it seems share description only work with the last answer. i send a pic and get a description, i use the "ask more" function and ask two more questions. when i share the description and save it to a file, i only get the second (last) response from the "ask more" section, is this how it is supposed to work? if that's the case, can we add an option to share the entire transcript from begin to end?

Share description

LaBoheme, this is how it currently works. After each response there are a trio of buttons, including a share button. If you use that, it will share that response. So if you want all responses, you'd have to share all of them one by one. I'll see if I can change that for the general share button on the main screen, for example.

I have just released another update of PiccyBot, which streamlines the settings screen somewhat and adds a few more options. It also includes short descriptions of each AI model, which was a common request.

please add .webp processing

it has become very common among many sites. since PiccyBot doesn't support it, sharing won't work, one would need to save it to photo album first.

Updates

LaBoheme, I'll look into support of .webp, hope to add that soon.

The Claude 4.5 Sonnet model is now available within PiccyBot, replacing the Claude 4 Sonnet one. I find it particularly good in describing emotions within scenes.

The PiccyBot interface has had an overhaul and should work smoother now. Some earlier video processing glitches have been resolved as well.

I received a request for guidance while taking a selfie within PiccyBot. It's quite some effort to do this though, giving audio feedback to get your face in the correct frame. Do you think it is worth adding or are there alternatives for this?

Thanks for all support!

re: Updates

Hello, thank you very much for constantly updating the software. Yes, it would be excellent if we could receive an audio guide for taking pictures, whether with the rear camera or the front camera. It would be absolutely great if we could then save the photo in Photos.

My thoughts, and a couple of suggestions

I know you said you don’t want to remove any of the models, but I’m making a suggestion here. You already have a pretty active model base, but most of the models don’t really differ that much from each other. I was only suggesting this because I thought you’d maybe like to simplify the platform a bit — that way we could focus on getting higher-quality models without having so many versions doing almost the same thing. There’s virtually no major difference between most of them, but I was thinking maybe you could keep the latest few and remove the ones that aren’t needed. The best way to handle it might be to ask for feedback first — then decide which models are worth keeping as standalone and which could be consolidated. You’re never going to be able to please everyone, of course, but asking the community what they think might help. People could vote on which models should stay and which ones they hardly use. That way, we’d get a more consistent setup and more consolidated descriptions, especially for videos, since these descriptions tend to be pretty short even when the output is set to 100.
As for the question you asked about feedback and taking photos — I’m not entirely sure how that would work. I had an idea, but here’s what I was thinking: some apps out there guide you while taking a photo — they tell you cues such as if your face is in the frame, then instruct you to move up or down. The only issue is, if your hands shake or shift slightly while following the instructions, by the time it tells you to move, you might’ve already missed the perfect position, and the photo could come out blurry or slightly off. Then again, I don’t take photos of myself for social media or public sharing. If I do take one, I usually have someone else take it for me — family or friends, people I know personally. I’m not one to post much on social media, but that’s a separate topic. Anyway, I don’t think this exact idea is possible because I don’t know if VoiceOver speaks while you’re taking photos. What I was thinking instead is maybe there could be a way for the model to record a short video instead of taking a static photo. For example, when you’re using your iPhone’s camera, you can record a video and also capture a photo at the same time.

So maybe the model could use that approach — where instead of snapping a single image, it starts recording a short video. While you’re moving the phone to find the right angle, the system could detect when your face is properly in frame, then automatically capture the perfect still image from that video. That way, you wouldn’t have to worry about exact timing. It could either (a) take the photo automatically once your face is detected clearly in frame, or (b) analyze the short video afterward and extract the best single image from it.

You could also include a timer function — like how the phone’s camera takes a photo after 3 or 10 seconds — except this would be automated once the framing is correct. It wouldn’t need to be complicated. You could just have the user hold the phone, let it record briefly while adjusting the angle, and once it detects the face is clear, it captures the image automatically or selects the best frame from the clip.

You might need to talk to the users to see if this is technically possible, but that’s the general idea I was thinking of. It could be a useful feature — maybe something premium, if it takes more processing power or development work.

Maybe changing voicing to Grok4?

In the "voice mode" in Grok you can prompt Grok to "read this with a Chinese accent" or "Read this in an ASMR style soozing tone<" and it'd do really well. Is there a way for us to do this with the current voices for Piccy, or maybe just change the voice set up to be voiced by Grok? Personally I find the Grok voices more human and expressive especially in other languages like Mandarin Chinese.

Guidance when taking pictures

Yeah, I was the one who suggested that feature the other day, I would even be willing to pay for it, but I think it’s a very important feature! It is true that there are other apps that can kind of help, but not really! Because even once you point out a face correctly, by the time you actually take the picture, the phone has moved and the picture comes out terrible! So what we need is the app to guide us and once we’re pointing correctly for it to automatically take the picture!🤓 as for the models, I don’t change them anymore because I don’t really see the difference, I don’t really get it so ever since I reinstalled the app I’ve been using the model that was already being used when I downloaded the app

I have to say

I have to say I am a huge fan of the Groc for model. I hope dictation is saying that correctly. Natalie doesn’t give extremely detailed descriptions, but I can view things that may be filtered out in the other AI models.

New AI app for describing images and video: PiccyBot

Options

Comments