This is a key aspect of how current multimodal AI works. The AI processes the image and your text
in a single turn. It doesn't "remember" seeing the image in the next turn unless you send it again. To ask follow-up questions, you have two options:
- Provide all your questions in the same message as the image.
- Use the image analysis mode, which keeps the image in context while you ask questions.