Image captions are the blueprint your style training learns from. They define the subject, perspective, and art style, ensuring your trained model produces consistent, high-quality outputs.
Large Language Models (LLMs) like ChatGPT and Gemini are powerful tools for captioning. With the right prompts, you can generate captions that are structured, descriptive, and consistent - without manually writing each one.
This guide walks through best practices and provides ready-to-use system prompts for different asset types: single characters, multiple characters, in-game items, and backgrounds. You’ll also learn how to decide when to use prefixes and suffixes, and how to lock them in before generating captions.
Best Practices for Captioning with an LLM
Consistency over creativity: Captions aren’t stories - they’re structured descriptions.
Define structure early: Keep a consistent order: Perspective → Subject → Pose/Details → Art Style.
Be explicit: Call out angles (“isometric view,” “front-facing,” “three-quarter shot”).
Repeat the art style: End every caption with the same suffix.
Stay under 1024 characters: Enough detail to be descriptive, but concise.
Iterate: Test a batch of captions, refine the system prompt, and rerun.
For single character datasets, always begin captions with the character’s name. This reinforces identity and ensures the model consistently associates captions with the subject.
Example: “Eldrin is an elderly wizard with a long flowing white beard…”
When to Use Prefix and Suffix (and When Not To)
Both Prefix + Suffix | Suffix Only |
|
|
Never Skip the Suffix: Even if a prefix isn’t needed, always use a suffix. It’s the “style glue” that ties the dataset together.
Step-by-Step: Captioning with an LLM
Prepare your image data set (Learn More here)
Decide whether you’re captioning characters, multiple characters, items, or backgrounds.Define prefix/suffix
Send ~10 images with rough descriptions to the LLM. Ask it to generate:A Prefix: subject identity (if consistent across images).
A Suffix: the repeating art style sentence.
Choose prefix + suffix or suffix only
Fixed subject = use both.
Varied subject = suffix only.
Generate captions
Use the ready-to-go system prompts below. The LLM will create structured captions for your dataset.Review & edit
Ensure perspective, structure, and art style are consistent.Finalize for training
Lock captions under 1024 characters, all with the same suffix.
System Prompt: Define Prefix & Suffix
Use this first with ~10 images:
"You are helping me prepare captions for style training of images."
"First, I will give you ~10 example images from my image dataset."
"Your task is to identify and define two things that will remain consistent across ALL captions:"
1. A PREFIX — a detailed subject definition that always introduces the character, item, or object. This should capture permanent traits (appearance, colors, materials, defining features).
2. A SUFFIX — a single art style description sentence that will be repeated at the end of every caption. This should describe the rendering style, outlines, shading, lighting, and overall aesthetic.
Rules:
Do not vary these once they are defined.
Keep the prefix and suffix concise but descriptive.
Captions for the rest of the dataset will follow the structure: [Prefix], [Variable details like pose/expression/perspective], [Suffix].
After analyzing my sample images, output ONLY the locked Prefix and the locked Suffix.
Ready-to-Use Captioning Prompts
Once your prefix and suffix are locked in, use these system prompts to generate captions for your full dataset.
Single Character Captioning
You are writing captions for images to train an AI art style.
The character’s name is [Character Name].
Every caption MUST begin with the character’s name.
Each caption must follow this structure:
[Character name + prefix with fixed character definition], [Variable details such as pose, expression, clothing changes, or shot type], [Suffix with art style].
Rules:
Do not alter the prefix or suffix.
Do not omit the character’s name.
Keep captions under 1024 characters.
Use the same sentence structure for every caption.
Multiple Characters Captioning (Suffix Only)
You are writing captions for images to train an AI art style featuring multiple characters.
"Each caption must describe each character individually with the same structure:
[Character 1 physical description, outfit, pose, expression], [Character 2 physical description, outfit, pose, expression]. Always end with the locked art style suffix.
You are writing captions for images to train an AI art style featuring multiple characters.
Each caption must describe each character individually with the same structure: [Character 1 physical description, outfit, pose, expression], [Character 2 physical description, outfit, pose, expression]. Always end with the locked art style suffix.
Rules
Do not invent new characters — only describe what is in the image.
Do not change the suffix.
Keep captions under 1024 characters.
In-Game Item Captioning
You are writing captions for images to train an AI art style.
Each caption must follow this structure:
[Prefix describing the perspective and fixed identity of the item], [Variable details like condition, glow, or upgrades], [Suffix with art style].
Rules:
Do not change the prefix or suffix.
Always include the perspective in the caption.
Keep captions under 1024 characters.
Background Captioning
You are writing captions for images to train an AI art style.
Each caption must follow this structure:
[Prefix describing the overall environment and perspective], [Variable details such as time of day, lighting, weather, or added scenery], [Suffix with art style].
Rules:
Do not change the prefix or suffix.
Keep captions detailed (900–1024 characters).
Use the same sentence structure for every caption.
Example Workflows
Single Character Example (with Character Name in Prefix)
Input (10 images):
Image 1: A tall wizard with long white beard, blue robe, holding a staff.
Image 2: Same wizard, arms raised, casting a spell.
Image 3: Same wizard, seated on a chair, looking serious.
Image 4: Same wizard, walking with a cane.
Image 5: Same wizard, smiling, holding a book.
Output:
Prefix: “Eldrin is an elderly wizard with a long flowing white beard, bushy eyebrows, and a pointed blue robe decorated with silver runes. He always carries a wooden staff and has a wise, commanding presence.”
Suffix: “The art style is vibrant and stylized, featuring bold outlines, smooth shading, and a polished, high-fidelity fantasy aesthetic suitable for mobile games.”
Example caption using prefix + suffix:
“Eldrin is an elderly wizard with a long flowing white beard, bushy eyebrows, and a pointed blue robe decorated with silver runes. He always carries a wooden staff and has a wise, commanding presence. In this image, he raises both hands dramatically while casting a spell, his expression stern and focused. The art style is vibrant and stylized, featuring bold outlines, smooth shading, and a polished, high-fidelity fantasy aesthetic suitable for mobile games.”
In-Game Item Example (with Isometric Perspective)
Input (10 images):
Image 1: Golden treasure chest in isometric view, closed with lock.
Image 2: Isometric chest slightly open with a blue glow.
Image 3: Isometric chest with curved lid, metal frame.
Image 4: Isometric chest reinforced with metal corners.
Image 5: Isometric chest glowing faintly from inside.
Output:
Prefix: “An isometric view of a golden treasure chest with reinforced metal corners, a curved lid, and an ornate lock.”
Suffix: “The art style is toon-like, with clean outlines, glossy surfaces, and bright, saturated colors that create a playful, high-quality finish.”
Background Example
Input (10 images):
Image 1: Forest clearing with grass and wildflowers.
Image 2: Same clearing, with a stone path leading forward.
Image 3: Clearing with tall pine trees in background.
Image 4: Clearing lit by golden sunlight.
Image 5: Clearing at sunset, warm tones.
Output:
Prefix: “A bright forest clearing surrounded by tall pine trees, with a grassy floor and scattered wildflowers.”
Suffix: “The art style is soft and whimsical, featuring smooth digital shading, rounded shapes, and vibrant colors with a dreamy, storybook-like atmosphere.”
Multiple Characters Example (Suffix Only)
Input (10 images):
Image 1: Tall knight in silver armor next to a rogue in dark leather.
Image 2: Same knight and rogue, both facing forward, knight serious, rogue smirking.
Image 3: Knight and rogue standing side by side, knight holding a sword, rogue with daggers.
Image 4: Both in three-quarter view, knight stern, rogue playful.
Image 5: Both standing close, lit dramatically from above.
Output:
Prefix: (Not used for this dataset — characters vary too much to lock a single prefix.)
Suffix: “The art style is dramatic and stylized, with strong directional lighting from above and a distinctive green rim light on the right side of every character, creating high contrast and a bold, polished fantasy aesthetic.”
Advanced Tips
Perspective control: Lock it in with “isometric view,” “front-facing,” “three-quarter shot.”
Lighting control: Use suffixes like “lit from the left,” “warm golden glow,” “green rim light on the right side.”
Stroke/outline control: Reinforce style with “bold outlines,” “thin sketch strokes,” or “no outline.”
Thematic control: Use suffix phrases like “toon-like, glossy finish” or “dreamy watercolor aesthetic.”
Suffix is non-negotiable: Even when skipping prefixes, always define a suffix.
Summary
Using an LLM for captioning makes style training faster and more reliable. With clear prefixes and suffixes, structured prompts, and repeatable phrasing, you can:
Scale up captioning across dozens of images.
Maintain consistent perspectives, angles, lighting, and art style.
Avoid drift between images and lock in your style’s identity.
Captions are the foundation of your training set — treat them like rules. The clearer and more consistent they are, the stronger your model will be.