Dashtoon Insiders

No more face mashups: UMO meets Qwen Edit

Ayushman Buragohain — Tue, 04 Nov 2025 06:59:03 GMT

We rebuilt multi‑identity image editing on Qwen Edit around global matching and reward-driven training If you’ve tried to edit images with multiple people, you’ve felt it: faces average together, attributes leak across subjects, and “who’s who?” becomes a guessing game. As the headcount grows, most pipelines crumble. Humans are ultra-sensitive to faces; one-to-one heuristics don’t scale.

The Challenge

Modern AI image editing has mastered the art of the solo portrait. You can conjure a photorealistic CEO or a fantasy warrior from a simple text prompt with stunning fidelity. But add a second or third person to the scene, and the magic often breaks. The core challenge? Maintaining distinct identities.

Current models struggle to keep track of "who's who." When prompted to create an image of multiple specific people, the AI often gets confused. This isn't just about placing people in a scene; it's about preserving the unique facial features, hair color, and characteristics of each individual without them blending together.

Imagine you're trying to generate a photo of three friends: Alice, Bob, and Carol.

1. Facial Averaging and Feature Bleed:

Instead of creating three distinct faces, the AI might produce individuals who look like strange composites of each other. Alice might end up with Bob's jawline, or Carol's nose might appear on Alice's face. The model creates an "average" face that incorporates features from all the references, rather than assigning them correctly. It’s as if the AI took all the source photos and digitally mashed them together, resulting in three vaguely similar, uncanny strangers.

2. Identity Swapping:

Another common failure is "identity swapping." The AI might generate three perfectly clear faces, but it gets their identities mixed up. If your prompt was "Alice on the left, Bob in the middle, and Carol on the right," you might get Bob on the left, Alice on the right, and Carol in the middle. The model understands the features but fails to correctly map them to the positions and descriptions specified in the prompt. This makes it impossible to control the composition of a scene.

3. The "Uncanny Valley" Effect:

Because humans are incredibly sensitive to facial details, even minor errors can make a generated image feel "off" or unsettling. When features are blended or slightly distorted, the resulting faces can fall into the "uncanny valley"—looking almost human, but with subtle inaccuracies that are deeply unnerving. Skin can appear overly smooth and plastic-like, and lighting can feel unnatural, further distancing the image from reality.

Solving this "identity leakage" problem is one of the most significant challenges in generative AI today. It's the barrier that stands between creating simple solo portraits and generating complex, believable scenes with multiple, specific individuals—a critical step for everything from personalized family photos to commercial ad creation.

UMO: A new approach

We integrated UMO (Unified Multi‑identity Optimization) into Qwen Edit to turn this multi-person chaos into crisp, consistent portraits that still follow your prompt and style. UMO reframes multi‑identity generation as a global assignment problem with reinforcement learning on diffusion models—so every generated face matches the best reference face, and non‑matches are actively discouraged. Then we add a final realism‑alignment RL pass to remove the plastic “AI sheen” without sacrificing identity.

How UMO Works ?

We detect faces in the generated image, embed them with a lightweight face encoder, and compute pairwise similarities to the reference faces. The reward boosts the matched edges and penalizes mismatches (MIMR). We apply ReReFL on the late denoising steps only—where identity signals are stable—to align Qwen Edit with these rewards using parameter‑efficient LoRA finetuning. Net result: fewer swaps, higher fidelity, better separation.

Multi‑to‑multi matching: Instead of pinning each output face to a single reference, UMO maximizes overall matching quality using a bipartite assignment (Hungarian) step. That balances intra‑identity variation (pose, expression, lighting) with inter‑identity separation.
Reference Reward Feedback Learning (ReReFL): Late‑step reward optimization on diffusion models—where identity cues stabilize—so gradients are clean and impactful.
MIMR (Multi‑Identity Matching Reward): Positive signal for matched pairs, negative signal for mismatches. This suppresses cross‑identity leakage and makes each person stand on their own.

What it means for creators ?

Multi‑person scenes that look like your actual people—not averaged composites.
Scales from single portraits to 3–5 person compositions, and early wins on crowded scenes.

Qwen Image Edit Training Recipe

Optimizer: AdamW, learning rate 5e‑6; LoRA rank 512 on attention blocks; effective batch 8 on 8×H100.
Scheduler: 30–50 denoising steps
Late‑step RL window: We choose the last ~30–40% of steps where identity reward variance flattens. A typical example:
- If T=30 steps: focus RL on t ∈ [1, 10–12] (where t=1 is the last step).
- If T=50 steps: focus RL on t ∈ [1, 18–20].
Rewards:
- SIR (single‑identity): cosine similarity in face‑embedding space.
- MIMR (multi‑identity): λ1=+1 for matched pairs, λ2=−1 for non‑matches.
Loss: total = pretrain loss + (− reward).
Data: A mix of real multi‑person video frames (with per‑ID retrieval across clips) and filtered synthetic scenes for pose/lighting diversity, all filtered with strict face‑similarity thresholds.

Results

Our Next Big Push: Tackling the “AI Look”

While UMO dramatically improves identity consistency, we know the work isn't done. What we saw in some outputs were tell-tale signs of AI generation: over-smooth skin, uncanny lighting, and subtle plastic textures that break realism.

Fixing this is our top priority, and it's what we are actively working on for our next release.

Our solution is a final, realism-alignment RL stage that will sit on top of the UMO framework. Here’s the plan:

Realism Reward: This new reward function will be trained on a mixture of human preference labels and a separate realism scorer. It will penalize non-photorealistic textures, unrealistic lighting, and other AI artifacts.
- Identity Guardrails: Crucially, we will keep the MIMR reward active as a guardrail during this phase. This ensures that in our quest for realism, we don't sacrifice the identity fidelity we worked so hard to achieve.
Low-level Priors: We are also experimenting with auxiliary penalties that target over-smoothing directly by looking at edge and texture statistics in the generated images.

This next phase will bridge the final gap between "consistent" and "indistinguishable from a real photo." Stay tuned for updates.

References

@misc{wu2025qwenimagetechnicalreport,
      title={Qwen-Image Technical Report}, 
      author={Chenfei Wu and Jiahao Li and Jingren Zhou and Junyang Lin and Kaiyuan Gao and Kun Yan and Sheng-ming Yin and Shuai Bai and Xiao Xu and Yilei Chen and Yuxiang Chen and Zecheng Tang and Zekai Zhang and Zhengyi Wang and An Yang and Bowen Yu and Chen Cheng and Dayiheng Liu and Deqing Li and Hang Zhang and Hao Meng and Hu Wei and Jingyuan Ni and Kai Chen and Kuan Cao and Liang Peng and Lin Qu and Minggang Wu and Peng Wang and Shuting Yu and Tingkun Wen and Wensen Feng and Xiaoxiao Xu and Yi Wang and Yichang Zhang and Yongqiang Zhu and Yujia Wu and Yuxuan Cai and Zenan Liu},
      year={2025},
      eprint={2508.02324},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.02324}, 
}

@article{cheng2025umo,
  title={UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward},
  author={Cheng, Yufeng and Wu, Wenxu and Wu, Shaojin and Huang, Mengqi and Ding, Fei and He, Qian},
  journal={arXiv preprint arXiv:2509.06818},
  year={2025}
}

@article{wu2025omnigen2,
  title={OmniGen2: Exploration to Advanced Multimodal Generation},
  author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
  journal={arXiv preprint arXiv:2506.18871},
  year={2025}
}

Improving Control in Flux-Driven Image Generation

Ayushman Buragohain — Thu, 21 Aug 2025 06:54:50 GMT

In our continuous effort to push the boundaries of controllable image generation, we've identified and addressed a critical gap in how current ControlNet models interact with the Flux pipeline. Despite the power of ControlNet, existing models — even when paired with Flux Ultra — fell short in several key areas such as structural accuracy, prompt fidelity, and response to control signals.

To address these constraints, we've developed a custom FluxDev fine-tune and a newly trained ControlNet variant that together produce markedly better results in structure-aware generation tasks like pose guidance, depth rendering, and edge detection conditioning.

🔍 The Challenge

While ControlNet-based models are generally effective at conditioning on structure-like inputs, we observed several limitations when applied within the Flux and Flux Ultra environments:

Weak correlation between the input control maps and the generated features

Poor overall aesthetics in the outputs

💡

These issues were consistent across multiple pretrained ControlNet checkpoints and became more pronounced under heavier Flux configurations.

📷 Sample Output — Existing ControlNet with Flux Ultra

💡

Note how the generated pose diverges significantly from control input under standard models.

🛠️ Our Solution

We initiated a comprehensive re-architecture of the Flux conditioning pipeline:

Fine-tuned a FluxDev variant optimized for multi-stream attention flow
Trained a custom ControlNet model on hybrid internal+public datasets across multiple conditioning maps (pose, depth, canny)
Introduced dynamic control strength adaptation to maintain guidance integrity across a range of prompt lengths and noise thresholds

These updates significantly improve signal fidelity while still preserving generative flexibility.

📊 Performance Benchmarks

We evaluated our new control stack along three key dimensions: Control Adherence, Prompt Consistency, and Structural Error.

Metric	Baseline (Existing ControlNet + Flux Ultra)	New Method (Custom FluxDev + ControlNet)
Control Map Adherence (SSIM)	0.68	0.84
Prompt/Control Harmony (%)	71%	92%
Structural Deviation (low = better)	0.342	0.112
Artifact Rate (per 100 imgs)	18.2	4.7

Additionally, average inference latency stayed within ±5% compared to the previous method, indicating no significant trade-offs in speed.

💡

Benchmarks conducted on internal validation set using consistent prompts, control types, and seeds.

🧠 Architecture Notes

Below is a high-level overview of what has changed in the architecture layout:

ControlNet Input Pathway:
- Swapped out standard adapter layers for multi-head attention fusion blocks
- Introduced gated skip connections from early encoder positions for stronger pose retention
Training Setup
- Dataset: 1.3M control-labeled image pairs (pose, depth, etc.)
- Target losses: MSE for control adherence, perceptual loss for image fidelity

💡

We’ve intentionally abstracted some fine-grain implementation specifics to preserve internal IP, but we plan to share more experimental results in an upcoming technical deep-dive.

➕ What’s Next

We’re continuing to iterate on other control domains, including semantic maps and sketch-based prompts, using similar architectural principles. Additionally, we’re exploring interpolation guidance — where users can blend between multiple control signals dynamically during generation.

However, one issue we’ve observed with this model is that generations often adopt an overly blueish color tone.

This unintended color bias reduces the naturalness of outputs and makes them appear more stylized than realistic. The effect is especially noticeable on skin tones, clothing, and ambient lighting, where cooler hues dominate regardless of the intended palette.

We suspect a few possible causes for this issue:

Training data imbalance – if a significant portion of the dataset contains cooler/blue-tinted lighting conditions, the model may overfit to that distribution.
Conditioning signal leakage – certain control modalities (e.g., depth maps or sketch prompts) might bias the network toward cooler tones due to how they were preprocessed or normalized.
Interpolation interactions – blending multiple control signals dynamically may amplify subtle biases, leading to a systematic shift toward blueish palettes.

Addressing this requires better color consistency handling within the control framework to ensure that user-specified prompts and reference conditions are respected without introducing systematic tinting artifacts.

Introducing Hunyuan Keyframe LoRA: Open-Source Keyframe-Based Video Generation

Ayushman Buragohain — Thu, 27 Feb 2025 11:08:51 GMT

Introduction

In the realm of AI-driven video creation, tools like Runway's Gen-3 Alpha Turbo and Kling have showcased the potential of keyframe-based generation, enabling smooth transitions between specified frames. Inspired by this approach, we present Hunyuan Keyframe LoRA, an open-source solution built upon the Hunyuan Video framework. This model empowers creators to define keyframes and generate seamless video sequences, all within an open-source ecosystem

Image 1	Image 2	Generated Video

Architecture: Enhancing Keyframe Integration

Our architecture builds upon existing models, introducing key enhancements to optimize keyframe-based video generation:

Input Patch Embedding Expansion: We modify the input patch embedding projection layer to effectively incorporate keyframe information. By adjusting the convolutional input parameters, we enable the model to process image inputs within the Diffusion Transformer (DiT) framework.
LoRA Integration: We apply Low-Rank Adaptation (LoRA) across all linear layers and the convolutional input layer. This approach facilitates efficient fine-tuning by introducing low-rank matrices that approximate the weight updates, thereby preserving the base model's foundational capabilities while reducing the number of trainable parameters.
Keyframe Conditioning: The model is conditioned on user-defined keyframes, allowing precise control over the generated video's start and end frames. This conditioning ensures that the generated content aligns seamlessly with the specified keyframes, enhancing the coherence and narrative flow of the video.

These architectural modifications collectively enhance the model's ability to generate high-quality videos that adhere closely to user-defined keyframes, all while maintaining computational efficiency.

Curating High-Quality Motion Sequences

A meticulously curated dataset is essential for training effective keyframe-based models. Our data collection strategy includes:

OpenVideo1M Subset: We selected approximately 20,000 samples from OpenVideo1M, focusing on clips with high aesthetic value and dynamic motion.
Dashtoon Internal Dataset: An additional 5,000 samples were incorporated from Dashtoon's proprietary dataset, primarily centered on human subjects.
Data Filtering: Utilizing scripts from EasyAnimate, we filtered the dataset to exclude low-quality or repetitive frames, ensuring a diverse and high-quality training set.

Leveraging Keyframes for Seamless Transitions

Our training process is designed to fully leverage the power of keyframes, ensuring high-quality, temporally consistent video generation:

Keyframe Sampling: We condition the model using static keyframes, specifically selecting the initial and final frames of videos as anchors for generation. To maintain consistency, we ensure that these keyframes do not contain significant motion, preventing unwanted artifacts in the generated sequence.
Motion-Aware Sampling: During training, careful selection of keyframes is crucial—ensuring that the start and end frames are truly static helps establish a clear visual reference, allowing the model to infer smooth motion transitions more effectively.
Temporal Consistency: The model is trained to generate intermediate frames that naturally bridge the defined keyframes, ensuring smooth, coherent transitions. By optimizing for temporal stability, the model maintains consistency in motion, structure, and subject appearance across the generated sequence.

These refinements enable Hunyuan Keyframe LoRA to produce high-quality, controlled video generations that align closely with user-defined keyframes while preserving natural motion dynamics.

During inference, Hunyuan Keyframe LoRA offers enhanced flexibility and control, enabling users to craft videos that align closely with their creative vision:

Keyframe Specification: Users can define the initial and final frames, setting precise visual anchors that guide the video's narrative and flow.
Variable Video Length: Trained on sequences of varying lengths, the model can generate between 33 to 121 frames between the specified keyframes, offering adaptability in video duration.
Textual Prompts: While the model can operate without textual input, incorporating descriptive prompts significantly enriches the generated content's detail and relevance.

By combining keyframe control with flexible video lengths and optional textual guidance, Hunyuan Keyframe LoRA empowers creators to produce dynamic and tailored video content.

Conclusion

Hunyuan Keyframe LoRA democratizes keyframe-based video generation by offering an open-source alternative to proprietary models. We are fully open-sourcing the model weights, enabling the community to access, utilize, and build upon our work. Additionally, we are collaborating with the Hugging Face Diffusers library to integrate Hunyuan Keyframe LoRA, streamlining its adoption and fostering a more intuitive creative process. Future developments will focus on refining keyframe conditioning techniques, expanding dataset diversity, and enhancing user interfaces to make the creative process more seamless and accessible.

Model weights are available on HuggingFace: [Control Lora]
Training code is available on [Github]

DashTailor - Training Free Clothing and Object Transfer for AI Comics

Amogh Vaishampayan — Sat, 22 Feb 2025 12:37:44 GMT

While making comics using generative AI, achieving consistency in clothing and specific objects across panels is essential. It’s easier to do with characters and art style by using LoRAs. But it’s impractical to train a LoRA for clothing because using too many concept LoRAs together creates unwanted artefacts. A character might wear different outfits in different scenes, so generating the clothing as a part of the character by including it consistently in every image of the LoRA dataset isn’t practical either. We needed a method to upload an image of clothing or an object and transfer it seamlessly into the target image.

To solve this problem, we used an interesting capability of the Flux Fill inpainting model that transfers concepts from one part of an image to another part of the same image remarkably well. The workflow for this technique can be found here.

Approach

Masking: Draw masks on the object and target images

Object image – the mask covers the piece of clothing the character needs to wear

Object image	Object mask

Target image – the mask covers should cover only the part of the image where the object must be applied, such as the character’s body

Target image	Target mask

Extracting the Masked Object: Isolate the part of the object image within the mask, leaving the rest of the image blank. The original aspect ratio of the image was retained.

Masked object

Using GPT-4o Vision: Describe the isolated object to use as a prompt. The GPT instruction prompt is: Describe the clothing, object, design or any other item in this image. Be brief and to the point. Avoid starting phrases like "This image contains...”
In this case the extracted object prompt was Metallic plate armor with intricate designs, including a winged emblem on the chest. Brown leather straps and accents secure the armor, complemented by layered pauldrons and arm guards.

Creating a Composite Image: Join the object image and the target image side by side. Scale the object image height up or down to match the target image.

Composite image

Pose Controlnet: To maintain the pose of the original subject while inpainting, we passed the composite image into an Openpose annotator and used Flux Union Controlnet with type Openpose and strength value of 0.8.

Openpose annotation

Flux Fill Inpainting: Use the composite image along with the GPT extracted prompt as the conditioning to guide the inpainting process. The inpainting parameters are:

Flux Guidance: 50
Denoise: 1
Steps: 20
CFG: 1
Sampler: Euler
Scheduler: Beta

Initial Flux Fill output

Cropping the composite image: Crop the output from the left by the width of the scaled object mask image to get the target image with the transferred output.

Final output by inpainting

Improving quality: This output may still have some rough edges and artifacts. Also, there may be a style mismatch if the object image was in a different art style than the target image. So we recommend doing one final inpainting pass at 0.15 to 0.2 denoise and the same object prompt from earlier. It’s important that this inpainting pass uses the checkpoint, Lora, and generation configuration that’s suitable for the target image style. This will eliminate any artifacts and ensure style consistency. For example, if your target image is in anime style, use an anime checkpoint or Lora for this step.

Cropped initial output

Using Flux Redux: Optionally, you can use the Flux Redux model in the ComfyUI workflow instead of relying on GPT to write a prompt based on the masked object. To use Redux, change the value in this node in the Generation parameters section from 1 to 2. However, Redux is tricky to use. If your object and target masks do not perfectly match, black patches may be generated. For a general use case I would recommend using a prompt.

Flux Redux setting

Why Does This Work?

When you combine the object and target images into a single composite image, you’re providing the model with a unified context. This allows the model to leverage the spatial and visual cues from both images simultaneously. Flux Fill is excellent at inpainting because its architecture utilizes the context of the image very well.

When both the object and the target areas are present in the same composite image, the model can more effectively learn the relationship and ensure the masked area is filled in a way that aligns with the object’s characteristics and the overall scene. The model’s attention mechanism can more easily correlate the visual features of the object with the masked area. This ensures a more precise and accurate transfer, effectively guiding the model to fill in the missing parts with high fidelity.

This method effectively transfers the isolated object into the target image, maintaining consistency across different poses and orientations.

How to use DashTailor

You can run this workflow in CommfyUI.

DashTailor is also available to use on dashtoon.com/studio. Just Create a new Dashtoon and use the tool in the Editor section.

More examples

A Road Towards Tuning-Free Identity-Consistent Character Inpainting

Naman Rastogi — Wed, 22 Jan 2025 10:55:22 GMT

At Dashtoon, we are dedicated to simplifying and enabling seamless visual storytelling through comic creation, with character creation at the heart of this process. Our mission is to make storytelling accessible to everyone by empowering users to craft personalized characters—whether entirely original or inspired by real-world personas—that bring their narratives to life. Comics, by nature, involve creating multi-panel, multi-page stories, where maintaining consistency in character design is essential. Ensuring that characters retain their intrinsic attributes—such as facial features, hairstyles, body shapes, and other physical traits—across diverse scenes and frames is critical to achieving narrative coherence and visual continuity (Fig A and B). This focus on character consistency underpins the immersive storytelling experience we aim to deliver, offering users the flexibility to shape narratives that reflect their ideas, identities, and imagination.

Fig A: Character: Asian, Female, Adult, Red eyes, Short fringed, Outer dark black hair , Inner yellow Hair, Bangs, Skinny.

Fig B: Character: Caucasian, Female, Adult, Long Brown Wavy hair, wearing a blue top with a beige blazer.

To support this, we are working to adapt Text-to-Image generative models for efficient and scalable ID-driven synthesis, which is formally defined in the next section.

1 | ID-Consistent Synthesis—What is it?

It’s a direction within the field of text-to-image generation aimed at creating visually coherent images that maintain a subject's identity (ID) across varied prompts defining diverse situations or artistic variations. Unlike traditional generative models that focus on general styles or objects, ID-consistent synthesis deals with a relatively subtle task of preserving intricate identity details—such as facial features—of a person or object from a single or limited reference image. Recent advances in diffusion models like Stable Diffusion [19], DALL-E3 [18], and Flux [15] have significantly improved creative, context-rich image generation. However, ensuring identity consistency remains challenging, as these models balance preserving identity with aligning to diverse prompts.

1.1 | The Spectrum of Methods

The problem can fundamentally be addressed through three distinct dimensions, each defined by the role of model training:

Test-time optimization methods, exemplified by techniques like DreamBooth [10] and Textual Inversion [11], rely on fine-tuning the model specifically for each subject or object. These methods are both time-consuming and computationally demanding as they require per-subject optimization and often multiple reference images, which limit their scalability in practical applications.
Tuning-free methods. The focus of these is to tune specialized ID adapters that inject identity information into the base generative model. Approaches like InstantID [2] and PULID [13] leverage these adapters or conditioning strategies to generate consistent images for unseen subjects, avoiding per-subject optimization.
Training-free approaches eliminate the need for any training. These methods focus on inference-time manipulations to achieve identity consistency, primarily by adjusting attention layers across batches to align with the reference identity. We will touch on this briefly at the end of our article.

Among these approaches, the focus of this article lies on the second dimension—tuning-free methods—and the advancements we have contributed in this space.

Let’s begin with how we constructed our data pipeline to facilitate extraction of consistent character images for model training.

2 | Dataset Construction Flow

Fig 0: Overview of dataset construction pipeline for extracting consistent character images from videos.

Curating datasets for consistent character generation presents specific challenges due to the inherent requirement for structured data in the form of triplets. These triplets consist of: (1) a reference or ID image that defines the subject to remain consistent across generations, (2) a text prompt describing how the subject should appear in the generated image, and (3) a target image showcasing the subject as specified by the text prompt.

To meet our dataset requirements, we opted for a large-scale collection of movies. Movies offer a valuable resource, as they consist of numerous image frames featuring various characters across different scenes. By leveraging appropriate deep learning approaches, we can efficiently extract multiple consistent images of distinct characters from a single movie, providing a rich source of structured data for our needs.

Furthermore, our goal was to train the model using in-the-wild reference images, ensuring that user-provided reference images are not limited to portrait, front-facing, or close-up formats. Character images extracted from movie frames satisfied this requirement for the dataset.

2.1 | Character aggregation phase

Below are details for the dataset construction pipeline:

Frame extraction: For each movie, frames were extracted using a frame skip value of 50, chosen as an optimal balance based on empirical observations. A lower value yielded redundant frames with minimal variation, while a higher value risked missing sufficient consistent and diverse frames for character representation. In total, frames were extracted from over 1200 movies scraped from the web.
Face detection stage: We use the ArcFace [3] module to detect human faces in the frames extracted from the movie. Frames with no detected faces are discarded. For frames where multiple faces are detected, we calculate the pairwise cosine similarity of the face embeddings and sum these similarities for each detected face. The face with the lowest cumulative similarity score is selected, as it is deemed to be the most unique or distinctive among the detected faces. This ensures consistency in selecting a single representative face per frame, even when multiple faces are present. The selected face's embedding, bounding box, keypoints, and corresponding frame information are stored for further processing.
Grouping distinct characters: After extracting the face embeddings and their associated metadata from the frames, we group the frames into clusters based on cosine similarity of the embeddings. This process is carried out using a similarity threshold to identify frames that likely belong to the same character. Frames are iteratively compared, and those with a similarity score above the defined threshold are grouped together. Each group represents a distinct character, with its frames ordered by descending similarity scores.Once the groups are formed, we apply additional filtering criteria to ensure the quality and usability of the groups. Groups with fewer than the minimum required frames are discarded, while a maximum number of frames is retained for groups exceeding the limit. The final output is a set of structured character data, where each distinct character is represented by its frame IDs, bounding boxes, and facial landmarks.

With this, we extracted around 50k IDs from the movies.

Fig 1: Subset of characters extracted from the movie Simulant (2023)

2.2 | Mask Generation

For the consistent frames aggregated in the previous step, we generate segmentation masks for the target character using the SAM (Segment Anything Model) [7]. Since each frame may contain multiple faces or human figures, the face bounding box stored for each frame is uniquely associated with the target character. To generate masks, we utilize the midpoint of the face bounding box as the input to the segmentation model. For each input frame, the model produces a multi-mask output, from which we select the mask with the largest area.

Note 1: The quality of the generated masks can be further enhanced by incorporating whole-body bounding boxes. This is achieved by using Grounding-DINO or other object detectors to detect all bounding boxes for humans within the frame. The bounding box with the largest overlap with the target face bounding box is then used as the input to the segmentation model, providing a more accurate and refined segmentation of the target character.

Note 2: Specialized segmentation models can be employed for anime images, as general-purpose models such as SAM often exhibit suboptimal mask quality in certain cases.

Fig 2.0 and 2.1: Extracted frames for a character with its corresponding segmentation masks

Fig 2.0: Character with the bounding box is the extracted character for these set of images. From the movie Enchanto.

Fig 2.1: Corresponding masks for the extracted character as in Fig 2.0.

More Samples

Fig 3: From Avatar - The Way of Water

Fig 4: From Avatar - The Way of Water

Fig 5: From Batman

2.3 | Captioning

In this study, we used InternVL2-26B [6] as the Vision-Language Model (VLM) to generate high-quality captions tailored to our needs, providing descriptive textual annotations for target images.

To ensure precision and relevance, we developed a structured instruction prompt template to guide the VLM in producing focused descriptions. The prompt emphasized key aspects such as appearance, clothing, expressions, gestures, and interactions while excluding irrelevant details. Constraints on word count (100–150 words) and avoidance of subjective or speculative details further refined the output, ensuring captions aligned with the inpainting task's requirements.

The prompt template, shown below, helped meet our training objectives.

Prompt Template:
Describe the image in detail, STRICTLY keeping it under 100-150 words, in a single coherent paragraph.
DO NOT BEGIN WITH 'This image shows', 'In this image', 'The image depicts', etc. Clearly specify the gender (man, woman, boy, girl, or male/female).
Focus on the character's appearance, gestures, poses, clothing, and any accessories they might be wearing.
If the character is interacting with any objects, describe the nature of the interaction clearly, including interactions with the background.
Describe the character's expressions and emotions.
Mention any text visible on the character or objects (e.g., logos, patterns, or labels).
Specify the lighting’s direction, intensity, and its effect on the character (e.g., shadows or highlights on the body or clothing).
Indicate the style of the image (e.g., cartoon, photograph, 3D render) and avoid adding subjective interpretations or speculation. Keep the description strictly factual and focus solely on the observable details within the image.

Below are some sample images with their generated captions.

Fig 6: A man with dark hair and a beard, wearing a black suit and a white shirt with a tie, is kneeling at a wooden table. He is holding an orange in his hands, seemingly about to peel it. On the table in front of him are several peeled orange segments. The background features a green wall, a globe on a stand, and a chair. The lighting is soft and natural, coming from the left side of the image, casting gentle shadows. The style of the image is a detailed anime illustration. The man's expression is focused and slightly intense.

Fig 7: A young girl with long brown hair sits on the carpeted stairs, wearing a blue dress with a pattern of small dots. She holds a clear glass in her right hand and looks directly at the camera with a neutral expression. The stairs have white railings and the carpet is a light green color. The background features a yellow wall with a leafy pattern and a decorative fan hanging on the wall. The lighting is soft and natural, coming from the left side of the image, casting gentle shadows. The style of the image is a photograph. There are no other characters or text visible in the image.

3 | Towards the Approach

Existing methods struggle to generate characters in diverse poses, facial expressions, and varying backgrounds using a single model with just a reference character image and prompt description—an open problem to date. To address this, and in alignment with Dashtoon’s custom requirements and internal workflow for comic creation, we viewed the problem through the lens of inpainting, leveraging its suitability for our structured comic creation process.

Formally, the objective is to repurpose a text-to-image generative model for ID-consistent character inpainting. The task involves generating an inpainted image by utilizing user-provided inputs, including a base image, a binary mask defining the region to be inpainted, a reference image specifying the identity and appearance of the character, and a textual prompt describing the final inpainted image. The approach should ensure that the inpainted character maintains identity consistency with the reference image, adheres to the attributes specified in the prompt, and integrates seamlessly into the masked region of the base image.

We chose SDXL [12] for this research to meet our specific requirements; however, the proposed method is adaptable and can be extended to other text-to-image models, including Diffusion Transformer-based models such as Flux [15], SD3.5 [17], and others.

3.1 | Preliminaries

3.1.1 | Cross-Attention

In text-to-image models such as Stable Diffusion, text conditioning is achieved through the cross-attention mechanism, where CLIP-generated text embeddings are integrated into the cross-attention layers to guide the generation process. Formally,

Theoretically, this cross-attention mechanism can be extended to incorporate information from other modalities, provided that the information can be effectively projected into the model’s cross-attention embedding space. Let’s discuss this further in the next subsection.

3.1.2 | Extending Cross-Attention for Image Conditioning

Reference-based text-to-image generation requires that we are effectively able to inject identity information in the base diffusion model through some form of conditioning for ID awareness. IP-Adapter [1] introduced a novel mechanism to enable a pretrained text-to-image diffusion model to generate images with image as a prompt. This is characterized by utilizing a decoupled-cross attention mechanism to inject image features through additional cross-attention layer in the base model along with the text cross-attention layer. Formally, that means:

While [1] demonstrates the addition of image features through a dedicated cross-attention layer (let’s call it ICA1) for image conditioning, we can further extend this concept to include more cross-attention layers, each injecting different modalities into the base model to provide supplementary information that might not be fully captured by the earlier ones.

However, the point to consider here is that cross-attention extension essentially operates as a weighted addition, which inherently dilutes the information being integrated. Consequently, the more cross-attention layers we add, the greater the risk of diminishing the distinctiveness of the injected information. However, through our experiments and empirical observations, we found that adding yet another cross-attention layer actually helped in providing additional information that was not captured by the initial image cross-attention layer.

We will discuss more on this in the upcoming sections.

3.2 | Method Discussion

Fig 8: Overview of the proposed framework for ID-consistent character inpainting. Reference images are processed to extract facial and body embeddings using specialized modules (ArcFace and InternViT, respectively), which are projected into the SDXL’s text embedding space and injected into SDXL as extra cross-attention layers (ICA1 for facial features and ICA2 for global body-level features). Pose-IdentityNet further refines the output by replacing text embeddings with face embeddings (following [2]) to ensure accurate identity preservation and pose adherence by virtue of being conditioned on pose keypoints images.

3.2.1 | SDXL for inpainting

To adapt SDXL for the inpainting task, we followed the standard approach of utilizing 9-channel inputs for the U-Net backbone. The base input image and the masked image were encoded into 4-channel latents using the SDXL VAE, with the masked image downsampled according to the SDXL VAE's scale factor. These components were then concatenated in the following order: image latents + mask + masked image latents, forming the final input to the U-Net.

3.2.2 | Injecting ID information - Facial Features

To incorporate identity information into the SDXL U-Net, we divide the process into two parts. In the first part, we employ an ArcFace module (antelope-v2) [3] to extract facial features from the reference image. These features are then projected into the SDXL cross-attention embedding space via a trainable projection module known as the Perceiver-Resampler [4]. The Perceiver-Resampler accepts input image features and transforms them into a fixed number of learnable queries, which serve as the output feature sequence. These features are subsequently used to extract key and value embeddings, and are then combined with the queries obtained from self-attention to form the final cross-attention mechanism, following the approach described in [1].

In the second part, we follow the approach outlined in [2] to enhance face-ID information within SDXL. Specifically, we employ the ControlNet (or IdentityNet) described in [2], initialized from the same checkpoint. In this setup, text embeddings are replaced with the facial embeddings extracted in the first part, ensuring the network focuses exclusively on face information for improved identity awareness.

3.2.3 | Injecting Pose information

In the original IdentityNet framework described in [2], five facial keypoints (two for the eyes, one for the nose, and two for the mouth) are used as the conditioning image to provide spatial control information to the ControlNet. However, in the context of our problem space, this approach proved suboptimal, as it confines the facial region strictly to the layout of the user reference image. Moreover, the facial expressions of the reference image are perpetuated in the generation process (see Fig. 9). This outcome is at odds with our goal, which requires the generated image to conform to the target character’s pose and facial expressions requiring inpainting.

Fig 9: Examples of problematic generations when using five facial keypoints as the conditioning image are shown. In each sample, the first image represents the reference image, while the subsequent images depict the generated outputs.

Hence, we opted to extract 133 pose keypoints set in the OpenPose format for the character defined by the mask region in the user-provided base image, which is to be inpainted with the character from the reference image. The extracted keypoint image (Fig 10) is then provided to our version of IdentityNet, referred to as Pose-IdentityNet. This approach offers additional pose information for the target character, while also capturing a richer range of facial expressions through the expanded keypoint set. For this we used RTMPose-l, 384x288 variant (based on MMPose) [5], to extract the required keypoints set. Note that the pose model receives the masked image containing only the target character, allowing keypoints to be extracted exclusively for the intended subject.

Fig 10: Input base image and the corresponding keypoints extracted. These keypoints image are fed as spatial control image in our version of ControlNet (Pose-IdentityNet).

3.2.4 | Injecting ID information - Body features

In addition to the facial information supplied to SDXL via cross-attention and Pose-IdentityNet, we introduce an additional cross-attention layer to incorporate more global information about the character from the reference image. This step ensures that the generated outputs fully align with the reference subject’s overall body appearance, including details such as hairstyle. Based on our discussion in Section 3.1.2, we understand that the cross-attention layer in the SDXL U-Net can be further extended to incorporate additional information. Formally, this can be translated as:

The first image cross-attention mechanism is utilized to inject facial ID features, while the second image cross-attention is employed for incorporating global ID features. To extract these global features, we leverage the InternViT-300M-448px [6] model, which processes the segmented character region obtained from the reference image and outputs image embeddings. The extracted features are then passed through a Perceiver-Resampler module, which projects them into the SDXL cross-attention embedding space, similar to what we did for facial ID features.

This straightforward extension enabled the model to better focus on body-related attributes, including physical details, shapes, and hairstyles.

3.2.5 | Training details

Figure 8 provides an overview of the training flow and pipeline, which includes the following trainable components: two Perceiver-Resampler projection modules, a facial ID cross-attention layer, a global ID cross-attention layer, and the entire Pose-IdentityNet. The experiments were conducted using SDXL 1.0 while keeping its base weights frozen. Model training was performed on 8 H100 GPUs (80 GB), with a batch size of one sample per GPU.

Approximately 50k distinct IDs were extracted from movie footage; although each character had multiple images available, only one was used as a reference image and another as the target image, thereby forming the necessary triplet data points (reference image, target image, and target prompt). An additional set of approximately 30k single-image anime IDs was also incorporated; in these cases, a horizontally flipped version of the reference image served as the target image to create the triplet data.

To further enhance generalization, we randomly dropped text, face, and body feature embeddings for classifier-free guidance. We also employed multi-resolution training via aspect-ratio bucketing (following [8]) to support multi-resolution inputs. Additionally, the model was trained on longer descriptive prompts (following [9]), using a maximum token length of 154.

3.3 | Sample Results

From left-to-right or top-to-bottom: Base Image, Mask, Extracted Pose Keypoints (from character in the masked region), Reference Image, Inpainted Image.

FIg 11: Inpaint Prompt: A man playing a piano

Fig 12: Inpaint Prompt: A boy in a green t-shirt holding a pen.

Fig 13: Inpaint Prompt: A middle-aged man with silver hair, wearing a beige V-neck sweater over a dark blue collared shirt, smiling.

4 | Conclusion

This blogpost presented an approach to Zero-Shot ID-Consistent Character Inpainting, addressing the demands of our comic-creation pipelines. To achieve this goal, we introduced a data construction process that extracts consistent character frames from large-scale movie datasets, thereby enabling robust ID-consistent character training on in-the-wild images (flexible reference images which doesn’t need to be portrait as shown in sample results). Our proposed inpainting architecture integrates additional cross-attention over global image features, ensuring the faithful injection of the reference character’s physical attributes. Furthermore, we incorporated a pose-conditioned ControlNet to promote accurate pose and facial attribute adherence—an essential requirement for high-fidelity character inpainting. To the best of our knowledge, this is the first work to explicitly focus on zero-shot ID-preserving character inpainting. The empirical results underscore the effectiveness of our approach in maintaining character identity and pose alignment.

We also encountered challenges with our proposed model. One primary issue is color saturation (Fig 14), an artifact carried over from the InstantID-based ideas that we leveraged, which is similarly reflected in our outputs. Additionally, the model sometimes struggles with rendering complex hairstyles (Fig 15), indicating the need for more robust handling of fine-grained attributes. Lastly, the reliance on external modules such as ArcFace and pose extraction introduces potential failure points, as inaccuracies in these priors affect subsequent stages of the generation pipeline.

Fig. 14: From left to right: base input image, mask, pose keypoints image, reference character image, and inpainted image. Observe the color saturation in the inpainted image, highlighting an issue identified in the InstantID model.

Fig. 15: From left to right: base input image, mask, pose keypoints image, reference character image, and inpainted image. The inpainted character fails to accurately replicate the hairstyle of the reference character.

4.1 | Scaling data with the synthetic images

As discussed earlier, recent advancements in training-free approaches have demonstrated promising results for achieving ID consistency. These methods tend to leverage cross-image attention manipulations across batches to maintain consistency effectively. A notable example is [14], which also employs Diffusion Features (DIFT) [16] to establish dense visual correspondences across images, ensuring that the subject appears consistent across different generations.

In the context of data generation for training the architecture proposed in this work, the approach outlined in [14] can be effectively leveraged to further scale the dataset. By providing unique character descriptions—generated based on specific requirements and utilizing available LLMs—to the model, and enabling the generation of batches of two or more consistent images under varying prompt settings, it can serve as strong medium to facilitate the creation of diverse training samples.

What Next - Efforts on training-free approaches

As discussed in the preceding section, we are concurrently exploring training-free approaches for zero-shot ID-consistent character synthesis, not only for data scaling but also with the goal of adapting them to support reference images, inpainting, and other tools integral to the character creation process.

Recent flow-based methods, such as Flux [15], also present a promising foundation for achieving training-free ID-consistent character generation, owing to their better prompt adherence and anatomy fidelity. Our research efforts, therefore, also focus on advancing strategies for encoding reference image information and improving consistency mechanisms within these flow-based models.

Glimpse of what we are currently cooking:

Fig 16: Reference image featuring Yann LeCun.

Fig 17: Images generated using reference image in Fig 16 from our method in its initial development phase. The model was not trained on ID-consistent task. The result is from our method falling under training-free approaches that we briefly discussed in section 1.1

Stayed tuned..!

References

[1] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
[2] InstantID: Zero-shot Identity-Preserving Generation in Seconds
[3] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
[4] Flamingo: a Visual Language Model for Few-Shot Learning
[5] RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose
[6] OpenGVLab-InternVL
[7] Segment Anything (SAM)
[8] PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
[9] kohya-ss sd scripts - SDXL training
[10] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
[11] An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
[12] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
[13] PuLID: Pure and Lightning ID Customization via Contrastive Alignment
[14] ConsiStory: Training-Free Consistent Text-to-Image Generation
[15] Flux
[16] Emergent Correspondence from Image Diffusion
[17] Stable Diffusion 3.5
[18] DALL-E3
[19] High-Resolution Image Synthesis with Latent Diffusion Models

Insights from Our Adversarial Diffusion Distillation POC

Ayushman Buragohain — Mon, 04 Nov 2024 06:14:34 GMT

When building a comic creation platform, speed and efficiency are paramount. Users expect rapid feedback, especially when generating images through multiple iterations. To meet this demand, we turned to Adversarial Diffusion Distillation, a cutting-edge technique designed to speed up image generation without sacrificing quality. In this blog post, we'll share why we chose this approach, how we implemented it using our own in-house models, and the steps we took to streamline our process.

Why Adversarial Diffusion Distillation?

Faster Inference Speed: Reducing the time it takes to generate images improves the user experience. The faster we can deliver outputs, the more seamless the creative process becomes for our users. Additionally, quicker generation times reduce operational costs for us, ensuring a more sustainable business model.
Efficient Iteration in Comic Creation: Comic creation is an iterative process. Artists frequently modify images before reaching their final version. Showing image outputs quickly allows creators to make adjustments on the fly, fostering a smoother creative workflow.

Introduction

The goal is to generate high-fidelity samples quickly while achieving the quality of top-tier models. The adversarial objective enables fast generation by producing samples in a single step, but scaling GANs to large datasets has shown the importance of not solely relying on the discriminator. Incorporating a pretrained classifier or CLIP network enhances text alignment, however, overuse of discriminative networks can lead to artifacts and reduced image quality.

To address this, the authors leverage the gradient of a pretrained diffusion model through score distillation to improve text alignment and sample quality. The model is initialized with pretrained diffusion model weights, known to enhance training with adversarial loss. Finally, rather than a decoder-only architecture typical in GAN training, a standard diffusion model framework is adapted, allowing for iterative refinement.

Training Procedure

The training procedure involves three networks: the ADD-student initialized from a pretrained UNet-DM, a discriminator with trainable weights, and a DM teacher with frozen weights. The ADD-student generates samples from noisy data, which are produced from real images through a forward diffusion process. The process uses coefficients and samples timesteps uniformly from a chosen set, typically four timesteps, starting from pure noise.

For the adversarial objective, the discriminator distinguishes between generated samples and real images. Knowledge is distilled from the DM teacher by diffusing student samples with the teacher’s process and using the teacher’s denoising prediction as a reconstruction target for the distillation loss. The overall objective combines the adversarial loss and distillation loss.

The method is formulated in pixel space but can be adapted to Latent Diffusion Models operating in latent space. For LDMs with a shared latent space between teacher and student, the distillation loss can be computed in either pixel or latent space, with pixel space providing more stable gradients for distilling latent diffusion models.

Adversarial Loss

The discriminator design and training procedure use a frozen pretrained feature network, typically ViTs, and a set of trainable lightweight discriminator heads applied to features at different layers of the network. The discriminator can be conditioned on additional information, such as text embeddings in text-to-image settings, or on a given image, especially useful when the ADD-student receives some signal from the input image. In practice, an additional feature network extracts an image embedding to condition the discriminator, enhancing the ADD-student’s use of input effectively. The hinge loss is used as the adversarial objective function.

Score Distillation Loss

The distillation loss measures the mismatch between samples generated by the ADD-student and the outputs from the DM-teacher, using a distance metric. The teacher model is applied to diffused outputs of the student’s generations, not directly to the non-diffused student outputs, as these would be out-of-distribution for the teacher.

The distance function used is the squared L2 norm. The weighting function has two options: exponential weighting, where higher noise levels contribute less, and score distillation sampling weighting.

Dashtoon’s Experiments

We began by carefully studying the key details from the original Adversarial Diffusion Distillation paper. The goal was to transfer knowledge from a larger, pre-trained model (the teacher) to a smaller, more efficient model (the student), all while maintaining the quality of generated images.

For our implementation, we selected diffusers, a flexible and efficient framework for diffusion models. Below, we walk through the main components of our training process.

Teacher Model: We used our in-house trained SDXL model, which we call dashanime-xl. This model has been fine-tuned on a wide array of anime images, ensuring that it understands the nuances of comic-style art.
Student Model: The student model was initialized from the pretrained weights of dashanime-xl. Starting from a strong base model allowed us to train faster while preserving image quality.
Discriminator: We used a discriminator, just as described in the paper. The discriminator includes two key components:
1. Text Embeddings: Generated from a pretrained CLIP-ViT-g-14 text encoder, which helps evaluate how well the generated image aligns with the text prompt.
2. Image Embeddings: Extracted using the CLS embedding of a DINOv2 ViT-L encoder. This ensures that the images generated by the student model maintain high fidelity and visual quality.
Training Dataset: Our training dataset consists of 2 million publicly available anime images, curated and filtered using the process we detailed in our blog post about dashanime-xl. This dataset was essential in ensuring that our models learned to generate high-quality anime-style artwork.

Below is the pseudo code outlining our training process using diffusers:

Compute student predictions

Compute the teacher predictions on student predicted x0

Computing the GAN loss

Compute the distillation loss. We used the exponential weighting scheme for the score distillation loss

Experiment Results

By applying Adversarial Diffusion Distillation, we successfully reduced inference times while maintaining the high-quality outputs that are essential for comic creation. The student model now generates images significantly faster than the teacher model, providing creators with a smoother and more efficient experience.

Due to time constraints, we weren't able to perform a thorough quantitative evaluation of our new model, but we did manage to test it qualitatively. We're excited to share some of the results generated by our dashanime-xl-1.0-turbo, the outcome of our Adversarial Diffusion Distillation (ADD) implementation.

Our initial tests show that dashanime-xl-1.0-turbo performs impressively, producing high-quality anime-style images with significantly reduced generation times. The faster inference speed makes it ideal for rapid iteration during the comic creation process, fulfilling our goal of delivering seamless user experiences.

While we still plan to conduct comprehensive quantitative tests, these initial qualitative outputs are promising. They highlight dashanime-xl-1.0-turbo's ability to generate intricate and stylistically consistent images rapidly. The model's ability to maintain image quality while reducing generation time is a significant leap for use cases like comic creation, where speed and creative flexibility are crucial. Our next steps include further fine-tuning and testing on a broader range of comic styles and scenarios. We’re also exploring ways to integrate this approach directly into our platform to allow for real-time image generation.

Stay tuned as we refine the model further and move on to more quantitative analysis to showcase the full potential of dashanime-xl-1.0-turbo!

In this blog post, we've highlighted the technical aspects of implementing Adversarial Diffusion Distillation, but the implications for user experience and efficiency are even more exciting. As we roll out these enhancements, we’re eager to see how our community of creators utilizes this newfound speed and power in their projects.

Dashtoon Studio August 2024 Release

Amogh Vaishampayan — Fri, 23 Aug 2024 14:21:50 GMT

What’s new?

In April 2023, we embarked on the journey of building Dashtoon Studio with a vision to revolutionize the way comics are created. After several iterations and invaluable feedback from our community, we are delighted to announce a host of new features that empower creators to bring their stories to life with unparalleled control and creativity.

Introducing Story Mode

Every captivating comic begins with a compelling story. However, transforming a narrative into a comic format is far from simple. Dashtoon Studio's Story Mode makes it possible for storytellers and artists to seamlessly convert their stories into engaging comics.

When you input your story in the Story section, our AI identifies the characters within your narrative and maps each one to a character from our extensive library of consistent characters. This ensures that each character maintains a consistent appearance throughout the comic.

0:00

/0:10

Dashtoon Studio automatically detects characters from your story and gives them a face

Dashtoon Studio's AI then adapts your story into a detailed, panel-by-panel screenplay. After that, it generates the comic images, complete with the mapped characters, bringing your story to life in a visually stunning comic. Every panel of the screenplay is editable. You can even add new panels using just text instructions.

0:00

/0:12

Add panels using simple text prompts

The All-New Editor

Our improved Editor brings back the beloved infinite canvas, now with a cleaner interface that balances accessibility and functionality. Whether you're a newcomer or a seasoned artist familiar with digital painting software like Photoshop and Clip Studio, you'll find the interface intuitive. The tool bar on the left and settings on the right offer a familiar yet enhanced workspace.

Key Features

Toolbar on Left: Easy access to essential tools.

0:00

/0:05

A better toolbar positioning

Preset Frame Sizes: Convenient Frame sizes in a variety of aspect ratios to fit your comic's layout.

0:00

/0:22

Create Frames in preset sizes

Simplified Image Generation Modes:
- Text to Image: Convert text descriptions into images.
- Storyboard to Comic: Transform storyboards - hand drawn or uploaded - into fully realized comic panels.

Inpainting: Edit specific areas within an image, now with the option of preserving the original composition.

Image to Image: Modify existing images.
Edge to Comic: Turn edge-detected outlines into detailed comic art.

Use different image generation methods for complete control over your comic

New Upscalers:
- Creative: Increase image size while also enhance it with artistic flair.
- Non-Creative: Increase size while maintaining the original image’s integrity.

0:00

/0:04

Upscale your images without losing any details, but rather enhancing them

Speech Bubbles:
- New Bubble Shapes: Fresh designs for speech bubbles.
- Flexible Bubble Tails: New curved tails with easy manipulation for.
- Connecting Bubbles: Link multiple speech bubbles for cohesive dialogue flow.

With these new features, Dashtoon Studio combines an unprecedented control over AI with an uncompromising storyteller’s toolkit to produce comics at one-tenth the time and cost as compared to traditional methods. Our in-house comic production studio creates over 2,000 episodes a month, with one artist producing an entire episode in just one day.

AI That Empowers Artists

In the current landscape, the narrative surrounding generative AI within the artist community is often negative. Many view AI as soulless, a theft of their work, producing low-quality output, and taking artistic control away from the user. This perception, coupled with the idea that "anybody can be an artist without effort," undermines the years of training and hard work that actual artists invest in their craft. Another common fear is that AI image generation will completely replace comic artists. In our experience at Dashtoon Studio, these fears are unfounded.

What doesn't get enough attention—perhaps because it's not sensational enough—is how artists can leverage generative AI. By learning to manipulate AI, artists can produce high-quality works at astonishing speeds.

Most image generation AI tools today are good at creating individual images, similar to stock imagery. While these images may look impressive on their own, maintaining storytelling continuity between them is challenging. Typically, these services use a single text prompt as their primary input. However, generative AI for comics and storytelling requires so much more than just generating standalone images. There's a vast arsenal of tools available beyond a simple text prompt, enabling richer and more coherent narratives.

In comic storytelling, where every image is deliberate and conveys something specific, artists need a high degree of control. They need to manage composition tightly, pose characters accurately, and make specific changes like adjusting the color of a character's clothes. Character and style consistency in AI-generated images is crucial for crafting an engaging narrative. This level of precision cannot be achieved with just prompting; it requires a comprehensive toolkit to harness the AI effectively. That's why the fundamental philosophy behind Dashtoon Studio is that the AI always works for the artist, rather than against them. Comic artists and storytellers have complete control over how, when, where, and how much the system's AI affects the images.

0:00

/0:09

Artists can upload or draw storyboards to guide the AI image generation

At Dashtoon Studio, we include several techniques beyond text prompting to achieve this level of control. For instance, artists can draw storyboards to set up their scenes and use them as guidance for AI image generation. This helps control the composition. They can also command the AI to edit only specific parts of an image according to their rough drawing combined with a text prompt. Advanced object selection using segmentation AI allows artists to extract parts of an image as a layer. Additionally, we have included classic non-AI image manipulation features such as blend modes and control over hue, saturation, brightness, and contrast.

0:00

/0:06

Object detection and color changes

We Hire Great Artists

Possibly the greatest proof of this philosophy is the fact that Dashtoon employs a creative staff of over 300 artists, writers and producers at our in-house comic production studio. If AI image generation alone could make anybody a great comic creator, we would hire anybody off the street. But we don’t.

Dashtoon’s artist hiring process is rigorous and extremely competitive. Through our evaluation pipeline, we have trained numerous artists across India, but only a select few have ever been hired at Dashtoon Studio.

We have observed a significant difference in the quality of comics produced by an artist versus a non-artist using Dashtoon Studio. With both personas leveraging AI, the artist-produced comics are of a much higher quality and more enjoyable to read by a large margin. We are building the AI equivalent of a Lamborghini, but only an artist can truly drive it.

Dashtoon Studio is revolutionizing how comic artists can use generative AI, providing them with the tools and control they need to create compelling and high-quality comics. By empowering artists rather than replacing them, we are ensuring that the art of storytelling through comics continues to thrive in the age of AI.

Coming soon…

Character creation
- Create custom characters that the AI will learn and replicate in any pose, with any expression.
- Create characters that work across several styles
- Addition of humanoid and non-human characters.
New art styles
- A brand new menu of preset art styles that you can use for your comic
- Enabling artists to train our AI on their signature style, giving them an even greater degree of control over their work
Improved AI capabilities
- Prompt adherence - how well the AI follows the instructions in your text prompt
- Facial expression and eye gaze control
- Image segmentation
Improved Layer management

Try using the new Dashtoon Studio today!

If you are a published author, consider applying to our Author’s Program.

If you are a comic book artist, or have previously written content for online platforms, please apply to our Creator Program.

Introducing DashAnime XL 1.0

Ayushman Buragohain — Wed, 21 Aug 2024 13:46:45 GMT

Dashtoon Studio is an innovative AI-powered platform designed to empower creators in the art of comic crafting. The immense popularity and profound cultural influence of anime have sparked a fervent demand from our user base for this distinctive style. Recognizing this enthusiasm, we embarked on an ambitious journey to develop a state-of-the-art anime image generation model specifically tailored for storytelling. Our goal was to create a tool that not only meets the high expectations of anime enthusiasts but also enhances the creative process for comic artists working in this beloved medium.

The Art of Anime: Dashtoon's Need for a Dedicated Model

Existing anime models primarily rely on danbooru tags, which can make it challenging for a novice user to control these models effectively. Additionally, these models often exhibit low prompt adherence. With these challenges in mind, we set out to create an anime model that responds accurately to natural language prompts. To achieve this, we used SDXL as our base architecture.

Dataset Curation & Captioning

We trained our model using a carefully curated dataset of approximately 7 million openly avaialble anime-styled images. In projects like this, a common technique is to recaption all images to eliminate any faulty text conditions in the dataset. To achieve this, we employed our in-house Vision Large Language Model (VLLM) for captioning, which significantly enhances the model's ability to follow instructions accurately. This step was crucial to ensure the model's effectiveness with natural language, as the existing captions for anime-style images are predominantly in the Danbooru style, which we aimed to avoid.
Danbooru-style prompting involves guiding the model using tags found on the danbooru. This approach can be particularly challenging for novices, as danbooru contains a vast number of tags. Achieving the desired generations often requires experimenting with multiple tags and demands extensive knowledge of the available tags.

Balancing the dataset …

The anime dataset we curated (also any anime based dataset in general) exhibits a significant long-tail distribution phenomenon, where some less common concepts have fewer samples due to their niche nature or recent introduction. While many of these concepts are quite interesting, direct training often makes it difficult for the model to learn them effectively.

Below are some examples illustrating how this long-tail distribution impacts the final image generation. Even if we don't explicitly tag an image to include a girl, the model might inadvertently generate one, which is an outcome we want to avoid. The images shown below are generated using an open-source anime model, we also add images generated by DashAnimeXL for comparison.

Prompt	Generated Image	DashAnimeXL
a wall
abstract shapes and colours
a horse
a book
a tree
A vibrant street mural covering an entire wall

We implemented specific balancing strategies to smooth out the original data distribution:

We partitioned the dataset into specific tag groups based on our internal testing and analysis
We then sampled tags within the selected categories separately. To balance the dataset, we reduced the influence of tags from high-frequency categories and increased the sampling frequency of tags from low-frequency categories. Using these adjusted sampling frequencies, we reconstructed the dataset, resulting in a smoother distribution across the categories.
Finally, all the images were captioned in natural language using our in-house captioning tool.

We primarily categorised the entire dataset into one of the buckets (the actual list of buckets is exhaustive and here we provide only the most important ones):

1 girl with emotion: Includes images where a girl showing any emotion is the central figure.
1 girl only: Includes images where a girl is the central figure
1 boy with emotion: Includes images where a boy showing any emotion is the central figure.
1 boy only:Includes images where a boy is the central figure
multiple girls: Includes images with multiple girl characters
multiple boys: Includes images with multiple boy characters
animal: Includes images where an animal is the central figure
object: Includes images where an object is the central figure
is_no_human: Includes images with no human subjects, focusing on scenery or backgrounds.

After categorising the data, we ensured that each image was assigned to exactly one bucket. This step was crucial to avoid inadvertently skewing the weight of any particular category. Once the images were correctly categorised, we adjusted the weightage of each bucket. Specifically, we increased the weight of underrepresented buckets and decreased the weight of overrepresented ones, aiming to balance the distribution and bring each bucket's weight closer to the median of the overall dataset.

The pseudo-code looks something like this :

# Step 1: Analyze and categorize tags
for each tag in Danbooru dataset:
    assign tag to appropriate category based on its characteristics

# Step 2: Separate and sample tags within each category
for each category in dataset:
    calculate frequency of each tag in the category
    if category is high-frequency:
        reduce sampling frequency
    else if category is low-frequency:
        increase sampling frequency

# Step 3: Reconstruct dataset with balanced sampling
balanced_dataset = []
for each category in dataset:
    sampled_tags = sample_tags(category, adjusted_sampling_frequency)
    add sampled_tags to balanced_dataset

# Step 4: Categorize images into buckets
for each image in balanced_dataset:
    assign image to one of the buckets:
        - 1 girl with emotion
        - 1 girl only
        - 1 boy with emotion
        - 1 boy only
        - multiple girls
        - multiple boys
        - animal
        - object
        - is_no_human

# Step 5: Ensure each image is in exactly one bucket
for each image in balanced_dataset:
    if image is assigned to multiple buckets:
        remove from all but one bucket

# Step 6: Adjust bucket weights to balance distribution
for each bucket in balanced_dataset:
    if bucket is overrepresented:
        decrease weight
    else if bucket is underrepresented:
        increase weight

# Step 7: Final dataset preparation
final_dataset = apply_weighting(balanced_dataset)
caption_images(final_dataset, captioning_tool)

From long captions to short captions …

In our initial experiments, we began with very long and descriptive prompts, which significantly enhanced the model's ability to follow instructions. However, we soon observed a decline in performance when using shorter prompts. This approach also introduced style inconsistencies, as some image captions included style references while others did not. This inconsistency occasionally led the model to generate realistic images when we actually desired a different style. Style inconsistency is a significant concern for us because we are engaged in storytelling through comics. Visual continuity needs to be maintained from panel to panel and across hundreds of chapters. Maintaining a consistent visual style is crucial for ensuring that our readers remain immersed in the story without distractions.

Upon further investigation, we traced this issue back to inherent biases in the SDXL and inconsistent captioning of training dataset.

Below, we present some examples of model generations that exhibit style inconsistencies. In these cases, we intended to produce images in an anime style, but instead received realistic images.

Below we present some examples that highlight labelling inconsistencies within the dataset. Notice how certain samples explicitly include the "anime" style in the caption, while others do not. This inconsistency leads the models to develop their own biases, resulting in unpredictable behaviour during inference. Specifically, we observed that the output image styles became inconsistent—sometimes appearing in an anime style and other times in the default SDXL realistic style as shown above. For instance, some captions include phrases like animated characters and A stylized, illustrated. Additionally, in our full dataset, we discovered that many images contain keywords such as A digital illustration, A cartoon-style illustration, and A stylized, anime-style character.

Image	Caption
	Two animated characters, one with brown hair and one with green hair, both wearing black and red outfits with accessories. The character on the left holds a vintage-style camera, while the character on the right holds a smartphone. They stand in front of a plain background with a red object visible to the right. The characters have a friendly demeanor, with the brown-haired character smiling and the green-haired character looking at the camera with a slight smile.
	A girl character with long, light-colored hair, large blue eyes, and a black and white outfit with a bow tie. She is smiling and appears to be in a cheerful mood. The character is positioned in the foreground against a white background with scattered orange and pink hues.
	An illustration of a man with spiky, dark blue hair and striking red eyes. He is wearing a black and white checkered shirt and a black choker. His right hand is raised to his chin, with his index finger resting on his cheek, while his thumb is tucked under his chin. He is holding a white electric guitar with a brown pickguard and six strings. The background is minimalistic, featuring a white wall with a red and white sign partially visible. The style of the image is cartoonish, with a focus on the character's expressive features and the guitar, suggesting a theme of music and youth culture.
	A girl riding a tricycle. She is depicted from the back, wearing a black and white outfit with a hoodie, leggings, and sneakers. Her hair is styled in a bun with two pigtails on top. The background is a solid red color. The girl is positioned on the right side of the image, with her left foot on the front wheel and her right foot on the back wheel.
	A stylized, illustrated character with angelic features, standing centrally with arms outstretched. The character has large, feathered wings with a gradient of blue and white, and is adorned with a golden, ornate armor-like outfit with circular designs on the chest and arms. The character's hair is short and dark, and they have a serene expression. Behind the character is a symmetrical, stained-glass window-like background with intricate patterns and motifs, including a central circular emblem with a face, surrounded by four smaller circles with various symbols. The background is a blend of pastel colors, predominantly in shades of blue, white, and peach. There is no text present in the image. The lighting appears to be emanating from the central emblem, casting a soft glow on the character and the surrounding patterns. The style of the image is a digital illustration with a fantasy or mythical theme. The character appears to be a girl, and the overall mood conveyed is one of tranquility and ethereal beauty.
	A male character with spiky blond hair and blue eyes, wearing a green military-style jacket with gold buttons and a white shirt. He holds a sword with a golden hilt and a silver blade, examining it closely. The character's expression is focused and serious.

To address this, we developed an in-house LLM that compresses these ultra-descriptive captions into short, concise ones, deliberately omitting any style references. In later experiments, we discovered that adding "anime illustration" as a prefix significantly improved the quality of the generated images. As a result, we applied this prefix to all captions in our training dataset.

Image	Caption
	Long Caption: An animated character with green hair and green eyes, wearing a blue and white outfit with a high ponytail. she is holding a sword with both hands, and the sword is emitting a bright light, suggesting it might be a magical or special weapon. the character is looking directly at the viewer with a focused expression. she is wearing fingerless gloves and has a rope belt around her waist. the background is a warm, brownish color, which could indicate an indoor setting or a warm environment. the character's attire and the style of the image suggest it is related to the fire emblem series, specifically referencing the game fire emblem: the blazing blade and the character lyn from that game. the image is rich in detail and conveys a sense of action and readiness. Short Caption: A girl with long, dark green hair and striking green eyes. she is adorned in a blue dress with intricate designs and accessories, including earrings and a belt with a pouch. She is holding a sword, which emanates a glowing blue light, suggesting it might be enchanted or imbued with some magical power.
	Long Caption: An animated character from the "the idolmaster" franchise, specifically from the "shiny colors" series. the character is labeled as chiyoko sonoda, indicating her name within the series. the character is drawn in a style typical of anime and manga, with large expressive eyes and stylized features. she has brown hair styled in a double bun, with a black choker around her neck. she is wearing a light blue shirt with the word "shiny" written on it, paired with a pink skirt that has a high waist. the skirt has suspenders, and she is wearing a black choker with a small charm. the character is standing with her hands on her hips, and she is looking to the side with a slight blush on her cheeks. her pose and expression suggest a confident and playful demeanor. the background is simple and does not distract from the character, focusing the viewer's attention on her. the character's appearance and pose are notable, with her hair style, the color of her clothing, and the accessories she is wearing all contributing to her overall look. the image is cropped at the waist, showing a "cowboy shot" view of the character. the overall impression is that of a cute and fashionable girl, likely designed to appeal to fans of the "idolmaster" series. Short Caption: A girl character with long, vibrant blue hair and striking blue eyes. she's adorned in a black dress with intricate lace and frill details, complemented by black gloves and knee-high boots. the character holds a black umbrella with a decorative ribbon, and her gaze is directed towards the viewer. the background is a muted grey and the character has a few tattoos on her arm.

Image

Caption

Long Caption: An animated character with green hair and green eyes, wearing a blue and white outfit with a high ponytail. she is holding a sword with both hands, and the sword is emitting a bright light, suggesting it might be a magical or special weapon. the character is looking directly at the viewer with a focused expression. she is wearing fingerless gloves and has a rope belt around her waist. the background is a warm, brownish color, which could indicate an indoor setting or a warm environment. the character's attire and the style of the image suggest it is related to the fire emblem series, specifically referencing the game fire emblem: the blazing blade and the character lyn from that game. the image is rich in detail and conveys a sense of action and readiness.

Short Caption: A girl with long, dark green hair and striking green eyes. she is adorned in a blue dress with intricate designs and accessories, including earrings and a belt with a pouch. She is holding a sword, which emanates a glowing blue light, suggesting it might be enchanted or imbued with some magical power.

Long Caption: An animated character from the "the idolmaster" franchise, specifically from
the "shiny colors" series. the character is labeled as chiyoko sonoda, indicating her name within the series. the character is drawn in a style typical of anime and manga, with large expressive eyes and stylized features. she has brown hair styled in a double bun, with a black choker around her neck. she is wearing a light blue shirt with the word "shiny" written on it, paired with a pink skirt that has a high waist. the skirt has suspenders, and she is wearing a black choker with a small charm. the character is standing with her hands on her hips, and she is looking to the side with a slight blush on her cheeks. her pose and expression suggest a confident and playful demeanor. the background is simple and does not distract from the character, focusing the viewer's attention on her. the character's appearance and pose are notable, with her hair style, the color of her clothing, and the accessories she is wearing all contributing to her overall look. the image is cropped at the waist, showing a "cowboy shot" view of the character. the overall impression is that of a cute and fashionable girl, likely designed to appeal to fans of the "idolmaster" series.

Short Caption: A girl character with long, vibrant blue hair and striking blue eyes. she's adorned in a black dress with intricate lace and frill details, complemented by black gloves and knee-high boots. the character holds a black umbrella with a decorative ribbon, and her gaze is directed towards the viewer. the background is a muted grey and the character has a few tattoos on her arm.

Adding special tags for improved control …

In addition to recaptioning the entire dataset, we retained a select few tags that we identified as crucial for enhancing the model's learning capabilities.

We carefully curated a list of special tags from the original Danbooru prompts, recognizing that certain tags, such as lowres, highres, and absurdres, were essential for guiding the model toward better generations. We also preserved the character name and series name tags from the original danbooru prompts.

Furthermore, we introduced special tags to each image in the dataset to provide more control over the model's output. These tags serve specific purposes:

nsfw rating tags: safe, sensitive, general, explicit, nsfw
aesthetic tags: very aesthetic, aesthetic, displeasing, very displeasing
quality tags: masterpiece quality, best quality, high quality, medium quality, normal quality, low quality, worst quality

Assembling the final prompts

Using the above guidelines, we arrive at the final assembled prompt structure:

anime illustration, [[super descriptive prompt OR condensed prompt]]. [[quality tags]], [[aesthetic tags]], [[nsfw rating tags]]

This final prompt structure ensures that each generated image aligns with the desired style, quality, and content, providing a comprehensive framework for creating high-quality anime illustrations.

Our model leverages special tags to guide the output towards specific qualities, ratings, and aesthetics. While the model can generate images without these tags, including them often results in superior outcomes

Below we showcase some examples from our training dataset after assembling the final prompt.

anime illustration, a boy with spiky hair, one purple eye, and bandaged arms crouches in a ready pose, his face set in a serious expression. Low quality, very displeasing, safe, monochrome.

anime illustration, a boy with dark hair and a white headband with a black strap looks to the side with a surprised or intense expression. He wears a plaid shirt. A girl with blonde hair and a blue beret smiles with one eye closed, wearing a purple outfit with a white collar. Low quality, very displeasing, safe.

anime illustration, a boy with reddish-brown hair and blue eyes wears a light blue hoodie with a black strap over his shoulder, holding black earphones to his ear in a casual, relaxed pose with a slight smile and direct gaze. best quality, aesthetic, safe.

Training details

This model architecture is built on the SDXL model. After performing rule-based cleaning and balancing on the dataset (as discussed above), we fine-tuned the model over a total of 25 epochs following this schedule:

Epochs 1-10:
1. We fine-tuned both the Unet and CLIP Text Encoders to help the model learn anime concepts.
2. During this stage, we alternated between using super descriptive captions and short captions in the training process.
3. While this approach improved the model's understanding of anime concepts, it also introduced the style inconsistency issue described earlier. We decided to retain the text encoders and the unet from this stage, add "anime illustration" to all the training data, and remove all style references from the images in stage 2.
4. The sample size during this phase was approximately ~5M images.
Epochs 10-25:
1. We shifted focus to fine-tuning only the Unet from stage 1, aiming to refine the model's art style and improve the rendering of hands and anatomy.
2. In this stage, we trained exclusively on short captions.
3. The dataset was further filtered down to ~1.5M images.

Both stages of training employed aspect ratio bucketing to prevent the degradation of image quality in the training samples.

Throughout both stages, we used the Adafactor optimizer, which we found to offer the best memory/performance ratio. Initially, we experimented with kohya's sd-scripts for the training but soon developed our own tooling around diffusers, mosaicml-streaming, and pytorch-lightning to handle the massive data loads and scale efficiently across multiple nodes.

mosaicml-streaming in particular played a crucial role in meeting our dataset requirements by enabling seamless scaling across multiple nodes. It provided an optimal balance, allowing us to stream large amounts of data in and out of multiple nodes while leveraging local NVMe space as a staging ground.

Evaluation

We experimented with nearly all of the open-source benchmarks for diffusion models, including DPG and Geneval. However, we quickly realised that these existing benchmarks don't account for the specific factors that are crucial in show and comic creation. In comic creation, aspects such as character consistency, multi-character interaction, scene interaction, and overall consistency are more important.

Drawing insights from the existing benchmarks, particularly the excellent DPG benchmark developed by the ELLA team, we developed our own benchmarks tailored specifically for show and comic creation. A more detailed blog on this topic will be released in the future. With these considerations in mind, we present our benchmark scores.

From our internal testing metric, we concluded that our model demonstrates superior prompt adherence compared to other anime models available in the market. One of the key advantages of our model is its ease of use—there's no need to rely on danbooru tags to generate high-quality images. Instead, this can be easily achieved through natural language prompting.

We plan to conduct a more comprehensive study in the future to further validate and refine our model's capabilities. While our model performs well overall, we aim to further enhance its capabilities by incorporating more detailed character and style information. Currently, the model struggles to consistently generate styles beyond the base anime style. We plan to address this in future iterations, ensuring more consistent and diverse stylistic outputs.

Results Showcase

How to use

DashAnimeXL is now publicly available on the Dashtoon Studio ! Head over there to check it out. Refer to the video below for a guide on how to use it.

Your browser does not support the video tag.

The model is also available in Huggingface and Civitai

What’s next ?

Recently, we've seen the FLUX gaining traction, and our internal benchmarks also found FLUX to be very promising for show and comic creation.

However, we observed a significant issue with style inconsistency, especially when deviating from the realistic style. The anime style, in particular, tends to be quite inconsistent.

As a way to contribute to the open-source community, we plan to release some cool variants of FLUX in the coming weeks.

Here’s a sneak peek at some of the upcoming outputs.

Enhancing Performance in Dashtoon Studio: A Leap from 8FPS to 55FPS

Mohammad Zaryab — Wed, 01 May 2024 13:11:01 GMT

Hey There, Everyone!

We've got an exciting story for you! In just one week, our team turned our comic creation tool, Dashtoon Studio, from kind of slow (think 8FPS slow) to super speedy (a whopping 55FPS!). Want to know how we did it? Read on!

Some Context

Behind the scenes, we use Tldraw (tldraw.com) for its impressive functionality, which served as a solid starting point for our project. However, customizing Tldraw to meet our unique needs, especially in terms of canvas state management, proved to be a significant challenge. Achieving a balance between our specific requirements and Tldraw's existing features required a thorough understanding of the library, but even so, we encountered some challenges.

The Problem: Dashtoon Studio, our platform for making comics with AI, was moving like a snail. When our canvas got filled up, the whole thing would just crawl. Not good, right? We knew we had to fix this, and fast!

FPS Comparison

Our Super Busy Week of Fixes

1. Making Our App Smarter with Redux

No More Useless Chatter

What Was Wrong: Our app was updating too much and causing a lot of unnecessary work.
What We Did: We optimized the way how we were using Redux to make our app smarter about updates. It stopped bothering every part of the app unless it really needed to.

Remembering Stuff to Save Time

What Was Wrong: Our app parts were redoing stuff they didn't need to.
What We Did: We used some tricks with React.memo, useMemo, and useCallback to help our app remember things better, so it didn’t redo work.

2. Smoothing Out Interactions on the Canvas

No More Jumpy Interactions

What Was Wrong: Drawing or Dragging on our canvas was all jerky and not smooth.
What We Did: We made this debounced custom hook that made every move on the canvas smooth and nice.

Handling Inputs Better

What Was Wrong: Our app was getting overwhelmed with too much input.
What We Did: We used useCallback to make it chill and only deal with what was important.

3. Custom Hooks: Streamlining Components

Unburdening Components

What Was Wrong: Components were overloaded with too much logic and state.
What We Did: We developed custom hooks for specific component needs. This helped isolate and manage state and effects where needed, making components lighter and faster.

4. Canvas Element Positioning

Stabilizing Moving Elements

What Was Wrong: Elements using transform were causing unnecessary re-renders.
What We Did: We switched these elements to fixed positions, cutting down on the re-renders and enhancing performance.

5. Smarter Auto-Save

Auto-Save That Actually Makes Sense

What Was Wrong: Our app was saving stuff too often, even when it didn’t need to.
What We Did: We made our auto-save feature only kick in when it really needed to, by checking for actual changes.

6. Trimming State Data

Clean Up Our Data

What Was Wrong: We had too much unnecessary stuff in our data.
What We Did: We got rid of the clutter and made sure we only kept what we needed.

7. Updating the Canvas Only When Needed

Smart Canvas Updates

What Was Wrong: Our canvas was updating too often, which was slow.
What We Did: We got smarter about when to update the canvas, which helped speed things up.

The Results

After a week of hard work, we didn’t just make small changes; we totally transformed Dashtoon Studio! It went from being slow and frustrating to fast and fun.

What We Learned

Keep an Eye on Things: It’s super important to regularly check how things are running, just like you’d check your car.
Fix the Right Stuff: Like being a detective, we found what was really causing the problems and fixed those things.
Keep It Simple: We made sure not to make our code too complicated while making these fixes.

Kubernetes at Seed Stage

Anmol Chawla — Thu, 25 Apr 2024 11:13:04 GMT

Now I know what you, dear reader, would be thinking just by reading the title, “Kubernetes in an early stage startup? This is like bringing a gun a to a knife fight” or “Kubernetes is an overpowered solution at such a small scale and does not make sense” and so on. But do give me a chance to present its benefits for early stage startups before you decide to move on.

Before we begin, it would be prudent to introduce Kubernetes to everyone, just on the off-chance that you’re not familiar with it. Kubernetes, or k8s as it is fondly known to its users, is a at its core a container orchestration framework which allows us to automate deployment, management and scaling of services. This allows us to automate the effort that would go into handling operations such as rolling updates and high availability.

The traditional stereotypes associated with k8s are that it is a massive framework requiring a dedicated team to handle the cluster as well as ensure its continued operation. Another common stereotype associated with the framework is that it is time taking and painstakingly difficult to setup. But these are not really that accurate anymore.

While k8s definitely requires dedicated teams and engineers to maintain at a scale, the same is not true when used to orchestrate the small number of workloads at an early stage startup. At Dashtoon we are a team of 5 engineers, working on our own product tasks, who are also responsible for the continued operation of the cluster. Each and every one of us is comfortable with making changes to the cluster, be they for deploying a new service, or creating a new ingress and so on.

Coming to the aspect of setting up the cluster, it has now become ridiculously easy to be up and running with a production scale cluster with the help of managed solutions provided by cloud providers such as Azure and AWS. In fact, we were able to bring up our cluster and serve production traffic within an hour using one of the managed solutions.

Now that I’ve mentioned about the existing stereotypes and their relevance, lets move on to how we use the framework at Dashtoon and how it makes our life easier as compared to the previous setup that we had.

First, a bit of an introduction to Dashtoon. We are a generative AI startup aiming to make it easier to tell visual stories for creators and delighting users with visual content. The idea is to reduce the time and effort it takes to create visual stories and allow the authors to focus on the storytelling.

Due to the small team and scale, we made a conscious decision to go with a mono-repo setup for both the backend and frontend. Our language of choice has been Kotlin for back-end and Dart (via Flutter) for the front-end. Along with this, we’re also self-hosting a setup for stable-diffusion-webui to allow for quick prototyping of model changes and new technologies. Now with all this, we’ve got separation of dev and prod setups to handle along with respective ingresses for each of the service.

At the start, we had started out with running Docker on a VM with each service running in a container. At that time, we had the same bias that k8s is an overpowered solution which would not be needed at such a small scale and would be more hassle than it is worth (how wrong we were!).

Docker Setup

So, here is how the setup was :

A VM hosted on Azure
Running Docker containers
nginx container to front the incoming requests, with individual configs for each sub-domain
watchtower to update the service on each image push
Secondary VM running docker and basic prometheus and grafana for monitoring

While this was perfectly serviceable, there were several gripes with the setup :

To scale up a service would mean to bring up another VM with the same set of services
There were multiple nginx config files, with a lot of boilerplate code. To give you an idea, we had 3 active services and 2 sub-domains for each. This meant that the number of config files was 6 just to start with.
Separating out dev and prod environments was difficult
Almost impossible to do rolling updates. For example, to do rolling updates while staying within the docker setup would have meant moving to a docker swarm setup, just to gain the ability to control replicas for each service. To add to the difficulty, watchtower works at a machine level and hence would not be aware of the different machines on the docker swarm. This would have meant finding an alternative which would be aware of the swarm setup.
RBAC for access to logs and configuration was difficult
Managing secrets and environment variables required editing the base docker compose file

Taking all of this into account, we decided to give k8s a try due to our prior experience with it while handling infrastructure at udaan and knowing that it provides a solution for each of the above mentioned points. Let me list out the changes to the setup and the subsequent ease that came with k8s.

With k8s now the setup changes to be:

An k8s cluster, with two nodepools (read VMs). One for controlplane and another for workloads. You would have noticed that I’ve not mentioned this anywhere on the above diagram, and that is because k8s abstracts away the nodes and takes care of the scheduling and scaling for us.
Three main namespaces :
1. dev
2. prod
3. monitoring
Both dev and prod namespace has the back-end and the web front-end
Monitoring now includes open-telemetry and pushes to prometheus which is them visualised on grafana.
Rolling updates, at 25% unavailability
RBAC, integrated with Azure AD. This allows us to manage access at AAD group level
Secrets and config maps mounted as environment variables on the deployments
Ingress is handled by k8s using ingress-nginx. This allows us to just mention the sub-domain, certificates and the host path
Ingress is handled by k8s using ingress-nginx. This allows us to just mention the sub-domain, certificates and the host path
CI/CD pipeline with the ability to target dev and production deployments

k8s setup

Looking at this list should now give you an idea of how easy k8s makes it to manage infrastructure and the various components involved. Migrating to k8s freed up the time that was earlier spent writing nginx configurations and fiddling with docker compose yamls to deploy the new service. To put that into perspective, we had spent about 3-4 man hours to setup the docker workflow and ensure it worked as compared to the 1 man hour it took to setup k8s. And not to forget, reduction in upkeep and maintenance time that k8s automates for us.

Now by this point you must be realising that benefits that k8s brings, but would also be wondering if all of this magic comes at a cost. And to assuage your fears, no it does not increase the costs by much. On most platforms, there are no additional costs for setting up and running the cluster. The costs come purely from the VMs which are used for the node pools. Then why the increase in costs? Well, that is because we end up provisioning an extra VM to host the the system (control plane, if you will) components, according to the good practices for a k8s cluster. Hence compared to the previous docker setup, our k8s cluster comes in at roughly 5-10% additional cost due to the additional VM.

So to summarise,

Summary