Closing the SOTA ↔ Open-Source Gap - Image (Part 1)

Closing the SOTA ↔ Open-Source Gap - Image (Part 1)

Chronological walk-through of how we moved away from Nano Banana Pro → Flux Klein and Seedance 2.0 → LTX 2.3 for specific pipelines

Where this fits

Dashreels is our content-consumption platform, supported by Frameo, our content-creation platform at scale. There are three buckets of content we create:

01

Originals

New stories, created from scratch.

Source episode - Ep 01
02

Remix / Adaptation

An existing story relocalised to a new language and geography. Different cast, different dialogues, same narrative DNA.

Remix output - characters swapped, story relocalised
03

Dubbing

Visuals untouched, only the audio swapped.

Chinese (original)
Hindi (dubbed)

The three buckets sit at very different points on the automation curve. Originals are still deeply hands-on - artists are in the loop creating assets throughout. Remix produces an automated first version that artists then iterate on and fix. Dubbing is the most automated: the first output and the iterative rephrasing both happen automatically, with humans only reviewing at the end.

Category How the first cut is made Human role Automation
Originals Built asset-by-asset with heavy artistic direction Artists in the loop throughout creation
25%
Remix First version generated automatically end-to-end Artists iterate & fix after the first pass
50%
Dubbing First output + iterative rephrase, both automatic Humans review only at the end
90%

Bars are indicative of how much of the first deliverable lands without human touch - not a hard metric.

// Frameo · north-starWith our Frameo platform, we aim to bring 100% automation across all 3 categories - enabling Creators & Artists to create 1000s of shows every day.

Pipeline 1 - Image: Distilling Nano Banana Pro

This post is about the middle bucket. Remix is the one with the heavy visual lift: the characters have to change, which means every frame they appear in has to change with them.

Defining the problem

Stated precisely: given an input frame and a character reference, replace the person in the frame with that character. Everything that makes the shot that shot - pose, framing, the phone held to the ear, lighting, and street background - has to survive. Only the identity and wardrobe change.

Input frame Input frame: red-haired woman in a leopard-print fur coat on a phone call, street background
+
Character reference Character reference: dark curly-haired woman in a tan fur coat over a red dress
=
Target output Target output: the same shot with the character swapped in - identity, hair, and wardrobe replaced, pose and background preserved
The swap task, end to end. The input frame's person is replaced by the character - face, hair, and wardrobe swapped - while the pose, the phone, framing, lighting, and street background are preserved.

The base shape of the pipeline has stayed constant. We take the source video, slice it into clips short enough for the video model in play (≤15s, usually ≤8s), swap characters on the keyframes, and use image-to-video to rebuild motion.


The journey

v0 - Late 2025

Nano Banana Pro, direct

The naïve setup: hand NB Pro the source frame and the character reference, and prompt it to do the swap in one shot. Identity and hair transferred reasonably well - but the wardrobe leaked. The character came out wearing the original's clothes instead of her own. We call this dress bleed.

Input frame Input frame: red-haired woman in a leopard-print fur coat
+
Character reference Character reference: dark curly-haired woman in a tan fur coat over a red dress
=
NB Pro output NB Pro direct output: character's face and hair transferred but wearing the original's leopard-print coat
Direct NB Pro swap. The face and hair come through, but look at the coat: the character should be in her own tan fur over a red dress - instead she inherits the original's leopard-print coat. The wardrobe has bled through from the source frame.
v1 - Storyboard intermediates

Two-phase swap via line-art

To kill the dress bleed from v0, we broke the swap in two. Phase 1 converts the source frame into an anatomical line-art intermediate - pose and composition preserved, all appearance information (face, hair, wardrobe) stripped. Phase 2 then dresses the sketch with the new character. With no original clothing or face for Phase 2 to copy from, the bleed went away.

Input frame Input frame before line-art conversion
Line-art intermediate Line-art anatomical intermediate generated from the input frame
Phase 1: the source frame is reduced to a clean line-art intermediate - pose, framing, and composition preserved, all appearance stripped. With nothing to copy from, Phase 2 can no longer bleed the original's dress.

But the intermediate had its own instability. The same prompt produced wildly different kinds of line-art from frame to frame - some came back as a full skeleton with ribcage and pelvis, others as a plain outline, others with detailed muscle anatomy. That inconsistency forced iterations here too: we couldn't trust Phase 1 to hand Phase 2 a predictable input.

Three input frames and their line-art intermediates showing inconsistent skeleton styles for the same prompt - full skeleton, plain outline, and muscle anatomy
Same prompt, three very different intermediates: a full skeleton (top), a clean outline (middle), and detailed musculature (bottom). The line-art step refused to be deterministic, so each frame still needed manual checking.
Phase 1: NB Pro Phase 2: NB Pro

Quality was acceptable but the pipeline was now two model calls per keyframe, and Phase 1's quality varied - NB Pro sometimes bled clothing patterns through into the sketch.

v2 - Klein-LoRA on Phase 1

The data flywheel, and a Flux Klein LoRA

Here's the part that's easy to miss: we already had the training data. Frameo's normal workflow is a data flywheel - it's implicit in how the platform works. Artists iterate on shots, reviewers comment, and the good ones get picked. Every approved frame is, quietly, a labelled training example.

Frameo canvas showing a column of shots with reviewer comment markers
The flywheel in situ - the Frameo canvas. Artists lay out shots, reviewers drop comments (the purple markers), and approved frames flow downstream. The training set is a by-product of work that was happening anyway.

And the volume is real. A single 1-hour show yields 1,500–2,000 images. In about a week of normal production we accumulate the ~10K input/intermediate pairs a LoRA needs - with zero separate labelling effort.

We trained a Flux Klein LoRA on those pairs, and the v1 instability disappeared: the skeleton became consistent. Same prompt, same clean line-art, every time. Where stock Klein had been worse than NB Pro and NB Pro itself wandered, Klein-on-our-pairs was clean and deterministic.

Comparison: OG image, Nano Banana output, and Klein output - Klein produces clean consistent line-art across all rows
OG frame (left) → Nano Banana output (middle) → Klein-LoRA output (right). NB Pro still smuggles clothing detail and varies its style; the Klein-LoRA output is a clean, consistent line-art every row.
Phase 1: Klein-LoRA Phase 2: NB Pro
The flywheel compounds: the more shows we ship, the more training pairs we collect - so every model we distill keeps getting cheaper and better, for free.
v3 - Phases merged + multi-character

One step, up to three characters

With Phase 1 solved, we collapsed the two-phase pipeline into a single step. Instead of frame → line-art → transfer, the LoRA now takes the input frame plus the character references and produces the swapped frame directly - no intermediate, one model call.

We also extended it to multiple characters. Klein caps inputs at four images, so with the input frame taking one slot we can swap up to three characters at once, each from its own reference, in a single pass.

One-step multi-character swap: input frame plus char1, char2, char3 references producing an output with all characters replaced
One-step, multi-character swap. Input frame + up to three character references (char1–char3) → output with every character replaced in place. Pose, blocking, framing, and subtitles are preserved.
Single step: input + char refs → output Up to 3 characters

Composition, clothing, and lighting held up against NB Pro. One gap remained.

v4 - UMO LoRA for character resemblance

Stacking a UMO identity LoRA on top

Klein is excellent at the structural half of the job - translating pose, body, and location faithfully. But on raw character resemblance it trails Nano Banana, and by this point Nano Banana 2 had shipped, raising the bar again. The output looked like the right kind of person, not unmistakably that person - and for a remix product, the cast has to be recognisable.

Rather than start from scratch, we piggy-backed on earlier in-house work on UMO LoRA training. We trained a Flux Klein-based UMO LoRA on 1.8M images, and at inference stacked both LoRAs - the swap LoRA for structure, UMO for identity. Resemblance closed the gap with Nano Banana, and beat it in several categories.

No more face mashups: UMO meets Qwen-Edit - prior UMO LoRA work
Built on our prior write-up, “No more face mashups: UMO meets Qwen-Edit” - click through for the deep-dive on the UMO LoRA approach.
Structure: Klein swap LoRA Identity: Klein UMO LoRA · 1.8M imgs Inference: both stacked

A note on data

If one thing made this work, it's data - and the two LoRAs lean on it in opposite ways.

The Flux Klein swap LoRA was trained on roughly 10K image pairs from the flywheel. That's a rounding error next to the billions of images behind a model like Nano Banana Pro - and on this task it still wins. That's the quality argument: 10K perfectly-paired, human-reviewed, in-distribution examples carry more signal for a narrow job than billions of generic ones ever could.

UMO sits at the other end. Character resemblance isn't a narrow, structured task - it's open-ended, so it needs scale. There we trained on 1.8M images. That's the quantity argument: when the task is broad, volume wins. The flywheel is what lets us have it both ways.


Where this goes next - Part 2: video

With the image pipeline effectively solved, we're now moving away from image-based pipelines and onto video-based ones directly.

Part 2 is where we take the same three ideas - the data flywheel that's implicit in how artists already work, the distillation playbook that beat a SOTA model with a focused LoRA, and the instinct to delete every intermediate we can - and carry them from stills into motion. We'll show where they transfer cleanly, and where video breaks them in ways images never did.

// to be continuedPart 1 closed the gap on images.
Part 2 takes it to video - use cases: watermark removal & character swap in videos.

Read more