Aller au contenu principal
Frank Houbre
Tutoriels9 min read

Anatomical errors in AI video: why bodies, hands and gestures distort in motion

Hands in AI video are not corrected the way they are in a still image. Understand why bodies distort in motion, and the concrete techniques to reduce these artifacts in your clips.

Illustration for “Anatomical errors in AI video: why bodies, hands and gestures distort in motion”

You have a perfect image. The hands are clean, the face is believable, the composition is good. You launch the video generation from this image. And after two seconds, the fingers melt, the wrist goes off in an impossible direction, the character suddenly has six phalanges.

This is not the same thing as inpainting on a still image. In video, the distortion happens in time, not in space. And it is corrected differently.

Why bodies distort in AI video, and not only in images

Video generation models predict each frame in continuity from the previous one. This is fundamentally different from generating a still image, where the model optimizes for a single static result.

In video, the model has to maintain anatomical consistency across 24 frames per second for several seconds. The body areas that cause problems are the ones with a lot of joint freedom and fine structures: the hands, of course, but also the wrists, the ankles, the knees in flexion, and the shoulders during complex movements.

The problem gets worse when:

  • The movement is fast. The faster the area moves between two frames, the more latitude the model has to invent incorrect intermediate states.
  • The starting pose is already complex. A hand holding an object, fingers in a pincer shape, an unusual posture. The model starts on fragile foundations.
  • The shot is long. On a 4-second clip, the distortions accumulate. What was acceptable at frame 1 becomes obvious at frame 80.
  • The background changes. When the focus or the context move, the model prioritizes the overall consistency of the scene and drops the anatomical details.

💡 Frank's Cut: The best way to avoid broken hands in video is to hide them. Not because we are giving up, but because that is what good directors do on a real shoot when a situation is constraining: the framing solves the problem the technique cannot.

The most frequent errors by body area

Hands and fingers. This is the best-known artifact. In motion, fingers merge, multiply, or change shape between frames. It happens more often on close-up shots than on wide shots, and more often on dynamic gestures (grabbing, pointing, gesticulating) than on hands at rest.

Wrists and forearms. Less documented, but frequent: the wrist can adopt an anatomically impossible twist during a pose transition. Visible especially on gestures that cross the frame from left to right or on rotation movements.

Feet and ankles. On shots that include feet walking, the ankles sometimes change orientation between two strides. The flat foot that ends up in extension without a transition, the ankle that pivots 90 degrees.

Knees in flexion. Shots of characters sitting down, standing up or climbing. Knee flexion can exceed anatomical limits, especially if the movement is fast.

Shoulders on wide movements. During a broad gesture (throwing something, stretching), the shoulder can adopt a position that matches no real joint. More subtle than the hands, but spottable.

Concrete techniques to reduce these distortions

1. Start from an anatomically neutral pose

The starting image strongly conditions the rest. A neutral pose: hands slightly open at the sides, body straight, no complex twist. The model has fewer "fragile" areas to keep consistent.

If you need a specific pose, do it at the end of the clip rather than at the start. Generate a movement that arrives in the target pose, not one that leaves from it.

2. Frame to avoid the problematic areas

This is a directing decision, not a limitation to endure. If your scene does not absolutely need to see the hands, frame to the chest. If your character walks, cut at the knees. This is not cheating: it is production realism.

Wide shots tolerate anatomical imperfections far better than close-ups. A character walking down a corridor in a wide shot: the foot details are invisible. The same in a tight shot: every step is scrutinized.

3. Break the movement into short clips

A complex gesture over 10 seconds is going to accumulate distortions. The same gesture in two or three clips of 3-4 seconds that you then edit together is much more controllable.

Think like a cinematographer: a short action, a cut, another short action. The edit camouflages the problematic transitions and each individual clip is shorter, so more anatomically stable.

4. Guide the movements with precise prompts

Vague: "a person reaching out their hand" Precise: "slow close-up, hand gently resting on a wooden table, fingers relaxed, minimal movement, shallow depth of field"

The more you guide the type and amplitude of the movement in your prompt, the less the model invents. Terms like "minimal movement", "slow", "gentle" reduce distortions. Terms like "dynamic", "fast", "energetic" amplify them.

On tools like Kling 3.0 or Veo 3.1, the motion intensity or movement amplitude parameters let you cap the amplitude. Use them.

SituationRecommended approach
Hands visible in close-upNeutral pose at the start, minimal movement in the prompt
Walking in a tight shotReframe to the knees or film in a wide shot
Long complex gestureCut into 2-3 short clips + edit
Body rotationSimple starting image, "slow rotation" prompt
Interaction with an objectHeld/placed before generation, minimal movement in the clip

5. Use video inpainting when it is available

Some tools are starting to offer frame-by-frame editing or inpainting on sections of clips. It is not yet standard on all models, but it is available in Flow (Google) on certain types of edits, and in specialized ComfyUI workflows.

The idea: identify the exact frame or frames where the distortion appears, mask the problematic area, and regenerate that area in consistency with the rest. It takes time but it can save an otherwise excellent clip.

6. Select the best take rather than correct

The truth of AI video productions in 2026: generating the same clip 4 to 6 times and choosing the best take is often more efficient than correcting an artifact. With models like Seedance 2.0 Mini (half the cost of Fast), the cost of regeneration has dropped a lot.

This is not laziness, it is economic rationality. If generating 5 versions of a clip costs less than an hour of manual correction, the math is quick.

When these techniques are not enough

There are scenes where anatomical errors are nearly unavoidable with current models: a conversation with expressive gestures in close-up, a dance scene with fast hand movements, a manual work sequence (handling tools, cooking, etc.).

For these cases, two honest options:

Rework the script. Change the staging to avoid the problematic areas. A shot cut at the shoulders during the expressive gestures. A cooking scene filmed in a wide shot on the dishes rather than in close-up on the hands.

Accept a level of stylization. If your project tolerates a more stylized aesthetic (fantasy, anime, symbolic), the anatomical distortions pass much better. What breaks immersion in photo realism goes unnoticed in a more graphic render.

For the correction of associated still images, local inpainting to retouch eyes and hands covers the techniques applicable on a frozen frame. And to structure the whole pipeline from A to Z, the script-to-shot-list guide places these choices in a complete production logic.

FAQ

Foire aux questions

Réponses rapides aux questions les plus fréquentes sur cet article.

Why are hands harder than faces in AI video?

Hands have 27 bones and 29 joints with a very large freedom of movement. The models were trained on a lot of face images (social media, portrait photos) and far fewer images of hands in motion with precise anatomical annotations. The face is also more constrained: facial proportions vary little from one expression to another, which the model can learn easily. Each hand configuration is nearly unique.

Does choosing one model over another solve the problem?

Partially. Kling 3.0 and Veo 3.1 are generally better than older models on anatomical consistency. But the fundamental problem remains: no AI video model in 2026 reliably generates perfect hands on dynamic gestures in close-up. The difference between models is a matter of artifact frequency, not absence.

Is video inpainting different from image inpainting?

Yes. Image inpainting corrects an area on a fixed frame. Video inpainting has to maintain the consistency of the correction over time, across several frames. It is technically much more complex and the available tools are still in development. ComfyUI workflows allow frame-by-frame corrections, but it is time-consuming.

How many times do you have to regenerate a clip to get an acceptable result?

It depends on the difficulty of the scene. For a shot with no visible hands or with simple gestures, 2 or 3 generations are often enough. For a shot with complex gestures in close-up, count 5 to 10 generations, or even more. If after 10 tries the result stays problematic, it is a signal to rework the staging.

Does slow motion always solve the hand problem?

It reduces it significantly but does not eliminate it. Slow motion decreases the gap between consecutive frames, which leaves the model less room for error. But on a long shot even in slow motion, the distortions can accumulate progressively. Combine slow movement + short clip + neutral starting pose to maximize your chances.

Are there specific prompt techniques for hands?

A few formulations that help: "hands relaxed at sides", "fingers gently curled", "hands resting on surface", "minimal hand movement". Avoid prompts that imply a complex gestural motion ("waving", "pointing", "grasping") if you do not need that precise gesture in the narrative. And if you do need hands in motion, be as precise as possible about the type of gesture.

Author

Frank Houbre

AI trainer, AI filmmaker and image & video creator.