This AI Generates Tom and Jerry Cartoon Episodes!

TTT-MLP, New AI model can now creates one-minute Tom and Jerry-style cartoon videos from just simple text prompts.

Developed by researchers from NVIDIA, Stanford University, University of California San Diego, UC Berkeley, and UT Austin, this technology transforms written storyboards into dynamic animations like the Tom and Jerry cartoon episodes.

What Is TTT-MLP?

The new AI model, called TTT-MLP (Test-Time Training-Multilayer Perceptron), enhances pre-trained transformers with specialized TTT layers.

These layers enable the AI to process complex narratives and generate coherent, multi-scene animations that capture the spirit of classic Tom and Jerry cartoons.

Features

Creates one-minute cartoon videos from text storyboards.
Maintains temporal consistency across scenes.
Delivers smooth motion and appealing aesthetics.
Outperforms other AI video generation systems in human evaluations.

How It Works?

TTT-MLP works by incorporating TTT layers into pre-trained transformer models.

These layers allow the model’s hidden states to function as neural networks, providing more expressive capabilities and longer-term memory.

This enhanced memory is crucial for generating videos with complex narratives that remain coherent throughout the entire minute-long animation.

The researchers tested their model using a specially created dataset based on Tom and Jerry cartoons. In human evaluations using the Elo rating system, TTT-MLP outperformed other strong baseline models like Mamba 2 and Gated DeltaNet by 34 Elo points in areas including temporal consistency, motion smoothness, and overall visual appeal.

Examples

Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.

We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.

Every video below is produced directly by… pic.twitter.com/Bh2impMBWA
— Karan Dalal (@karansdalal) April 7, 2025

One example video shows Tom walking into an office, taking an elevator, and sitting at his desk in a New York City office building.

The situation quickly escalates into classic Tom and Jerry chaos when Jerry cuts a wire, triggering their famous cat-and-mouse chase—but with a modern office setting twist.

One of the neat side effects of initializing from a pre-trained Transformer is that we can generate videos of locations that weren’t in the original Tom and Jerry cartoons.

“Around the World” – A 30-second video from earlier in training. https://t.co/hUrDYerQWj pic.twitter.com/GFZM8ZpFVq
— Gashon Hussein (@GashonHussein) April 8, 2025

Another episode “Around the World” shows videos of locations that weren’t in the original Tom and Jerry cartoons.

How to Use the TTT-Video Model

Here’s a simple guide to get started with the TTT-Video model for generating Tom and Jerry style animations:

AI Model Demo

Step 1: Set Up Your Environment

Using Conda:

conda env create -f environment.yaml
conda activate ttt-video

Step 2: Install the TTT-MLP Kernel

git submodule update --init --recursive
(cd ttt-tk && python setup.py install)

Note: You need CUDA toolkit (12.3+) and gcc11+ to build the kernel.

Step 3: Download Required Models

Follow instructions to download the VAE and T5 encoder from the CogVideoX documentation.
Download the pretrained weights from HuggingFace:
- diffusion_pytorch_model-00001-of-00002.safetensors
- diffusion_pytorch_model-00002-of-00002.safetensors
- Important: Only use the 5B weights, not the 2B weights.

Step 4: Get Text Annotations

Download the text annotations for Tom and Jerry segments from the link provided in the documentation.

Step 5: Learn About Data Preparation

Review the dataset documentation to understand how to prepare your text prompts.

Step 6: Generate Videos

Follow the sampling documentation to create your videos.
Start with simple text storyboards that describe the scenes.
The model will generate videos up to 63 seconds long.

Step 7: Experiment with Different Prompts

Try various scenarios featuring Tom and Jerry in different settings, from classic home environments to modern office spaces.

Remember that while the model can create impressive animations, results may contain some visual artifacts due to current limitations.

Limitations

Despite impressive results, the researchers acknowledge several limitations:

The generated videos still contain visual artifacts.
These issues likely stem from limitations in the pre-trained model.
Creating longer videos with more complex stories would require significantly larger hidden states.
The current implementation faces performance challenges due to register spills and suboptimal instruction ordering.

Conclusion

For animation studios and content creators, this technology offers unprecedented opportunities to quickly prototype ideas, generate storyboards, or even produce complete short-form content.

For fans of classic cartoons, it means the potential for endless new adventures featuring their favorite characters, faithfully recreated with the slapstick humor and visual style that made them icons of animation.

While challenges remain, the rapid pace of development in this field suggests that even more impressive capabilities are just around the corner.

The cat-and-mouse chase between technological limitations and innovative solutions continues—much like the eternal pursuit between Tom and Jerry themselves.

Share On: