Programmatically Modernizing 20-Year-Old Videos

TLDR — 21 anime music videos from 2002–2008, mostly RealMedia at 320×240. Converted to MP4 with ffmpeg, AI-upscaled with Real-ESRGAN, uploaded to YouTube — where they immediately developed audio dropouts every ~45 seconds that weren't in the source files. Three days and thirteen dead-end theories later, the fix was passing the audio through ffmpeg's atempo filter at factor 1.0. That filter silently rewrites sample-level timing through a process called WSOLA windowing, which apparently broke whatever pattern YouTube's transcoder was reacting to. All 21 now upload clean.

How it started

Back in high school, one of my main hobbies was running a Dragon Ball fan site that featured anime music videos — AMVs — made by me and others in that community. I spent a ridiculous amount of time in the early 2000s editing these things together — learning Adobe Premiere 6, RealProducer, VirtualDub, AVISynth, and other tools — and uploading the results to whatever corners of the internet would have them. (Web hosting was its own challenge back then — finding a host that wouldn't pull your videos for blowing past a monthly transfer cap was a recurring side quest.)

The hobby fizzled out after I graduated. Twenty years went by. The files sat on hard drive after hard drive, mostly forgotten. Then my little brother asked if I'd ever put them online so he could see them again. Easy afternoon project, I figured. After all, I'd done this once before — probably ten years ago I'd uploaded all the videos to YouTube in their original .rm format and they played fine. Looked like garbage at 320×240 stretched across a modern monitor, sure, but it worked.

So my first move was to do the same thing again: grab the original .rm files, upload one, see how it looked. It uploaded. YouTube processed it. I hit play — and the audio was full of stutters. Sharp, obvious skips every ~45 seconds, none of which were in the source file. Whatever YouTube does to ingest video had clearly changed in the intervening years, and not in a way that was friendly to my sloppy early-2000s encoding. So I left it alone for a few months — it wasn't worth the trouble. That changed when some online friends started organizing a 2000's AMV Twitch stream — between that and my brother still waiting to see the videos, it finally felt worth the effort to fix properly.

The whole process ended up taking about three days of long hours, late nights, and real effort — and I came out the other side knowing a little more about codecs, frame rates, and YouTube's transcoding pipeline than I ever expected to learn.

The state of the originals

The collection: 21 AMVs from 2002–2008, mostly in RealMedia (.rm) format at a 320×240 resolution. The rest were a mix of .avi and .mpg with one straggler .mp4.

For anyone who didn't live through the early 2000s internet: 320×240 was standard. RealMedia was a popular format for streaming at the time. RealPlayer was the tool you watched this stuff in. If you wanted to share a 200 MB AMV at original quality, you didn't — you encoded it down to ~10 MB and accepted the quality hit, because nobody was waiting hours to download anything bigger.

To understand how these .rm files ended up in such rough shape, it helps to trace the whole pipeline. The anime footage I was working from was already low quality before I even started editing — VHS captures at best, sketchy internet clips at worst. Cartoon Network rips, fansubs, whatever I could find.

I'd take all of that and import it into Premiere 6 — sometimes after running it through VirtualDub or AVISynth first, just to coerce it into a format Premiere wouldn't crash on — edit the AMV together, and export the whole project as one uncompressed AVI file. That AVI was the file I'd then pipe into RealProducer Plus 8.5, with settings I absolutely did not understand, optimized for "DSL/Cable Modem (350 Kbps)" delivery.

By 2026 standards, these are unwatchable.

The goal

Three things I wanted to get done:

Convert everything to a modern format (MP4, H.264, AAC).
AI-upscale the low-res ones so they look acceptable on a current monitor.
Upload them all to YouTube without the audio dropouts that wrecked an earlier attempt with the raw .rm files.

The first one was a one-liner. The other two are the rest of this post.

Phase 1: A boring ffmpeg loop

Getting all 21 files into MP4 was a single ffmpeg loop (brew install ffmpeg if you need it). H.264 video, AAC audio, faststart flag for fast playback. No drama — ffmpeg reads .rm, .avi, and .mpg without any special flags.

for f in "videos_original/"*; do
  name=$(basename "$f")
  base="${name%.*}"
  out="videos_fixed/${base}.mp4"
  ffmpeg -y -nostdin -i "$f" \
    -c:v libx264 -profile:v high -preset medium -crf 20 -pix_fmt yuv420p \
    -c:a aac -profile:a aac_low -b:a 192k \
    -movflags +faststart \
    "$out"
done

ffmpeg's wiki is a decent reference if you want to understand what those flags are doing.

The interesting stuff starts once you decide you want to make these look better than 320×240.

Phase 2: AI upscaling

For anime-style content, Real-ESRGAN with the realesr-animevideov3 model is the go-to. It's trained specifically on anime video and runs on an M1's built-in GPU — no external graphics card required. About 11 frames/second on a 320×240 source at 2× scale — call it 2.4× realtime.

The pipeline is conceptually simple:

Extract every frame from the source video as a numbered PNG.
Run Real-ESRGAN on the folder of PNGs.
Reassemble the upscaled PNGs back into a video, with the original audio copied over.

Naive first pass — step 1 is ffmpeg, extracting each frame as a numbered PNG:

input="videos_fixed/your-video.mp4"
ffmpeg -i "$input" frames/%06d.png

Step 2 is Real-ESRGAN. Grab the prebuilt realesrgan-ncnn-vulkan binary from the Real-ESRGAN releases page and unzip it somewhere like ~/.local/share/realesrgan/. One detail worth knowing: the binary looks for its models/ folder relative to wherever it's invoked from, so the cleanest way to run it is to cd into its install directory first.

work="$PWD"

( cd ~/.local/share/realesrgan \
  && ./realesrgan-ncnn-vulkan \
       -i "$work/frames" \
       -o "$work/frames_up" \
       -n realesr-animevideov3 \
       -s 2 -f png )

Step 3 is ffmpeg again, reassembling the upscaled PNGs into a video and pulling audio straight off the original file. NTSC video is 29.97 fps, so I figured that was a safe default:

ffmpeg -framerate 29.97 -i frames_up/%06d.png \
       -i "$input" -map 0:v:0 -map 1:a:0 \
       -c:v libx264 -profile:v high -preset medium -crf 18 -pix_fmt yuv420p \
       -c:a copy \
       -movflags +faststart \
       output.mp4

I was wrong.

Frame rate trauma

The first time I ran this end-to-end, the audio and video looked fine for the first few seconds and then progressively drifted out of sync. By the end of a one-minute clip, the music was about five seconds ahead of the visuals.

The culprit: variable framerate (VFR) sources. The frame rate ffmpeg reports is just an average — the source might say 29.97 fps but actually hold individual frames for different durations across the timeline. When you decompose video into a sequence of still PNGs, that timing information evaporates. The PNGs are just numbered files; there's nothing in them that says "this one should be held for 33 ms but that one for 50 ms."

If you reassemble at a constant 29.97 fps, the video timeline ends up shorter than the audio timeline by exactly however many frames the source was "holding longer" than the reported rate. Hence the drift.

The fix is to force constant framerate (CFR) at both extraction and reassembly, and to use the source's actual frame rate rather than guessing at 29.97. ffprobe pulls the rate out of the source, then both extract and reassemble use it explicitly:

# Probe the source's frame rate (e.g. "30000/1001" for standard NTSC)
rate=$(ffprobe -v error -select_streams v:0 \
  -show_entries stream=r_frame_rate -of csv=p=0 "$input")

# Extract with that rate, forcing CFR
ffmpeg -i "$input" -r "$rate" -fps_mode cfr frames/%06d.png

# Reassemble at the same rate, also forcing CFR
ffmpeg -framerate "$rate" -i frames_up/%06d.png \
       -i "$input" -map 0:v:0 -map 1:a:0 \
       -c:v libx264 -profile:v high -preset medium -crf 18 -pix_fmt yuv420p \
       -fps_mode cfr -c:a copy \
       -movflags +faststart \
       output.mp4

ffmpeg duplicates frames where the source was holding one longer, padding the timeline out to the true duration. Reassemble at the same rate and everything stays in sync.

Probing the collection turned up an entertaining variety of frame rates: standard 29.97, standard 23.976, exactly 24, exactly 30, a non-NTSC 982057/32768 (~29.9712) inherited from RealProducer's internal sample-counting math, and one outlier at 117/4 = exactly 29.25 fps that I can only assume came from some Premiere export setting I no longer remember.

The 4× mistake

One source was 240×180 — the smallest in the collection. At 2× it became 480×360, which is smaller than every other output. I wanted parity, so I tried 4× expecting 960×720.

The 4× output looked worse than the 2×. The model, given more freedom, hallucinated more aggressively. Edges got sharper but also more synthetic. Fine textures developed an over-processed look.

What actually worked was non-AI: take the 2× upscale, then plain bicubic resize from 480×360 to 640×480. No distortion, no over-processing, just bigger pixels of the AI-upscaled image.

Lesson: AI upscalers aren't "more is better." For heavily compressed, low-bitrate source material, a higher scale factor can introduce artifacts that look worse than a plain resize of a lower-scale output.

Phase 3: Preparing for YouTube

With the MP4s in hand, I had reason to think the YouTube dropout problem from the failed .rm upload was already on its way out. RealMedia's quirky variable-timestamp audio packets had been my prime suspect, and the re-encoded MP4s are clean CFR with well-behaved AAC audio — re-encoded to 48 kHz stereo AAC. A re-upload of those should — by every reasonable theory — sail through YouTube's pipeline without a problem. So I uploaded one.

The audio still had dropouts.

At 0:44, 1:29, 2:14, and 2:58. Roughly every 45 seconds. About 90 ms of true digital silence each — not just music dipping, actual silence dropping the level to -100 dB. Locally the file played perfectly. Only YouTube's transcoded version had the gaps — confirmed by downloading YouTube's processed version with yt-dlp and comparing it directly against the local file.

So began the rabbit hole.

Phase 4: Why does YouTube hate my audio?

I was working through this with Claude Code (Anthropic's Opus 4.7) the whole way. It was hugely useful for brainstorming diagnostics and chewing through ffprobe output, but I had to push back hard at multiple points. Claude kept circling back to two non-answers — either "accept that YouTube does this, there's nothing you can do" or "it's probably Content ID, give up." Both were easy to disqualify with the same argument: if either were true at this level, it'd be a massive, well-documented problem in the creator community. It isn't. We ended up back at those same dead-ends more than once before finally getting past them.

One thing that did help shape the investigation: early on we noticed the gaps landed at consistent ~44.69-second intervals, machine-precise across re-uploads of the same content. That kind of regularity strongly implies a chunk boundary somewhere in YouTube's pipeline. Google even has a patent (US9338467B1) describing a parallel transcoding pipeline that splits audio and video into chunks for distributed processing. So most of the theories that follow are really asking the same question from different angles: what about my files is making YouTube's chunk boundaries misbehave?

The investigation moved through thirteen distinct theories before getting anywhere useful:

A/V duration mismatch — re-encoded with matched durations, didn't help.
Non-standard frame rate — re-encoded at exactly 30000/1001, didn't help.
Irregular keyframes — regular GOPs at 1 s and 2 s intervals, didn't help.
AAC ingestion bug, allegedly fixed by uploading lossless audio (old internet wisdom) — FLAC source, same gaps.
Container format — MKV, same gaps.
Resolution / aspect handling — 720p, same gaps, plus content got squished.
Fixed-timer chunker in YouTube's pipeline — adding 2 s of black pre-roll shifted the gaps by exactly 2 s along with the content, so they're content-aligned, not output-time-aligned.
Visual scene detection — three of the four moments had scene cuts, but the fourth was the same character close-up before and after.
Sample-level audio splices from old editing software — peak-to-peak jumps at those moments were within normal music range; no detectable splice clicks.
Spectral / loudness / dynamic-range anomalies at the source — RMS, peak, crest factor, stereo correlation, DC offset all statistically indistinguishable from random reference points elsewhere.
My "silence" measurements were actually natural quiet moments — at 5 ms RMS resolution the source had continuous music between -15 and -22 dB while the YouTube output dropped to -100 dB.
YouTube Content ID — the same song uploaded by the artist's official channel has none of these gaps.
AAC priming silence at parallel-chunk boundaries — would affect every upload, doesn't.

A handful of those theories — particularly the ones Claude kept defaulting to (Content ID, "just accept it") — got disqualified by the same reasoning mentioned earlier. Mass-uploaded RealMedia, 4:3 vintage TV captures, low-bitrate sources — these are all common on YouTube. The artifact would be widely reported. It isn't. So whatever was triggering it had to be specific to these files.

The rest got ruled out by testing — re-encode with that one variable changed, re-upload, and check whether the gaps moved or disappeared.

A second-file test (different song, different anime, different length) showed the artifact landed in nearly the same absolute output positions — within 50 ms. That ruled out specific content as a trigger — neither the specific visuals nor audio mattered — and pointed at some shared property of how these files are encoded.

The breakthrough

I produced four test files from the same source, each changing exactly one variable:

Test	Variable changed	Result
1	Frame rate forced to standard NTSC	still glitched
2	Audio bitrate raised to 384 kbps	still glitched
3a	Audio time-stretched by 0.5% (`atempo=1.005`)	clean
3b	Audio re-encoded AAC → Opus → AAC	still glitched

3a vs 3b is the diagnostic. Both re-encode the audio waveform. The difference: 3a changes sample-level timing; 3b changes frequency-domain content but preserves timing. 3a is clean, 3b isn't. So the trigger is timing-related, not frequency-related.

Three follow-ups narrowed it further:

atempo=1.0 (the filter, but at no speed change) → clean
anull (true no-op pass-through) → still glitched
aresample=48000 (resampler at the same rate) → still glitched

anull and aresample don't do WSOLA windowing. atempo does. WSOLA — Waveform Similarity Overlap-Add — breaks audio into overlapping windows, finds similar waveform sections across them, and overlap-adds with windowing functions. In plain English: "it chops the audio into overlapping chunks, finds the parts that look most alike where the chunks meet, and stitches them back together with a soft crossfade." Even at factor 1.0 with no actual speed change, the windowing modifies sample-level details just slightly in a way that's perceptually identical but breaks whatever YouTube's transcoder was reacting to.

I still don't know what YouTube is actually reacting to in the source audio. The best guess is the chunked parallel-transcoding pipeline from earlier, with imperfect handling of certain sample-level patterns at chunk boundaries. The fix works. The reason it works stays a mystery.

The drift problem

There was one catch: atempo=1.0 shortens long-file audio by ~430 ms due to a precision artifact in ffmpeg's WSOLA flush behavior. That's enough for audio and video to drift visibly out of sync by the end of a song — a huge problem.

The fix is to overcompensate. Instead of atempo=1.0, use atempo = D / (D + L), where L is the amount of loss for that file. For my original test file, that's 204.776 / 205.206 = 0.99790.

But L isn't constant. On short files (~25 s) it's only 22 ms. On one file atempo=1.0 actually added 9 ms instead of subtracting. So the fix became: measure each file individually — a two-pass ffmpeg run that measures the loss for each specific file, then applies the compensation in pass two:

SRC="INPUT.mp4"
OUT="INPUT [yt-fix].mp4"

# Pass 1: measure atempo's actual effect on this specific file
AUDIO_DUR=$(ffprobe -v error -select_streams a:0 \
  -show_entries stream=duration -of default=nokey=1:noprint_wrappers=1 "$SRC")
ffmpeg -y -i "$SRC" -vn -af "atempo=1.0" \
  -c:a aac -profile:a aac_low -b:a 192k -ar 48000 -ac 2 \
  /tmp/measure.m4a 2>/dev/null
LOSS_DUR=$(ffprobe -v error -select_streams a:0 \
  -show_entries stream=duration -of default=nokey=1:noprint_wrappers=1 /tmp/measure.m4a)
LOSS=$(awk -v a="$AUDIO_DUR" -v b="$LOSS_DUR" 'BEGIN{printf "%.6f", a - b}')
FACTOR=$(awk -v d="$AUDIO_DUR" -v l="$LOSS" 'BEGIN{printf "%.6f", d / (d + l)}')

# Pass 2: apply atempo with the computed compensation factor
ffmpeg -y -i "$SRC" \
  -c:v copy \
  -af "atempo=$FACTOR" \
  -c:a aac -profile:a aac_low -b:a 192k -ar 48000 -ac 2 \
  -movflags +faststart \
  "$OUT"

Run that against all 21 files and the A/V residuals (the leftover difference between audio and video durations after compensation) come in at ±7 ms — well below the threshold where a viewer would notice.

All 21 upload to YouTube clean.

What I ended up with

21 videos, about 50 minutes of content total
Final resolutions ranging from 728×410 to 1366×768, all 16:9
~2 hours of pure GPU time on an M1 for the upscale work
Around 1 GB of output

Takeaways

If you find yourself doing this kind of work:

Probe everything per file, and force CFR when decomposing video. Frame rate, sample rate, channel count, aspect ratio — none of it is safe to assume. Image sequences carry no timing information of their own, so whatever rate you reassemble at becomes the truth. ffmpeg's defaults won't save you, and even sources that look uniform can carry weirdness baked in by the original encoder (RealProducer's 982057/32768, for example).
AI upscaling isn't a knob to crank. Use the lowest scale that gives acceptable quality and compose with plain bicubic resize where you just need bigger pixels.
Verify the symptom is gone, not just the cause you assumed. I thought clean CFR MP4s would fix the YouTube dropouts because that's what the obvious cause looked like. It wasn't. The fix has to be checked against the actual symptom, not against your theory of where it came from.

The videos are up on YouTube now. They look better than I had any right to expect.

If you want to see the results — or just dip into some early-2000s AMV nostalgia — check out my AMV YouTube channel.

Glenn Kimble Jr