I Built a YouTube Clip Detector in 2 Hours. Here's Where the Hard Parts Actually Are.

I decided to understand the AI video clipping space by building something in it. This is what I learned in 2 hours.

The problem space

Long-form video is abundant and cheap to produce. Short-form video is what spreads. The bottleneck is the human editor who has to watch an hour of content to find the 60 seconds worth sharing.

This is the problem companies like OpusClip are solving at scale. The approach I see most often: transcript → AI → highlights. Simple pipeline, hard execution. I wanted to understand the execution.

So I built a minimal version of the core loop — Clip Finder — and deployed it. Here's what I learned about what's easy, what's hard, and where the interesting problems live.

The stack

For the happy path, this is genuinely simple:

youtube-transcript — fetches CC transcripts with no API key. Pass a video ID, get back an array of {text, offset, duration} objects.
Claude Haiku — fast, cheap, excellent at structured text extraction. The transcript goes in; a JSON array of ranked clips comes out.
Next.js API route — glue code. The hardest part was deciding what to put in the prompt.

The minimal working version took about an hour. The second hour was spent on the prompt.

The prompt is the product

My first prompt asked Claude to find "the most interesting moments." The results were mediocre. "Interesting" is not a useful concept for an AI without more grounding.

The second version added context: "moments that would make someone stop scrolling." Better, but still vague.

The version that worked introduced a taxonomy. Six clip types: insight, funny, quotable, surprising, emotional, actionable. Once Claude had a vocabulary to think with, the quality jumped. It stopped finding "interesting" moments and started finding shareable ones.

The lesson: AI models don't have bad judgment, they have underspecified judgment. Give them the framework for evaluation and they'll apply it correctly. The framework is the product.

Where the real hard parts are

My implementation works for the demo. A production system has to solve problems I haven't touched:

Video vs. transcript. I'm analyzing text. Real clip detection uses the video — facial expressions, pacing, background music, laughter, camera cuts. A clip that reads well in transcript might fall flat on screen; a visual joke doesn't survive transcript extraction at all. Getting AI to reason about both modalities simultaneously is the hard version of this problem.

Clip boundaries. My timestamps are estimated from transcript segments. Production systems need frame-accurate cut points — start on a natural pause, end before the next thought begins. That's a different (harder) problem than finding the moment.

Platform optimization. A LinkedIn clip and a TikTok clip from the same video are edited differently — pacing, caption style, aspect ratio, hook structure. I'm returning metadata about which clip fits which platform. Actually producing the edited file for each is where the real engineering is.

Quality at scale. My prompt was tuned on a handful of videos. A system processing thousands of videos per day, across categories (podcasts, lectures, vlogs, news, gaming), needs evaluation infrastructure: ground truth labels, automated quality metrics, feedback loops. The prompt that works for a tech podcast doesn't generalize to a cooking show.

What the exercise taught me

The core loop — transcript in, clip recommendations out — is almost trivially simple to implement. What's not simple is the evaluation infrastructure, the multimodal reasoning, the platform-specific editing logic, and the quality at scale.

This is true of most AI products. The AI part is a few API calls. The hard part is everything around it: what signals to weight, how to measure whether it's working, how to handle the cases where it doesn't work, how to close the loop on quality.

I built this because I wanted to think in the problem space, not just read about it. The best way to understand a product category is to build the minimal version and notice where the minimal version breaks. The breakages tell you where the actual engineering work is.

What's next

Clip Finder is a proof-of-concept. If I were taking it further, the obvious next step is multimodal input — analyzing the actual video frames alongside the transcript. The new generation of vision models makes this more tractable than it was a year ago.

The more interesting step would be to instrument it: collect feedback on which clips actually get used, and use that signal to improve the prompt and, eventually, a fine-tuned model. That feedback loop is what separates a toy from a product.

Source: github.com/matua-agent/clip-finder