Skip to main content
AI Video Content Pipeline

Personal Project

AI Video Content Pipeline

Automated video content generation pipeline that transforms a text topic into a complete short-form video — GPT-generated scripts, text-to-speech narration (ElevenLabs/gTTS), AI-generated visuals (DALL-E/Stable Diffusion), and FFmpeg-based video assembly with transitions, subtitles, and background music.

Built an end-to-end automated video production pipeline in Python that takes a single text topic and outputs a finished short-form video (30–90 seconds) suitable for TikTok, YouTube Shorts, or Instagram Reels. The pipeline orchestrates 4 AI services in sequence: GPT-4 generates a structured video script with timed narration segments, visual descriptions per scene, and on-screen text callouts. ElevenLabs (or gTTS fallback) converts each narration segment to speech audio with configurable voice, speed, and emotion. DALL-E 3 (or Stable Diffusion via diffusers) generates scene-matched visuals from the script's visual descriptions. FFmpeg (ffmpeg-python) assembles everything into the final video.

The video assembly engine uses FFmpeg's complex filter graphs to layer multiple tracks: background visuals (AI-generated images with Ken Burns pan/zoom animation applied via zoompan and crop filters), narration audio synced to scene timestamps, auto-generated word-level subtitles (using whisper for forced alignment of the TTS audio to get precise word timings, then rendered with drawtext filter with configurable font, size, color, and animation style), and royalty-free background music mixed at -20dB under narration. Scene transitions use crossfade (xfade) with configurable duration and style (fade, dissolve, slideright).

Designed for batch production: a CSV input with topics and target platforms generates an entire content library overnight. Each video is rendered at platform-specific resolutions (1080×1920 for vertical shorts, 1920×1080 for landscape). The system tracks API costs per video (GPT tokens + ElevenLabs characters + DALL-E generations) and logs them to a JSON cost report. Total cost per 60-second video averages ~$0.50 using DALL-E 3 and ElevenLabs, or ~$0.05 using Stable Diffusion and gTTS.

PythonGPT-4DALL-EFFmpegElevenLabsStable DiffusionVideo