Clipzy processes videos through a multi-stage pipeline that extracts scene, audio, speech, and visual data. Understanding the processing configuration helps you set appropriate limits and scale to meet your throughput requirements.
Concurrency
MAX_CONCURRENT_JOBS controls how many jobs a single worker processes simultaneously.
Setting this too high on an underpowered machine causes all jobs to compete for CPU or GPU, increasing the total time for each individual job. A good starting point is:
- CPU deployment: set
MAX_CONCURRENT_JOBS to the number of physical cores divided by 2
- GPU deployment: set
MAX_CONCURRENT_JOBS to 1–2 per GPU, depending on VRAM
Scaling with multiple workers
Running multiple worker processes is the primary way to increase throughput. Each worker picks up jobs independently from the Redis queue. You can run workers on the same machine or across multiple machines, as long as all workers share the same Redis instance and storage backend.
# Start two workers on the same machine (in separate terminals)
python run_worker.py --worker-id worker-1 &
python run_worker.py --worker-id worker-2 &
With two workers each handling up to MAX_CONCURRENT_JOBS at once, your effective throughput doubles.
Job timeout
PROCESSING_TIMEOUT_SECONDS sets the maximum wall-clock time a job may run before Clipzy marks it as failed.
PROCESSING_TIMEOUT_SECONDS=3600
If a job exceeds this limit, its status becomes failed with error code JOB_TIMEOUT. Jobs do not automatically retry on timeout. You must re-submit them manually.
Setting PROCESSING_TIMEOUT_SECONDS too low will cause legitimate long videos to fail. A 60-minute video processed on CPU may take 20–30 minutes. Ensure your timeout accommodates your longest expected input.
Processing stages
Clipzy runs these stages sequentially for each video. The table shows approximate durations for a 2-minute video on CPU and GPU hardware:
| Stage | CPU | GPU |
|---|
| Video ingestion | 2–3s | 2–3s |
| Shot detection | 5–8s | 5–8s |
| Motion analysis | 10–15s | 2–4s |
| Audio analysis | 3–5s | 3–5s |
| Speech recognition | 30–60s | 5–10s |
| Visual embeddings | 20–30s | 5–10s |
| Color analysis | 2–3s | 2–3s |
| Total | 73–124s | 24–36s |
Speech recognition and visual embeddings account for the majority of processing time on CPU. A GPU deployment reduces both stages dramatically and is recommended for any production workload with consistent video volume.
Expected processing times by video length
The table below shows rough estimates for end-to-end job duration (not including queue wait time):
| Video length | CPU | GPU |
|---|
| 1 minute | 35–65s | 12–18s |
| 2 minutes | 73–124s | 24–36s |
| 5 minutes | 3–5 min | 60–90s |
| 15 minutes | 9–15 min | 3–5 min |
| 60 minutes | 35–60 min | 12–20 min |
These are estimates for typical talking-head or interview-style content. Videos with dense motion, multiple speakers, or complex audio may take longer.
Checking queue depth
To understand how many jobs are waiting, use GET /api/v1/jobs with page_size=1 and read the total field. Compare that count against the number of jobs currently in processing status to gauge queue depth.
curl "http://localhost:8000/api/v1/jobs?page=1&page_size=1"
{
"jobs": [...],
"total": 14,
"page": 1,
"page_size": 1
}
If the total count is consistently growing, you need more workers or a higher MAX_CONCURRENT_JOBS value.