DGX Spark Ollama Leaderboard - benchmark suite for local LLM inference on NVIDIA DGX Spark

Find a file

lucataco c9dca53365 Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision) - rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm from spark-arena.com - vLLM with flashinfer / FP8 KV cache / MTP spec decoding (k=1) - Rank 4 overall (83.8), 74.54 tok/s decode (fastest 35B class) - Quality 76.7 / Pass 80.56% — drops on math (t07) and format (t08/t12) - Bench overrides recipe: ctx 8192, max_num_seqs 1, gpu_mem_util 0.8 - Dropped recipe flags: --reasoning-parser qwen3, --load-format instanttensor, --optimization-level, --performance-mode (container-only)		2026-05-08 01:40:51 -04:00
benchmark_core	Add qwen3.6:35b-a3b-fp8 vLLM benchmark with MTP speculative decoding	2026-05-05 02:11:24 -04:00
configs	Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision)	2026-05-08 01:40:51 -04:00
docs	Add Nemotron 3 Nano Omni benchmarks	2026-04-28 14:52:37 -04:00
leaderboard	Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision)	2026-05-08 01:40:51 -04:00
logs	Refresh Ollama zero-quality model runs	2026-04-29 00:00:55 -04:00
prompts/suite_v1	Initial commit: DGX Spark Ollama Leaderboard benchmark suite and phase-1 results	2026-04-10 18:40:04 -04:00
results	Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision)	2026-05-08 01:40:51 -04:00
scripts	Reweight performance sub-score: 0.65 tps / 0.10 ttft / 0.25 latency	2026-05-05 02:16:38 -04:00
tests	Refresh Qwen3.5 llama.cpp results	2026-04-29 01:14:30 -04:00
.gitignore	Refresh Qwen vLLM benchmark results	2026-04-28 21:48:31 -04:00
README.md	Add Nemotron 3 Nano Omni benchmarks	2026-04-28 14:52:37 -04:00

README.md

DGX Spark LLM Leaderboard

Benchmark harness and results for local LLMs on an NVIDIA DGX Spark across Ollama, llama.cpp, and vLLM.

Leaderboard

The rendered leaderboard is:

leaderboard/leaderboard.md

That file is the source of truth for current rankings. It includes:

Practical Leaderboard: single-request baseline rows by model/runtime.
Speed View: one row per model, showing the fastest runtime and side-by-side runtime throughput.
Concurrency View: separate multi-user throughput probes from results/concurrency/*.summary.json.

Regenerate it after adding raw results:

python3 scripts/summarize_results.py --suite suite_v1
python3 scripts/render_leaderboard.py

Current Coverage

Configured models live in configs/models.yaml. Each model can carry runtime-specific blocks for Ollama, llama.cpp, and vLLM eligibility.

Current runtime coverage:

Ollama single-request benchmark runs on DGX Spark.
llama.cpp single-request benchmark runs on DGX Spark.
vLLM single-request coverage is currently canonical for qwen3.5:27b.
Concurrency probes are tracked separately from the single-request baseline.

Recent notable entry:

nemotron-3-nano-omni:30b
- Ollama 0.22.0, text-only GGUF Modelfile path: benchmarked successfully.
- llama.cpp server mode: benchmarked successfully.
- The direct multimodal Ollama import with projector metadata crashed the local Ollama runner, so the committed Ollama row uses the text-only GGUF model path.

Benchmark Protocol

Per model/runtime:

2 warmup runs
12 prompt tests from suite_v1
3 measured repeats per test
36 measured records per full run
context window: 8192
temperature: 0.2
top_p: 0.95
concurrency: 1 for the single-request leaderboard

The prompt suite covers short instruction following, coding, structured output, reasoning, summarization, and long-context retrieval/summary tasks.

Common Commands

Run one Ollama model:

python3 scripts/run_model.py qwen3.5:27b

Run one llama.cpp model:

python3 scripts/run_model_llama_cpp.py qwen3.5:27b

Run all configured models for a runtime:

python3 scripts/run_suite.py --runtime ollama
python3 scripts/run_suite.py --runtime llama.cpp
python3 scripts/run_suite.py --runtime vllm

Run selected tests:

python3 scripts/run_model.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment

Dry-run a model without executing prompts:

python3 scripts/run_model.py qwen3.5:27b --dry-run
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --dry-run
python3 scripts/run_model_vllm.py qwen3.5:27b --dry-run

Run a focused concurrency probe:

python3 scripts/run_concurrency.py \
  --runtime vllm \
  --model qwen3.5:27b \
  --only-test t01_short_factual \
  --concurrency-level 1 \
  --concurrency-level 2 \
  --concurrency-level 4 \
  --server-max-num-seqs 4 \
  --warmups 1 \
  --repeats 2

Clean Reruns

Before rerunning a model/runtime pair, archive or remove that pair's raw JSONL and log file. Appending a second full run to the same raw file will contaminate pass rates and medians.

Default artifact names:

Ollama: results/raw/<model_safe>.jsonl, logs/run_model__<model_safe>.log
llama.cpp: results/raw/llama_cpp__<model_safe>.jsonl, logs/run_model_llama_cpp__<model_safe>.log
vLLM: results/raw/vllm__<model_safe>.jsonl, logs/run_model_vllm__<model_safe>.log

Outputs

results/raw/*.jsonl: per-model/per-runtime records
results/summary/suite_summary.json: aggregated summary
leaderboard/leaderboard.md: rendered leaderboard
logs/*.log: run logs

Caveats

Quality scoring is still lightweight benchmark validation, not a final human-audited quality rubric.
Some historical raw result files contain duplicate measurement warnings; check results/summary/suite_summary.json before treating old rows as clean reruns.
vLLM model coverage is intentionally conservative until local artifacts are present and smoke-tested.

Hardware

Benchmarks were run on a single NVIDIA DGX Spark:

NVIDIA GB10
128 GB unified memory
Ubuntu 24.04.4 LTS
CUDA 13.0

Layout

configs/                  model and runtime config
docs/                     benchmark spec, runbook, scoring rubric
prompts/suite_v1/         frozen benchmark prompts and manifest
scripts/                  runners, summarizer, leaderboard renderer
results/raw/              per-model/per-runtime JSONL records
results/summary/          aggregated suite summary
leaderboard/              rendered markdown leaderboard

License

MIT