DGX Spark Ollama Leaderboard - benchmark suite for local LLM inference on NVIDIA DGX Spark
Find a file
lucataco c9dca53365 Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision)
- rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm from spark-arena.com
- vLLM with flashinfer / FP8 KV cache / MTP spec decoding (k=1)
- Rank 4 overall (83.8), 74.54 tok/s decode (fastest 35B class)
- Quality 76.7 / Pass 80.56% — drops on math (t07) and format (t08/t12)
- Bench overrides recipe: ctx 8192, max_num_seqs 1, gpu_mem_util 0.8
- Dropped recipe flags: --reasoning-parser qwen3, --load-format
  instanttensor, --optimization-level, --performance-mode (container-only)
2026-05-08 01:40:51 -04:00
benchmark_core Add qwen3.6:35b-a3b-fp8 vLLM benchmark with MTP speculative decoding 2026-05-05 02:11:24 -04:00
configs Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision) 2026-05-08 01:40:51 -04:00
docs Add Nemotron 3 Nano Omni benchmarks 2026-04-28 14:52:37 -04:00
leaderboard Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision) 2026-05-08 01:40:51 -04:00
logs Refresh Ollama zero-quality model runs 2026-04-29 00:00:55 -04:00
prompts/suite_v1 Initial commit: DGX Spark Ollama Leaderboard benchmark suite and phase-1 results 2026-04-10 18:40:04 -04:00
results Add qwen3.6:35b-a3b-prismaquant (NVFP4+MXFP8 mixed-precision) 2026-05-08 01:40:51 -04:00
scripts Reweight performance sub-score: 0.65 tps / 0.10 ttft / 0.25 latency 2026-05-05 02:16:38 -04:00
tests Refresh Qwen3.5 llama.cpp results 2026-04-29 01:14:30 -04:00
.gitignore Refresh Qwen vLLM benchmark results 2026-04-28 21:48:31 -04:00
README.md Add Nemotron 3 Nano Omni benchmarks 2026-04-28 14:52:37 -04:00

DGX Spark LLM Leaderboard

Benchmark harness and results for local LLMs on an NVIDIA DGX Spark across Ollama, llama.cpp, and vLLM.

Leaderboard

The rendered leaderboard is:

  • leaderboard/leaderboard.md

That file is the source of truth for current rankings. It includes:

  • Practical Leaderboard: single-request baseline rows by model/runtime.
  • Speed View: one row per model, showing the fastest runtime and side-by-side runtime throughput.
  • Concurrency View: separate multi-user throughput probes from results/concurrency/*.summary.json.

Regenerate it after adding raw results:

python3 scripts/summarize_results.py --suite suite_v1
python3 scripts/render_leaderboard.py

Current Coverage

Configured models live in configs/models.yaml. Each model can carry runtime-specific blocks for Ollama, llama.cpp, and vLLM eligibility.

Current runtime coverage:

  • Ollama single-request benchmark runs on DGX Spark.
  • llama.cpp single-request benchmark runs on DGX Spark.
  • vLLM single-request coverage is currently canonical for qwen3.5:27b.
  • Concurrency probes are tracked separately from the single-request baseline.

Recent notable entry:

  • nemotron-3-nano-omni:30b
    • Ollama 0.22.0, text-only GGUF Modelfile path: benchmarked successfully.
    • llama.cpp server mode: benchmarked successfully.
    • The direct multimodal Ollama import with projector metadata crashed the local Ollama runner, so the committed Ollama row uses the text-only GGUF model path.

Benchmark Protocol

Per model/runtime:

  • 2 warmup runs
  • 12 prompt tests from suite_v1
  • 3 measured repeats per test
  • 36 measured records per full run
  • context window: 8192
  • temperature: 0.2
  • top_p: 0.95
  • concurrency: 1 for the single-request leaderboard

The prompt suite covers short instruction following, coding, structured output, reasoning, summarization, and long-context retrieval/summary tasks.

Common Commands

Run one Ollama model:

python3 scripts/run_model.py qwen3.5:27b

Run one llama.cpp model:

python3 scripts/run_model_llama_cpp.py qwen3.5:27b

Run all configured models for a runtime:

python3 scripts/run_suite.py --runtime ollama
python3 scripts/run_suite.py --runtime llama.cpp
python3 scripts/run_suite.py --runtime vllm

Run selected tests:

python3 scripts/run_model.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment

Dry-run a model without executing prompts:

python3 scripts/run_model.py qwen3.5:27b --dry-run
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --dry-run
python3 scripts/run_model_vllm.py qwen3.5:27b --dry-run

Run a focused concurrency probe:

python3 scripts/run_concurrency.py \
  --runtime vllm \
  --model qwen3.5:27b \
  --only-test t01_short_factual \
  --concurrency-level 1 \
  --concurrency-level 2 \
  --concurrency-level 4 \
  --server-max-num-seqs 4 \
  --warmups 1 \
  --repeats 2

Clean Reruns

Before rerunning a model/runtime pair, archive or remove that pair's raw JSONL and log file. Appending a second full run to the same raw file will contaminate pass rates and medians.

Default artifact names:

  • Ollama: results/raw/<model_safe>.jsonl, logs/run_model__<model_safe>.log
  • llama.cpp: results/raw/llama_cpp__<model_safe>.jsonl, logs/run_model_llama_cpp__<model_safe>.log
  • vLLM: results/raw/vllm__<model_safe>.jsonl, logs/run_model_vllm__<model_safe>.log

Outputs

  • results/raw/*.jsonl: per-model/per-runtime records
  • results/summary/suite_summary.json: aggregated summary
  • leaderboard/leaderboard.md: rendered leaderboard
  • logs/*.log: run logs

Caveats

  • Quality scoring is still lightweight benchmark validation, not a final human-audited quality rubric.
  • Some historical raw result files contain duplicate measurement warnings; check results/summary/suite_summary.json before treating old rows as clean reruns.
  • vLLM model coverage is intentionally conservative until local artifacts are present and smoke-tested.

Hardware

Benchmarks were run on a single NVIDIA DGX Spark:

  • NVIDIA GB10
  • 128 GB unified memory
  • Ubuntu 24.04.4 LTS
  • CUDA 13.0

Layout

configs/                  model and runtime config
docs/                     benchmark spec, runbook, scoring rubric
prompts/suite_v1/         frozen benchmark prompts and manifest
scripts/                  runners, summarizer, leaderboard renderer
results/raw/              per-model/per-runtime JSONL records
results/summary/          aggregated suite summary
leaderboard/              rendered markdown leaderboard

License

MIT