- rdtand/Qwen3.6-35B-A3B-PrismaQuant-4.75bit-vllm from spark-arena.com - vLLM with flashinfer / FP8 KV cache / MTP spec decoding (k=1) - Rank 4 overall (83.8), 74.54 tok/s decode (fastest 35B class) - Quality 76.7 / Pass 80.56% — drops on math (t07) and format (t08/t12) - Bench overrides recipe: ctx 8192, max_num_seqs 1, gpu_mem_util 0.8 - Dropped recipe flags: --reasoning-parser qwen3, --load-format instanttensor, --optimization-level, --performance-mode (container-only) |
||
|---|---|---|
| benchmark_core | ||
| configs | ||
| docs | ||
| leaderboard | ||
| logs | ||
| prompts/suite_v1 | ||
| results | ||
| scripts | ||
| tests | ||
| .gitignore | ||
| README.md | ||
DGX Spark LLM Leaderboard
Benchmark harness and results for local LLMs on an NVIDIA DGX Spark across Ollama, llama.cpp, and vLLM.
Leaderboard
The rendered leaderboard is:
leaderboard/leaderboard.md
That file is the source of truth for current rankings. It includes:
- Practical Leaderboard: single-request baseline rows by model/runtime.
- Speed View: one row per model, showing the fastest runtime and side-by-side runtime throughput.
- Concurrency View: separate multi-user throughput probes from
results/concurrency/*.summary.json.
Regenerate it after adding raw results:
python3 scripts/summarize_results.py --suite suite_v1
python3 scripts/render_leaderboard.py
Current Coverage
Configured models live in configs/models.yaml. Each model can carry
runtime-specific blocks for Ollama, llama.cpp, and vLLM eligibility.
Current runtime coverage:
- Ollama single-request benchmark runs on DGX Spark.
- llama.cpp single-request benchmark runs on DGX Spark.
- vLLM single-request coverage is currently canonical for
qwen3.5:27b. - Concurrency probes are tracked separately from the single-request baseline.
Recent notable entry:
nemotron-3-nano-omni:30b- Ollama
0.22.0, text-only GGUF Modelfile path: benchmarked successfully. - llama.cpp server mode: benchmarked successfully.
- The direct multimodal Ollama import with projector metadata crashed the local Ollama runner, so the committed Ollama row uses the text-only GGUF model path.
- Ollama
Benchmark Protocol
Per model/runtime:
- 2 warmup runs
- 12 prompt tests from
suite_v1 - 3 measured repeats per test
- 36 measured records per full run
- context window:
8192 - temperature:
0.2 - top_p:
0.95 - concurrency:
1for the single-request leaderboard
The prompt suite covers short instruction following, coding, structured output, reasoning, summarization, and long-context retrieval/summary tasks.
Common Commands
Run one Ollama model:
python3 scripts/run_model.py qwen3.5:27b
Run one llama.cpp model:
python3 scripts/run_model_llama_cpp.py qwen3.5:27b
Run all configured models for a runtime:
python3 scripts/run_suite.py --runtime ollama
python3 scripts/run_suite.py --runtime llama.cpp
python3 scripts/run_suite.py --runtime vllm
Run selected tests:
python3 scripts/run_model.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
Dry-run a model without executing prompts:
python3 scripts/run_model.py qwen3.5:27b --dry-run
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --dry-run
python3 scripts/run_model_vllm.py qwen3.5:27b --dry-run
Run a focused concurrency probe:
python3 scripts/run_concurrency.py \
--runtime vllm \
--model qwen3.5:27b \
--only-test t01_short_factual \
--concurrency-level 1 \
--concurrency-level 2 \
--concurrency-level 4 \
--server-max-num-seqs 4 \
--warmups 1 \
--repeats 2
Clean Reruns
Before rerunning a model/runtime pair, archive or remove that pair's raw JSONL and log file. Appending a second full run to the same raw file will contaminate pass rates and medians.
Default artifact names:
- Ollama:
results/raw/<model_safe>.jsonl,logs/run_model__<model_safe>.log - llama.cpp:
results/raw/llama_cpp__<model_safe>.jsonl,logs/run_model_llama_cpp__<model_safe>.log - vLLM:
results/raw/vllm__<model_safe>.jsonl,logs/run_model_vllm__<model_safe>.log
Outputs
results/raw/*.jsonl: per-model/per-runtime recordsresults/summary/suite_summary.json: aggregated summaryleaderboard/leaderboard.md: rendered leaderboardlogs/*.log: run logs
Caveats
- Quality scoring is still lightweight benchmark validation, not a final human-audited quality rubric.
- Some historical raw result files contain duplicate measurement warnings; check
results/summary/suite_summary.jsonbefore treating old rows as clean reruns. - vLLM model coverage is intentionally conservative until local artifacts are present and smoke-tested.
Hardware
Benchmarks were run on a single NVIDIA DGX Spark:
- NVIDIA GB10
- 128 GB unified memory
- Ubuntu 24.04.4 LTS
- CUDA 13.0
Layout
configs/ model and runtime config
docs/ benchmark spec, runbook, scoring rubric
prompts/suite_v1/ frozen benchmark prompts and manifest
scripts/ runners, summarizer, leaderboard renderer
results/raw/ per-model/per-runtime JSONL records
results/summary/ aggregated suite summary
leaderboard/ rendered markdown leaderboard
License
MIT