| benchmark_core | ||
| configs | ||
| docs | ||
| leaderboard | ||
| logs | ||
| prompts/suite_v1 | ||
| results | ||
| scripts | ||
| tests | ||
| .gitignore | ||
| README.md | ||
DGX Spark LLM Leaderboard
Benchmark results and harness code for running local LLMs on an NVIDIA DGX Spark across Ollama, llama.cpp, and vLLM.
This repository now includes:
- a Phase 1 single-node Ollama benchmark pass across 10 configured models
- a llama.cpp comparison pass for the same model set
- a canonical vLLM benchmark pass for qwen3.5:27b
- a model-centric Speed View that shows the fastest single-request runtime per model plus per-runtime tok/sec columns
- a separate Concurrency View for multi-user throughput probes
Current Leaderboard
The latest rendered leaderboard lives at:
leaderboard/leaderboard.md
The main leaderboard remains the reproducible single-request baseline.
The Speed View is model-centric: it renders one row per model, shows the fastest single-request runtime for that model, and keeps the Ollama / vLLM / llama.cpp tok/sec columns side by side.
The Concurrency View is separate on purpose: it renders summaries from results/concurrency/*.summary.json so vLLM and Ollama can show multi-user throughput without polluting the single-request leaderboard summary.
The rendered markdown file is the source of truth for current rankings. This README intentionally avoids duplicating a manual ranking table because it drifts as soon as new summaries are rendered.
Models Included
Configured in configs/models.yaml:
qwen3.5:27bqwen3.5:35bqwen3.6:35b-a3bqwen3.5:122bnemotron-3-super:120bgemma4:31bgpt-oss:20bgpt-oss:120bmistral-small-4:119bminimax-m2.7:229b
Each model entry can now carry runtime-specific llama_cpp metadata:
gguf_pathchat_templaten_gpu_layersenablednotes
Benchmark Protocol
Per model and per runtime:
- 2 warmup runs
- 12 benchmark tests
- 3 measured repeats per test
- 36 measured runs per model/runtime pair
Current runtime coverage:
- Ollama runtime on a single DGX Spark
- llama.cpp runtime on a single DGX Spark
- vLLM runtime on a single DGX Spark (currently canonical for qwen3.5:27b)
Fixed Ollama settings:
- Ollama
0.20.2 - Concurrency
1 - Temperature
0.2 top_p = 0.95- Context window
8192 - Streaming enabled
Fixed llama.cpp settings:
llama-clictx_size = 8192n_gpu_layers = alltemperature = 0.2top_p = 0.95--single-turn --simple-io --jinja --log-disable --perf
Large llama.cpp GGUFs can opt into a persistent llama-server path via per-model llama_cpp.execution_mode: server, which keeps the model loaded across warmups and measured prompts instead of paying a full llama-cli startup per request.
Fixed vLLM single-request settings:
tensor_parallel_size = 1max_model_len = 8192max_num_seqs = 1temperature = 0.2top_p = 0.95- streaming enabled
vLLM concurrency settings:
scripts/run_concurrency.pyautomatically raises the servermax_num_seqsto at least the highest requested concurrency level unless--server-max-num-seqsis supplied- concurrency summaries use distinct request-group wall time so repeated probes do not inflate aggregate tok/sec
Test Suite
The repository uses suite_v1, a 12-test prompt suite covering:
- short instruction following
- coding
- structured output
- reasoning
- summarization
- long-context retrieval and summarization
Test IDs:
t01_short_factualt02_oom_bulletst03_python_rolling_averaget04_debug_unique_wordst05_json_extractiont06_strict_schema_assessmentt07_benchmark_matht08_model_tradeoff_recommendationt09_benchmark_summaryt10_operator_note_rewritet11_long_context_retrievalt12_long_context_structured_summary
Important Harness Notes
gpt-oss thinking-token handling
gpt-oss models emit reasoning text through Ollama's thinking field rather than normal response chunks. The harness captures both so TTFT and decode throughput are recorded correctly.
Gemma final-event timing fallback
gemma4:31b can return no streamed text chunks at all, while still reporting eval_count and eval_duration in the final Ollama event. The harness now falls back to those final-event durations to recover:
generation_latency_msdecode_tokens_per_secttft_ms
Records that use this path are tagged with:
metrics_source: ollama_final_event_durations
llama.cpp metric extraction
run_model_llama_cpp.py runs llama-cli in simple-io mode and parses the perf footer:
- prompt throughput from
Prompt: X t/s - decode throughput from
Generation: Y t/s
For new runs, the harness records:
total_latency_msfrom subprocess wall-clock timedecode_tokens_per_secandprompt_tokens_per_secfrom the perf footer
The harness no longer derives output_tokens, generation_latency_ms, or truncated from the requested --n-predict limit because that overstated certainty when llama-cli did not expose the actual completion token count.
Records that use this path are tagged with:
metrics_source: llama_cli_perf_output
Clean reruns matter
Before rerunning a model benchmark, archive or reset the corresponding raw JSONL and log file for that runtime. Otherwise a second run appends into the same raw file and contaminates pass-rate and summary aggregation.
Examples:
- Ollama:
results/raw/<model_safe>.jsonlandlogs/run_model__<model_safe>.log - llama.cpp:
results/raw/llama_cpp__<model_safe>.jsonlandlogs/run_model_llama_cpp__<model_safe>.log - vLLM:
results/raw/vllm__<model_safe>.jsonlandlogs/run_model_vllm__<model_safe>.log
Current Output Artifacts
Primary outputs:
results/raw/*.jsonl— one file per model/runtime, one record per runresults/summary/suite_summary.json— aggregated per-model summary plusby_model_runtimepivots used by the leaderboardleaderboard/leaderboard.md— rendered markdown leaderboard
Current Ollama raw result set:
results/raw/gpt-oss__20b.jsonlresults/raw/gpt-oss__120b.jsonlresults/raw/qwen3.5__27b.jsonlresults/raw/qwen3.5__35b.jsonlresults/raw/qwen3.5__122b.jsonlresults/raw/qwen3.6__35b-a3b.jsonlresults/raw/nemotron-3-super__120b.jsonlresults/raw/gemma4__31b.jsonl
Current llama.cpp raw result set:
results/raw/llama_cpp__gpt-oss__20b.jsonlresults/raw/llama_cpp__gpt-oss__120b.jsonlresults/raw/llama_cpp__qwen3.5__27b.jsonlresults/raw/llama_cpp__qwen3.5__35b.jsonlresults/raw/llama_cpp__qwen3.5__122b.jsonlresults/raw/llama_cpp__qwen3.6__35b-a3b.jsonlresults/raw/llama_cpp__nemotron-3-super__120b.jsonlresults/raw/llama_cpp__gemma4__31b.jsonlresults/raw/llama_cpp__mistral-small-4__119b.jsonl(archive / ad hoc rerun exists underresults/archive/)
Configured but not yet benchmarked in committed raw results:
minimax-m2.7:229bvia llama.cpp after the UnslothUD-Q3_K_Mshard set is downloaded onto the Spark
Current vLLM raw result set:
results/raw/vllm__qwen3.5__27b.jsonl
Important current caveat:
- the existing
qwen3.5:27bvLLM artifact on this machine is a multimodalQwen3_5ForConditionalGenerationrebuild with BF16 weights, not a text-only quantized artifact comparable to the Ollama and llama.cpp rows - the harness now rejects that local artifact for new canonical runs unless you explicitly opt in with
--allow-multimodal-artifact
Important Caveat About Quality Scoring
The repository contains scaffolding for quality scoring, but the current committed leaderboard remains primarily a performance/stability pass.
At the moment:
- the practical leaderboard is most useful for speed and stability comparison
quality_scoreis not yet a finalized rubric-driven score for this benchmark pass- the rendered leaderboard should not yet be treated as a final human-audited quality ranking
If you want a strict quality leaderboard, the next step is to finalize and validate the per-test scoring rubric in scripts/run_model.py and summary generation.
Hardware
Benchmarks were run on a single NVIDIA DGX Spark with:
- NVIDIA GB10
- 128 GB unified memory
- Ubuntu 24.04.4 LTS
- CUDA 13.0
Repository Layout
dgx-spark-leaderboard/
├── configs/ # Model list and runtime config
├── docs/ # Benchmark spec, runbook, scoring rubric
├── prompts/suite_v1/ # Frozen benchmark prompts and manifest
├── scripts/
│ ├── run_model.py # Run one model through the Ollama benchmark suite
│ ├── run_model_llama_cpp.py # Run one model through the llama.cpp benchmark suite
│ ├── run_suite.py # Run configured models for a selected runtime
│ ├── summarize_results.py # Aggregate raw JSONL into suite summary JSON
│ ├── render_leaderboard.py # Render markdown leaderboard from summary JSON
│ └── download_llama_cpp_ggufs.sh # Fetch GGUFs into canonical local paths
├── results/
│ ├── raw/ # Per-model/per-runtime JSONL run logs
│ └── summary/ # Aggregated suite summary
└── leaderboard/ # Rendered leaderboard markdown
Usage
Run one Ollama model:
python3 scripts/run_model.py qwen3.5:27b
Run one llama.cpp model:
python3 scripts/run_model_llama_cpp.py qwen3.5:27b
Run all configured models for a runtime:
python3 scripts/run_suite.py --runtime ollama
python3 scripts/run_suite.py --runtime llama.cpp
python3 scripts/run_suite.py --runtime vllm
Regenerate aggregate outputs:
python3 scripts/summarize_results.py --raw-dir results/raw --output results/summary/suite_summary.json
python3 scripts/render_leaderboard.py --summary results/summary/suite_summary.json --output leaderboard/leaderboard.md
Dry run:
python3 scripts/run_model.py qwen3.5:27b --dry-run
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --dry-run
python3 scripts/run_model_vllm.py qwen3.5:27b --dry-run
Run a focused concurrency probe:
python3 scripts/run_concurrency.py \
--runtime vllm \
--model qwen3.5:27b \
--only-test t01_short_factual \
--concurrency-level 1 \
--concurrency-level 2 \
--concurrency-level 4 \
--server-max-num-seqs 4 \
--warmups 1 \
--repeats 2
Run only selected tests:
python3 scripts/run_model.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
Roadmap
- expand canonical vLLM model coverage beyond qwen3.5:27b
- finalize quality scoring rubric
- concurrency harness added (Phase 4), expand coverage to more models
- expand model coverage
License
MIT