DGX Spark Ollama Leaderboard - benchmark suite for local LLM inference on NVIDIA DGX Spark
Find a file
2026-04-18 23:05:29 -04:00
benchmark_core minimax 2.7 2026-04-18 23:05:29 -04:00
configs minimax 2.7 2026-04-18 23:05:29 -04:00
docs minimax 2.7 2026-04-18 23:05:29 -04:00
leaderboard minimax 2.7 2026-04-18 23:05:29 -04:00
logs Add gemma4:26b benchmark results 2026-04-10 19:05:31 -04:00
prompts/suite_v1 Initial commit: DGX Spark Ollama Leaderboard benchmark suite and phase-1 results 2026-04-10 18:40:04 -04:00
results minimax 2.7 2026-04-18 23:05:29 -04:00
scripts minimax 2.7 2026-04-18 23:05:29 -04:00
tests minimax 2.7 2026-04-18 23:05:29 -04:00
.gitignore Implement vLLM server lifecycle smoke path 2026-04-11 23:09:02 -04:00
README.md minimax 2.7 2026-04-18 23:05:29 -04:00

DGX Spark LLM Leaderboard

Benchmark results and harness code for running local LLMs on an NVIDIA DGX Spark across Ollama, llama.cpp, and vLLM.

This repository now includes:

  • a Phase 1 single-node Ollama benchmark pass across 10 configured models
  • a llama.cpp comparison pass for the same model set
  • a canonical vLLM benchmark pass for qwen3.5:27b
  • a model-centric Speed View that shows the fastest single-request runtime per model plus per-runtime tok/sec columns
  • a separate Concurrency View for multi-user throughput probes

Current Leaderboard

The latest rendered leaderboard lives at:

  • leaderboard/leaderboard.md

The main leaderboard remains the reproducible single-request baseline.

The Speed View is model-centric: it renders one row per model, shows the fastest single-request runtime for that model, and keeps the Ollama / vLLM / llama.cpp tok/sec columns side by side.

The Concurrency View is separate on purpose: it renders summaries from results/concurrency/*.summary.json so vLLM and Ollama can show multi-user throughput without polluting the single-request leaderboard summary.

The rendered markdown file is the source of truth for current rankings. This README intentionally avoids duplicating a manual ranking table because it drifts as soon as new summaries are rendered.

Models Included

Configured in configs/models.yaml:

  • qwen3.5:27b
  • qwen3.5:35b
  • qwen3.6:35b-a3b
  • qwen3.5:122b
  • nemotron-3-super:120b
  • gemma4:31b
  • gpt-oss:20b
  • gpt-oss:120b
  • mistral-small-4:119b
  • minimax-m2.7:229b

Each model entry can now carry runtime-specific llama_cpp metadata:

  • gguf_path
  • chat_template
  • n_gpu_layers
  • enabled
  • notes

Benchmark Protocol

Per model and per runtime:

  • 2 warmup runs
  • 12 benchmark tests
  • 3 measured repeats per test
  • 36 measured runs per model/runtime pair

Current runtime coverage:

  • Ollama runtime on a single DGX Spark
  • llama.cpp runtime on a single DGX Spark
  • vLLM runtime on a single DGX Spark (currently canonical for qwen3.5:27b)

Fixed Ollama settings:

  • Ollama 0.20.2
  • Concurrency 1
  • Temperature 0.2
  • top_p = 0.95
  • Context window 8192
  • Streaming enabled

Fixed llama.cpp settings:

  • llama-cli
  • ctx_size = 8192
  • n_gpu_layers = all
  • temperature = 0.2
  • top_p = 0.95
  • --single-turn --simple-io --jinja --log-disable --perf

Large llama.cpp GGUFs can opt into a persistent llama-server path via per-model llama_cpp.execution_mode: server, which keeps the model loaded across warmups and measured prompts instead of paying a full llama-cli startup per request.

Fixed vLLM single-request settings:

  • tensor_parallel_size = 1
  • max_model_len = 8192
  • max_num_seqs = 1
  • temperature = 0.2
  • top_p = 0.95
  • streaming enabled

vLLM concurrency settings:

  • scripts/run_concurrency.py automatically raises the server max_num_seqs to at least the highest requested concurrency level unless --server-max-num-seqs is supplied
  • concurrency summaries use distinct request-group wall time so repeated probes do not inflate aggregate tok/sec

Test Suite

The repository uses suite_v1, a 12-test prompt suite covering:

  • short instruction following
  • coding
  • structured output
  • reasoning
  • summarization
  • long-context retrieval and summarization

Test IDs:

  • t01_short_factual
  • t02_oom_bullets
  • t03_python_rolling_average
  • t04_debug_unique_words
  • t05_json_extraction
  • t06_strict_schema_assessment
  • t07_benchmark_math
  • t08_model_tradeoff_recommendation
  • t09_benchmark_summary
  • t10_operator_note_rewrite
  • t11_long_context_retrieval
  • t12_long_context_structured_summary

Important Harness Notes

gpt-oss thinking-token handling

gpt-oss models emit reasoning text through Ollama's thinking field rather than normal response chunks. The harness captures both so TTFT and decode throughput are recorded correctly.

Gemma final-event timing fallback

gemma4:31b can return no streamed text chunks at all, while still reporting eval_count and eval_duration in the final Ollama event. The harness now falls back to those final-event durations to recover:

  • generation_latency_ms
  • decode_tokens_per_sec
  • ttft_ms

Records that use this path are tagged with:

  • metrics_source: ollama_final_event_durations

llama.cpp metric extraction

run_model_llama_cpp.py runs llama-cli in simple-io mode and parses the perf footer:

  • prompt throughput from Prompt: X t/s
  • decode throughput from Generation: Y t/s

For new runs, the harness records:

  • total_latency_ms from subprocess wall-clock time
  • decode_tokens_per_sec and prompt_tokens_per_sec from the perf footer

The harness no longer derives output_tokens, generation_latency_ms, or truncated from the requested --n-predict limit because that overstated certainty when llama-cli did not expose the actual completion token count.

Records that use this path are tagged with:

  • metrics_source: llama_cli_perf_output

Clean reruns matter

Before rerunning a model benchmark, archive or reset the corresponding raw JSONL and log file for that runtime. Otherwise a second run appends into the same raw file and contaminates pass-rate and summary aggregation.

Examples:

  • Ollama: results/raw/<model_safe>.jsonl and logs/run_model__<model_safe>.log
  • llama.cpp: results/raw/llama_cpp__<model_safe>.jsonl and logs/run_model_llama_cpp__<model_safe>.log
  • vLLM: results/raw/vllm__<model_safe>.jsonl and logs/run_model_vllm__<model_safe>.log

Current Output Artifacts

Primary outputs:

  • results/raw/*.jsonl — one file per model/runtime, one record per run
  • results/summary/suite_summary.json — aggregated per-model summary plus by_model_runtime pivots used by the leaderboard
  • leaderboard/leaderboard.md — rendered markdown leaderboard

Current Ollama raw result set:

  • results/raw/gpt-oss__20b.jsonl
  • results/raw/gpt-oss__120b.jsonl
  • results/raw/qwen3.5__27b.jsonl
  • results/raw/qwen3.5__35b.jsonl
  • results/raw/qwen3.5__122b.jsonl
  • results/raw/qwen3.6__35b-a3b.jsonl
  • results/raw/nemotron-3-super__120b.jsonl
  • results/raw/gemma4__31b.jsonl

Current llama.cpp raw result set:

  • results/raw/llama_cpp__gpt-oss__20b.jsonl
  • results/raw/llama_cpp__gpt-oss__120b.jsonl
  • results/raw/llama_cpp__qwen3.5__27b.jsonl
  • results/raw/llama_cpp__qwen3.5__35b.jsonl
  • results/raw/llama_cpp__qwen3.5__122b.jsonl
  • results/raw/llama_cpp__qwen3.6__35b-a3b.jsonl
  • results/raw/llama_cpp__nemotron-3-super__120b.jsonl
  • results/raw/llama_cpp__gemma4__31b.jsonl
  • results/raw/llama_cpp__mistral-small-4__119b.jsonl (archive / ad hoc rerun exists under results/archive/)

Configured but not yet benchmarked in committed raw results:

  • minimax-m2.7:229b via llama.cpp after the Unsloth UD-Q3_K_M shard set is downloaded onto the Spark

Current vLLM raw result set:

  • results/raw/vllm__qwen3.5__27b.jsonl

Important current caveat:

  • the existing qwen3.5:27b vLLM artifact on this machine is a multimodal Qwen3_5ForConditionalGeneration rebuild with BF16 weights, not a text-only quantized artifact comparable to the Ollama and llama.cpp rows
  • the harness now rejects that local artifact for new canonical runs unless you explicitly opt in with --allow-multimodal-artifact

Important Caveat About Quality Scoring

The repository contains scaffolding for quality scoring, but the current committed leaderboard remains primarily a performance/stability pass.

At the moment:

  • the practical leaderboard is most useful for speed and stability comparison
  • quality_score is not yet a finalized rubric-driven score for this benchmark pass
  • the rendered leaderboard should not yet be treated as a final human-audited quality ranking

If you want a strict quality leaderboard, the next step is to finalize and validate the per-test scoring rubric in scripts/run_model.py and summary generation.

Hardware

Benchmarks were run on a single NVIDIA DGX Spark with:

  • NVIDIA GB10
  • 128 GB unified memory
  • Ubuntu 24.04.4 LTS
  • CUDA 13.0

Repository Layout

dgx-spark-leaderboard/
├── configs/                  # Model list and runtime config
├── docs/                     # Benchmark spec, runbook, scoring rubric
├── prompts/suite_v1/         # Frozen benchmark prompts and manifest
├── scripts/
│   ├── run_model.py           # Run one model through the Ollama benchmark suite
│   ├── run_model_llama_cpp.py # Run one model through the llama.cpp benchmark suite
│   ├── run_suite.py           # Run configured models for a selected runtime
│   ├── summarize_results.py   # Aggregate raw JSONL into suite summary JSON
│   ├── render_leaderboard.py  # Render markdown leaderboard from summary JSON
│   └── download_llama_cpp_ggufs.sh # Fetch GGUFs into canonical local paths
├── results/
│   ├── raw/                  # Per-model/per-runtime JSONL run logs
│   └── summary/              # Aggregated suite summary
└── leaderboard/              # Rendered leaderboard markdown

Usage

Run one Ollama model:

python3 scripts/run_model.py qwen3.5:27b

Run one llama.cpp model:

python3 scripts/run_model_llama_cpp.py qwen3.5:27b

Run all configured models for a runtime:

python3 scripts/run_suite.py --runtime ollama
python3 scripts/run_suite.py --runtime llama.cpp
python3 scripts/run_suite.py --runtime vllm

Regenerate aggregate outputs:

python3 scripts/summarize_results.py --raw-dir results/raw --output results/summary/suite_summary.json
python3 scripts/render_leaderboard.py --summary results/summary/suite_summary.json --output leaderboard/leaderboard.md

Dry run:

python3 scripts/run_model.py qwen3.5:27b --dry-run
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --dry-run
python3 scripts/run_model_vllm.py qwen3.5:27b --dry-run

Run a focused concurrency probe:

python3 scripts/run_concurrency.py \
  --runtime vllm \
  --model qwen3.5:27b \
  --only-test t01_short_factual \
  --concurrency-level 1 \
  --concurrency-level 2 \
  --concurrency-level 4 \
  --server-max-num-seqs 4 \
  --warmups 1 \
  --repeats 2

Run only selected tests:

python3 scripts/run_model.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment
python3 scripts/run_model_llama_cpp.py qwen3.5:27b --only-test t05_json_extraction --only-test t06_strict_schema_assessment

Roadmap

  • expand canonical vLLM model coverage beyond qwen3.5:27b
  • finalize quality scoring rubric
  • concurrency harness added (Phase 4), expand coverage to more models
  • expand model coverage

License

MIT