Field notes from the current B70 inference landscape

The Intel Arc Pro B70 Local LLM Landscape: What Works, What Breaks, and What I’d Run Today

A practical guide to the current B70 software landscape for local LLM serving: what to run first, what is slower but useful, and which paths still hit blockers.

1. Why this B70 test matters

These notes look at what actually happened when several local LLM serving stacks were tried on Intel Arc Pro B70 hardware.

The tested stacks were llama.cpp, vLLM, OpenVINO Model Server, OpenArc, and Intel LLM Scaler. The report keeps the focus on observable outcomes: whether the service started, whether it returned usable text, which model format was used, and which timing numbers were recorded.

Gemma 4 appears heavily in the tests because it was the model family used for personal projects. That choice makes the results useful for this specific Gemma 4/B70 setup, but it also means the OpenVINO Model Server, OpenArc and LLM Scaler findings should not be read as final judgments about every other model family. Testing other supported models may be sensible.

2. What to expect before starting

The results vary by model format and runtime. GGUF models were tested through llama.cpp with both SYCL and Vulkan. For vLLM the XPU kernel backend was used. OpenVINO IR serving was tested through OpenVINO Model Server and OpenArc.

3. The B70 setup used here

The test machine exposed B70 GPUs through Linux DRM nodes. Intel’s public B70 specification lists 32 Xe-cores, 256 XMX engines, 32 GB graphics memory, 367 INT8 TOPS, PCIe 5.0 x16, and 230 W board power. The recorded results depend on the full software path as well as the hardware.

4. Models and tool paths used

The same model family behaved differently depending on format. BF16, OpenVINO INT4, AutoRound INT4, AWQ, and GGUF variants were not interchangeable in these tests.

Model / artifact	Used in	Source	Notes from test
OpenVINO/gemma-4-31B-it-int4-ov	OpenVINO Model Server, OpenArc	Hugging Face OpenVINO model	OpenVINO IR existed, but the tested OVMS/OpenArc Gemma 4 serving paths did not complete inference.
Intel/gemma-4-31B-it-int4-AutoRound	vLLM Intel XPU, LLM Scaler	Hugging Face Intel AutoRound model	Worked in the tested vLLM Intel image; failed in tested LLM Scaler image.
cyankiwi/gemma-4-31B-it-AWQ-4bit	vLLM Intel XPU, LLM Scaler	Hugging Face AWQ 4-bit model	Worked in vLLM language-only mode, slightly slower than AutoRound in the recorded Gemma runs; failed in the tested LLM Scaler image.
cyankiwi/gemma-4-31B-it-AWQ-8bit	vLLM Intel XPU	Hugging Face AWQ 8-bit model	Failed because the tested XPU kernels did not support the detected uint8b128/W8A16 path.
cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4	vLLM Intel XPU	Hugging Face model card	Worked with reasoning parsing in the tested vLLM image.
`gemma-4-31B-it-UD-Q4_K_XL.gguf`	llama.cpp SYCL/Vulkan	Local file in test notes	Used for Gemma 4 31B Q4_K_XL llama-bench runs. Exact download page was not recorded in the supplied notes.
`gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf`	llama.cpp SYCL/Vulkan	Local file in test notes	Used for Gemma 4 26B A4B Q4_K_XL llama-bench runs. Exact download page was not recorded in the supplied notes.
`Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf`	llama.cpp SYCL/Vulkan	Local file in test notes	Used for Qwen 3.6 35B A3B Q4_K_XL llama-bench runs. Exact download page was not recorded in the supplied notes.
`gemma-4-E4B-it-Q5_K_M.gguf`	llama.cpp SYCL April baseline	Local file in test notes	Used for the older April Gemma 4 E4B Q5_K_M generation-script baseline. Exact download page was not recorded in the supplied notes.
`Qwen3.5-9B-Q4_K_M.gguf`	llama.cpp SYCL April baseline	Local file in test notes	Used for the older April Qwen 3.5 9B Q4_K_M generation-script baseline. Exact download page was not recorded in the supplied notes.

The tested tool paths were: llama.cpp SYCL/Vulkan for GGUF; vLLM Intel XPU for OpenAI-compatible serving; OpenVINO Model Server and OpenArc for OpenVINO IR serving; and Intel LLM Scaler as an Intel-oriented vLLM-based route.

5. How the runs were measured

The runs used Docker containers, port 8000, OpenAI-compatible chat requests where supported, and server logs for timing/failure details. llama.cpp used --server, --jinja, large context, GPU layer offload, and Docker log capture.

Metrics fields.
The result tables below focus on token throughput rather than elapsed prompt, generation, or request time. For llama.cpp, May benchmark rows use llama-bench pp/tg throughput where available; older April rows remain generation-script baselines. For vLLM, the tables report generation throughput; comparable average prompt throughput was not available from the recorded result summaries.

The original benchmark harness was not supplied, so this post keeps the recorded metrics instead of inferring hidden implementation details.

6. llama.cpp results

The largest set of successful runs in the notes came from llama.cpp with GGUF models.

The recorded llama.cpp tests used both Intel SYCL and Vulkan backends. The May SYCL numbers below are from llama-bench, so they report prompt-processing rows (pp) and generation rows (tg) directly. The revised Gemma 4 31B Q4_K_XL single-B70 tg128 result was 20.79 ± 0.03 tokens/s; the two-B70 layer-split tg128 result was 21.53 ± 0.00 tokens/s.

The May llama.cpp SYCL and Vulkan tables use llama-bench-style metric rows where available: pp is prompt-processing throughput and tg is generation throughput. Tensor split is the split ratio used for multi-GPU llama.cpp rows. The April rows are older generation-script baselines.

llama.cpp SYCL, May 2026

Gemma 4 31B Q4_K_XL, llama-bench

Run	Model	Size	Params	Backend	ngl	Threads	n_batch	fa	Tensor split	Test	tok/s
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024			pp512	270.56 ± 0.77
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024			pp2048	263.34 ± 1.16
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024			pp4096	259.68 ± 0.20
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024			pp8192	256.63 ± 0.03
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024			tg128	20.79 ± 0.03
2× B70 layer split	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024	1	1.00/1.00	pp512	264.24 ± 0.00
2× B70 layer split	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024	1	1.00/1.00	pp2048	360.08 ± 0.00
2× B70 layer split	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024	1	1.00/1.00	pp4096	372.45 ± 0.00
2× B70 layer split	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024	1	1.00/1.00	pp8192	357.94 ± 0.00
2× B70 layer split	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	SYCL	99	8	1024	1	1.00/1.00	tg128	21.53 ± 0.00

Qwen 3.6 35B A3B Q4_K_XL, llama-bench

Run	Model	Size	Params	Backend	ngl	Threads	n_batch	fa	Tensor split	Test	tok/s
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1		pp512	408.47 ± 3.06
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1		pp2048	400.33 ± 0.98
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1		pp4096	390.50 ± 2.03
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1		pp8192	382.71 ± 4.60
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1		tg128	42.18 ± 0.15
2× B70 layer split	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1	1.00/1.00	pp512	557.23 ± 0.00
2× B70 layer split	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1	1.00/1.00	pp2048	551.86 ± 0.00
2× B70 layer split	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1	1.00/1.00	pp4096	540.30 ± 0.00
2× B70 layer split	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1	1.00/1.00	pp8192	526.44 ± 0.00
2× B70 layer split	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	8	1024	1	1.00/1.00	tg128	39.02 ± 0.00

Gemma 4 26B A4B Q4_K_XL, llama-bench

Run	Model	Size	Params	Backend	ngl	Threads	n_batch	fa	Test	tok/s
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	SYCL	99	8	1024	1	pp512	843.10 ± 11.15
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	SYCL	99	8	1024	1	pp2048	781.05 ± 3.66
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	SYCL	99	8	1024	1	pp4096	748.17 ± 2.38
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	SYCL	99	8	1024	1	pp8192	696.51 ± 0.90
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	SYCL	99	8	1024	1	tg128	55.77 ± 0.51

llama.cpp Vulkan, May 2026

Vulkan also ran the tested GGUF models. The Vulkan rows below are from llama-bench, so they include prompt-processing rows (pp) and generation rows (tg) directly.

Gemma 4 31B Q4_K_XL, llama-bench

Run	Model	Size	Params	Backend	ngl	Threads	n_batch	Test	tok/s
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	Vulkan	99	8	1024	pp512	501.89 ± 0.11
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	Vulkan	99	8	1024	pp2048	477.94 ± 0.26
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	Vulkan	99	8	1024	pp4096	464.74 ± 0.66
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	Vulkan	99	8	1024	pp8192	447.73 ± 0.15
1× B70 GPU.1 / card2	gemma4 31B Q4_K - Medium	17.52 GiB	30.70 B	Vulkan	99	8	1024	tg128	15.35 ± 0.01

Qwen 3.6 35B A3B Q4_K_XL, llama-bench

Run	Model	Size	Params	Backend	ngl	Threads	n_batch	Test	tok/s
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	8	1024	pp512	1342.28 ± 6.58
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	8	1024	pp2048	1326.57 ± 2.77
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	8	1024	pp4096	1295.72 ± 6.24
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	8	1024	pp8192	1243.41 ± 1.43
1× B70 GPU.1 / card2	qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	Vulkan	99	8	1024	tg128	37.47 ± 0.03

Gemma 4 26B A4B Q4_K_XL, llama-bench

Run	Model	Size	Params	Backend	ngl	Threads	n_batch	Test	tok/s
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	Vulkan	99	8	1024	pp512	1702.59 ± 9.28
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	Vulkan	99	8	1024	pp2048	1642.82 ± 9.39
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	Vulkan	99	8	1024	pp4096	1605.83 ± 6.59
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	Vulkan	99	8	1024	pp8192	1556.38 ± 5.69
1× B70 GPU.1 / card2	gemma4 26B.A4B Q4_K - Medium	15.83 GiB	25.23 B	Vulkan	99	8	1024	tg128	41.25 ± 0.03

llama.cpp SYCL, April 2026 baselines

Model	Backend	GPU(s)	ngl	Threads	Batch	UBatch	Test	tok/s	Notes
Gemma 4 E4B Q5_K_M	SYCL	1× B70 GPU.1	99	8	4096	4096	pp	426.36	Derived from 165 total prompt tokens over 3 runs and the recorded average prompt-processing field.
Gemma 4 E4B Q5_K_M	SYCL	1× B70 GPU.1	99	8	4096	4096	tg	38.40	1546.7 completion tokens; 2755.7 final-output chars; 68.24 chars/s.
Qwen 3.5 9B Q4_K_M GGUF	SYCL	1× B70 GPU.1	99	8	4096	4096	pp	124.35	Derived from 144 total prompt tokens over 3 runs and the recorded average prompt-processing field.
Qwen 3.5 9B Q4_K_M GGUF	SYCL	1× B70 GPU.1	99	8	4096	4096	tg	52.11	4781.7 completion tokens; 1677.7 final-output chars; 19.46 chars/s.

llama.cpp SYCL llama-bench command used for the Gemma 4 31B single-GPU path

docker run --rm \
 --name llamacpp-bench \
 --net=none \
 --device=/dev/dri/card2 \
 --device=/dev/dri/renderD129 \
 --group-add="$(stat -c '%g' /dev/dri/renderD129)" \
 -u "$(id -u):$(id -g)" \
 --cap-add=IPC_LOCK \
 -e ONEAPI_DEVICE_SELECTOR="level_zero:0" \
 -e ZES_ENABLE_SYSMAN=1 \
 -e GGML_NO_MMAP=1 \
 -e LLAMA_ARG_FLASH_ATTN=on \
 -v "$MODEL_DIR":/models \
 --entrypoint /app/llama-bench \
 local/llama.cpp:full-intel-sycl-15-05-2026 \
 -m /models/gemma-4-31B-it-UD-Q4_K_XL.gguf \
 -ngl 99 \
 -t 8 \
 -p 512,2048,4096,8192 \
 -n 128 \
 -b 1024 \
 -ub 512 \
 -r 3 \
 -o md

April vs May comparison

The April entries above remain generation-script baselines. The revised May SYCL entries are llama-bench results for Gemma 4 31B, Qwen 3.6 35B A3B, and Gemma 4 26B A4B, so this section no longer presents April-vs-May percentage comparisons from the earlier generation-script May data.

7. vLLM results

vLLM was the main OpenAI-compatible server path that returned usable results in the notes.

The working Gemma 4 31B INT4 AutoRound runs on 2×B70 were around 14.2–14.3 tokens/s. The revised llama.cpp SYCL Gemma 4 31B GGUF llama-bench result was 20.79 ± 0.03 tokens/s on one B70 and 21.53 ± 0.00 tokens/s with two-B70 layer split. This is not a strict model-for-model comparison because the model formats differed, but it is the practical speed relationship recorded in these tests.

vLLM runs that worked

Model	Hardware	Status	Generation tok/s	Notes
Gemma 4 31B INT4 AutoRound	2× B70 TP=2	Worked	14.21	315.0 completion tokens; 80.7 final-output chars; 3.58 chars/s.
Gemma 4 31B INT4 AutoRound, language-only	2× B70 TP=2	Worked	14.33	439.3 completion tokens; 108.7 final-output chars; 3.66 chars/s.
Gemma 4 31B AWQ INT4, language-only	2× B70 TP=2	Worked	13.11	323.7 completion tokens; 97.0 final-output chars; 4.02 chars/s.
Qwen 3.5 27B AWQ INT4	2× B70 TP=2	Worked	15.65	2005.3 completion tokens; reasoning/parser output worked; 104.0 final-output chars; 0.82 chars/s.
Qwen 3.5 27B AWQ INT4, XPU graph env	2× B70 TP=2	Worked	15.64	4070.3 completion tokens; XPU graph capture was disabled for communication operations; 128.3 final-output chars; 0.51 chars/s.

Note that Gemma 4 31B has 32 attention heads. Therefore two-way tensor parallelism was valid while three-way tensor parallelism failed before inference because 32 is not divisible by 3. Many model architectures have a number of attention heads which is divisible by 2. This will cause problems on 3-gpu setups and should be considered beforehand. Alternatively you can run pipeline parallelism, but this is often slower than tensor parallelism.

vLLM runs that did not load or did not reach inference

Model	Hardware	Status	Result / blocker
Gemma 4 31B AWQ INT8	2× B70 TP=2	No kernel	compressed-tensors uint8b128/W8A16 path not supported by tested XPU kernels.

vLLM Intel XPU command used for the Gemma 4 AutoRound path

set -euo pipefail

MODEL_NAME="google-gemma4-31b-it-int4-Intel-Autoround"

docker rm -f vllm >/dev/null 2>&1 || true

docker run -d --restart=always \
  --name vllm \
  --net=bridge \
  -p 8000:8000 \
  --group-add=video \
  --ipc=host \
  --privileged \
  --device /dev/dri:/dev/dri \
  -v /dev/dri/by-path:/dev/dri/by-path \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_XPU_ENABLE_XPU_GRAPH=1 \
  -v /home/ejer/llm/local_models/google-gemma4-31b-it-int4-Intel-Autoround:/app/model:ro \
  --entrypoint /bin/bash \
  vllm-intel-12-05-2026 \
  -lc "source /opt/intel/oneapi/setvars.sh --force && \
       vllm serve /app/model \
         --host 0.0.0.0 --port 8000 \
         --served-model-name ${MODEL_NAME} \
         --enable-chunked-prefill \
         --tensor-parallel-size 2 \
         --reasoning-parser gemma4 \
         --language-model-only \
         --max-model-len 4096 \
         --gpu-memory-utilization 0.9 \
         --dtype bfloat16 \
         --default-chat-template-kwargs '{"enable_thinking": true}' \
         --trust-remote-code"

8. OpenVINO Model Server

The tested OpenVINO Model Server path did not serve the Gemma 4 OpenVINO model.

OpenVINO Model Server got far enough to create the repository and graph, then failed during LLM node initialization with Unsupported 'gemma4' VLM model type. This result applies to the recorded Gemma 4 31B INT4 OpenVINO setup.

However this does not rule out OpenVINO Model Server for other model families. However for now it is necessary to install transformers==5.5.0 to run Qwen 3.5/3.6 models.

Model	Hardware	Status	Result / blocker
OpenVINO/gemma-4-31B-it-int4-ov	2× B70, target_device GPU	Did not start	Unsupported 'gemma4' VLM model type during LLM node initialization.

OpenVINO Model Server command

MODEL_ID="OpenVINO/gemma-4-31B-it-int4-ov"
MODEL_NAME="gemma-4-31B-it-int4-ov"
OVMS_REPO="/home/ejer/llm/ovms_models"

mkdir -p "$OVMS_REPO"
docker rm -f openvino 2>/dev/null || true

docker run --rm -it \
  --name openvino \
  --net=bridge \
  -p 8000:8000 \
  --user "$(id -u):$(id -g)" \
  --device /dev/dri \
  --group-add="$(stat -c '%g' /dev/dri/render* | head -n 1)" \
  -v "$OVMS_REPO:/models:rw" \
  openvino/model_server:weekly \
  --model_repository_path /models \
  --source_model "$MODEL_ID" \
  --model_name "$MODEL_NAME" \
  --rest_port 8000 \
  --target_device GPU \
  --task text_generation \
  --pipeline_type VLM_CB \
  --log_level INFO

9. OpenArc

The tested OpenArc path did not complete Gemma 4 inference.

The final useful OpenArc attempt included transformers==5.5.0, which got past the earlier runtime/tokenizer issue. The remaining blocker was an inference failure around token_type_ids: the expected input port was not found.

There are indications that Qwen 3.5/3.6 may work with OpenArc when using Transformers 5.5, but that was not tested here and is not counted as a result. Since the focus here was on Gemma 4, testing OpenArc with other models may be worthwhile.

Model	Hardware	Status	Result / blocker
gemma-4-31B-it-int4-ov OpenVINO IR	2× B70 via /dev/dri	Inference failed	Transformers 5.5 advanced startup, but inference failed with token_type_ids port mismatch.

OpenArc command

OPENARC_SRC="/home/ejer/OpenArc"
OPENARC_GEMMA4_IMAGE="openarc-gemma4:latest"
MODEL_DIR_HOST="/home/ejer/llm/local_models/google-gemma-4-31b-it-int4-openvino"
MODEL_NAME="gemma-4-31B-it-int4-ov"

# Runtime image included: RUN uv pip install -U "transformers==5.5.0"

docker run -d \
  --name openarc \
  --restart unless-stopped \
  --net=bridge \
  -p 127.0.0.1:8000:8000 \
  --device /dev/dri:/dev/dri \
  --group-add="$(stat -c '%g' /dev/dri/render* | head -n 1)" \
  -e OPENARC_API_KEY_REQUIRED=false \
  -e NEOReadDebugKeys=1 \
  -e OverrideGpuAddressSpace=48 \
  -e EnableImplicitScaling=1 \
  -v "$MODEL_DIR_HOST:/models/gemma4:ro" \
  --entrypoint /bin/bash \
  "$OPENARC_GEMMA4_IMAGE" \
  -lc "openarc add \
         --model-name '$MODEL_NAME' \
         --model-path /models/gemma4 \
         --engine ovgenai \
         --model-type vlm \
         --device GPU \
         --vlm-type gemma4 || true; \
       openarc serve start \
         --host 0.0.0.0 \
         --port 8000 \
         --load-models '$MODEL_NAME'"

10. LLM Scaler

The tested LLM Scaler image did not load the recorded Gemma 4 AutoRound or AWQ variants.

The AutoRound route needed --allow-deprecated-quantization to pass the initial vLLM check, then failed around model.vision_tower.std_bias. The AWQ route failed because the tested image could not find a WNA16 linear-layer kernel for the model.

Model	Hardware	Status	Result / blocker
Gemma 4 31B INT4 AutoRound	2× B70 TP=2	Did not load	AutoRound required --allow-deprecated-quantization, then failed on model.vision_tower.std_bias in TransformersMultiModalForCausalLM.
Gemma 4 31B AWQ INT4	2× B70 TP=2	Did not load	Failed to find an XPU WNA16 linear kernel for the AWQ/W4A16 path.

LLM Scaler command used for the Gemma 4 AutoRound path

docker run -d --restart=always \
 --name llm-scaler-vllm \
 --net=bridge \
 -p 8000:8000 \
 --group-add=video \
 --ipc=host \
 --privileged \
 --device /dev/dri:/dev/dri \
 -v /dev/dri/by-path:/dev/dri/by-path \
 -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
 -e VLLM_XPU_ENABLE_XPU_GRAPH=1 \
 -v /home/ejer/llm/local_models/google-gemma4-31b-it-int4-Intel-Autoround:/app/model:ro \
 --entrypoint /bin/bash \
 vllm-intel-llm-scaler-13-05-2026 \
 -lc "source /opt/intel/oneapi/setvars.sh --force && \
 vllm serve /app/model \
 --host 0.0.0.0 --port 8000 \
 --served-model-name ${MODEL_NAME} \
 --enable-chunked-prefill \
 --tensor-parallel-size 2 \
 --reasoning-parser gemma4 \
 --max-model-len 4096 \
 --gpu-memory-utilization 0.9 \
 --dtype bfloat16 \
 --default-chat-template-kwargs '{\"enable_thinking\": true}' \
 --allow-deprecated-quantization \
 --trust-remote-code"

11. How to reproduce the tests

Reproduce one path at a time. Do not change model format, backend, GPU count, and container image together; the notes show that any one of those can decide the outcome.

Use a Linux host where the B70 cards are visible under /dev/dri.
Confirm render nodes with ls -l /dev/dri.
Download the same model variant for the route being tested. Do not substitute BF16, AWQ, AutoRound, OpenVINO IR, and GGUF as if they are equivalent.
Use the same Docker images named in the notes where possible: vllm-intel-12-05-2026, vllm-intel-llm-scaler-13-05-2026, local/llama.cpp:full-intel-sycl-15-05-2026, local/llama.cpp:full-vulkan-16-05-2026, and openarc-gemma4:latest.
Run one framework at a time on port 8000.
Use the same sampling parameters: Gemma tests used temperature 1.0, top_k 64, top_p 0.95; Qwen vLLM tests used temperature 0.6, top_k 20, top_p 0.95, presence_penalty 1.5, repetition_penalty 1.0.
Capture client-side wall time and server logs.

Simple compatible measurement helper

This reproduces the style of the measurements, but it is not claimed to be the original harness.

#!/usr/bin/env python3
import json, time, requests

URL = "http://127.0.0.1:8000/v1/chat/completions"
MODEL = "replace-with-served-model-name"
PROMPT = "Hello! Give me a one-sentence fun fact about Denmark."

payload = {
    "model": MODEL,
    "messages": [{"role": "user", "content": PROMPT}],
    "max_completion_tokens": 2048,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 64,
}

runs = []
for i in range(3):
    t0 = time.perf_counter()
    r = requests.post(URL, json=payload, timeout=600)
    elapsed = time.perf_counter() - t0
    r.raise_for_status()
    data = r.json()
    text = data["choices"][0]["message"].get("content", "")
    usage = data.get("usage", {})
    runs.append({
        "elapsed_s": elapsed,
        "completion_tokens": usage.get("completion_tokens"),
        "prompt_tokens": usage.get("prompt_tokens"),
        "chars": len(text),
        "text": text,
    })

print(json.dumps(runs, indent=2))

12. The good, the bad, and the gotchas

What worked in the recorded tests

llama.cpp SYCL single-GPU inference recorded 20.79 ± 0.03 tokens/s for Gemma 4 31B Q4_K_XL tg128 on one B70, and 55.77 ± 0.51 tokens/s for Gemma 4 26B A4B Q4_K_XL tg128 on one B70.
vLLM Intel XPU with Gemma 4 INT4 AutoRound worked as an OpenAI-compatible 2×B70 server path at about 14.2–14.3 tokens/s.
Qwen 3.5 27B AWQ INT4 under vLLM worked and produced parsed reasoning/final output.
The May llama.cpp SYCL benchmark rows now use llama-bench pp/tg results. The April rows are retained as earlier generation-script baselines.

What did not work in the recorded tests

OpenVINO Model Server failed on the tested Gemma 4 VLM path.
OpenArc still failed at Gemma 4 inference after the resolved Transformers/runtime issue.
LLM Scaler did not load the tested Gemma 4 AutoRound or AWQ variants.
vLLM BF16 Gemma 4 31B did not fit the tested 2×B70 4096-token configuration, and TP=3 is invalid for this model.
vLLM XPU graph did not speed up the two-GPU run because graph capture was disabled for communication operations.
llama.cpp SYCL dual-GPU layer split is represented here by llama-bench throughput rows. Those rows measure prompt and generation throughput, not output quality.

13. Conclusion

For now the fastest way to run LLM's on Intel's B70 was through llama.cpp, followed by vLLM Intel XPU.

The most direct comparison to keep in mind is that the recorded vLLM Gemma 4 31B INT4 AutoRound runs on 2×B70 were around 14.2–14.3 tokens/s, while the revised llama.cpp SYCL Gemma 4 31B GGUF tg128 result was 20.79 ± 0.03 tokens/s on one B70 and 21.53 ± 0.00 tokens/s with two-B70 layer split.

The OpenVINO Model Server, OpenArc, and LLM Scaler sections should be read narrowly: they describe the tested Gemma 4 configurations, not every possible B70 model/runtime combination. Since Gemma 4 was selected because it was used for personal projects, further tests with Qwen and other supported models could change the picture for OpenArc and OpenVINO Model Server.

Across the whole set, more GPUs did not automatically produce a better result. Tensor-parallel divisibility, available memory, XPU kernel support, graph-capture behavior, and backend-specific multi-GPU behavior all mattered. However this will hopefully change in future releases for the various frameworks.

1. Why this B70 test matters

2. What to expect before starting

3. The B70 setup used here

4. Models and tool paths used

5. How the runs were measured

6. llama.cpp results

llama.cpp SYCL, May 2026

Gemma 4 31B Q4_K_XL, llama-bench

Qwen 3.6 35B A3B Q4_K_XL, llama-bench

Gemma 4 26B A4B Q4_K_XL, llama-bench

llama.cpp Vulkan, May 2026

Gemma 4 31B Q4_K_XL, llama-bench

Qwen 3.6 35B A3B Q4_K_XL, llama-bench

Gemma 4 26B A4B Q4_K_XL, llama-bench

llama.cpp SYCL, April 2026 baselines

April vs May comparison

7. vLLM results

vLLM runs that worked

vLLM runs that did not load or did not reach inference

8. OpenVINO Model Server

9. OpenArc

10. LLM Scaler

11. How to reproduce the tests

12. The good, the bad, and the gotchas

What worked in the recorded tests

What did not work in the recorded tests

13. Conclusion

References