Field notes from the current B70 inference landscape

The Intel Arc Pro B70 Local LLM Landscape: What Works, What Breaks, and What I’d Run Today

A practical guide to the current B70 software landscape for local LLM serving: what to run first, what is slower but useful, and which paths still hit blockers.

1. Why this B70 test matters

These notes look at what actually happened when several local LLM serving stacks were tried on Intel Arc Pro B70 hardware.

The tested stacks were llama.cpp, vLLM, OpenVINO Model Server, OpenArc, and Intel LLM Scaler. The report keeps the focus on observable outcomes: whether the service started, whether it returned usable text, which model format was used, and which timing numbers were recorded.

Gemma 4 appears heavily in the tests because it was the model family used for personal projects. That choice makes the results useful for this specific Gemma 4/B70 setup, but it also means the OpenVINO Model Server, OpenArc and LLM Scaler findings should not be read as final judgments about every other model family. Testing other supported models may be sensible.

2. What to expect before starting

The results vary by model format and runtime. GGUF models were tested through llama.cpp with both SYCL and Vulkan. For vLLM the XPU kernel backend was used. OpenVINO IR serving was tested through OpenVINO Model Server and OpenArc.

3. The B70 setup used here

The test machine exposed B70 GPUs through Linux DRM nodes. Intel’s public B70 specification lists 32 Xe-cores, 256 XMX engines, 32 GB graphics memory, 367 INT8 TOPS, PCIe 5.0 x16, and 230 W board power. The recorded results depend on the full software path as well as the hardware.

4. Models and tool paths used

The same model family behaved differently depending on format. BF16, OpenVINO INT4, AutoRound INT4, AWQ, and GGUF variants were not interchangeable in these tests.

Model / artifactUsed inSourceNotes from test
OpenVINO/gemma-4-31B-it-int4-ovOpenVINO Model Server, OpenArcHugging Face OpenVINO modelOpenVINO IR existed, but the tested OVMS/OpenArc Gemma 4 serving paths did not complete inference.
Intel/gemma-4-31B-it-int4-AutoRoundvLLM Intel XPU, LLM ScalerHugging Face Intel AutoRound modelWorked in the tested vLLM Intel image; failed in tested LLM Scaler image.
cyankiwi/gemma-4-31B-it-AWQ-4bitvLLM Intel XPU, LLM ScalerHugging Face AWQ 4-bit modelWorked in vLLM language-only mode, slightly slower than AutoRound in the recorded Gemma runs; failed in the tested LLM Scaler image.
cyankiwi/gemma-4-31B-it-AWQ-8bitvLLM Intel XPUHugging Face AWQ 8-bit modelFailed because the tested XPU kernels did not support the detected uint8b128/W8A16 path.
cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4vLLM Intel XPUHugging Face model cardWorked with reasoning parsing in the tested vLLM image.
gemma-4-31B-it-UD-Q4_K_XL.ggufllama.cpp SYCL/VulkanLocal file in test notesUsed for Gemma 4 31B Q4_K_XL llama-bench runs. Exact download page was not recorded in the supplied notes.
gemma-4-26B-A4B-it-UD-Q4_K_XL.ggufllama.cpp SYCL/VulkanLocal file in test notesUsed for Gemma 4 26B A4B Q4_K_XL llama-bench runs. Exact download page was not recorded in the supplied notes.
Qwen3.6-35B-A3B-UD-Q4_K_XL.ggufllama.cpp SYCL/VulkanLocal file in test notesUsed for Qwen 3.6 35B A3B Q4_K_XL llama-bench runs. Exact download page was not recorded in the supplied notes.
gemma-4-E4B-it-Q5_K_M.ggufllama.cpp SYCL April baselineLocal file in test notesUsed for the older April Gemma 4 E4B Q5_K_M generation-script baseline. Exact download page was not recorded in the supplied notes.
Qwen3.5-9B-Q4_K_M.ggufllama.cpp SYCL April baselineLocal file in test notesUsed for the older April Qwen 3.5 9B Q4_K_M generation-script baseline. Exact download page was not recorded in the supplied notes.

The tested tool paths were: llama.cpp SYCL/Vulkan for GGUF; vLLM Intel XPU for OpenAI-compatible serving; OpenVINO Model Server and OpenArc for OpenVINO IR serving; and Intel LLM Scaler as an Intel-oriented vLLM-based route.

5. How the runs were measured

The runs used Docker containers, port 8000, OpenAI-compatible chat requests where supported, and server logs for timing/failure details. llama.cpp used --server, --jinja, large context, GPU layer offload, and Docker log capture.

Metrics fields.
The result tables below focus on token throughput rather than elapsed prompt, generation, or request time. For llama.cpp, May benchmark rows use llama-bench pp/tg throughput where available; older April rows remain generation-script baselines. For vLLM, the tables report generation throughput; comparable average prompt throughput was not available from the recorded result summaries.

The original benchmark harness was not supplied, so this post keeps the recorded metrics instead of inferring hidden implementation details.

6. llama.cpp results

The largest set of successful runs in the notes came from llama.cpp with GGUF models.

The recorded llama.cpp tests used both Intel SYCL and Vulkan backends. The May SYCL numbers below are from llama-bench, so they report prompt-processing rows (pp) and generation rows (tg) directly. The revised Gemma 4 31B Q4_K_XL single-B70 tg128 result was 20.79 ± 0.03 tokens/s; the two-B70 layer-split tg128 result was 21.53 ± 0.00 tokens/s.

The May llama.cpp SYCL and Vulkan tables use llama-bench-style metric rows where available: pp is prompt-processing throughput and tg is generation throughput. Tensor split is the split ratio used for multi-GPU llama.cpp rows. The April rows are older generation-script baselines.

llama.cpp SYCL, May 2026

Gemma 4 31B Q4_K_XL, llama-bench

RunModelSizeParamsBackendnglThreadsn_batchfaTensor splitTesttok/s
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL9981024pp512270.56 ± 0.77
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL9981024pp2048263.34 ± 1.16
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL9981024pp4096259.68 ± 0.20
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL9981024pp8192256.63 ± 0.03
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL9981024tg12820.79 ± 0.03
2× B70 layer splitgemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL998102411.00/1.00pp512264.24 ± 0.00
2× B70 layer splitgemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL998102411.00/1.00pp2048360.08 ± 0.00
2× B70 layer splitgemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL998102411.00/1.00pp4096372.45 ± 0.00
2× B70 layer splitgemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL998102411.00/1.00pp8192357.94 ± 0.00
2× B70 layer splitgemma4 31B Q4_K - Medium17.52 GiB30.70 BSYCL998102411.00/1.00tg12821.53 ± 0.00

Qwen 3.6 35B A3B Q4_K_XL, llama-bench

RunModelSizeParamsBackendnglThreadsn_batchfaTensor splitTesttok/s
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL99810241pp512408.47 ± 3.06
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL99810241pp2048400.33 ± 0.98
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL99810241pp4096390.50 ± 2.03
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL99810241pp8192382.71 ± 4.60
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL99810241tg12842.18 ± 0.15
2× B70 layer splitqwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL998102411.00/1.00pp512557.23 ± 0.00
2× B70 layer splitqwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL998102411.00/1.00pp2048551.86 ± 0.00
2× B70 layer splitqwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL998102411.00/1.00pp4096540.30 ± 0.00
2× B70 layer splitqwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL998102411.00/1.00pp8192526.44 ± 0.00
2× B70 layer splitqwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BSYCL998102411.00/1.00tg12839.02 ± 0.00

Gemma 4 26B A4B Q4_K_XL, llama-bench

RunModelSizeParamsBackendnglThreadsn_batchfaTesttok/s
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BSYCL99810241pp512843.10 ± 11.15
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BSYCL99810241pp2048781.05 ± 3.66
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BSYCL99810241pp4096748.17 ± 2.38
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BSYCL99810241pp8192696.51 ± 0.90
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BSYCL99810241tg12855.77 ± 0.51

llama.cpp Vulkan, May 2026

Vulkan also ran the tested GGUF models. The Vulkan rows below are from llama-bench, so they include prompt-processing rows (pp) and generation rows (tg) directly.

Gemma 4 31B Q4_K_XL, llama-bench

RunModelSizeParamsBackendnglThreadsn_batchTesttok/s
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BVulkan9981024pp512501.89 ± 0.11
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BVulkan9981024pp2048477.94 ± 0.26
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BVulkan9981024pp4096464.74 ± 0.66
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BVulkan9981024pp8192447.73 ± 0.15
1× B70 GPU.1 / card2gemma4 31B Q4_K - Medium17.52 GiB30.70 BVulkan9981024tg12815.35 ± 0.01

Qwen 3.6 35B A3B Q4_K_XL, llama-bench

RunModelSizeParamsBackendnglThreadsn_batchTesttok/s
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BVulkan9981024pp5121342.28 ± 6.58
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BVulkan9981024pp20481326.57 ± 2.77
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BVulkan9981024pp40961295.72 ± 6.24
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BVulkan9981024pp81921243.41 ± 1.43
1× B70 GPU.1 / card2qwen35moe 35B.A3B Q4_K - Medium20.81 GiB34.66 BVulkan9981024tg12837.47 ± 0.03

Gemma 4 26B A4B Q4_K_XL, llama-bench

RunModelSizeParamsBackendnglThreadsn_batchTesttok/s
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BVulkan9981024pp5121702.59 ± 9.28
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BVulkan9981024pp20481642.82 ± 9.39
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BVulkan9981024pp40961605.83 ± 6.59
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BVulkan9981024pp81921556.38 ± 5.69
1× B70 GPU.1 / card2gemma4 26B.A4B Q4_K - Medium15.83 GiB25.23 BVulkan9981024tg12841.25 ± 0.03

llama.cpp SYCL, April 2026 baselines

ModelBackendGPU(s)nglThreadsBatchUBatchTesttok/sNotes
Gemma 4 E4B Q5_K_MSYCL1× B70 GPU.199840964096pp426.36Derived from 165 total prompt tokens over 3 runs and the recorded average prompt-processing field.
Gemma 4 E4B Q5_K_MSYCL1× B70 GPU.199840964096tg38.401546.7 completion tokens; 2755.7 final-output chars; 68.24 chars/s.
Qwen 3.5 9B Q4_K_M GGUFSYCL1× B70 GPU.199840964096pp124.35Derived from 144 total prompt tokens over 3 runs and the recorded average prompt-processing field.
Qwen 3.5 9B Q4_K_M GGUFSYCL1× B70 GPU.199840964096tg52.114781.7 completion tokens; 1677.7 final-output chars; 19.46 chars/s.
llama.cpp SYCL llama-bench command used for the Gemma 4 31B single-GPU path
docker run --rm \
 --name llamacpp-bench \
 --net=none \
 --device=/dev/dri/card2 \
 --device=/dev/dri/renderD129 \
 --group-add="$(stat -c '%g' /dev/dri/renderD129)" \
 -u "$(id -u):$(id -g)" \
 --cap-add=IPC_LOCK \
 -e ONEAPI_DEVICE_SELECTOR="level_zero:0" \
 -e ZES_ENABLE_SYSMAN=1 \
 -e GGML_NO_MMAP=1 \
 -e LLAMA_ARG_FLASH_ATTN=on \
 -v "$MODEL_DIR":/models \
 --entrypoint /app/llama-bench \
 local/llama.cpp:full-intel-sycl-15-05-2026 \
 -m /models/gemma-4-31B-it-UD-Q4_K_XL.gguf \
 -ngl 99 \
 -t 8 \
 -p 512,2048,4096,8192 \
 -n 128 \
 -b 1024 \
 -ub 512 \
 -r 3 \
 -o md

April vs May comparison

The April entries above remain generation-script baselines. The revised May SYCL entries are llama-bench results for Gemma 4 31B, Qwen 3.6 35B A3B, and Gemma 4 26B A4B, so this section no longer presents April-vs-May percentage comparisons from the earlier generation-script May data.

7. vLLM results

vLLM was the main OpenAI-compatible server path that returned usable results in the notes.

The working Gemma 4 31B INT4 AutoRound runs on 2×B70 were around 14.2–14.3 tokens/s. The revised llama.cpp SYCL Gemma 4 31B GGUF llama-bench result was 20.79 ± 0.03 tokens/s on one B70 and 21.53 ± 0.00 tokens/s with two-B70 layer split. This is not a strict model-for-model comparison because the model formats differed, but it is the practical speed relationship recorded in these tests.

vLLM runs that worked

ModelHardwareStatusGeneration tok/sNotes
Gemma 4 31B INT4 AutoRound2× B70 TP=2Worked14.21315.0 completion tokens; 80.7 final-output chars; 3.58 chars/s.
Gemma 4 31B INT4 AutoRound, language-only2× B70 TP=2Worked14.33439.3 completion tokens; 108.7 final-output chars; 3.66 chars/s.
Gemma 4 31B AWQ INT4, language-only2× B70 TP=2Worked13.11323.7 completion tokens; 97.0 final-output chars; 4.02 chars/s.
Qwen 3.5 27B AWQ INT42× B70 TP=2Worked15.652005.3 completion tokens; reasoning/parser output worked; 104.0 final-output chars; 0.82 chars/s.
Qwen 3.5 27B AWQ INT4, XPU graph env2× B70 TP=2Worked15.644070.3 completion tokens; XPU graph capture was disabled for communication operations; 128.3 final-output chars; 0.51 chars/s.

Note that Gemma 4 31B has 32 attention heads. Therefore two-way tensor parallelism was valid while three-way tensor parallelism failed before inference because 32 is not divisible by 3. Many model architectures have a number of attention heads which is divisible by 2. This will cause problems on 3-gpu setups and should be considered beforehand. Alternatively you can run pipeline parallelism, but this is often slower than tensor parallelism.

vLLM runs that did not load or did not reach inference

ModelHardwareStatusResult / blocker
Gemma 4 31B AWQ INT82× B70 TP=2No kernelcompressed-tensors uint8b128/W8A16 path not supported by tested XPU kernels.
vLLM Intel XPU command used for the Gemma 4 AutoRound path
set -euo pipefail

MODEL_NAME="google-gemma4-31b-it-int4-Intel-Autoround"

docker rm -f vllm >/dev/null 2>&1 || true

docker run -d --restart=always \
  --name vllm \
  --net=bridge \
  -p 8000:8000 \
  --group-add=video \
  --ipc=host \
  --privileged \
  --device /dev/dri:/dev/dri \
  -v /dev/dri/by-path:/dev/dri/by-path \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  -e VLLM_XPU_ENABLE_XPU_GRAPH=1 \
  -v /home/ejer/llm/local_models/google-gemma4-31b-it-int4-Intel-Autoround:/app/model:ro \
  --entrypoint /bin/bash \
  vllm-intel-12-05-2026 \
  -lc "source /opt/intel/oneapi/setvars.sh --force && \
       vllm serve /app/model \
         --host 0.0.0.0 --port 8000 \
         --served-model-name ${MODEL_NAME} \
         --enable-chunked-prefill \
         --tensor-parallel-size 2 \
         --reasoning-parser gemma4 \
         --language-model-only \
         --max-model-len 4096 \
         --gpu-memory-utilization 0.9 \
         --dtype bfloat16 \
         --default-chat-template-kwargs '{"enable_thinking": true}' \
         --trust-remote-code"

8. OpenVINO Model Server

The tested OpenVINO Model Server path did not serve the Gemma 4 OpenVINO model.

OpenVINO Model Server got far enough to create the repository and graph, then failed during LLM node initialization with Unsupported 'gemma4' VLM model type. This result applies to the recorded Gemma 4 31B INT4 OpenVINO setup.

However this does not rule out OpenVINO Model Server for other model families. However for now it is necessary to install transformers==5.5.0 to run Qwen 3.5/3.6 models.

ModelHardwareStatusResult / blocker
OpenVINO/gemma-4-31B-it-int4-ov2× B70, target_device GPUDid not startUnsupported 'gemma4' VLM model type during LLM node initialization.
OpenVINO Model Server command
MODEL_ID="OpenVINO/gemma-4-31B-it-int4-ov"
MODEL_NAME="gemma-4-31B-it-int4-ov"
OVMS_REPO="/home/ejer/llm/ovms_models"

mkdir -p "$OVMS_REPO"
docker rm -f openvino 2>/dev/null || true

docker run --rm -it \
  --name openvino \
  --net=bridge \
  -p 8000:8000 \
  --user "$(id -u):$(id -g)" \
  --device /dev/dri \
  --group-add="$(stat -c '%g' /dev/dri/render* | head -n 1)" \
  -v "$OVMS_REPO:/models:rw" \
  openvino/model_server:weekly \
  --model_repository_path /models \
  --source_model "$MODEL_ID" \
  --model_name "$MODEL_NAME" \
  --rest_port 8000 \
  --target_device GPU \
  --task text_generation \
  --pipeline_type VLM_CB \
  --log_level INFO

9. OpenArc

The tested OpenArc path did not complete Gemma 4 inference.

The final useful OpenArc attempt included transformers==5.5.0, which got past the earlier runtime/tokenizer issue. The remaining blocker was an inference failure around token_type_ids: the expected input port was not found.

There are indications that Qwen 3.5/3.6 may work with OpenArc when using Transformers 5.5, but that was not tested here and is not counted as a result. Since the focus here was on Gemma 4, testing OpenArc with other models may be worthwhile.

ModelHardwareStatusResult / blocker
gemma-4-31B-it-int4-ov OpenVINO IR2× B70 via /dev/driInference failedTransformers 5.5 advanced startup, but inference failed with token_type_ids port mismatch.
OpenArc command
OPENARC_SRC="/home/ejer/OpenArc"
OPENARC_GEMMA4_IMAGE="openarc-gemma4:latest"
MODEL_DIR_HOST="/home/ejer/llm/local_models/google-gemma-4-31b-it-int4-openvino"
MODEL_NAME="gemma-4-31B-it-int4-ov"

# Runtime image included: RUN uv pip install -U "transformers==5.5.0"

docker run -d \
  --name openarc \
  --restart unless-stopped \
  --net=bridge \
  -p 127.0.0.1:8000:8000 \
  --device /dev/dri:/dev/dri \
  --group-add="$(stat -c '%g' /dev/dri/render* | head -n 1)" \
  -e OPENARC_API_KEY_REQUIRED=false \
  -e NEOReadDebugKeys=1 \
  -e OverrideGpuAddressSpace=48 \
  -e EnableImplicitScaling=1 \
  -v "$MODEL_DIR_HOST:/models/gemma4:ro" \
  --entrypoint /bin/bash \
  "$OPENARC_GEMMA4_IMAGE" \
  -lc "openarc add \
         --model-name '$MODEL_NAME' \
         --model-path /models/gemma4 \
         --engine ovgenai \
         --model-type vlm \
         --device GPU \
         --vlm-type gemma4 || true; \
       openarc serve start \
         --host 0.0.0.0 \
         --port 8000 \
         --load-models '$MODEL_NAME'"

10. LLM Scaler

The tested LLM Scaler image did not load the recorded Gemma 4 AutoRound or AWQ variants.

The AutoRound route needed --allow-deprecated-quantization to pass the initial vLLM check, then failed around model.vision_tower.std_bias. The AWQ route failed because the tested image could not find a WNA16 linear-layer kernel for the model.

ModelHardwareStatusResult / blocker
Gemma 4 31B INT4 AutoRound2× B70 TP=2Did not loadAutoRound required --allow-deprecated-quantization, then failed on model.vision_tower.std_bias in TransformersMultiModalForCausalLM.
Gemma 4 31B AWQ INT42× B70 TP=2Did not loadFailed to find an XPU WNA16 linear kernel for the AWQ/W4A16 path.
LLM Scaler command used for the Gemma 4 AutoRound path
docker run -d --restart=always \
 --name llm-scaler-vllm \
 --net=bridge \
 -p 8000:8000 \
 --group-add=video \
 --ipc=host \
 --privileged \
 --device /dev/dri:/dev/dri \
 -v /dev/dri/by-path:/dev/dri/by-path \
 -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
 -e VLLM_XPU_ENABLE_XPU_GRAPH=1 \
 -v /home/ejer/llm/local_models/google-gemma4-31b-it-int4-Intel-Autoround:/app/model:ro \
 --entrypoint /bin/bash \
 vllm-intel-llm-scaler-13-05-2026 \
 -lc "source /opt/intel/oneapi/setvars.sh --force && \
 vllm serve /app/model \
 --host 0.0.0.0 --port 8000 \
 --served-model-name ${MODEL_NAME} \
 --enable-chunked-prefill \
 --tensor-parallel-size 2 \
 --reasoning-parser gemma4 \
 --max-model-len 4096 \
 --gpu-memory-utilization 0.9 \
 --dtype bfloat16 \
 --default-chat-template-kwargs '{\"enable_thinking\": true}' \
 --allow-deprecated-quantization \
 --trust-remote-code"

11. How to reproduce the tests

Reproduce one path at a time. Do not change model format, backend, GPU count, and container image together; the notes show that any one of those can decide the outcome.

  1. Use a Linux host where the B70 cards are visible under /dev/dri.
  2. Confirm render nodes with ls -l /dev/dri.
  3. Download the same model variant for the route being tested. Do not substitute BF16, AWQ, AutoRound, OpenVINO IR, and GGUF as if they are equivalent.
  4. Use the same Docker images named in the notes where possible: vllm-intel-12-05-2026, vllm-intel-llm-scaler-13-05-2026, local/llama.cpp:full-intel-sycl-15-05-2026, local/llama.cpp:full-vulkan-16-05-2026, and openarc-gemma4:latest.
  5. Run one framework at a time on port 8000.
  6. Use the same sampling parameters: Gemma tests used temperature 1.0, top_k 64, top_p 0.95; Qwen vLLM tests used temperature 0.6, top_k 20, top_p 0.95, presence_penalty 1.5, repetition_penalty 1.0.
  7. Capture client-side wall time and server logs.
Simple compatible measurement helper

This reproduces the style of the measurements, but it is not claimed to be the original harness.

#!/usr/bin/env python3
import json, time, requests

URL = "http://127.0.0.1:8000/v1/chat/completions"
MODEL = "replace-with-served-model-name"
PROMPT = "Hello! Give me a one-sentence fun fact about Denmark."

payload = {
    "model": MODEL,
    "messages": [{"role": "user", "content": PROMPT}],
    "max_completion_tokens": 2048,
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 64,
}

runs = []
for i in range(3):
    t0 = time.perf_counter()
    r = requests.post(URL, json=payload, timeout=600)
    elapsed = time.perf_counter() - t0
    r.raise_for_status()
    data = r.json()
    text = data["choices"][0]["message"].get("content", "")
    usage = data.get("usage", {})
    runs.append({
        "elapsed_s": elapsed,
        "completion_tokens": usage.get("completion_tokens"),
        "prompt_tokens": usage.get("prompt_tokens"),
        "chars": len(text),
        "text": text,
    })

print(json.dumps(runs, indent=2))

12. The good, the bad, and the gotchas

What worked in the recorded tests

  • llama.cpp SYCL single-GPU inference recorded 20.79 ± 0.03 tokens/s for Gemma 4 31B Q4_K_XL tg128 on one B70, and 55.77 ± 0.51 tokens/s for Gemma 4 26B A4B Q4_K_XL tg128 on one B70.
  • vLLM Intel XPU with Gemma 4 INT4 AutoRound worked as an OpenAI-compatible 2×B70 server path at about 14.2–14.3 tokens/s.
  • Qwen 3.5 27B AWQ INT4 under vLLM worked and produced parsed reasoning/final output.
  • The May llama.cpp SYCL benchmark rows now use llama-bench pp/tg results. The April rows are retained as earlier generation-script baselines.

What did not work in the recorded tests

  • OpenVINO Model Server failed on the tested Gemma 4 VLM path.
  • OpenArc still failed at Gemma 4 inference after the resolved Transformers/runtime issue.
  • LLM Scaler did not load the tested Gemma 4 AutoRound or AWQ variants.
  • vLLM BF16 Gemma 4 31B did not fit the tested 2×B70 4096-token configuration, and TP=3 is invalid for this model.
  • vLLM XPU graph did not speed up the two-GPU run because graph capture was disabled for communication operations.
  • llama.cpp SYCL dual-GPU layer split is represented here by llama-bench throughput rows. Those rows measure prompt and generation throughput, not output quality.

13. Conclusion

For now the fastest way to run LLM's on Intel's B70 was through llama.cpp, followed by vLLM Intel XPU.

The most direct comparison to keep in mind is that the recorded vLLM Gemma 4 31B INT4 AutoRound runs on 2×B70 were around 14.2–14.3 tokens/s, while the revised llama.cpp SYCL Gemma 4 31B GGUF tg128 result was 20.79 ± 0.03 tokens/s on one B70 and 21.53 ± 0.00 tokens/s with two-B70 layer split.

The OpenVINO Model Server, OpenArc, and LLM Scaler sections should be read narrowly: they describe the tested Gemma 4 configurations, not every possible B70 model/runtime combination. Since Gemma 4 was selected because it was used for personal projects, further tests with Qwen and other supported models could change the picture for OpenArc and OpenVINO Model Server.

Across the whole set, more GPUs did not automatically produce a better result. Tensor-parallel divisibility, available memory, XPU kernel support, graph-capture behavior, and backend-specific multi-GPU behavior all mattered. However this will hopefully change in future releases for the various frameworks.

References