Running DiffusionGemma on Intel Arc Pro B70 with llama.cpp

Executive summary

25.56output tokens/s during generation

76.93 send-to-end for 1,966 output tokens

0.000017 scontext acquisition with warm pool

The experiment successfully served DiffusionGemma through a local OpenAI-compatible HTTP endpoint on Intel Arc Pro B70. The model produced a long response from a chat-completions request, used a warm context from the context pool, and completed with no request queueing.

The headline result is mixed: context pooling solved the previous multi-second per-request setup overhead, but the 2,000-token long-form request still took about 77 seconds. This is not a context-creation problem anymore. The remaining cost is the diffusion denoising loop itself, plus the current backend limitation that sampling is falling back to host-side logic rather than remaining fully on-device.

Important interpretation: diffusion language models can be faster than autoregressive models because they refine a canvas of many tokens in parallel. That does not mean every long output is one pass. DiffusionGemma uses a 256-token canvas. A 2,000-token request requires eight canvas blocks, and each block may use many denoising steps.

What is a diffusion language model?

Traditional autoregressive language models emit text one token at a time: token 1, then token 2, then token 3. Each next-token decision depends on the committed prefix, so generation is inherently sequential.

A discrete diffusion language model works differently. It starts with a block, or canvas, of placeholder or noisy tokens and iteratively denoises the entire block. Positions can be refined in parallel, and the denoiser can use bidirectional attention inside the canvas. Google describes DiffusionGemma as a model that shifts from token-by-token autoregression to block-autoregressive multi-canvas sampling, denoising 256-token canvases in parallel before moving to the next block.¹

For short blocks, this can make much better use of GPU compute. Google’s developer guide describes the architecture as moving the local serving bottleneck away from repeatedly streaming weights for one-token decoding and toward a larger parallel workload across a 256-token canvas.²

For long generations, the system becomes block-autoregressive: finish one 256-token canvas, commit it, then start the next canvas conditioned on the existing text. That means a 2,000-token request is not one 2,000-token parallel decode. It is approximately eight 256-token canvases.

Why this is currently not just normal llama.cpp serving

The normal llama-server path in llama.cpp is designed around autoregressive next-token generation. It expects to compute logits for the current position and sample the next token. DiffusionGemma needs a different generation loop: prepare a canvas, run iterative denoising, commit confident canvas tokens, and repeat for additional blocks.

The GGUF model can load in a branch that knows the diffusion-gemma architecture, but loading is not enough. The server must call the diffusion-specific generation functions. Unsloth’s DiffusionGemma GGUF documentation states that the model needs the DiffusionGemma llama.cpp branch and the dedicated llama-diffusion-cli runner; it also notes that the standard llama-cli and llama-server cannot generate from it yet.³

The custom HTTP stub used here fills that gap. It loads the model once, builds prompts through the chat template, creates or leases a llama_context, and calls diffusion_generate_entropy_bound() for canvas-based diffusion generation. The result is exposed through familiar endpoints such as /v1/chat/completions.

Test setup with automatic context setup

Item	Value
GPU	Intel(R) Arc(TM) Pro B70 Graphics
Backend	llama.cpp SYCL / oneAPI / Level Zero
Model	`Dg_Rc0P1_Patched`, architecture `diffusion-gemma`
Model memory	16,028 MiB reported by the wrapper
Canvas length	256 tokens
Request	1,000-word one-month US attractions trip plan
Requested output	2,000 tokens
Context pool	Enabled; context reused from pool slot 1

The request used this OpenAI-compatible call:

curl -s http://127.0.0.1:8081/v1/chat/completions ^
  -H "Content-Type: application/json" ^
  -d "{\"model\":\"diffusiongemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Answer directly. Do not show reasoning. Give me 1000 word round trip plan for visiting major attractions in United States for 1 month.\"}],\"max_tokens\":2000,\"temperature\":0.8,\"stream\":false}"

The wrapper automatically calculated the number of diffusion blocks from max_tokens and canvas_length. With a 256-token canvas and a 2,000-token request, it requested eight blocks and processed 2,048 canvas tokens. It allocated n_ctx=4096, n_batch=4096, and n_ubatch=4096.

Mathematical explanation of the run

The simplest way to understand the result is to separate four quantities: requested visible tokens, canvas length, denoising steps, and memory budget. The audit record gives enough data to estimate each one.

Autoregressive decoding versus diffusion decoding

An autoregressive language model factorizes output as a strictly sequential product. Each token depends on all previously emitted tokens, so the loop has one sampling decision per token.

Autoregressive generationp(x_1:T | c) = ∏_t=1^T p_θ(x_t | c, x_<t)

A discrete diffusion language model instead refines a noisy or masked canvas. A denoiser updates many positions in the same block at once, repeating until the block is confident enough to commit.

One denoising update over a canvasx^(s-1)_1:C = D_θ(x^(s)_1:C, c), s = S, S-1, ..., 1

For a long answer, the model is still block-autoregressive across canvases: finish canvas 1, append it to the prefix, then generate canvas 2. This is why a 2,000-token answer is not one single parallel pass.

Canvas blocks requiredB = ⌈ T / C ⌉ = ⌈ 2000 / 256 ⌉ = 8 blocks

Work estimate from denoising steps

The audit record shows eb_max_denoising_steps=48. The worst-case denoising workload is approximately block count multiplied by maximum denoising steps.

Maximum denoising loop countW_max ≈ B × S = 8 × 48 = 384 denoising iterations

That equation explains why the measured 2,000-token run is much slower than a short 256-token demonstration. A one-block response has roughly one eighth of the block sequence length before considering prompt growth, early stopping, and backend overhead.

Context allocation

The minimum token allocation must fit the prompt plus the canvas work. The wrapper then rounds up to the configured context pool size.

Minimum context requirementn_ctx,min ≥ P + B × C = 47 + 8 × 256 = 2095 tokens

Actual pooled allocationn_ctx = n_batch = n_ubatch = 4096 tokens

Throughput equations

The audit contains two useful speed scores: visible output throughput and canvas throughput. Visible throughput uses the actual number of returned tokens. Canvas throughput uses the full diffusion canvas work.

Visible output throughputTPS_out = N_out / t_gen = 1966 / 76.92879 = 25.56 tokens/s

Canvas throughputTPS_canvas = (B × C) / t_gen = 2048 / 76.92879 = 26.62 canvas tokens/s

Memory guard equation

The memory-aware scheduler admits a request only if the estimated model, pool, active request, and safety margin remain below the detected GPU budget. The numbers below use the compact audit record from this run.

Admission conditionM_model + M_pool + M_active + M_margin ≤ M_{GPU budget}

This run16028 + 9216 + 512 + 1024 = 26780 MiB < 31862 MiB

This is also why concurrency must be conservative on a single B70. The model and two warm contexts already occupy a large part of the DXGI memory budget before any request begins generating.

Scores and interpretation

Metric	Measured result	Interpretation
Prompt tokens	47	Short prompt; prompt processing is not the bottleneck.
Requested tokens	2,000	Long-form answer target.
Output tokens	1,966	98.3% of requested token budget used.
Canvas tokens processed	2,048	Eight 256-token diffusion blocks.
Context acquisition	0.000017 s	Context pooling removed the earlier ~3.7 s context setup overhead.
Generation time	76.93 s	Dominated by diffusion denoising, not setup.
Total time	76.93 s	Almost identical to generation time because context was reused.
Output throughput	25.56 tokens/s	Practical end-to-end user-visible text throughput.
Canvas throughput	26.62 canvas tokens/s	Useful diffusion-specific throughput metric.
GPU consumed memory	27,904 MiB	High utilization; only about 3,958 MiB budget remained.
GPU budget	31,862 MiB	DXGI-reported current budget, not merely physical VRAM.

The most important result is that the context pool worked. Earlier runs spent several seconds creating a fresh context for every request; this run leased a warm context in microseconds. That is a clear win for short and medium requests. For this long request, however, the generation loop remains the bottleneck.

The remaining gap versus ideal diffusion-model marketing numbers is most likely explained by a combination of current backend maturity, host-side sampling fallback, denoising step count, and the fact that the 2,000-token output needed eight sequential canvas blocks. Google’s published examples emphasize the speed potential of 256-token parallel canvases on high-end NVIDIA GPUs and integrated serving stacks such as vLLM; this local llama.cpp/SYCL experiment is a different backend and a custom wrapper.²

Compact audit record used for this article

[audit] {"event":"generation_complete","model":"Dg_Rc0P1_Patched","model_arch":"diffusion-gemma","timestamp":"2026-07-05T07:56:52.235Z","request_id":1,"endpoint":"/v1/chat/completions","success":true,"tokens":{"prompt":47,"requested":2000,"output":1966,"canvas_processed":2048},"timing_s":{"queue":0.0,"context":0.0000168,"context_pool_wait":0.000013,"generation":76.92879,"total":76.930306},"speed":{"output_tps_total":25.5556,"output_tps_generation":25.5561,"canvas_tps_generation":26.6220},"allocation":{"n_ctx":4096,"n_batch":4096,"n_ubatch":4096,"blocks":8,"canvas_length":256,"request_est_mib":512.0,"model_mib":16028.22,"context_reused":true,"context_pool_slot":1,"context_pool_mib":9216.0},"memory":{"gpu_adapter":"Intel(R) Arc(TM) Pro B70 Graphics","gpu_available_mib":3957.625,"gpu_consumed_mib":27904.3125,"gpu_budget_mib":31861.9375,"gpu_dedicated_mib":32630.0}}

Supported platforms: NVIDIA, AMD, and Intel

The wrapper is ordinary C++ around llama.cpp, cpp-httplib, nlohmann JSON, and the diffusion example code. Platform support mainly depends on which llama.cpp backend you build.

Vendor	Typical backend	Status for this wrapper	Memory detection
NVIDIA	CUDA	Should build if the DiffusionGemma branch and CUDA backend build successfully.	Linux fallback can use `nvidia-smi`; a stronger production version should use NVML directly.
AMD	HIP/ROCm or Vulkan	Expected to work if the diffusion branch supports the chosen backend.	Linux DRM sysfs can read `mem_info_vram_total` and `mem_info_vram_used` on AMDGPU.
Intel	SYCL / oneAPI / Level Zero	The tested target here. llama.cpp documents SYCL support for Intel Data Center Max, Flex, Arc, built-in GPU and iGPU devices.⁴	Windows uses DXGI. Linux currently uses DRM sysfs or fallback methods; a future Intel-specific enhancement should use Level Zero Sysman memory state.

llama.cpp also supports building multiple GPU backends into one build in many cases, and documents CUDA, Vulkan, SYCL, and other backends in its build guide.⁵

How the source code works

The custom server exists because DiffusionGemma needs a different generation path than normal next-token decoding. The major components are:

Component	Purpose
Model lifetime	The GGUF model is loaded once at process startup and shared read-only by requests.
Request isolation	Each request leases or creates its own `llama_context`. Contexts are not shared concurrently.
Context pool	Warm contexts are preallocated. A request that fits the pool leases a slot and clears it before reuse, avoiding per-request `llama_init_from_model()`.
Automatic allocation	The wrapper calculates canvas blocks from `max_tokens`, then chooses context and batch sizes large enough for prompt tokens plus the requested canvas work.
Memory scheduler	Requests are admitted only when active request memory, model memory, queue limits, and detected GPU budget are safe. Otherwise they wait in FIFO order.
Diffusion generation	For canvas models it calls `diffusion_generate_entropy_bound()`, then detokenizes and cleans channel markers from the final response.
Audit logging	Compact JSONL records summarize tokens, timings, throughput, allocation, queue state, system memory, and GPU memory.
OpenAI-compatible API	The wrapper exposes `/v1/chat/completions`, `/completion`, `/v1/models`, `/health`, and `/memory`.

Context pooling is the key latency improvement. In this run, context reuse reduced the context phase to about 17 microseconds. That does not accelerate the denoising math itself, but it removes the request setup overhead that previously made short requests feel slow.

How to build

Windows + Intel Arc Pro B70 + oneAPI SYCL

cd C:\llama-sycl-build\llama-diffusion

call "C:\Program Files\Microsoft Visual Studio\18\Professional\Common7\Tools\VsDevCmd.bat" -arch=amd64 -host_arch=amd64
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

cmake -B build -G "Ninja" ^
  -DGGML_SYCL=ON ^
  -DGGML_SYCL_F16=ON ^
  -DCMAKE_C_COMPILER=cl ^
  -DCMAKE_CXX_COMPILER=icx ^
  -DCMAKE_BUILD_TYPE=Release ^
  -DCMAKE_CXX_FLAGS="/EHsc -fexceptions"

cmake --build build --target llama-diffusion-http --parallel 2

Ubuntu + Intel oneAPI SYCL

cd ~/llama-diffusion
source /opt/intel/oneapi/setvars.sh

cmake -B build -G Ninja \
  -DGGML_SYCL=ON \
  -DGGML_SYCL_F16=ON \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=icpx \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_FLAGS="-fexceptions"

cmake --build build --target llama-diffusion-http -j2

NVIDIA and AMD notes

For NVIDIA, build the DiffusionGemma branch with CUDA, for example -DGGML_CUDA=ON. For AMD, use the backend that is supported and stable for your environment, typically HIP/ROCm or Vulkan. The exact backend flags are llama.cpp-specific, so check the llama.cpp build guide for the currently supported options.⁵

How to run

Windows, one B70, context pool enabled

set ONEAPI_DEVICE_SELECTOR=level_zero:0
set SYCL_CACHE_PERSISTENT=1

build\bin\llama-diffusion-http.exe ^
  -m C:\llama\models\diffusiongemma-26B-A4B-it-Q4_K_M.gguf ^
  -ngl 99 ^
  --host 127.0.0.1 ^
  --port 8081 ^
  -n 2048 ^
  -t 20 ^
  --max-concurrent 2 ^
  --context-pool-size 2 ^
  --context-pool-n-ctx 4096 ^
  --context-pool-n-batch 4096 ^
  --context-pool-n-ubatch 4096 ^
  --context-pool-strict ^
  --serialize-context-creation ^
  --parallel-generation ^
  --memory-safety-margin-mb 1024 ^
  --audit-summary ^
  --audit-log C:\llama-sycl-build\diffusion-http-audit.jsonl

Ubuntu

export ONEAPI_DEVICE_SELECTOR=level_zero:0
export SYCL_CACHE_PERSISTENT=1

./build/bin/llama-diffusion-http \
  -m /models/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 \
  --host 127.0.0.1 \
  --port 8081 \
  -n 2048 \
  -t 20 \
  --max-concurrent 2 \
  --context-pool-size 2 \
  --context-pool-n-ctx 4096 \
  --context-pool-n-batch 4096 \
  --context-pool-n-ubatch 4096 \
  --context-pool-strict \
  --audit-summary \
  --audit-log /tmp/diffusion-http-audit.jsonl

Useful health checks

curl http://127.0.0.1:8081/health
curl http://127.0.0.1:8081/v1/models
curl http://127.0.0.1:8081/memory

Benchmarking diffusion models in general

Do not benchmark diffusion language models exactly like autoregressive models. Token throughput is still useful, but it is incomplete. A diffusion run has blocks, denoising steps, commit behavior, canvas length, and often early stopping. A fair report should include at least these fields:

Benchmark field	Why it matters
Prompt tokens	Long prompts add prefill and KV cache cost.
Requested and actual output tokens	Needed for visible tokens/s and quality comparison.
Canvas length	Determines the parallel block size.
Blocks completed	Long outputs become sequential across blocks.
Denoising step limit and actual steps	Critical for diffusion speed; fewer steps can be faster but may reduce quality.
Generation time versus total time	Separates model math from server overhead.
Context setup time	Shows whether context pooling or reuse is working.
Backend and sampling path	Host-side sampling fallback can materially affect performance.
Model quantization	Q4, Q5, Q8, BF16 change memory footprint, accuracy, and kernel behavior.
GPU memory budget and usage	Shows whether concurrency settings are safe.

For comparison with autoregressive models, use identical prompts, output length targets, hardware, quantization class where possible, and repeated warm runs. Report p50 and p95 latency, not only a single tokens/s number. For diffusion models, also report canvas tokens/s and blocks/s so readers can see whether speed comes from true parallel canvas work or from shorter outputs. The equations above make the key normalization explicit: output tokens/s measures user-visible text, while canvas tokens/s measures the amount of diffusion canvas processed.

A common mistake: comparing a one-block 256-token diffusion result against a 2,000-token autoregressive run, or vice versa. The 2,000-token diffusion run still has eight sequential canvas blocks, so it is a different workload from a single 256-token demonstration.

Evaluation

The custom wrapper is already doing several important things correctly: it loads the model once, isolates requests through context leasing, uses automatic allocation from the diffusion canvas size, tracks GPU budget, and emits compact audit records that are suitable for benchmarking.

The result also shows the current limits. Context pooling removed setup overhead, yet throughput for the long request remained about 25.6 visible tokens/s. The next performance work should focus on backend-side sampling and denoising efficiency, exposing entropy-bound tuning flags, and comparing the same prompt on CUDA, ROCm/HIP, Vulkan, and SYCL where possible.

For a dual-B70 system, the most robust production topology is usually two server processes, one pinned to each GPU, each with a small context pool and conservative concurrency. That avoids memory contention and makes per-GPU audit logs easy to interpret.

Source code

The custom wrapper in C++ source code .

The custom wrapper in cmake build file .