Executive summary
The experiment successfully served DiffusionGemma through a local OpenAI-compatible HTTP endpoint on Intel Arc Pro B70. The model produced a long response from a chat-completions request, used a warm context from the context pool, and completed with no request queueing.
The headline result is mixed: context pooling solved the previous multi-second per-request setup overhead, but the 2,000-token long-form request still took about 77 seconds. This is not a context-creation problem anymore. The remaining cost is the diffusion denoising loop itself, plus the current backend limitation that sampling is falling back to host-side logic rather than remaining fully on-device.
What is a diffusion language model?
Traditional autoregressive language models emit text one token at a time: token 1, then token 2, then token 3. Each next-token decision depends on the committed prefix, so generation is inherently sequential.
A discrete diffusion language model works differently. It starts with a block, or canvas, of placeholder or noisy tokens and iteratively denoises the entire block. Positions can be refined in parallel, and the denoiser can use bidirectional attention inside the canvas. Google describes DiffusionGemma as a model that shifts from token-by-token autoregression to block-autoregressive multi-canvas sampling, denoising 256-token canvases in parallel before moving to the next block.1
For short blocks, this can make much better use of GPU compute. Google’s developer guide describes the architecture as moving the local serving bottleneck away from repeatedly streaming weights for one-token decoding and toward a larger parallel workload across a 256-token canvas.2
For long generations, the system becomes block-autoregressive: finish one 256-token canvas, commit it, then start the next canvas conditioned on the existing text. That means a 2,000-token request is not one 2,000-token parallel decode. It is approximately eight 256-token canvases.
Why this is currently not just normal llama.cpp serving
The normal llama-server path in llama.cpp is designed around autoregressive next-token generation. It expects to compute logits for the current position and sample the next token. DiffusionGemma needs a different generation loop: prepare a canvas, run iterative denoising, commit confident canvas tokens, and repeat for additional blocks.
The GGUF model can load in a branch that knows the diffusion-gemma architecture, but loading is not enough. The server must call the diffusion-specific generation functions. Unsloth’s DiffusionGemma GGUF documentation states that the model needs the DiffusionGemma llama.cpp branch and the dedicated llama-diffusion-cli runner; it also notes that the standard llama-cli and llama-server cannot generate from it yet.3
The custom HTTP stub used here fills that gap. It loads the model once, builds prompts through the chat template, creates or leases a llama_context, and calls diffusion_generate_entropy_bound() for canvas-based diffusion generation. The result is exposed through familiar endpoints such as /v1/chat/completions.
Test setup with automatic context setup
| Item | Value |
|---|---|
| GPU | Intel(R) Arc(TM) Pro B70 Graphics |
| Backend | llama.cpp SYCL / oneAPI / Level Zero |
| Model | Dg_Rc0P1_Patched, architecture diffusion-gemma |
| Model memory | 16,028 MiB reported by the wrapper |
| Canvas length | 256 tokens |
| Request | 1,000-word one-month US attractions trip plan |
| Requested output | 2,000 tokens |
| Context pool | Enabled; context reused from pool slot 1 |
The request used this OpenAI-compatible call:
curl -s http://127.0.0.1:8081/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"diffusiongemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Answer directly. Do not show reasoning. Give me 1000 word round trip plan for visiting major attractions in United States for 1 month.\"}],\"max_tokens\":2000,\"temperature\":0.8,\"stream\":false}"
The wrapper automatically calculated the number of diffusion blocks from max_tokens and canvas_length. With a 256-token canvas and a 2,000-token request, it requested eight blocks and processed 2,048 canvas tokens. It allocated n_ctx=4096, n_batch=4096, and n_ubatch=4096.
Mathematical explanation of the run
The simplest way to understand the result is to separate four quantities: requested visible tokens, canvas length, denoising steps, and memory budget. The audit record gives enough data to estimate each one.
Autoregressive decoding versus diffusion decoding
An autoregressive language model factorizes output as a strictly sequential product. Each token depends on all previously emitted tokens, so the loop has one sampling decision per token.
A discrete diffusion language model instead refines a noisy or masked canvas. A denoiser updates many positions in the same block at once, repeating until the block is confident enough to commit.
For a long answer, the model is still block-autoregressive across canvases: finish canvas 1, append it to the prefix, then generate canvas 2. This is why a 2,000-token answer is not one single parallel pass.
Work estimate from denoising steps
The audit record shows eb_max_denoising_steps=48. The worst-case denoising workload is approximately block count multiplied by maximum denoising steps.
That equation explains why the measured 2,000-token run is much slower than a short 256-token demonstration. A one-block response has roughly one eighth of the block sequence length before considering prompt growth, early stopping, and backend overhead.
Context allocation
The minimum token allocation must fit the prompt plus the canvas work. The wrapper then rounds up to the configured context pool size.
Throughput equations
The audit contains two useful speed scores: visible output throughput and canvas throughput. Visible throughput uses the actual number of returned tokens. Canvas throughput uses the full diffusion canvas work.
Memory guard equation
The memory-aware scheduler admits a request only if the estimated model, pool, active request, and safety margin remain below the detected GPU budget. The numbers below use the compact audit record from this run.
This is also why concurrency must be conservative on a single B70. The model and two warm contexts already occupy a large part of the DXGI memory budget before any request begins generating.
Scores and interpretation
| Metric | Measured result | Interpretation |
|---|---|---|
| Prompt tokens | 47 | Short prompt; prompt processing is not the bottleneck. |
| Requested tokens | 2,000 | Long-form answer target. |
| Output tokens | 1,966 | 98.3% of requested token budget used. |
| Canvas tokens processed | 2,048 | Eight 256-token diffusion blocks. |
| Context acquisition | 0.000017 s | Context pooling removed the earlier ~3.7 s context setup overhead. |
| Generation time | 76.93 s | Dominated by diffusion denoising, not setup. |
| Total time | 76.93 s | Almost identical to generation time because context was reused. |
| Output throughput | 25.56 tokens/s | Practical end-to-end user-visible text throughput. |
| Canvas throughput | 26.62 canvas tokens/s | Useful diffusion-specific throughput metric. |
| GPU consumed memory | 27,904 MiB | High utilization; only about 3,958 MiB budget remained. |
| GPU budget | 31,862 MiB | DXGI-reported current budget, not merely physical VRAM. |
The most important result is that the context pool worked. Earlier runs spent several seconds creating a fresh context for every request; this run leased a warm context in microseconds. That is a clear win for short and medium requests. For this long request, however, the generation loop remains the bottleneck.
The remaining gap versus ideal diffusion-model marketing numbers is most likely explained by a combination of current backend maturity, host-side sampling fallback, denoising step count, and the fact that the 2,000-token output needed eight sequential canvas blocks. Google’s published examples emphasize the speed potential of 256-token parallel canvases on high-end NVIDIA GPUs and integrated serving stacks such as vLLM; this local llama.cpp/SYCL experiment is a different backend and a custom wrapper.2
Compact audit record used for this article
[audit] {"event":"generation_complete","model":"Dg_Rc0P1_Patched","model_arch":"diffusion-gemma","timestamp":"2026-07-05T07:56:52.235Z","request_id":1,"endpoint":"/v1/chat/completions","success":true,"tokens":{"prompt":47,"requested":2000,"output":1966,"canvas_processed":2048},"timing_s":{"queue":0.0,"context":0.0000168,"context_pool_wait":0.000013,"generation":76.92879,"total":76.930306},"speed":{"output_tps_total":25.5556,"output_tps_generation":25.5561,"canvas_tps_generation":26.6220},"allocation":{"n_ctx":4096,"n_batch":4096,"n_ubatch":4096,"blocks":8,"canvas_length":256,"request_est_mib":512.0,"model_mib":16028.22,"context_reused":true,"context_pool_slot":1,"context_pool_mib":9216.0},"memory":{"gpu_adapter":"Intel(R) Arc(TM) Pro B70 Graphics","gpu_available_mib":3957.625,"gpu_consumed_mib":27904.3125,"gpu_budget_mib":31861.9375,"gpu_dedicated_mib":32630.0}}
Supported platforms: NVIDIA, AMD, and Intel
The wrapper is ordinary C++ around llama.cpp, cpp-httplib, nlohmann JSON, and the diffusion example code. Platform support mainly depends on which llama.cpp backend you build.
| Vendor | Typical backend | Status for this wrapper | Memory detection |
|---|---|---|---|
| NVIDIA | CUDA | Should build if the DiffusionGemma branch and CUDA backend build successfully. | Linux fallback can use nvidia-smi; a stronger production version should use NVML directly. |
| AMD | HIP/ROCm or Vulkan | Expected to work if the diffusion branch supports the chosen backend. | Linux DRM sysfs can read mem_info_vram_total and mem_info_vram_used on AMDGPU. |
| Intel | SYCL / oneAPI / Level Zero | The tested target here. llama.cpp documents SYCL support for Intel Data Center Max, Flex, Arc, built-in GPU and iGPU devices.4 | Windows uses DXGI. Linux currently uses DRM sysfs or fallback methods; a future Intel-specific enhancement should use Level Zero Sysman memory state. |
llama.cpp also supports building multiple GPU backends into one build in many cases, and documents CUDA, Vulkan, SYCL, and other backends in its build guide.5
How the source code works
The custom server exists because DiffusionGemma needs a different generation path than normal next-token decoding. The major components are:
| Component | Purpose |
|---|---|
| Model lifetime | The GGUF model is loaded once at process startup and shared read-only by requests. |
| Request isolation | Each request leases or creates its own llama_context. Contexts are not shared concurrently. |
| Context pool | Warm contexts are preallocated. A request that fits the pool leases a slot and clears it before reuse, avoiding per-request llama_init_from_model(). |
| Automatic allocation | The wrapper calculates canvas blocks from max_tokens, then chooses context and batch sizes large enough for prompt tokens plus the requested canvas work. |
| Memory scheduler | Requests are admitted only when active request memory, model memory, queue limits, and detected GPU budget are safe. Otherwise they wait in FIFO order. |
| Diffusion generation | For canvas models it calls diffusion_generate_entropy_bound(), then detokenizes and cleans channel markers from the final response. |
| Audit logging | Compact JSONL records summarize tokens, timings, throughput, allocation, queue state, system memory, and GPU memory. |
| OpenAI-compatible API | The wrapper exposes /v1/chat/completions, /completion, /v1/models, /health, and /memory. |
How to build
Windows + Intel Arc Pro B70 + oneAPI SYCL
cd C:\llama-sycl-build\llama-diffusion
call "C:\Program Files\Microsoft Visual Studio\18\Professional\Common7\Tools\VsDevCmd.bat" -arch=amd64 -host_arch=amd64
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
cmake -B build -G "Ninja" ^
-DGGML_SYCL=ON ^
-DGGML_SYCL_F16=ON ^
-DCMAKE_C_COMPILER=cl ^
-DCMAKE_CXX_COMPILER=icx ^
-DCMAKE_BUILD_TYPE=Release ^
-DCMAKE_CXX_FLAGS="/EHsc -fexceptions"
cmake --build build --target llama-diffusion-http --parallel 2
Ubuntu + Intel oneAPI SYCL
cd ~/llama-diffusion
source /opt/intel/oneapi/setvars.sh
cmake -B build -G Ninja \
-DGGML_SYCL=ON \
-DGGML_SYCL_F16=ON \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=icpx \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS="-fexceptions"
cmake --build build --target llama-diffusion-http -j2
NVIDIA and AMD notes
For NVIDIA, build the DiffusionGemma branch with CUDA, for example -DGGML_CUDA=ON. For AMD, use the backend that is supported and stable for your environment, typically HIP/ROCm or Vulkan. The exact backend flags are llama.cpp-specific, so check the llama.cpp build guide for the currently supported options.5
How to run
Windows, one B70, context pool enabled
set ONEAPI_DEVICE_SELECTOR=level_zero:0
set SYCL_CACHE_PERSISTENT=1
build\bin\llama-diffusion-http.exe ^
-m C:\llama\models\diffusiongemma-26B-A4B-it-Q4_K_M.gguf ^
-ngl 99 ^
--host 127.0.0.1 ^
--port 8081 ^
-n 2048 ^
-t 20 ^
--max-concurrent 2 ^
--context-pool-size 2 ^
--context-pool-n-ctx 4096 ^
--context-pool-n-batch 4096 ^
--context-pool-n-ubatch 4096 ^
--context-pool-strict ^
--serialize-context-creation ^
--parallel-generation ^
--memory-safety-margin-mb 1024 ^
--audit-summary ^
--audit-log C:\llama-sycl-build\diffusion-http-audit.jsonl
Ubuntu
export ONEAPI_DEVICE_SELECTOR=level_zero:0
export SYCL_CACHE_PERSISTENT=1
./build/bin/llama-diffusion-http \
-m /models/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
-ngl 99 \
--host 127.0.0.1 \
--port 8081 \
-n 2048 \
-t 20 \
--max-concurrent 2 \
--context-pool-size 2 \
--context-pool-n-ctx 4096 \
--context-pool-n-batch 4096 \
--context-pool-n-ubatch 4096 \
--context-pool-strict \
--audit-summary \
--audit-log /tmp/diffusion-http-audit.jsonl
Useful health checks
curl http://127.0.0.1:8081/health
curl http://127.0.0.1:8081/v1/models
curl http://127.0.0.1:8081/memory
Benchmarking diffusion models in general
Do not benchmark diffusion language models exactly like autoregressive models. Token throughput is still useful, but it is incomplete. A diffusion run has blocks, denoising steps, commit behavior, canvas length, and often early stopping. A fair report should include at least these fields:
| Benchmark field | Why it matters |
|---|---|
| Prompt tokens | Long prompts add prefill and KV cache cost. |
| Requested and actual output tokens | Needed for visible tokens/s and quality comparison. |
| Canvas length | Determines the parallel block size. |
| Blocks completed | Long outputs become sequential across blocks. |
| Denoising step limit and actual steps | Critical for diffusion speed; fewer steps can be faster but may reduce quality. |
| Generation time versus total time | Separates model math from server overhead. |
| Context setup time | Shows whether context pooling or reuse is working. |
| Backend and sampling path | Host-side sampling fallback can materially affect performance. |
| Model quantization | Q4, Q5, Q8, BF16 change memory footprint, accuracy, and kernel behavior. |
| GPU memory budget and usage | Shows whether concurrency settings are safe. |
For comparison with autoregressive models, use identical prompts, output length targets, hardware, quantization class where possible, and repeated warm runs. Report p50 and p95 latency, not only a single tokens/s number. For diffusion models, also report canvas tokens/s and blocks/s so readers can see whether speed comes from true parallel canvas work or from shorter outputs. The equations above make the key normalization explicit: output tokens/s measures user-visible text, while canvas tokens/s measures the amount of diffusion canvas processed.
Evaluation
The custom wrapper is already doing several important things correctly: it loads the model once, isolates requests through context leasing, uses automatic allocation from the diffusion canvas size, tracks GPU budget, and emits compact audit records that are suitable for benchmarking.
The result also shows the current limits. Context pooling removed setup overhead, yet throughput for the long request remained about 25.6 visible tokens/s. The next performance work should focus on backend-side sampling and denoising efficiency, exposing entropy-bound tuning flags, and comparing the same prompt on CUDA, ROCm/HIP, Vulkan, and SYCL where possible.
For a dual-B70 system, the most robust production topology is usually two server processes, one pinned to each GPU, each with a small context pool and conservative concurrency. That avoids memory contention and makes per-GPU audit logs easy to interpret.
Source code
The custom wrapper in C++ source code .
The custom wrapper in cmake build file .
