PR-012 - PBS Lexer Byte-Offset Spans

Briefing

Lexer spans are currently tracked as Java String character indices. The PBS syntax spec requires stable byte offsets. This PR aligns token/span attribution with byte offsets and keeps diagnostics deterministic.

Motivation

Without byte offsets, diagnostics and downstream attribution diverge on non-ASCII sources, violating the lexical contract.

Target

prometeu-frontend-pbs lexer and span attribution behavior.
Diagnostics and AST attribution consumers that depend on lexer spans.

Scope

Convert lexer position accounting to UTF-8 byte offsets.
Preserve existing tokenization semantics.
Keep parser/semantics APIs unchanged.

Method

Introduce byte-accurate cursor accounting in lexer scanning.
Emit token start/end using byte offsets.
Validate compatibility with parser and diagnostics sinks.
Add regression fixtures with non-ASCII source content.

Acceptance Criteria

All emitted tokens include UTF-8 byte offsets.
Diagnostics from lexer/parser over non-ASCII sources point to correct byte spans.
Existing ASCII tests remain green.
New non-ASCII span tests are added and deterministic.

Tests

Extend lexer tests with UTF-8 multibyte identifiers/strings.
Add parser span-attribution tests over multibyte source.
Run full prometeu-frontend-pbs test suite.

Non-Goals

Changing token classes or grammar.
Changing message wording policy.

1.4 KiB Raw Blame History