1.4 KiB
1.4 KiB
PR-012 - PBS Lexer Byte-Offset Spans
Briefing
Lexer spans are currently tracked as Java String character indices. The PBS syntax spec requires stable byte offsets. This PR aligns token/span attribution with byte offsets and keeps diagnostics deterministic.
Motivation
Without byte offsets, diagnostics and downstream attribution diverge on non-ASCII sources, violating the lexical contract.
Target
prometeu-frontend-pbslexer and span attribution behavior.- Diagnostics and AST attribution consumers that depend on lexer spans.
Scope
- Convert lexer position accounting to UTF-8 byte offsets.
- Preserve existing tokenization semantics.
- Keep parser/semantics APIs unchanged.
Method
- Introduce byte-accurate cursor accounting in lexer scanning.
- Emit token start/end using byte offsets.
- Validate compatibility with parser and diagnostics sinks.
- Add regression fixtures with non-ASCII source content.
Acceptance Criteria
- All emitted tokens include UTF-8 byte offsets.
- Diagnostics from lexer/parser over non-ASCII sources point to correct byte spans.
- Existing ASCII tests remain green.
- New non-ASCII span tests are added and deterministic.
Tests
- Extend lexer tests with UTF-8 multibyte identifiers/strings.
- Add parser span-attribution tests over multibyte source.
- Run full
prometeu-frontend-pbstest suite.
Non-Goals
- Changing token classes or grammar.
- Changing message wording policy.