prometeu-studio/docs/pbs/pull-requests/PR-012-pbs-byte-offset-spans.md
2026-03-24 13:42:21 +00:00

46 lines
1.4 KiB
Markdown

# PR-012 - PBS Lexer Byte-Offset Spans
## Briefing
Lexer spans are currently tracked as Java `String` character indices. The PBS syntax spec requires stable byte offsets. This PR aligns token/span attribution with byte offsets and keeps diagnostics deterministic.
## Motivation
Without byte offsets, diagnostics and downstream attribution diverge on non-ASCII sources, violating the lexical contract.
## Target
- `prometeu-frontend-pbs` lexer and span attribution behavior.
- Diagnostics and AST attribution consumers that depend on lexer spans.
## Scope
- Convert lexer position accounting to UTF-8 byte offsets.
- Preserve existing tokenization semantics.
- Keep parser/semantics APIs unchanged.
## Method
- Introduce byte-accurate cursor accounting in lexer scanning.
- Emit token start/end using byte offsets.
- Validate compatibility with parser and diagnostics sinks.
- Add regression fixtures with non-ASCII source content.
## Acceptance Criteria
- All emitted tokens include UTF-8 byte offsets.
- Diagnostics from lexer/parser over non-ASCII sources point to correct byte spans.
- Existing ASCII tests remain green.
- New non-ASCII span tests are added and deterministic.
## Tests
- Extend lexer tests with UTF-8 multibyte identifiers/strings.
- Add parser span-attribution tests over multibyte source.
- Run full `prometeu-frontend-pbs` test suite.
## Non-Goals
- Changing token classes or grammar.
- Changing message wording policy.