46 lines
1.4 KiB
Markdown
46 lines
1.4 KiB
Markdown
# PR-012 - PBS Lexer Byte-Offset Spans
|
|
|
|
## Briefing
|
|
|
|
Lexer spans are currently tracked as Java `String` character indices. The PBS syntax spec requires stable byte offsets. This PR aligns token/span attribution with byte offsets and keeps diagnostics deterministic.
|
|
|
|
## Motivation
|
|
|
|
Without byte offsets, diagnostics and downstream attribution diverge on non-ASCII sources, violating the lexical contract.
|
|
|
|
## Target
|
|
|
|
- `prometeu-frontend-pbs` lexer and span attribution behavior.
|
|
- Diagnostics and AST attribution consumers that depend on lexer spans.
|
|
|
|
## Scope
|
|
|
|
- Convert lexer position accounting to UTF-8 byte offsets.
|
|
- Preserve existing tokenization semantics.
|
|
- Keep parser/semantics APIs unchanged.
|
|
|
|
## Method
|
|
|
|
- Introduce byte-accurate cursor accounting in lexer scanning.
|
|
- Emit token start/end using byte offsets.
|
|
- Validate compatibility with parser and diagnostics sinks.
|
|
- Add regression fixtures with non-ASCII source content.
|
|
|
|
## Acceptance Criteria
|
|
|
|
- All emitted tokens include UTF-8 byte offsets.
|
|
- Diagnostics from lexer/parser over non-ASCII sources point to correct byte spans.
|
|
- Existing ASCII tests remain green.
|
|
- New non-ASCII span tests are added and deterministic.
|
|
|
|
## Tests
|
|
|
|
- Extend lexer tests with UTF-8 multibyte identifiers/strings.
|
|
- Add parser span-attribution tests over multibyte source.
|
|
- Run full `prometeu-frontend-pbs` test suite.
|
|
|
|
## Non-Goals
|
|
|
|
- Changing token classes or grammar.
|
|
- Changing message wording policy.
|