prometeu-studio/docs/pbs/pull-requests/PR-012-pbs-byte-offset-spans.md

# PR-012 - PBS Lexer Byte-Offset Spans

## Briefing

Lexer spans are currently tracked as Java `String` character indices. The PBS syntax spec requires stable byte offsets. This PR aligns token/span attribution with byte offsets and keeps diagnostics deterministic.

## Motivation

Without byte offsets, diagnostics and downstream attribution diverge on non-ASCII sources, violating the lexical contract.

## Target

- `prometeu-frontend-pbs` lexer and span attribution behavior.
- Diagnostics and AST attribution consumers that depend on lexer spans.

## Scope

- Convert lexer position accounting to UTF-8 byte offsets.
- Preserve existing tokenization semantics.
- Keep parser/semantics APIs unchanged.

## Method

- Introduce byte-accurate cursor accounting in lexer scanning.
- Emit token start/end using byte offsets.
- Validate compatibility with parser and diagnostics sinks.
- Add regression fixtures with non-ASCII source content.

## Acceptance Criteria

- All emitted tokens include UTF-8 byte offsets.
- Diagnostics from lexer/parser over non-ASCII sources point to correct byte spans.
- Existing ASCII tests remain green.
- New non-ASCII span tests are added and deterministic.

## Tests

- Extend lexer tests with UTF-8 multibyte identifiers/strings.
- Add parser span-attribution tests over multibyte source.
- Run full `prometeu-frontend-pbs` test suite.

## Non-Goals

- Changing token classes or grammar.
- Changing message wording policy.