prometeu-studio/docs/pbs/pull-requests/PR-012-pbs-byte-offset-spans.md
2026-03-24 13:42:21 +00:00

1.4 KiB

PR-012 - PBS Lexer Byte-Offset Spans

Briefing

Lexer spans are currently tracked as Java String character indices. The PBS syntax spec requires stable byte offsets. This PR aligns token/span attribution with byte offsets and keeps diagnostics deterministic.

Motivation

Without byte offsets, diagnostics and downstream attribution diverge on non-ASCII sources, violating the lexical contract.

Target

  • prometeu-frontend-pbs lexer and span attribution behavior.
  • Diagnostics and AST attribution consumers that depend on lexer spans.

Scope

  • Convert lexer position accounting to UTF-8 byte offsets.
  • Preserve existing tokenization semantics.
  • Keep parser/semantics APIs unchanged.

Method

  • Introduce byte-accurate cursor accounting in lexer scanning.
  • Emit token start/end using byte offsets.
  • Validate compatibility with parser and diagnostics sinks.
  • Add regression fixtures with non-ASCII source content.

Acceptance Criteria

  • All emitted tokens include UTF-8 byte offsets.
  • Diagnostics from lexer/parser over non-ASCII sources point to correct byte spans.
  • Existing ASCII tests remain green.
  • New non-ASCII span tests are added and deterministic.

Tests

  • Extend lexer tests with UTF-8 multibyte identifiers/strings.
  • Add parser span-attribution tests over multibyte source.
  • Run full prometeu-frontend-pbs test suite.

Non-Goals

  • Changing token classes or grammar.
  • Changing message wording policy.