Skip to content

Incorrect ${var%%"$suffix"} Behavior with Repeated Character Suffixes #833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sizur opened this issue Mar 17, 2025 · 10 comments
Closed

Incorrect ${var%%"$suffix"} Behavior with Repeated Character Suffixes #833

sizur opened this issue Mar 17, 2025 · 10 comments
Labels
bug Something is not working

Comments

@sizur
Copy link

sizur commented Mar 17, 2025

str=aaaa && echo "${str%%aaa}"

Expected:

a

Observed:

aaaa
@McDutchie McDutchie added the bug Something is not working label Mar 20, 2025
@McDutchie
Copy link

Thanks for the report. Bug also confirmed in the AT&T release versions, so at least we didn't introduce it.

@McDutchie
Copy link

McDutchie commented Mar 30, 2025

It looks like the bug is in the incredibly inscrutable regex code in src/lib/libast/regex. It may take a long time to find a fix. :(

Another manifestation of it:

[[ aaaa =~ aaa$ ]]; echo $?

Output: 1 (false/failure). Expected: 0 (true/success). Of course, that ERE should match.

Experimenting:

$ [[ aa =~ a$ ]]; echo $?    
0
$ [[ aaa =~ aa$ ]]; echo $?
0
$ [[ aaaa =~ aaa$ ]]; echo $?
1
$ [[ aaaaa =~ aaaa$ ]]; echo $?
1
$ [[ aaaaaa =~ aaaaa$ ]]; echo $?
1
$ [[ aaaaaaa =~ aaaaaa$ ]]; echo $?
1
$ [[ aaaaaaaa =~ aaaaaaa$ ]]; echo $?
1
$ [[ aaaaaaaaa =~ aaaaaaaa$ ]]; echo $?
1
$ [[ aaaaaaaaaa =~ aaaaaaaaa$ ]]; echo $?
1

This is really bad. All AT&T versions act the same, as well, so it looks like no one discovered this bug until now.

@jghub
Copy link

jghub commented Mar 30, 2025

just fyi, a construct like

[[ aaaaaaaa =~ aa[a]aaa$ ]]; echo $?

works as seemingly do all other possible variants where (at least!) one of the repeated instances of "a" in the pattern is converted to a bracket expression containing that single character "a".

do not know whether it helps to understand the underlying issue but at least it seems this could serve to work around the problem since the patterns w/ and w/o the [a] bracketing change are equivalent.

the same work around seems to be valid for the original example using standard or extended glob patterns:

str=aaaa && echo "${str%%aa[a]}"   # ==> "a"
str=aaaa && echo "${str%%aa@(a)}"  # ==> "a"

and

str=aaaa && echo "${str%%{3}(a)}"  # ==> "a"

also works just fine.

@McDutchie
Copy link

Ping @JohnoKing @phidebian

I could really use some eyes on this bug, if/when you have the time. It's a defect in basic glob/regex functionality, and I am having trouble figuring out where even to start looking. The bug is somewhere in src/lib/libast/regex.

@phidebian
Copy link

@McDutchie I submitted a PR that address only (I think didn't double check) the trivial case, i.e
$ str=baaaa ; echo "${str%aaa}"

The original code looks broken with a bunch of calls to strngrpmatch() strgrpmatch() and varsub() that make no sens to me... I'm too scared to address the whole case you discovered, only fixed the original request.

@phidebian
Copy link

@sizur @McDutchie I withdrawn my ptevious PR and created a new one.

I was focusing on @sizur test case and overlooked @McDutchie additional test cases

Thoses test leds me to regcomp.c and the obscure
if ((x = env->stats.x) && x->re.string.size < 3)
I see no reasons for this <3 test beside breaking regcomp, so I removed it, and it solved all the cases, I would say magically solved.

Note that the next test in line t->re.trie.min < 3 has the same kind of esoteric <3 limit, since it doesn't break anything I left it...

I added tests to catch the bogus part.

So far so good, but I trust your eagle eyes to catch other test cases.

@phidebian
Copy link

PR bug-833-3 submitted, cleaned leftover, cscope files, and 1 cut/paste line in tests/bracket.sh

Hope this one is ok.

@McDutchie
Copy link

just fyi, a construct like

[[ aaaaaaaa =~ aa[a]aaa$ ]]; echo $?

works as seemingly do all other possible variants where (at least!) one of the repeated instances of "a" in the pattern is converted to a bracket expression containing that single character "a".

Unfortunately not. The variant above works, but prepend one a to the regex and the bug shows up again:

$ [[ aaaaaaaa =~ aaa[a]aaa$ ]]; echo $?
1

@McDutchie
Copy link

Note: further discussion at #844.

McDutchie added a commit that referenced this issue Apr 11, 2025
Yevgeniy Grigoryev (@sizur) reports:
>
>   str=aaaa && echo "${str%%aaa}"
>
> Expected:
>
>   a
>
> Observed:
>
>   aaaa

Other manifestations of it (these should all match):
  [[ aaaa =~ aaa$ ]]			# returns 1 (no match)
  [[ aaaaaaaaab =~ aaaaa[a]aab$ ]]	# returns 1 (no match)
  [[ aaaaaaaab =~ aaaa[a]aab$ ]]	# returns 1 (no match)
  [[ aaaaaaab =~ aaa[a]aab$ ]]		# returns 1 (no match)
...but if we increase the string/regex length difference by one:
  [[ aaaaaab =~ aa[a]aab$ ]]		# returns 0 (match)

@phidebian helpe me out and tracked down a major clue (thanks!):
> @sizur test case and @McDutchie additional test cases [...] leds
> me to regcomp.c and the obscure
>   if ((x = env->stats.x) && x->re.string.size < 3)
> I see no reasons for this <3 test beside breaking regcomp, so I
> removed it, and it solved all the cases, I would say magically
> solved.

This code is in the special() function in regcomp.c (2999-3225).
Going by the comment there, it looks like the special function is
an optimisation for certain special cases. It sets two types for
compiled regular expression nodes: REX_KMP, which enables string
searching using the Knuth-Morris-Pratt algorithm[*1], and REX_BM,
which enables string searching using the Boyer-Moore algorithm[*2].

@phidebian's proposed change amounts to x always being 0, which
simply disables the optimisation in this case.

Which made me think, what would happen if we deleted the whole
special() function and the two calls to it? So I tried that, and I
threw all the tests at it that I have, with none failing, and this
bug is fixed.

Also, when playing with various benchmarks, I detect no significant
performance regression, and in fact a slight performance increase
for common glob patterns.

Removing special() renders all the REX_KMP and REG_BM code
effectively dead, because it is special() that creates nodes of
these types. Most markedly, the entire regrexec() function becomes
dead code, because it is only usable for regex trees of type REG_BM
and simply returns REG_BADPAT (which is 2) for any other tree type.
(This means it is never used without a generic fallback.) So, there
is a lot more code that could be removed.

I conclude that it is not worth spending time on fixing this
optimisation bug. I think the best way to fix this bug is to remove
the KMP and BM optimisations, which according to my tests have a
dubious performance benefit anyway.

src/cmd/ksh93/sh/io.c:
- Fix minor unrelated issue: the return type of pat_line() should
  be size_t, not int.
- Remove regrexec() call from io_patseek(). Simplify the code
  accordingly (act like regrexec returned REG_BADPAT == 2).
- Remove pat_seek() which was a handler for the regrexec() call.

src/lib/libast/regex/{regcomp,regdecomp,regnexec,regstat}.c:
- Remove special() and related code.
- Remove code handling REX_BM and REX_KMP node types.

src/lib/libast/regex/regrecord.c,
src/lib/libast/regex/regrexec.c:
- Removed. These are dead code after removing special().

src/lib/libast/regex/reglib.h:
- Remove REX_BM and REG_KMP type macros.
- Remove now-unused Bm_t type and 'bm' field in Rex_t.

src/lib/libast/include/regex.h:
- Remove now-unused fields, macros and function declarations.
- Define backward compat macros for removed functions: regrecord
  returns 0 (regrexec can't be used) and regrexec returns
  REG_BADPAT (also indicating it can't be used).

src/lib/libast/features/api:
- Bump API version due to removal of regrexec() and regrecord().

src/cmd/ksh93/tests/{bracket,substring}.sh:
- Add regression tests by @phidebian.

Resolves; #833
Resolves: #844

[*1] https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
[*2] https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
McDutchie added a commit that referenced this issue Apr 11, 2025
Yevgeniy Grigoryev (@sizur) reports:
>
>   str=aaaa && echo "${str%%aaa}"
>
> Expected:
>
>   a
>
> Observed:
>
>   aaaa

Other manifestations of it (these should all match):
  [[ aaaa =~ aaa$ ]]			# returns 1 (no match)
  [[ aaaaaaaaab =~ aaaaa[a]aab$ ]]	# returns 1 (no match)
  [[ aaaaaaaab =~ aaaa[a]aab$ ]]	# returns 1 (no match)
  [[ aaaaaaab =~ aaa[a]aab$ ]]		# returns 1 (no match)
...but if we increase the string/regex length difference by one:
  [[ aaaaaab =~ aa[a]aab$ ]]		# returns 0 (match)

@phidebian helpe me out and tracked down a major clue (thanks!):
> @sizur test case and @McDutchie additional test cases [...] leds
> me to regcomp.c and the obscure
>   if ((x = env->stats.x) && x->re.string.size < 3)
> I see no reasons for this <3 test beside breaking regcomp, so I
> removed it, and it solved all the cases, I would say magically
> solved.

This code is in the special() function in regcomp.c (2999-3225).
Going by the comment there, it looks like the special function is
an optimisation for certain special cases. It sets two types for
compiled regular expression nodes: REX_KMP, which enables string
searching using the Knuth-Morris-Pratt algorithm[*1], and REX_BM,
which enables string searching using the Boyer-Moore algorithm[*2].

@phidebian's proposed change amounts to x always being 0, which
simply disables the optimisation in this case.

Which made me think, what would happen if we deleted the whole
special() function and the two calls to it? So I tried that, and I
threw all the tests at it that I have, with none failing, and this
bug is fixed.

Also, when playing with various benchmarks, I detect no significant
performance regression, and in fact a slight performance increase
for common glob patterns.

Removing special() renders all the REX_KMP and REG_BM code
effectively dead, because it is special() that creates nodes of
these types. Most markedly, the entire regrexec() function becomes
dead code, because it is only usable for regex trees of type REG_BM
and simply returns REG_BADPAT (which is 2) for any other tree type.
(This means it is never used without a generic fallback.) So, there
is a lot more code that could be removed.

I conclude that it is not worth spending time on fixing this
optimisation bug. I think the best way to fix this bug is to remove
the KMP and BM optimisations, which according to my tests have a
dubious performance benefit anyway.

src/cmd/ksh93/sh/io.c:
- Fix minor unrelated issue: the return type of pat_line() should
  be size_t, not int.
- Remove regrexec() call from io_patseek(). Simplify the code
  accordingly (act like regrexec returned REG_BADPAT == 2).
- Remove pat_seek() which was a handler for the regrexec() call.

src/lib/libast/regex/{regcomp,regdecomp,regnexec,regstat}.c:
- Remove special() and related code.
- Remove code handling REX_BM and REX_KMP node types.

src/lib/libast/regex/regrecord.c,
src/lib/libast/regex/regrexec.c:
- Removed. These are dead code after removing special().

src/lib/libast/regex/reglib.h:
- Remove REX_BM and REG_KMP type macros.
- Remove now-unused Bm_t type and 'bm' field in Rex_t.

src/lib/libast/include/regex.h:
- Remove now-unused fields, macros and function declarations.
- Define backward compat macros for removed functions: regrecord
  returns 0 (regrexec can't be used) and regrexec returns
  REG_BADPAT (also indicating it can't be used).

src/lib/libast/features/api:
- Bump API version due to removal of regrexec() and regrecord().

src/cmd/ksh93/tests/{bracket,substring}.sh:
- Add regression tests by @phidebian.

Resolves; #833
Resolves: #844

[*1] https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
[*2] https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm
@McDutchie
Copy link

Fixed in b72d7eb (the bug was not automatically closed because of a typo in the commit message: Resolves; instead of Resolves:).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

4 participants