Optimise strings.ToLower() #1170

soujanyanmbri · 2024-10-09T13:45:45Z

Thank you for contributing to Coraza WAF, your effort is greatly appreciated
Before submitting check if what you want to add to coraza list meets quality standards before sending pull request. Thanks!

Make sure that you've checked the boxes below before you submit PR:

My code includes positive and negative tests.
I have an appropriate description with correct grammar.
I have read Contribution guidelines, maintainers note and Quality standard.
My code is properly linted and passes pre-commit tests.

Thanks for your contribution ❤️

soujanyanmbri · 2024-10-09T13:49:29Z

Benchmark:

BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Lowercase_Sentence-10         	40185970	        29.50 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Uppercase_Sentence-10         	17557453	        66.52 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Mixed_Case_Sentence-10              	18643068	        64.06 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Non-Alphabetic_Characters-10        	46125682	        26.00 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Greek_Alphabet-10                 	29133343	        41.43 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Cyrillic_Alphabet-10              	26644413	        44.82 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Mixed_Greek_and_ASCII-10          	18659374	        66.00 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Emoji-10                          	19809511	        59.83 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Empty_String-10                           	570688939	         2.225 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Punctuation-10                       	196423014	         6.113 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Whitespace-10                        	233590377	         5.154 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Long_Mixed_Case_String-10                 	 6503064	       183.7 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Special_Turkish_Case-10                   	16458063	        72.57 ns/op	      80 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Lowercase_Sentence-10             	15182882	        80.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Uppercase_Sentence-10             	14452549	        82.19 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Mixed_Case_Sentence-10                  	14902519	        81.55 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Non-Alphabetic_Characters-10            	16164508	        75.16 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Greek_Alphabet-10                     	12417586	       103.0 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Cyrillic_Alphabet-10                  	10973074	       103.6 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Mixed_Greek_and_ASCII-10              	15417500	        78.68 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Emoji-10                              	15793966	        77.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Empty_String-10                               	232960081	         5.279 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Punctuation-10                           	99651907	        12.02 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Whitespace-10                            	100000000	        11.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Long_Mixed_Case_String-10                     	 6063015	       195.0 ns/op	     448 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Special_Turkish_Case-10                       	13682554	        99.41 ns/op	     160 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Lowercase_Sentence-10            	20465416	        62.36 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Uppercase_Sentence-10            	 7061458	       172.2 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Mixed_Case_Sentence-10                 	 9851926	       129.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Non-Alphabetic_Characters-10           	24569835	        50.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Greek_Alphabet-10                    	 1287247	       929.4 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Cyrillic_Alphabet-10                 	 1000000	      1035 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Mixed_Greek_and_ASCII-10             	 2851934	       421.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Emoji-10                             	 4843082	       246.4 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Empty_String-10                              	568174978	         2.111 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Punctuation-10                          	84454507	        14.13 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Whitespace-10                           	100000000	        11.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Long_Mixed_Case_String-10                    	 3768148	       318.0 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Special_Turkish_Case-10                      	 3724746	       318.9 ns/op	      80 B/op	       1 allocs/op

codecov · 2024-10-09T13:49:48Z

Codecov Report

Attention: Patch coverage is 97.05882% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.56%. Comparing base (4ff1f76) to head (45223de).
Report is 98 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/seclang/rule_parser.go	71.42%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   82.72%   83.56%   +0.84%     
==========================================
  Files         162      166       +4     
  Lines        9080     9572     +492     
==========================================
+ Hits         7511     7999     +488     
- Misses       1319     1323       +4     
  Partials      250      250

Flag	Coverage Δ
default	`83.56% <97.05%> (+5.72%)`	⬆️
examples	`83.56% <97.05%> (+57.13%)`	⬆️
ftw	`83.56% <97.05%> (+36.19%)`	⬆️
ftw-multiphase	`83.56% <97.05%> (+34.02%)`	⬆️
tinygo	`83.56% <97.05%> (+8.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jcchavezs · 2024-10-09T18:57:50Z

internal/strings/strings.go

+// asciiToLowerInPlace converts ASCII characters in the string to lowercase
+// starting from the specified index.
+func asciiToLowerInPlace(s string, start int) string {
+	res := []byte(s)


Not sure if you have access to medium but this can be done more efficiently https://josestg.medium.com/140x-faster-string-to-byte-and-byte-to-string-conversions-with-zero-allocation-in-go-200b4d7105fc

"returned byte slice MUST NOT be modified since it shares the same backing array with the given string."
Not sure, if this is the best use here, i tried to copy the byte slice before modifying it, and it is causing unexpected behaviour.

internal/strings/strings.go

M4tteoP

I'm not per se against this change (performance is a key concept), but I would like to get a better understanding of the tradeoff we have in order to gain this extra performance.

Being just meant for ASCII chars, what are the transformations that we would not perform anymore? (First of all, I'm thinking at chars like ÀÈÉÖ).
Should we restrict the usage of the new stringsutil.AsciiToLower for comparisons where we expect to deal with ASCII and keep the standard one to deal with generic inputs? (but I get that it would not really be that beneficial in terms of performance gains)
Otherwise, we should consider it a breaking change, shouldn't we? Running the official ToLower tests with the new function might give more clarity about the behavior's change: https://github.com/golang/go/blob/master/src/strings/strings_test.go#L640-L650.

Thanks!

jcchavezs · 2024-10-10T10:25:33Z

@M4tteoP I think the main assumption here is that variable keys and collection names will always be ascii.

soujanyanmbri · 2024-10-10T10:35:02Z

Hey @M4tteoP ,
I've added a test case for unicode cases by the way:

coraza/internal/strings/strings_test.go

Line 107 in dd8793d

    
           {"Unicode Unchanged", "Привет Мир", "Привет Мир"},                   // Unicode characters remain unchanged

As @jcchavezs mentioned, the main assumption here is that variable keys and collection names will always be ascii, and this should not cause coraza to miss any evasions as well.

Optimise strings.lower()

aaed653

soujanyanmbri requested a review from a team as a code owner October 9, 2024 13:45

jcchavezs reviewed Oct 9, 2024

View reviewed changes

internal/strings/strings.go Outdated Show resolved Hide resolved

M4tteoP reviewed Oct 9, 2024

View reviewed changes

Resolve comment

dd8793d

soujanyanmbri requested review from jcchavezs and M4tteoP October 10, 2024 11:05

soujanyanmbri added 2 commits October 16, 2024 13:16

Merge branch 'main' into main

d01a687

Merge branch 'corazawaf:main' into main

45223de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise strings.ToLower() #1170

Optimise strings.ToLower() #1170

soujanyanmbri commented Oct 9, 2024 •

edited

Loading

soujanyanmbri commented Oct 9, 2024

codecov bot commented Oct 9, 2024 •

edited

Loading

jcchavezs Oct 9, 2024

soujanyanmbri Oct 10, 2024

M4tteoP left a comment •

edited

Loading

jcchavezs commented Oct 10, 2024

soujanyanmbri commented Oct 10, 2024

Optimise strings.ToLower() #1170

Are you sure you want to change the base?

Optimise strings.ToLower() #1170

Conversation

soujanyanmbri commented Oct 9, 2024 • edited Loading

soujanyanmbri commented Oct 9, 2024

codecov bot commented Oct 9, 2024 • edited Loading

Codecov Report

jcchavezs Oct 9, 2024

Choose a reason for hiding this comment

soujanyanmbri Oct 10, 2024

Choose a reason for hiding this comment

M4tteoP left a comment • edited Loading

Choose a reason for hiding this comment

jcchavezs commented Oct 10, 2024

soujanyanmbri commented Oct 10, 2024

soujanyanmbri commented Oct 9, 2024 •

edited

Loading

codecov bot commented Oct 9, 2024 •

edited

Loading

M4tteoP left a comment •

edited

Loading