Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise strings.ToLower() #1170

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

soujanyanmbri
Copy link
Contributor

@soujanyanmbri soujanyanmbri commented Oct 9, 2024

Thank you for contributing to Coraza WAF, your effort is greatly appreciated
Before submitting check if what you want to add to coraza list meets quality standards before sending pull request. Thanks!

Make sure that you've checked the boxes below before you submit PR:

Thanks for your contribution ❤️

@soujanyanmbri soujanyanmbri requested a review from a team as a code owner October 9, 2024 13:45
@soujanyanmbri
Copy link
Contributor Author

Benchmark:

BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Lowercase_Sentence-10         	40185970	        29.50 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Fully_Uppercase_Sentence-10         	17557453	        66.52 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Mixed_Case_Sentence-10              	18643068	        64.06 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/ASCII_Non-Alphabetic_Characters-10        	46125682	        26.00 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Greek_Alphabet-10                 	29133343	        41.43 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Cyrillic_Alphabet-10              	26644413	        44.82 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Mixed_Greek_and_ASCII-10          	18659374	        66.00 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Unicode_Emoji-10                          	19809511	        59.83 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Empty_String-10                           	570688939	         2.225 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Punctuation-10                       	196423014	         6.113 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Only_Whitespace-10                        	233590377	         5.154 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Long_Mixed_Case_String-10                 	 6503064	       183.7 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/AsciiToLower_Implementation/Special_Turkish_Case-10                   	16458063	        72.57 ns/op	      80 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Lowercase_Sentence-10             	15182882	        80.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Fully_Uppercase_Sentence-10             	14452549	        82.19 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Mixed_Case_Sentence-10                  	14902519	        81.55 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/ASCII_Non-Alphabetic_Characters-10            	16164508	        75.16 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Greek_Alphabet-10                     	12417586	       103.0 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Cyrillic_Alphabet-10                  	10973074	       103.6 ns/op	     192 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Mixed_Greek_and_ASCII-10              	15417500	        78.68 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Unicode_Emoji-10                              	15793966	        77.65 ns/op	     128 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Empty_String-10                               	232960081	         5.279 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Punctuation-10                           	99651907	        12.02 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Only_Whitespace-10                            	100000000	        11.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Long_Mixed_Case_String-10                     	 6063015	       195.0 ns/op	     448 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Manual_ASCII_Conversion/Special_Turkish_Case-10                       	13682554	        99.41 ns/op	     160 B/op	       2 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Lowercase_Sentence-10            	20465416	        62.36 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Fully_Uppercase_Sentence-10            	 7061458	       172.2 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Mixed_Case_Sentence-10                 	 9851926	       129.9 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/ASCII_Non-Alphabetic_Characters-10           	24569835	        50.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Greek_Alphabet-10                    	 1287247	       929.4 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Cyrillic_Alphabet-10                 	 1000000	      1035 ns/op	      96 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Mixed_Greek_and_ASCII-10             	 2851934	       421.8 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Unicode_Emoji-10                             	 4843082	       246.4 ns/op	      64 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Empty_String-10                              	568174978	         2.111 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Punctuation-10                          	84454507	        14.13 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Only_Whitespace-10                           	100000000	        11.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Long_Mixed_Case_String-10                    	 3768148	       318.0 ns/op	     224 B/op	       1 allocs/op
BenchmarkAsciiVsUnicodeCaseString/Standard_Unicode_ToLower/Special_Turkish_Case-10                      	 3724746	       318.9 ns/op	      80 B/op	       1 allocs/op

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

Attention: Patch coverage is 97.05882% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.56%. Comparing base (4ff1f76) to head (45223de).
Report is 98 commits behind head on main.

Files with missing lines Patch % Lines
internal/seclang/rule_parser.go 71.42% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   82.72%   83.56%   +0.84%     
==========================================
  Files         162      166       +4     
  Lines        9080     9572     +492     
==========================================
+ Hits         7511     7999     +488     
- Misses       1319     1323       +4     
  Partials      250      250              
Flag Coverage Δ
default 83.56% <97.05%> (+5.72%) ⬆️
examples 83.56% <97.05%> (+57.13%) ⬆️
ftw 83.56% <97.05%> (+36.19%) ⬆️
ftw-multiphase 83.56% <97.05%> (+34.02%) ⬆️
tinygo 83.56% <97.05%> (+8.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// asciiToLowerInPlace converts ASCII characters in the string to lowercase
// starting from the specified index.
func asciiToLowerInPlace(s string, start int) string {
res := []byte(s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returned byte slice MUST NOT be modified since it shares the same backing array with the given string."
Not sure, if this is the best use here, i tried to copy the byte slice before modifying it, and it is causing unexpected behaviour.

Copy link
Member

@M4tteoP M4tteoP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not per se against this change (performance is a key concept), but I would like to get a better understanding of the tradeoff we have in order to gain this extra performance.

  • Being just meant for ASCII chars, what are the transformations that we would not perform anymore? (First of all, I'm thinking at chars like ÀÈÉÖ).
  • Should we restrict the usage of the new stringsutil.AsciiToLower for comparisons where we expect to deal with ASCII and keep the standard one to deal with generic inputs? (but I get that it would not really be that beneficial in terms of performance gains)
  • Otherwise, we should consider it a breaking change, shouldn't we? Running the official ToLower tests with the new function might give more clarity about the behavior's change: https://github.com/golang/go/blob/master/src/strings/strings_test.go#L640-L650.

Thanks!

@jcchavezs
Copy link
Member

@M4tteoP I think the main assumption here is that variable keys and collection names will always be ascii.

@soujanyanmbri
Copy link
Contributor Author

Hey @M4tteoP ,
I've added a test case for unicode cases by the way:

{"Unicode Unchanged", "Привет Мир", "Привет Мир"}, // Unicode characters remain unchanged

As @jcchavezs mentioned, the main assumption here is that variable keys and collection names will always be ascii, and this should not cause coraza to miss any evasions as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants