Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend allowed characters' set #93

Closed

Conversation

kiriakosv
Copy link

Resolves #92.

This refactoring will allow for an easier addition/deletion of allowed
characters, without directly manipulating the regular expression.
@kiriakosv
Copy link
Author

This refactoring could facilitate adding more characters to the allowed set. @flooose what do you think?

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling bdd9ee6 on kiriakosv:extend-allowed-characters into 38f0552 on salesking:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling bdd9ee6 on kiriakosv:extend-allowed-characters into 38f0552 on salesking:master.

@flooose
Copy link

flooose commented Mar 13, 2021

I seems like it would be fine, but I'd be interested what happens in terms of performance when we start adding more characters to allowed_chars.txt. My understanding of the xls file was that all of the characters marked in green should be allowed. That would be a large regex :)

Any idea of how to test this? Have you tried adding just the Greek alphabet to see if there are any performance hits?

@kiriakosv
Copy link
Author

I ran the following code:

require 'benchmark'

STRING_LENGTH = 100_000

regex_filter = /[^a-zA-Z0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"

regex_filter_with_greek = /[^a-zA-ZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0-9ÄÖÜäöüß&*$%\ \'\:\?\,\-\(\+\.\)\/]/
string_filter_with_greek = "^#{Regexp.escape("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ0123456789ÄÖÜäöüß&*$% ':?,-(+.)/")}"

charset = Array('A'..'Z') + Array('a'..'z') + Array(0..9) + %w[ÄÖÜäöüß&*$%] + %w[!^@~\\] + %w[ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρστυφχψωάόίήώύΆΉΏΎ]
my_string = Array.new(STRING_LENGTH) { charset.sample }.join

results = Benchmark.bm do |x|
  x.report('gsub with greek:') { my_string.gsub(regex_filter_with_greek, '') }
  x.report('gsub no greek:') { my_string.gsub(regex_filter, '') }
  x.report('tr with greek:') { my_string.tr(string_filter_with_greek, '') }
  x.report('tr no greek:') { my_string.tr(string_filter, '') }
end

For a STRING_LENGTH of 100 I get:

implementation user system total real
gsub with greek: 0.000020 0.000002 0.000022 ( 0.000018)
gsub no greek: 0.000043 0.000001 0.000044 ( 0.000044)
tr with greek: 0.000020 0.000000 0.000020 ( 0.000020)
tr no greek: 0.000007 0.000001 0.000008 ( 0.000008)

For a STRING_LENGTH of 100.000 I get:

implementation user system total real
gsub with greek: 0.018775 0.000671 0.019446 ( 0.023128)
gsub no greek: 0.040723 0.000317 0.041040 ( 0.053153)
tr with greek: 0.008596 0.000172 0.008768 ( 0.011342)
tr no greek: 0.001963 0.000133 0.002096 ( 0.002097)

What I find somewhat surprising, is the fact that gsub actually improves relatively when the greek characters are added. I ran these multiple times and the improvement was consistent. Having said that, I'm not familiar with the implementation of gsub.

@flooose
Copy link

flooose commented Mar 14, 2021

Interesting. I guess, let's see what the maintainers say.

@flooose
Copy link

flooose commented Mar 22, 2021

Ping. Maybe no one is responding because the pull request is still a "draft"?

@kiriakosv kiriakosv marked this pull request as ready for review March 22, 2021 10:05
@kiriakosv
Copy link
Author

Closed in favour of #105.

@kiriakosv kiriakosv closed this Feb 28, 2022
@kiriakosv kiriakosv deleted the extend-allowed-characters branch March 9, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend allowed characters in InstanceMethods.convert_text
3 participants