You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--encoding: Encoding to use for reading text files [default: utf-8]
But different files have different encodings. Chinese PDF is being read correctly and characters are showing up correctly, but a .txt file in the same folder that's encoded in GB2312 is being garbled in both the search results and the file display.
Probably it should default to detecting the encoding for each file independently and then converting them internally to whatever the embedding expects (UTF8?)
Yep, you're absolutely right. This should be granular on a per-file basis. I can look into auto-detecting encoding, but that might be time consuming for ever file, and it might be error prone. In any case, v0.2 should have better controls for customizing how Semantra works per file.
Documentation says
But different files have different encodings. Chinese PDF is being read correctly and characters are showing up correctly, but a .txt file in the same folder that's encoded in GB2312 is being garbled in both the search results and the file display.
Probably it should default to detecting the encoding for each file independently and then converting them internally to whatever the embedding expects (UTF8?)
https://pypi.org/project/chardet/
The text was updated successfully, but these errors were encountered: