Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not encoding chars correctly #5

Open
judocode opened this issue Apr 21, 2016 · 4 comments
Open

Not encoding chars correctly #5

judocode opened this issue Apr 21, 2016 · 4 comments

Comments

@judocode
Copy link

judocode commented Apr 21, 2016

I am using

var doc = Document.FromString("...");
doc.CleanAndRepair();
string output = doc.Save();

and it is turning chars such as ’ “ ” into the unknown symbol �. I tried playing around with different encoding types but was unsuccessful.

@frandi
Copy link
Owner

frandi commented Apr 21, 2016

@jrenton where did you write the output to? console? file?

@judocode
Copy link
Author

Just debugging it shows the string output shows the �. I was actually able to avoid this issue by first converting my original string to a stream and using Document.FromStream

@Nyerguds
Copy link

Nyerguds commented Jul 5, 2016

The actual problem is probably that the [DllImport("tidy.dll")] for tidyParseString needs to have its charset property configured, so it's the same on both sides. According to MSDN, this defaults to Ansi in C#.

What encoding does the original dll normally expect its strings to be in? Because it seems the only choices available are "Ansi" and "Unicode" (which is utf-16), while the errors in the result seems to point towards ANSI-to-UTF-8 corruption.

The input encoding set in the Document.cs source was UTF-8, though. I haven't fiddled around with this, but if it actually uses that setting even for String input, setting the DLLImport charset to Unicode and the input encoding to UTF-16-LE should solve it.

@Nyerguds
Copy link

Nyerguds commented Jul 6, 2016

Well, that didn't seem to work. The input encoding seems to be ignored, so setting the DllImport's Charset to Unicode just messes it up even more.

In Document.cs, you can fix all this simply by treating the String input as a stream as well. Just replace the String constructor with this:

Document(string htmlString)
    : this()
{
    this.stream = new System.IO.MemoryStream(new UTF8Encoding(false).GetBytes(htmlString));
    this.InputCharacterEncoding = EncodingType.Utf8;
}

This also conveniently reduces CleanAndRepair() to just four lines.

On a related note... this prompted me to remove the htmlString and fromString variables completely, which made me notice something else: the Save(Stream stream) forces the output encoding to UTF-8 if fromString is enabled. That shouldn't happen; it's completely up to the programmer to interpret the bytes in the Stream, so logically one would expect to be able to control that by setting the OutputCharacterEncoding. The fact it originally came from a string has no relation whatsoever with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants