Encoding Issues Fixed

Posted 9 October 2007 by

I think that I've fixed most of the encoding issues that I am aware of. I had to edit a few lines of MT code and add some new logic to my MT-Dispatcher. Parts of the database are still "corrupted" because of the bug. I can fix most of it, but I won't do it right away.

37 Comments

Reed A. Cartwright is not Torbjörn · 9 October 2007

Checking Torbjörn encoding

Reed A. Cartwright is not Torbjörn · 9 October 2007

Checking

David Fickett-Wilbar · 9 October 2007

Am I the only one bothered by the unnecessary apostrophe in "Panda's Only?"

Reed A. Cartwright is not Torbjörn · 9 October 2007

David Fickett-Wilbar: Am I the only one bothered by the unnecessary apostrophe in "Panda's Only?"
Talk to the Alaska DOT. Personally, I think it refers to "Panda's Only Ole-Fashioned Malt Liqueur"

Ichthyic · 9 October 2007

in my browser (latest firefox), the accented o in Torbjorn looks like a smeary diamond shaped road sign.

Am I the only one bothered by the unnecessary apostrophe in “Panda’s Only?”

that's the way the sign was printed; it's not an encoding issue, it's an actual photograph of a sign, IIRC.

Reed A. Cartwright is not Torbjörn · 9 October 2007

testing

Reed A. Cartwright is not Torbjörn · 9 October 2007

bumpage

Ichthyic · 9 October 2007

still seeing the same "diamond shape" on my end, for whatever that's worth.

Ichthyic · 9 October 2007

...copying and pasting shows it to actually be a black diamond with a question mark in the middle.

is there perhaps something that needs to be enabled within the browser to see foreign characters?

Ichthyic · 9 October 2007

In IE, it shows up as a blank white box.

Häggström is not Reed A. Cartwright · 9 October 2007

No. For some reason, on some pages, the blog software is converting some characters from UTF-8 encoding to ISO-8859-1 encoding. This a new error.

Ichthyic · 9 October 2007

hmm. ISO-8859-1 is the default character encoding used by firefox when the webpage itself does not define which to use.

if that helps any, it sounds like the script is not putting in a definition for which character encoding package to use when it builds the page?

Ichthyic · 9 October 2007

no. that can't be, since if i force the browser to use UTF-8, it still looks the same

Ichthyic · 9 October 2007

interestingly, it appears correctly over on the "recent comments" section.

Häggström is not Reed A. Cartwright · 9 October 2007

The problem is that they are ISO-8859-1 characters on a UTF-8 page. They show up correctly if in firefox you do "View:Character Encoding:Western".

Ichthyic · 9 October 2007

sure enough.

that being the case, why is the script for the page defining the character set to use to be UTF-8?

I see that the script builds the UTF definition into the header, can't you just change that to the standard western ISO?

Ichthyic · 9 October 2007

no, better yet, looking at the form code (I assume it needs to have UTF-8 characters defined for it?), it should work to simply strip the reference to the code definition from the portion of the page script that generates the header.

leave the header definition blank, and let the specific references take care of themselves.

IOW, where the script is set to generate this tag:

meta http-equiv="Content-Type" content="text/html; charset=utf-8"

simply strip out the reference to the charset entirely.

that shouldn't fubar any references to using a specific character set in later instances, yes?

Häggström is not Reed A. Cartwright · 9 October 2007

I'm not going to do that.

Ichthyic · 9 October 2007

hmm, I just tried that on a local copy of this page, and it correctly defaults to the ISO standard, but then the foreign characters show up as just plain question marks.

so if the charset isn't defined in the primary header, it does seem to break the resulting form data.

but then, I'm not actually accessing the database to generate the page, so it might still work on your end?

Ichthyic · 9 October 2007

I’m not going to do that.

good luck to you then. you do need to somehow force the main body content NOT to use the UTF8 encoding, however.

Torbjörn Larsson, OM · 9 October 2007

I suspect these two issues (encoding forcing and recoding UTF as ISO) are general problems for many default blog scripts.

Some ScienceBlogs have these issues (more often the former), and I specifically remember GMBM having the current problem fixed (as he is a CS he couldn't leave it alone :-). For some reason it is the name input box, so it may be some industry error that is hereditary spread.

In any case, the current situation is similar to what happens on quite a few blogs.

Forcing of change in view (in Firefox) often happens when I comment, I think. Dunno how it happens, and I have the habit of reading the next blog while waiting for the comment update, so it may be unrelated.

As we have reached the point of diminishing returns, being a small problem for a few individuals, I can live with the current situation.

Ichthyic · 9 October 2007

don't you have a backup version of the site to play with?

Häggström is not Reed A. Cartwright · 9 October 2007

The current issue has to do with how pages are written from the database. Everything is being entered properly, but is being output improperly.

MT has some code dealing with encoding issues and I am certain that the encoding fixes that I put together earlier today are now conflicting with one of those encoding checks.

Reed A. Cartwright · 10 October 2007

I made another patch to the MT code and it appears to fix the final encoding issues.

David Fickett-Wilbar · 10 October 2007

Am I the only one bothered by the unnecessary apostrophe in “Panda’s Only?” that's the way the sign was printed; it's not an encoding issue, it's an actual photograph of a sign, IIRC.
I'm sure it is. I was just being crotchety, and this seemed as good a thread as any to be it on.

Popper's Ghost · 10 October 2007

why is the script for the page defining the character set to use to be UTF-8?

Presumably so it can display characters from a wide range of languages, not just the European ones covered by ISO-8859-1.

Ichthyic · 10 October 2007

appears fixed today.

whee!

nice job, Reed.

Torbjörn Larsson, OM is not Reed · 11 October 2007

Yes, even the crotchety home page shows UTF-8 OK.

Thanks for all the hard work, Reed!

An internationalized web site is all the better. (Well, it needs content posters from at least 3 time zones distributed across the globe to be really editorially international, but you know what I mean.)

And now I know that MT may handle database issues somewhat poorly. (Doesn't seem the likeliest explanation for ScienceBlogs varying issues, though.)

Bill Gascoyne · 11 October 2007

Speaking of encoding issues, how is one supposed to create a carriage return without leaving a blank line? HTML-style "br between angle brackets" doesn't work.

Popper's Ghost · 11 October 2007

The CR handling does seem to be busted; not even <code> or <pre> protects them.

Reed A. Cartwright · 11 October 2007

Bill Gascoyne: Speaking of encoding issues, how is one supposed to create a carriage return without leaving a blank line? HTML-style "br between angle brackets" doesn't work.
Then try XHTML-style <br/>.
Popper's Ghost: The CR handling does seem to be busted; not even <code> or <pre> protects them.
<pre> is not recognized. <code> now does the same thing as XHTML's code tag. <blockcode> is what you are looking for.

Popper's Ghost · 11 October 2007

Thanks Reed, for the response and all your hard work.

Bill Gascoyne · 11 October 2007

OK, I'll try

that and see if

it works.

Torbjörn Larsson, OM · 15 October 2007

Testing

and

learning.

So,

it

is

evidently

time

to

learn

some

XHTML!

Henry J · 17 October 2007

So, it is evidently time
to learn some XHTML!

Just as long as there's not a quiz on the stuff... :p