Why do some words look weird on RePEc sites?

When you browse through the various RePEc sites, you may come across some strange words or names, like GonzÃ¡lez, suÂ¢ cient or MÃ¶ller. Why do those appear? To get to the bottom of this, one has to first understand how the RePEc sites get their content. All of it comes directly from publishers, about 2000 of them, who make all the relevant information available on their respective sites. To do so, they followed instructions and put files with a particular layout on their ftp or web sites.

These files are supposed to be simple text files, not formatted like they would be with Microsoft Word or LibreOffice. That should make them easy to handle with automated scripts. Unfortunately, this ignores the pesky issue of character encoding. Every operating system or software assumes that a particular character encoding is the standard, which is fine until a file moves from one computer to another. Early on, the files used in RePEc were assumed to be encoded as ISO-LATIN-1 or Windows-1252 by default. Back in 1997, UTF-8 (“Unicode”) was rare. Yet, there is till the option to force RePEc scripts to assume UTF-8 by adding at the start of the file a byte-order mark (“BOM”), which signals that the file has a non-standard encoding.

Now UTF-8 has become much more prevalent, and publishers sometimes put UTF-8 encoded data in files without the BOM, especially for files created by scripts. RePEc then interprets the data as ISO-LATIN-1 or Windows-1252, and the output can then look strange for any character that is outside the restricted ASCII set (simple letters and numbers). For example, any accented characters like é, ñ, ç, and ü will look odd if wrongly encoded. The same applies to ligatures like æ, ﬃ, and ß, non-Western alphabets, and some punctuation used in Microsoft Word.

As a RePEc publisher, how can you fix your poorly encoded UTF-8 data? There are two solutions. Either add the BOM at the start of the data, or use the new .redif extension which assumes UTF-8. But if you convert from .rdf to .redif, make sure to delete the old .rdf file(s), or your records will come up as duplicated and thus become invalid. And remember: no HTML encoding in your files.

This entry was posted on Thursday, July 27th, 2017 at 8:34 pm and is filed under Workings of RePEc. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

The RePEc Blog