Why do some words look weird on RePEc sites?

July 27, 2017

When you browse through the various RePEc sites, you may come across some strange words or names, like González, su¢ cient or Möller. Why do those appear? To get to the bottom of this, one has to first understand how the RePEc sites get their content. All of it comes directly from publishers, about 2000 of them, who make all the relevant information available on their respective sites. To do so, they followed instructions and put files with a particular layout on their ftp or web sites.

These files are supposed to be simple text files, not formatted like they would be with Microsoft Word or LibreOffice. That should make them easy to handle with automated scripts. Unfortunately, this ignores the pesky issue of character encoding. Every operating system or software assumes that a particular character encoding is the standard, which is fine until a file moves from one computer to another. Early on, the files used in RePEc were assumed to be encoded as ISO-LATIN-1 or Windows-1252 by default. Back in 1997, UTF-8 (“Unicode”) was rare. Yet, there is till the option to force RePEc scripts to assume UTF-8 by adding at the start of the file a byte-order mark (“BOM”), which signals that the file has a non-standard encoding.

Now UTF-8 has become much more prevalent, and publishers sometimes put UTF-8 encoded data in files without the BOM, especially for files created by scripts. RePEc then interprets the data as ISO-LATIN-1 or Windows-1252, and the output can then look strange for any character that is outside the restricted ASCII set (simple letters and numbers). For example, any accented characters like é, ñ, ç, and ü will look odd if wrongly encoded. The same applies to ligatures like æ, ffi, and ß, non-Western alphabets, and some punctuation used in Microsoft Word.

As a RePEc publisher, how can you fix your poorly encoded UTF-8 data? There are two solutions. Either add the BOM at the start of the data, or use the new .redif extension which assumes UTF-8. But if you convert from .rdf to .redif, make sure to delete the old .rdf file(s), or your records will come up as duplicated and thus become invalid. And remember: no HTML encoding in your files.

Advertisements

RePEc in June 2017

July 5, 2017

There are a few novelties on RePEc this month. Three new NEP reports: NEP-BIG (Big Data), NEP-DES (Economic Design) and NEP-FLE (Financial Literacy and Education). A new ranking for institutions: Student records measured on the publications from the last 10 years. We have also a few new participating archives: University of Calgary (II), GRAPE, Centre for Economic History Research, National Association of Forensic Economics, D. A. Tsenov Academy of Economics, Tripal Publishing House, DOBA Faculty. We counted 443,596 file downloads and 1,596,970 abstract views in June 2017.

As for the milestones we reached:

40,000,000 cumulative article downloads
6,000,000 cumulative downloads through NEP
12,000 people listed in the RePEc Genealogy