Challenges with publisher data

All the bibliographic data that RePEc disseminates comes directly from the publishers. The quality of the data on RePEc services thus cannot be better than the quality of the data that publishers provide. RePEc imposes some syntaxic constraints to make data easy to handle, but unfortunately publishers do not always adhere to those rules, leading to lower data quality and even data loss. In this blog post, we expose some of the challenges that we face.

Let us start with an example. This is how good data looks like for a working paper:

Template-Type: ReDIF-Paper 1.0
Author-Name:  Daniel Rais
Author-Name-First: Daniel
Author-Name-Last: Rais
Author-Name:  Peter Lawater
Author-Name-First: Peter
Author-Name-Last: Lawater
Author-Email:  p.lawater@grandiose.edu
Author-Workplace-Name: Department of Economics, Grandiose University
Author-Name:  Jonathan Goldman
Author-Name-First: Jonathan
Author-Name-Last: Goldman
Author-Workplace-Name: Department of Finance, Grandiose University
Author-Name:  Zhiwei Chui
Author-Name-First: Zhiwei
Author-Name-Last: Chui
Title:  Phases of Imitation and Innovation in a North-South Endogenous Growth
Model
Abstract:  In this paper, we develop a North-South endogenous growth model to
examine three phases of development in the South: imitation of Northern
products, imitation and innovation and finally, innovation only.
In particular, the model has the features of catching up (and
potentially overtaking) which are of particular relevance to the Pacific Rim
economies.  We show that the possible equilibria
depend on cross-country assimilation effects and the ease of
imitation.  We then apply the model to analyse the impact of R&D
subsidies.  There are some clear global policy implications which emerge
from our analysis.  Firstly, because subsidies to Southern innovation
benefit the North as well, it is beneficial to the North to pay for some of
these subsidies.  Secondly, because the ability of the South to assimilate
Northern knowledge and innovate depends on Southern skills levels, the
consequent spillover benefits on growth make the subsidising
of Southern education by the North particularly attractive.
Length:  26 pages
Creation-Date:  1996-07
Revision-Date: 1998-01
Publication-Status: Published in Review of Economics, March 1999, pages 1-23
File-URL: ftp://ftp.grandiose.edu/pub/econ/WorkingPapers/surrec9602.pdf
File-Format: Application/pdf
File-Function: First version, 1996
File-URL: ftp://ftp.grandiose.edu/pub/econ/WorkingPapers/surrec9602R.pdf
File-Format: Application/pdf
File-Function: Revised version, 1998
Number: 9602
Classification-JEL: E32, R10
Keywords: North-South, growth model, innovation assimilation
Handle: RePEc:aaa:wpaper:9602

What can go wrong here? First there are some mandatory fields, and if they are missing, the template is automatically rejected. Examples are Author-Name, Title, Handle. Then some fields need to follow some format. This applies to dates, handles, URLs. An error here also leads to a rejection. In some other cases, a syntaxic error will only lead to a warning with the particular field being ignored.

The more subtle issues arise when the provider starts inputting data that is syntaxically correct, but not in an intended way. Let us look at what can happen to the Author-Name field:

Author-Name: John Doe
Author-Name: John Doe and Jane Doe
Author-Name: John Doe (j.doe@grandiose.edu)
Author-Name: John Doe, Department of Economics, Grandiose University
Author-Name: "John Doe"
Author-Name: JOHN DOE
Author-Name: Assistant Prof. John Doe
Author-Name: Juan GonzÃ¡lez

All these will pass the syntax check because they will not lead to wrong information. However, they will be confusing for various uses of this data. The first entry is entirely correct. The second is problematic because two names are listed. Each author should be in a separate Author-Name field. The problem here is that, for example, the RePEc Author Service will have difficulties attributing this entry to John Doe as a more complex name is listed. The third and fourth entries have information that is not about the name. This should be included in fields like Author-Email and Author-Workplace-Name. The problem here is that when you build a citation record, you only need the name of the author, not all the other “junk.” And speaking of junk, the next entry also features extra characters that are not useful for the record. The name in all capital is annoying, because when works from different origins are mixed (say, a list of references), records in all caps look awful. Some RePEc services adjust the capitalization (also for titles), but this can lead to mistakes. Next, including titles not only makes a record look awkward, it also confuses name matching in the RePEc Author Service. Finally the last record has a mangled character. This usually happens because the provider was negligent in tracking the character encoding while transferring the data. There is another blog post to explain this.

What else are common errors that lead to confusion? A surprisingly common one is to put the abstract in the title. Speaking of the title, too often it includes information that does not belong in a title, such as “revised version of paper 2001-10,” “in Japanese,” titles in all caps, adding punctuation at the end, or putting the whole title between quotation marks. Again, this prevents building a good citation. It also prevents automatically matching different versions of the same work if the titles are unnecessarily different. Users can manually match those misfits with this form.

RePEc provides some syntax analysis to its providers at EconPapers, and providers are alerted about warnings and errors in a monthly email. Yet, it is often when users, especially authors, complain that they seem to be correcting the data. Thus, if you see something amiss, contact the person listed on every page for corrections. This is the person that can do something about the record, not RePEc volunteers.

This entry was posted on Saturday, April 27th, 2019 at 3:49 pm and is filed under Workings of RePEc. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

The RePEc Blog