All the bibliographic data that RePEc disseminates comes directly from the publishers. The quality of the data on RePEc services thus cannot be better than the quality of the data that publishers provide. RePEc imposes some syntaxic constraints to make data easy to handle, but unfortunately publishers do not always adhere to those rules, leading to lower data quality and even data loss. In this blog post, we expose some of the challenges that we face.
Let us start with an example. This is how good data looks like for a working paper:
Template-Type: ReDIF-Paper 1.0 Author-Name: Daniel Rais Author-Name-First: Daniel Author-Name-Last: Rais Author-Name: Peter Lawater Author-Name-First: Peter Author-Name-Last: Lawater Author-Email: firstname.lastname@example.org Author-Workplace-Name: Department of Economics, Grandiose University Author-Name: Jonathan Goldman Author-Name-First: Jonathan Author-Name-Last: Goldman Author-Workplace-Name: Department of Finance, Grandiose University Author-Name: Zhiwei Chui Author-Name-First: Zhiwei Author-Name-Last: Chui Title: Phases of Imitation and Innovation in a North-South Endogenous Growth Model Abstract: In this paper, we develop a North-South endogenous growth model to examine three phases of development in the South: imitation of Northern products, imitation and innovation and finally, innovation only. In particular, the model has the features of catching up (and potentially overtaking) which are of particular relevance to the Pacific Rim economies. We show that the possible equilibria depend on cross-country assimilation effects and the ease of imitation. We then apply the model to analyse the impact of R&D subsidies. There are some clear global policy implications which emerge from our analysis. Firstly, because subsidies to Southern innovation benefit the North as well, it is beneficial to the North to pay for some of these subsidies. Secondly, because the ability of the South to assimilate Northern knowledge and innovate depends on Southern skills levels, the consequent spillover benefits on growth make the subsidising of Southern education by the North particularly attractive. Length: 26 pages Creation-Date: 1996-07 Revision-Date: 1998-01 Publication-Status: Published in Review of Economics, March 1999, pages 1-23 File-URL: ftp://ftp.grandiose.edu/pub/econ/WorkingPapers/surrec9602.pdf File-Format: Application/pdf File-Function: First version, 1996 File-URL: ftp://ftp.grandiose.edu/pub/econ/WorkingPapers/surrec9602R.pdf File-Format: Application/pdf File-Function: Revised version, 1998 Number: 9602 Classification-JEL: E32, R10 Keywords: North-South, growth model, innovation assimilation Handle: RePEc:aaa:wpaper:9602
What can go wrong here? First there are some mandatory fields, and if they are missing, the template is automatically rejected. Examples are Author-Name, Title, Handle. Then some fields need to follow some format. This applies to dates, handles, URLs. An error here also leads to a rejection. In some other cases, a syntaxic error will only lead to a warning with the particular field being ignored.
The more subtle issues arise when the provider starts inputting data that is syntaxically correct, but not in an intended way. Let us look at what can happen to the Author-Name field:
Author-Name: John Doe Author-Name: John Doe and Jane Doe Author-Name: John Doe (email@example.com) Author-Name: John Doe, Department of Economics, Grandiose University Author-Name: "John Doe" Author-Name: JOHN DOE Author-Name: Assistant Prof. John Doe Author-Name: Juan GonzÃ¡lez
All these will pass the syntax check because they will not lead to wrong information. However, they will be confusing for various uses of this data. The first entry is entirely correct. The second is problematic because two names are listed. Each author should be in a separate Author-Name field. The problem here is that, for example, the RePEc Author Service will have difficulties attributing this entry to John Doe as a more complex name is listed. The third and fourth entries have information that is not about the name. This should be included in fields like Author-Email and Author-Workplace-Name. The problem here is that when you build a citation record, you only need the name of the author, not all the other “junk.” And speaking of junk, the next entry also features extra characters that are not useful for the record. The name in all capital is annoying, because when works from different origins are mixed (say, a list of references), records in all caps look awful. Some RePEc services adjust the capitalization (also for titles), but this can lead to mistakes. Next, including titles not only makes a record look awkward, it also confuses name matching in the RePEc Author Service. Finally the last record has a mangled character. This usually happens because the provider was negligent in tracking the character encoding while transferring the data. There is another blog post to explain this.
What else are common errors that lead to confusion? A surprisingly common one is to put the abstract in the title. Speaking of the title, too often it includes information that does not belong in a title, such as “revised version of paper 2001-10,” “in Japanese,” titles in all caps, adding punctuation at the end, or putting the whole title between quotation marks. Again, this prevents building a good citation. It also prevents automatically matching different versions of the same work if the titles are unnecessarily different. Users can manually match those misfits with this form.
RePEc provides some syntax analysis to its providers at EconPapers, and providers are alerted about warnings and errors in a monthly email. Yet, it is often when users, especially authors, complain that they seem to be correcting the data. Thus, if you see something amiss, contact the person listed on every page for corrections. This is the person that can do something about the record, not RePEc volunteers.