Challenges with publisher data

April 27, 2019

All the bibliographic data that RePEc disseminates comes directly from the publishers. The quality of the data on RePEc services thus cannot be better than the quality of the data that publishers provide. RePEc imposes some syntaxic constraints to make data easy to handle, but unfortunately publishers do not always adhere to those rules, leading to lower data quality and even data loss. In this blog post, we expose some of the challenges that we face.

Let us start with an example. This is how good data looks like for a working paper:

Template-Type: ReDIF-Paper 1.0
Author-Name:  Daniel Rais
Author-Name-First: Daniel
Author-Name-Last: Rais
Author-Name:  Peter Lawater
Author-Name-First: Peter
Author-Name-Last: Lawater
Author-Email:  p.lawater@grandiose.edu
Author-Workplace-Name: Department of Economics, Grandiose University
Author-Name:  Jonathan Goldman
Author-Name-First: Jonathan
Author-Name-Last: Goldman
Author-Workplace-Name: Department of Finance, Grandiose University
Author-Name:  Zhiwei Chui
Author-Name-First: Zhiwei
Author-Name-Last: Chui
Title:  Phases of Imitation and Innovation in a North-South Endogenous Growth
Model
Abstract:  In this paper, we develop a North-South endogenous growth model to
examine three phases of development in the South: imitation of Northern
products, imitation and innovation and finally, innovation only.
In particular, the model has the features of catching up (and
potentially overtaking) which are of particular relevance to the Pacific Rim
economies.  We show that the possible equilibria
depend on cross-country assimilation effects and the ease of
imitation.  We then apply the model to analyse the impact of R&D
subsidies.  There are some clear global policy implications which emerge
from our analysis.  Firstly, because subsidies to Southern innovation
benefit the North as well, it is beneficial to the North to pay for some of
these subsidies.  Secondly, because the ability of the South to assimilate
Northern knowledge and innovate depends on Southern skills levels, the
consequent spillover benefits on growth make the subsidising
of Southern education by the North particularly attractive.
Length:  26 pages
Creation-Date:  1996-07
Revision-Date: 1998-01
Publication-Status: Published in Review of Economics, March 1999, pages 1-23
File-URL: ftp://ftp.grandiose.edu/pub/econ/WorkingPapers/surrec9602.pdf
File-Format: Application/pdf
File-Function: First version, 1996
File-URL: ftp://ftp.grandiose.edu/pub/econ/WorkingPapers/surrec9602R.pdf
File-Format: Application/pdf
File-Function: Revised version, 1998
Number: 9602
Classification-JEL: E32, R10
Keywords: North-South, growth model, innovation assimilation
Handle: RePEc:aaa:wpaper:9602

What can go wrong here? First there are some mandatory fields, and if they are missing, the template is automatically rejected. Examples are Author-Name, Title, Handle. Then some fields need to follow some format. This applies to dates, handles, URLs. An error here also leads to a rejection. In some other cases, a syntaxic error will only lead to a warning with the particular field being ignored.

The more subtle issues arise when the provider starts inputting data that is syntaxically correct, but not in an intended way. Let us look at what can happen to the Author-Name field:

Author-Name: John Doe
Author-Name: John Doe and Jane Doe
Author-Name: John Doe (j.doe@grandiose.edu)
Author-Name: John Doe, Department of Economics, Grandiose University
Author-Name: "John Doe"
Author-Name: JOHN DOE
Author-Name: Assistant Prof. John Doe
Author-Name: Juan González

All these will pass the syntax check because they will not lead to wrong information. However, they will be confusing for various uses of this data. The first entry is entirely correct. The second is problematic because two names are listed. Each author should be in a separate Author-Name field. The problem here is that, for example, the RePEc Author Service will have difficulties attributing this entry to John Doe as a more complex name is listed. The third and fourth entries have information that is not about the name. This should be included in fields like Author-Email and Author-Workplace-Name. The problem here is that when you build a citation record, you only need the name of the author, not all the other “junk.” And speaking of junk, the next entry also features extra characters that are not useful for the record. The name in all capital is annoying, because when works from different origins are mixed (say, a list of references), records in all caps look awful. Some RePEc services adjust the capitalization (also for titles), but this can lead to mistakes. Next, including titles not only makes a record look awkward, it also confuses name matching in the RePEc Author Service. Finally the last record has a mangled character. This usually happens because the provider was negligent in tracking the character encoding while transferring the data. There is another blog post to explain this.

What else are common errors that lead to confusion? A surprisingly common one is to put the abstract in the title. Speaking of the title, too often it includes information that does not belong in a title, such as “revised version of paper 2001-10,” “in Japanese,” titles in all caps, adding punctuation at the end, or putting the whole title between quotation marks. Again, this prevents building a good citation. It also prevents automatically matching different versions of the same work if the titles are unnecessarily different. Users can manually match those misfits with this form.

RePEc provides some syntax analysis to its providers at EconPapers, and providers are alerted about warnings and errors in a monthly email. Yet, it is often when users, especially authors, complain that they seem to be correcting the data. Thus, if you see something amiss, contact the person listed on every page for corrections. This is the person that can do something about the record, not RePEc volunteers.


RePEc in March 2019

April 4, 2019

What is new at RePEc? First, we welcomed the following new archives: Prizren Social Science Journal, Spanish Securities and Exchange Commission, Aix-Marseille School of Economics, Revista Universitara de Sociologie, Bulgarian Association for Management Development and Entrepreneurship (BAMDE), ISE Research Institute, Istanbul University Press. Second, we counted 538,251 downloads and 2,200,616 abstract views. Finally, here is our traditional milestone report:

15,000 followers on Twitter for the NEP reports
12,000 distinct items mentioned in blog posts on EconAcademics.org


RePEc to take over Google Scholar

April 1, 2019

RePEc is proud to announce that it will soon take over the management of Google Scholar. Indeed, Google is dropping Google Scholar from its portfolio of web services following its yearly Spring cleaning exercise. While Google Scholar is using relatively few resources, it is not bringing any revenue and there is no expectation that it ever will. This situation is not much different from RePEc, which has no revenue either and has learned to work efficiently with volunteer resources and some sponsored hardware. For a company that is accountable to shareholders, Google and its parent Alphabet find it more and more difficult to justify giving away resources. However, this is at the core of the mission of RePEc, bringing free bibliographic resources to the academic community.

While RePEc has a lot of experience, after all it is older than Google, the take over is not without challenges. Indeed, RePEc has concentrated on Economics while Google Scholar expanded into all sciences. Thus the amount of data is much larger. Initially, services will continue to run on Google hardware before eventually moving to be independent from their birth parent. As usual, RePEc will rely on volunteers and is now appealing for them to come forward. Talent is needed in system administration, programming, UX, and brain storming. Experience in the academic publishing industry or academia a plus, especially in marine biology. Motivated candidates are asked to make themselves known by email.