Classifying authors

March 16, 2008

A difficult task librarians often face in the classification of items is determining whether authors with similar names are the same person. Indeed, bibliographic records are most of the time very limited in author identification. Take the case of Adam Smith. He may be listed under his full name, which is by no means unique, or worse only as A. Smith, which is easily confused with others. Librarians then rely on context and additional information gathered outside of the bibliographic record to attribute the work to the right person, hopefully without error.

With the large numbers of works now available, such laborious categorization becomes unfeasible, and automatic classification makes numerous errors. Within RePEc, we rely on the authors themselves to perform the classification. When they register in the RePEc Author Service, they have the opportunity to enter all the possible name variations in they may be listed in a bibliographic record. For John Maynard Keynes (who is not registered), such name variations could be:

John Maynard Keynes
John M. Keynes
John Keynes
J. M. Keynes
J. Keynes
Keynes, John Maynard
Keynes, John M.
Keynes, John
Keynes, J. M.
Keynes, J.

In addition, an author may have changed names (through marriage), be listed with a title (Prof., Sir) or with a suffix (Jr, Sr, III). Variations multiply if names have accents, which some publishers do not take into account or encode in the wrong character set. The possibilities are numerous. The registered author is then offered first suggestions of works that match the name variations and then suggestions that offer some close match to name variations (typographical errors happen). The author can then accept these works or reject them.

The RePEc Author Service has so far managed to collect data from close to 16,000 authors who have claimed over 300,000 works as theirs. Such data is in particular used to increase the accuracy of various rankings. And within this set of authors, there is already a large number of homonyms, even when one looks beyond the initial of the first name, which is the precision that some other services have.

If you know of other homonyms in the profession, encourage them to register!


The RePEc budget for 2008

March 9, 2008
  Budget 2007 Effective 2007 Budget 2008
Expenses US$0.00 US$0.00 US$0.00
Revenues US$0.00 US$0.00 US$0.00

Thanks to all our volunteers!


RePEc in February 2008

March 2, 2008

Every month, a short summary of what happened with RePEc is sent to the RePEc-announce mailing list. I also put that message, slightly adapted, on this blog.

During this month, IDEAS moved to a new server sponsored by the Society for Economic Dynamics. It continues to be hosted by the University of Connecticut and is now located on a faster line to the Internet.

In terms of traffic, 613,984 file downloads and 2,246,241 abstract views were recorded within the month, once more significantly up from a year ago. This leads us to the thresholds we have passed this month:

40,000,000 cumulative article abstract views on all RePEc services
25,000,000 cumulative abstract views on EconPapers
300,000 items claimed by registered authors
100,000 JEL codes papers
20,000 unique subscribers in NEP
2,800 journals and series


Volunteer recognition: Thomas Krichel

February 21, 2008

Thomas Krichel is not just a RePEc volunteer, he is RePEc. In 1991, as an research assistant at the Economic Department of Loughborough University, he saw the potential that the Internet gave for the dissemination of research in Economics, but could not manage to get a hold on good data about new working papers. In February 1993, on a lectureship at the University of Surrey, he was more lucky and teamed with Féthy Mili, Economics librarian at the Université de Montréal, who contributed data on 250 series, and Hans Amman (University of Amsterdam), who let Thomas use his coryfee mailing list. Bob Parks soon joined with his Economics Working Paper Archive at Washington University. Thus the NetEc project was launched. It moved to a gopher server at the Manchester Computing Centre in 1993, and then to the web. That year, Thomas also got help in collecting data from José Manuel Barrueco Cruz, Economics librarian at the University of Valencia. But soon they realized that there was too much information out on the Internet for just the two of them to collect.

This is when Thomas suggested the creation of RePEc which would completely decentralize the data input: the publishers, who benefit the most from having their papers listed on web indexes, were to index the works themselves. With the collaboration of Sune Karlsson (SWoPEc, Stockholm School of Economics), Bob Parks and Corry Stuyts (DEGREE, Netherlands), José and Thomas then launched RePEc in June 1997. It still works under the same principles, with great success.

Thomas is still the heart and soul of RePEc. He has his hand in almost every project that is undertaken. After completing his Economics PhD at the University of Surrey, he moved to Long Island University to take a position of assistant professor in … Library Studies. Now tenured, he is an eminence grise in the online provision of bibliographic data and is pushing the RePEc concept into other fields. Within RePEc, most of his attention is currently directed towards NEP, the email notification service on new working papers.


World Ranking of Repositories, RePEc is #2

February 14, 2008

The Webometrics Ranking of World Universities is an initiative that tries to establish which universities provides to most content on the web and get visibility from it. The ranking of universities is based on the size of the web domain (20%), the number of rich files available (PDF, RTF, etc., 15%), Research on Google Scholar (15%), and link visibility (50%). Not surprisingly, US universities monopolize the 24 first spots, led by MIT.

Webometrics also ranks repositories, the criteria being the same as for universities. The ranking is led by Arxiv, the grand-daddy of all repositories covering much of Physics and Mathematics. RePEc is number 2, followed by E-LIS, a repository in Library Sciences founded by Thomas Krichel, who is also at the origin of RePEc!

Other notables down the list: HAL, a French repository that feeds to RePEc at number 9, CDLIB, the University of California Repository, a RePEc participant at number 19, SSRN, not in RePEc, at number 37, the Munich Personal RePEc Archive, barely a year old, is already number 56, and AgEconSearch, not in RePEc, is ranked number 126.


Society for Economic Dynamics sponsors new server for IDEAS

February 7, 2008

IDEAS just moved to a new server sponsored by the Society for Economic Dynamics. The old server, which was sponsored by the College of Liberal Arts and Sciences at the University of Connecticut had been running almost flawlessly since October 2002, but was starting to get overwhelmed by the amount of material now in RePEc and by the heavy traffic and number crunching it entails. While the amount of material more than tripled, the complexity of the data increased much more than that, given the links with authors, references, citations, JEL codes, NEP reports, rankings, institutions, publication compilations, and reading lists.

The new server has more computational power, more memory and especially more disk space. As before, it is hosting IDEAS, EDIRC and QM&RBC. It also hosts the website of the Society of Economic Dynamics, which is willing to sponsor it as it was looking for space to host the datasets and program codes used for articles published in the Review of Economic Dynamics. The server is also set up to provide limited emergency support in case another RePEc service is failing. The hosting continues to be provided by the College of Liberal Arts and Sciences at the University of Connecticut. In particular, Tim Ruggieri from the CLAS Computer Support Group helped with the configuration of the server.

RePEc relies entirely on the support of volunteers in its operations. Contact us if you want to help in one way or the other.


RePEc in January 2008

February 2, 2008

Every month, a short summary of what happened with RePEc is sent to the RePEc-announce mailing list. I also put that message, slightly adapted, on this blog.

The RePEc Author Service was unfortunately down for 10 days. We hope this was only a temporary problem, and full functionalities will be restored soon. The RePEc Blog was very helpful in keeping user abreast of the situation.

Contentwise, a notable addition has been the complete listing of the Journal of Political Economy, starting in 1893. In terms of traffic, 552,272 file downloads and 1,946,427 abstract views were recorded within the month, significantly up from a year ago. This leads us to the thresholds we have passed this month:

90,000,000 abstract views on IDEAS
450,000 online items
275,000 paper announcements through NEP
175,000 items with citations
170,000 online papers
170,000 papers with abstract
80,000 papers with citations
30,000 articles with references
900 books online


Supplementary Open Access

January 23, 2008

Most economic publications do not provide open access. Yet the articles published there may be accompanied with related Internet material that is openly accessible—in particular pre-print, post-print and other versions of the published articles, and including additional material that is unavailable in the published versions. I refer to this as “supplementary open access.”

It is obviously in every author’s best interest to make their works freely available on the Internet, and it is in the interest of all economists to enjoy unhampered access to as many economic research papers as possible. It is also helpful to have open access to other versions of articles in case the original article is not available, too expensive or shortened. This makes it advisable for authors to provide supplementary open access to their published work, in particular access to pre-print versions and post-print versions. (“Pre-print” refers here pre-refereed and “post-print” to post-refereed versions of a published paper).

Supplementary open access has several advantage for authors:

Visibility and citations. In his article “Online or Invisible,” Steve Lawrence has analyzed the effect of online availability of published journal articles in physics on citation. He concludes:

“The results are dramatic. There is a clear correlation between the number of times an article is cited, and the probability that the article is online. More highly cited articles, and more recent articles, are significantly more likely to be online. … When considering articles within each year, and averaging across all years from 1990 to 2000, we find that online articles are cited 4.5 times more often than offline articles. “

There is no reason to assume that economics would be any different, although I do not know a comparable systematic study. My own experience is, though, that those articles I placed online attracted much more attention than comparable articles not available online. The further spreading of Internet publishing since 2000, also witnessed by the growth of the RePEc database, may have further strengthend the effect.

Prestige. Theodore Bergstrom and Rosemarie Lavaty have looked at all articles that appeared 33 economic journals in August 2006 and determined for all papers whether or not an open access version was available on the net. The result:

“… freely available versions of about 90 percent of the articles in the top fifteen listed journals can be found by Google-searching the title and author. […] the self-archiving norm is less strong among those who publish in the less influential journals. Freely available versions of about 50 percent of the articles in the eighteen lower-ranked journals in our sample could be found on the internet.”

Several top-ranked journals (among them QJE, JPE, and Econometrica) scored 100% free accessibility, while lesser journals scored significantly less. The prestige of the journal thus correlates strongly and positively with supplementary open access.

Career concerns. Many hiring decisions are influenced by citation scores. I have mentioned above that open access improves citations and citation scores. These are usually taken from Thompson (Web of Science). The RePEc citation scores are increasingly used for these purposes as well. They refer not only to published work, but also to pre-prints and all other material available through the RePEc services.

For these reasons, authors should be interested in providing supplementary open access to their works. At the same time they help others.

Now to the publishers. Most publishers make supplementary open access easy, but a few still try to restrict public access, in spite of calling themselves “publishers.” (It would be better to call them “concealers.”) You may want to submit only to those publishers that permit some form of supplementary open access. These publishers are “green,” “blue,” and “yellow”:

Green publishers (most publishers) permit you to archive pre-print and post-print versions of your article.

Blue publishers permit you to archive post-print (i.e. final draft post-refereeing) versions of your article.

Yellow publishers permit you to archive archive pre-print (i.e. pre-refereeing) versions of your article.

All these publishers are fine, as far as supplementary open access is concerned. The publishers that are neither green, blue or yellow are “white”:

White publishers (a few) do not support archiving.

As supplementary open access is blocked only by the white publishers, don’t submit to the corresponding journals if possible–and don’t help them with referee reports. They obstruct the dissemination of knowledge.

You find further details on individual journals at the RoMEO database. The publishers’ websites usually offer additional information.

In order to realize the advantages of supplementary open access it does not suffice, though, to shun white publishers. You need also to make your paper available on the Internet. The usual way is to deposit your pre-refereed manuscript at the working paper series of your institution.

Make sure that all information is supplied to the RePEc database. This will guarantee inclusion in the CitEc citation compilations done by RePEc. At the same time, the paper will become easily available through the RePEc services such as IDEAS or EconPapers, as well as Google Scholar and OAIster. In addition, the paper will be advertised through the NEP mailing lists that target specific subjects. All this will make the paper very easy to find.

If your institution’s working paper series does not supply its data to RePEc, you may suggest that they do. (instructions). Otherwise you may consider depositing your paper with MPRA which provides this service. You can also have it both ways: If your institution’s series is not covered by RePEc, you may publish it in addition in MPRA, but this makes sense only if your institutions series is not covered by RePEc. Otherwise please don’t do it, as it creates confusion.

Once your paper is published, leave your pre-print on the repository—do not remove it. If you remove the open access (pre-print or post-print) versions, you lose all citations to these works, which reduces your citation score. Further you prohibit access to readers who have no subscription for the publisher’s data banks. In short, you lose all the advantages mentioned above.

Some further observations. Yes, it is true, readers can buy articles at the publishers’ data banks (such as IngentaConnect, ScienceDirect, or Blackwell Synergy), or at British Library Direct even without subscription, but prices are excessive: You may be easily required to pay 30$ for a book review of just a few pages! Most readers won’t buy a pig in a poke, though. Open supplementary access versions of the original articles help the reader to gather an impression about whether it is worthwhile to have a closer look at the original article. Publishers may increasingly become aware that supplementary open access effectively helps them to sell stuff through their data banks. Hopefully they pass some of the earnings to the authors.

Let me add a quite important additional benefit of supplementary open access: Authors keep the copyright for all material they put on the Internet. If they do not pre-publish, they may lose all rights for their own works! (The current German legislation will transfer all electronic rights to the publishers, for example, unless the author explicitly states by the end of the current year that he or she wants to keep the rights. German university libraries are very worried.) You avoid all these problems to some extent by pre-publishing on the Internet, as all pre-published versions will remain yours, even if the rights for the published version go to the publisher.

A further point: Keep the electronic manuscript of your final version. Green publishers allow you to deposit the final version, but, as a rule, not the publisher’s final PDF. This restriction, if it applies, may be inconvenient, but it does not pertain to the substance of your work. Given this state of affairs it is advisable that you produce your own PDF right after correcting the galleys (by using the free software PDFCreator or the pricey Adobe Acrobat). So you have a version that you can safely use to enhance the impact of your work by providing supplementary open access. But if this seems too cumbersome to you, just leave the pre-print version on the net! The RePEc listings will show these along with the final versions, so that readers are informed about which versions exist, and which one is the latest.


RePEc Author Service is down

January 18, 2008

The RePEc Author Service is down at the time of this writing. IT personel from the College of Liberal Arts and Sciences at the University of Connecticut is currently looking into the issue.

Update (Friday 18:00 EST): The service is back up, after an interruption of about 18 hours. It should be fully functional. Please do not hesitate to report any issues. We are sorry for the inconvenience.

Update 2 (Saturday 7:00 EST): The server went down again, and will not be back up before Monday.

Update 3 (Monday 19:00 EST): The server is still down. When it is back up, we will keep it offline to investigate the problem.

Update 4 (Thursday 18:00 EST): The machine is running (for the moment…) but we are still keeping it offline to work on it.

Update 5 (Tuesday 10:00 EST): All tests have been passed successfully, we are progessively reestablishing all services.

Update 6 (Tuesday 14:00 EST): Everything is looking good so far, expect the service to be available tomorrow if tests continue to look positive.

Update 7 (Wednesday 14:00 EST): The RePec Author Service is back online. Please report anything unusual. It is to be expected that some data is out of date, in particular citation data. Sorry for the inconvenience, and let us hope everything works fine now that the service is live.


The citation extraction process in CitEc

January 16, 2008

CitEc is an experimental autonomous citation index, that is, it is a software system which is able to automatically extract references out of the full texts of documents and create links between citing references and cited papers.

With its last update, the CitEc database has reached almost three million references and more than one million citations between documents available in RePEc. This is an important threshold but still is far of being a complete set of citations. There are some limits in the references extracion process:

First, the system needs to have open access to a electronic version of the documents full text. Many journals listed in RePEc have restricted access and therefore are excluded of CitEc unless they grant special access or push the citations to RePEc in other ways. We are working with some publishers that kindly provide us with metadata about references. We try to get on board as many publishers as possible but unfortunately not all of them are willing to collaborate with us at this time. As a result, the data set is still made up mainly of references extracted from working papers. This has the advantage of provide the most updated data about citations since working papers contains the most recent research results.

Second, the URL provided by the RePEc archive maintainer must be correct and must point to the PDF file containing the document full text and not to an intermediate abstract page or similar. Some archives provides this kind of links to force the researchers to pass through their institutional web pages. The system is unable to follow the links to the hidden papers and they are missed in the references extraction process.

The third limit is more technical. In order to extract references, the PDFs files need to be converted into plain ASCII text. This step is key to successfully complete the process, since a good quality text representation of the document makes easier the identification of references. There are a wide variety of PDF files created in different ways and not all of them can be converted.

Finally, the systems does a parsing of the references section, which first needs to be isolated, to identify each reference and split it in its parts: title, author, year, etc. The parsing is done using pattern matching techniques which in some cases are not able to identify the full list of existing references.

As the las update as of December 31, 2007, the CitEc numbers are: 527,357 articles and working papers available in RePEc. Of them, 343,441 cannot be processed by the system due to limitations mentioned in the first two points above, namely:

101,886 have not an electronic representation

216,110 have restricted access

19,174 have not a direct link to the docuent full text

6,271 have wrong url

That leaves an amount of 183,916 documents available to be processed by CitEc. Of them, the process was successfully completed in 134,130 papers, that is the 73% of the available documents. The complete list of sources and the number of processed documents for each series or journal is available here.

All the previous considerations should be taken into account when CitEc data is used for scientific evaluation purposes. We still consider the data to be experimental.

From the point of view of RePEc archive maintainers there are a few basic steps they can take to improve the situation. For example:

  • provide direct and correct URLs to the documents full text
  • make use of the X-File-Ref to give the system an ASCII version of the references section of a particular document
  • help us to lobby the publishers and editors of the restricted journals asking them to send us metadata about references.