Supplementary Open Access

January 23, 2008

Most economic publications do not provide open access. Yet the articles published there may be accompanied with related Internet material that is openly accessible—in particular pre-print, post-print and other versions of the published articles, and including additional material that is unavailable in the published versions. I refer to this as “supplementary open access.”

It is obviously in every author’s best interest to make their works freely available on the Internet, and it is in the interest of all economists to enjoy unhampered access to as many economic research papers as possible. It is also helpful to have open access to other versions of articles in case the original article is not available, too expensive or shortened. This makes it advisable for authors to provide supplementary open access to their published work, in particular access to pre-print versions and post-print versions. (“Pre-print” refers here pre-refereed and “post-print” to post-refereed versions of a published paper).

Supplementary open access has several advantage for authors:

Visibility and citations. In his article “Online or Invisible,” Steve Lawrence has analyzed the effect of online availability of published journal articles in physics on citation. He concludes:

“The results are dramatic. There is a clear correlation between the number of times an article is cited, and the probability that the article is online. More highly cited articles, and more recent articles, are significantly more likely to be online. … When considering articles within each year, and averaging across all years from 1990 to 2000, we find that online articles are cited 4.5 times more often than offline articles. “

There is no reason to assume that economics would be any different, although I do not know a comparable systematic study. My own experience is, though, that those articles I placed online attracted much more attention than comparable articles not available online. The further spreading of Internet publishing since 2000, also witnessed by the growth of the RePEc database, may have further strengthend the effect.

Prestige. Theodore Bergstrom and Rosemarie Lavaty have looked at all articles that appeared 33 economic journals in August 2006 and determined for all papers whether or not an open access version was available on the net. The result:

“… freely available versions of about 90 percent of the articles in the top fifteen listed journals can be found by Google-searching the title and author. […] the self-archiving norm is less strong among those who publish in the less influential journals. Freely available versions of about 50 percent of the articles in the eighteen lower-ranked journals in our sample could be found on the internet.”

Several top-ranked journals (among them QJE, JPE, and Econometrica) scored 100% free accessibility, while lesser journals scored significantly less. The prestige of the journal thus correlates strongly and positively with supplementary open access.

Career concerns. Many hiring decisions are influenced by citation scores. I have mentioned above that open access improves citations and citation scores. These are usually taken from Thompson (Web of Science). The RePEc citation scores are increasingly used for these purposes as well. They refer not only to published work, but also to pre-prints and all other material available through the RePEc services.

For these reasons, authors should be interested in providing supplementary open access to their works. At the same time they help others.

Now to the publishers. Most publishers make supplementary open access easy, but a few still try to restrict public access, in spite of calling themselves “publishers.” (It would be better to call them “concealers.”) You may want to submit only to those publishers that permit some form of supplementary open access. These publishers are “green,” “blue,” and “yellow”:

Green publishers (most publishers) permit you to archive pre-print and post-print versions of your article.

Blue publishers permit you to archive post-print (i.e. final draft post-refereeing) versions of your article.

Yellow publishers permit you to archive archive pre-print (i.e. pre-refereeing) versions of your article.

All these publishers are fine, as far as supplementary open access is concerned. The publishers that are neither green, blue or yellow are “white”:

White publishers (a few) do not support archiving.

As supplementary open access is blocked only by the white publishers, don’t submit to the corresponding journals if possible–and don’t help them with referee reports. They obstruct the dissemination of knowledge.

You find further details on individual journals at the RoMEO database. The publishers’ websites usually offer additional information.

In order to realize the advantages of supplementary open access it does not suffice, though, to shun white publishers. You need also to make your paper available on the Internet. The usual way is to deposit your pre-refereed manuscript at the working paper series of your institution.

Make sure that all information is supplied to the RePEc database. This will guarantee inclusion in the CitEc citation compilations done by RePEc. At the same time, the paper will become easily available through the RePEc services such as IDEAS or EconPapers, as well as Google Scholar and OAIster. In addition, the paper will be advertised through the NEP mailing lists that target specific subjects. All this will make the paper very easy to find.

If your institution’s working paper series does not supply its data to RePEc, you may suggest that they do. (instructions). Otherwise you may consider depositing your paper with MPRA which provides this service. You can also have it both ways: If your institution’s series is not covered by RePEc, you may publish it in addition in MPRA, but this makes sense only if your institutions series is not covered by RePEc. Otherwise please don’t do it, as it creates confusion.

Once your paper is published, leave your pre-print on the repository—do not remove it. If you remove the open access (pre-print or post-print) versions, you lose all citations to these works, which reduces your citation score. Further you prohibit access to readers who have no subscription for the publisher’s data banks. In short, you lose all the advantages mentioned above.

Some further observations. Yes, it is true, readers can buy articles at the publishers’ data banks (such as IngentaConnect, ScienceDirect, or Blackwell Synergy), or at British Library Direct even without subscription, but prices are excessive: You may be easily required to pay 30$ for a book review of just a few pages! Most readers won’t buy a pig in a poke, though. Open supplementary access versions of the original articles help the reader to gather an impression about whether it is worthwhile to have a closer look at the original article. Publishers may increasingly become aware that supplementary open access effectively helps them to sell stuff through their data banks. Hopefully they pass some of the earnings to the authors.

Let me add a quite important additional benefit of supplementary open access: Authors keep the copyright for all material they put on the Internet. If they do not pre-publish, they may lose all rights for their own works! (The current German legislation will transfer all electronic rights to the publishers, for example, unless the author explicitly states by the end of the current year that he or she wants to keep the rights. German university libraries are very worried.) You avoid all these problems to some extent by pre-publishing on the Internet, as all pre-published versions will remain yours, even if the rights for the published version go to the publisher.

A further point: Keep the electronic manuscript of your final version. Green publishers allow you to deposit the final version, but, as a rule, not the publisher’s final PDF. This restriction, if it applies, may be inconvenient, but it does not pertain to the substance of your work. Given this state of affairs it is advisable that you produce your own PDF right after correcting the galleys (by using the free software PDFCreator or the pricey Adobe Acrobat). So you have a version that you can safely use to enhance the impact of your work by providing supplementary open access. But if this seems too cumbersome to you, just leave the pre-print version on the net! The RePEc listings will show these along with the final versions, so that readers are informed about which versions exist, and which one is the latest.


RePEc Author Service is down

January 18, 2008

The RePEc Author Service is down at the time of this writing. IT personel from the College of Liberal Arts and Sciences at the University of Connecticut is currently looking into the issue.

Update (Friday 18:00 EST): The service is back up, after an interruption of about 18 hours. It should be fully functional. Please do not hesitate to report any issues. We are sorry for the inconvenience.

Update 2 (Saturday 7:00 EST): The server went down again, and will not be back up before Monday.

Update 3 (Monday 19:00 EST): The server is still down. When it is back up, we will keep it offline to investigate the problem.

Update 4 (Thursday 18:00 EST): The machine is running (for the moment…) but we are still keeping it offline to work on it.

Update 5 (Tuesday 10:00 EST): All tests have been passed successfully, we are progessively reestablishing all services.

Update 6 (Tuesday 14:00 EST): Everything is looking good so far, expect the service to be available tomorrow if tests continue to look positive.

Update 7 (Wednesday 14:00 EST): The RePec Author Service is back online. Please report anything unusual. It is to be expected that some data is out of date, in particular citation data. Sorry for the inconvenience, and let us hope everything works fine now that the service is live.


The citation extraction process in CitEc

January 16, 2008

CitEc is an experimental autonomous citation index, that is, it is a software system which is able to automatically extract references out of the full texts of documents and create links between citing references and cited papers.

With its last update, the CitEc database has reached almost three million references and more than one million citations between documents available in RePEc. This is an important threshold but still is far of being a complete set of citations. There are some limits in the references extracion process:

First, the system needs to have open access to a electronic version of the documents full text. Many journals listed in RePEc have restricted access and therefore are excluded of CitEc unless they grant special access or push the citations to RePEc in other ways. We are working with some publishers that kindly provide us with metadata about references. We try to get on board as many publishers as possible but unfortunately not all of them are willing to collaborate with us at this time. As a result, the data set is still made up mainly of references extracted from working papers. This has the advantage of provide the most updated data about citations since working papers contains the most recent research results.

Second, the URL provided by the RePEc archive maintainer must be correct and must point to the PDF file containing the document full text and not to an intermediate abstract page or similar. Some archives provides this kind of links to force the researchers to pass through their institutional web pages. The system is unable to follow the links to the hidden papers and they are missed in the references extraction process.

The third limit is more technical. In order to extract references, the PDFs files need to be converted into plain ASCII text. This step is key to successfully complete the process, since a good quality text representation of the document makes easier the identification of references. There are a wide variety of PDF files created in different ways and not all of them can be converted.

Finally, the systems does a parsing of the references section, which first needs to be isolated, to identify each reference and split it in its parts: title, author, year, etc. The parsing is done using pattern matching techniques which in some cases are not able to identify the full list of existing references.

As the las update as of December 31, 2007, the CitEc numbers are: 527,357 articles and working papers available in RePEc. Of them, 343,441 cannot be processed by the system due to limitations mentioned in the first two points above, namely:

101,886 have not an electronic representation

216,110 have restricted access

19,174 have not a direct link to the docuent full text

6,271 have wrong url

That leaves an amount of 183,916 documents available to be processed by CitEc. Of them, the process was successfully completed in 134,130 papers, that is the 73% of the available documents. The complete list of sources and the number of processed documents for each series or journal is available here.

All the previous considerations should be taken into account when CitEc data is used for scientific evaluation purposes. We still consider the data to be experimental.

From the point of view of RePEc archive maintainers there are a few basic steps they can take to improve the situation. For example:

  • provide direct and correct URLs to the documents full text
  • make use of the X-File-Ref to give the system an ASCII version of the references section of a particular document
  • help us to lobby the publishers and editors of the restricted journals asking them to send us metadata about references.

75% of the top 1000 economists are now registered with RePEc

January 8, 2008

The RePEc Author Service recently surpassed 15,000 registered authors, and the post relating this mentions the high coverage among top ranked economists. To document this, take one popular ranking, the one by Tom Coupé that is based on publications from 1990 to 2000. Tom Coupé has two rankings, one where publications are weighted by the impact factors of the journals, the other where citations are counted. According to the “publications” ranking, 75% of the 1000 economists are now registered with RePEc, according to the other 65%. The difference comes from the fact that the latter also includes non-economists (political scientists, statisticians, demographers, law scholars, and sociologists) that are cited in Economics journals.

One particularly interesting aspect of these rankings is how the proportions of registered authors decline with rankings:

Ranks registered,
publication ranking
registered,
citation ranking
1-100 93 77
101-200 81 72
201-300 78 69
301-400 73 76
401-500 77 66
501-600 71 61
601-700 73 54
701-800 77 55
801-900 62 62
901-1000 65 60
Total 750 652

How can we explain this pattern? Are registered authors more likely to publish well or be cited? This may be true for more recent measures of visibility, but in 1990-2000, the RePEc Author Service was not yet functional. Are then better ranked authors more likely to care more about their visibility and thus more likely to register?


RePEc in December 2007, and what we have done over Year 2007

January 2, 2008

Every month, a short summary of what happened with RePEc is sent to the RePEc-announce mailing list. I will also put that message, slightly adapted, on this blog.

The major event this month is that we passed to three important thresholds: 15,000 authors, 80% of the material now online, and 1/8 billion abstract views. For some hints at what 15,000 authors represent in the Economics profession, see elsewhere on the blog. Also, we have now released rankings for the most cited recent papers and articles.

As year 2007 is now over, we can reflect on what RePEc has achieved over that year. 158 archives were added, and the total of currently 844 archives have added 108,000 bibliographic items to RePEc, a 24% growth, with 240 new working paper series and 130 new journals. 105,000 new items are online, a 31% growth. 3,500 authors registered, almost ten a day, a 30% growth. Citation analysis coverage increased by 39%.

In 2007, we added also a few new features:

  • Compilations by institutions of all publications from affiliated and registered authors (find institutions on EDIRC)
  • Customized publication compilations: by defining a list of authors or by creating a reading list
  • Registered authors can now manage citations at the RePEc Author Service: delete erroneous ones and approve citations that were deemed dubious matches.
  • Rankings have been improved with more criteria, with rankings within fields and with citation rankings for recent items only.
  • The RePEc blog was inaugurated.

Finally, RePEc celebrated its 10th year in its current form. I think this was an impressive year, and I am looking forward to an even better year 2008!

In terms of traffic, December is expectedly calmer, but we still managed record numbers for the month: 1,822,061 abstract views and 504,315 downloads. This leads us to the thresholds we have passed this month:

125,000,000 cumulative abstract views
275,000 online articles
130,000 items with references
15,000 registered authors
1,900 working paper series
80% of all items available online


What are the most cited recent papers in Economics?

December 22, 2007

RePEc has been publishing for several years now a list of the most cited papers and articles cataloged in its database according to three criteria, recently expanded to six. By popular demand, we now publish also a list of the most cited recent papers and articles. The selection criterion here is that the last know version has been published five or less years ago. That may sound like a long period, but considering the publication lags we suffer, I think it is reasonable. Thus, currently, articles (and papers) published in 2002 or thereafter qualify. Within a few days, those from 2002 will be dropped, so enjoy them while you can

At the same time, the list of the most cited items has been expanded. Previously, only the top 200 were released, now we show the top 1‰. This list thus gets longer as RePEc expands and stands currently at 559. Again, the list is available according to six different criteria. So, check out whether your favorite papers are listed. And remember, all this citation data is still experimental as we try to improve on its quality, but still quite informative.


Citation Accuracy

December 19, 2007

Open Access News pointed out a very interesting article in the Journal of Cell Biology, Show Me the Data. Written by that journal’s executive editor, the executive editor of Journal of Experimental Medicine, and the Executive Director of The Rockefeller University Press, it first reiterates many quality issues with journal impact factors that seem to be well-known among biologists, but I suspect that they are news to many economists. Many of these issues also hold for citation rankings for individuals. Beyond that, there are other issues that make citation data suspect. Fortunately, there are potential solutions to many of these problems.

First, it helps to describe impact factors as they are calculated by Thomson Scientific (previously the Institute of Scientific Information, or ISI). An impact factor in year t is the mean number of cites to all articles in that journal in years t-1 and t-2 divided by the number of number of research or review articles. Criticisms include

  • the data in the denominator and numerator are not consistent
  • Thomson is unclear on what exactly defines a research or review article
  • some journals have negotiated with Thomson on exactly what defines the article type
  • retracted papers are not excluded
  • of course, the mean is inflated by a few star papers
  • editors can game the system; apparently some do and some don’t (I’ve even seen this in the Wall Street Journal)

The authors go on to say that they contacted Thomson and received some of their data. They found numerous errors in how article were categorized. Further, “The total number of citations for each journal was substantially fewer than the number published” as reported by Thomson. When they requested further data from Thomson, the data still didn’t add up. They conclude “It became clear that Thomson Scientific could not or (for some as yet unexplained reason) would not sell us the data used to calculate their published impact factor.”

Their bottom line is even more clear: “If an author is unable to produce original data to verify a figure in one of our papers, we revoke the acceptance of the paper. We hope this account will convince some scientists and funding organizations to revoke their acceptance of impact factors as an accurate representation of the quality—or impact—of a paper published in a given journal. Just as scientists would not accept the findings in a scientific paper without seeing the primary data, so should they not rely on Thomson Scientific’s impact factor, which is based on hidden data.”

Besides the points reiterated and brought up in the Journal of Cell Biology, there are further accuracy issues with Thomson data. For example, to identify authors, they only use initials for the their first and middle name. As they pool papers from all fields, this is a more severe error than one might first guess. Thomson reports that Kit Baum (known to Thomson as CF Baum) has publications in the Fordham Law Review (on nuclear waste) and the Sociology of Education (on group leadership).

A further issue is Thomson’s coverage; EconLit lists some 1,240 journals in our field while the last time I checked Thomson covered but a fraction of these. I don’t have recent data for their coverage, but in total Thomson covers 8,700 journals encompassing all academic fields, so it seems doubtful that Thomas has substantially changed its economics coverage.

A further problem plaguing all citation analysis is simply extracting citation data with software. After all, citations are written for people, not machines. I haven’t seen data for Thomson on this (one wonders if it is public), but I do know that CitEc has faced a very real challenge here.

There would seem to be several solutions to these problems. First, all of us should treat impact factors and citation data with considerable caution. Basing journal rankings, tenure, promotion, and raises on uncritical acceptance of this data is a poor idea. In the extreme, one could imagine legal action in a tenure case.

Second, as the authors of the Journal of Cell Biology argue, this data should be public, just as research findings should be. One initiative here is a Petition for OA [open access] to bibliographic data. My understanding is that through a “RePEc service” like EconPapers or IDEAS, raw CitEc data can be accessed by the public. Further, CitEc works with RePEc Author Services to correct citations. Here’s one more reason to join those 15,000 who have registered with it!

Third, we should investigate putting unique identifiers into each reference so that software can easily read it. That is, besides listing the journal, its volume, and so on, it would also include a unique identifier to the cited paper. DOIs are one possibility, but it is prohibitively expensive to get a license to dispense DOIs. However, “RePEc handles,” which identify papers in RePEc, are permanent and also cover working papers. Thus, we might start including them in each reference. This highlights a further issue: there is little incentive for authors to add this to their citations as it aids others. Perhaps one step in this direction would be for sites like IDEAS, which provide references for papers in different formats like BibTeX or EndNote, could include the RePEc handle along with the current author, title, journal, etc.


15,000 authors on the RePEc Author Service

December 15, 2007

The 15,000th author registered recently on the RePEc Author Service (which also has another 5,000 registered, but without any works in their profile). See a list of all those registered at EconPapers or IDEAS. This give us the opportunity to reflect on the coverage of this service: what proportion of academic economists is covered? Let me offer a few suggestions.

Assume that the works listed in RePEc provide a representative sample of all the works written by economists. Then determine how many of these works are listed in the profile of a registered author. By that account, about 40.1% have been claimed, and thus about 40% of the profession would be registered with RePEc. This latter number is in reality higher, due to several biases: a) some authors are not alive and cannot register; b) some registered authors have the unfortunate habit to remove from their profile working papers once they are published; c) some works listed are not written by economists, and these authors are less likely to register with RePEc.

Alternatively, estimate the number of authors in the world from the membership in academic societies. I guess the three largest societies are the American Economic Association (18,000 members), the European Economic Association (2,300 members) and the Econometric Society (5,500 members). Obviously, their membership overlaps, and not every of their members is an author. But not every economist is member either. Assume that adding their membership numbers corrects for all mismeasurements, then the RePEc Author Service covers 58% of the profession.

One can also observe a specific subsample of economists, those listed among the top 1000 by Tom Coupé. There, the RePEc Author Service covers 75% of the top 1000 by publications and 65% of the top 1000 by citations (which includes quite a few non-economists). But we have good reasons to believe these proportions are higher than for the whole population. Indeed the proportion is significantly higher for the better ranked within this sample, and we can extrapolate that those outside the top 1000 are less represented in the RePEc Author Service.

In summary, the RePEc Author Service covers between 40% and 75% of the profession. Possibly less, possibly more, likely in between.


Blog disruption

December 14, 2007

The RePEc blog was offline for a few days due to a hardware failure, along with a few other websites at Boston College, our host. Everything seems to be running well now, but please contact us if you see any remaining issues.


The end of print journals?

December 8, 2007

The (US) Association of Research Libraries released a few days ago a report entitled “The E-only Tipping Point for Journals: What’s Ahead in the Print-to-Electronic Transition Zone” (pdf). It makes the argument that sooner or later every publisher will turn to an electronic-only format in the face of rising (relative) costs of print formats. Currently, we are in a transition period where most journals went from print-only to print and electronic, and it is predicted that with 5 to 10 years, the printed journals will be only from the most specialized and small ones who cannot afford the fix cost of setting up the electronic editions. Another feature of the transition is the large proportion of new journals that do not even bother with a print edition.

This discussion largely pertains to university press publishing, but can probably be extended to commercial publishing. Indeed, commercial publishers show signs that they want to discourage print editions, either through their subscription price structure or by modifying subscriptions to be by default electronic-only. In Economics, the dissemination of research, in terms of readership, is dominated by pre-prints (working or discussion papers) that have gone all electronic for some time now, with only few exceptions. As far as I know, nobody regrets the period of the all printed working papers: they were difficult to obtain unless you were in the “club”, only few institutions had a systematic (but costly) way to disseminate them, and only established researchers had any chance of being read through this medium. People would even travel to some libraries to consult their working paper collections. Today, research is much more widely disseminated and researchers from outside the elite institutions have a better chance to follow and contribute to the research frontier. We hope RePEc has contributed to this democratization. Never has been the use of electronic pre-prints as widespread as now, possibly at the cost of reducing journals to historical records of research. Well, journals also act as gateways through peer-review, but you sometimes have to wonder about this as well when hearing all the complaints about this process.

A few interesting numbers from the study: 60% of 20,000 per-reviewed journals are available in electronic format, library-provided electronic editions are at least ten times more read than print ones, only 30% of library subscriptions are print only.