The citation extraction process in CitEc

CitEc is an experimental autonomous citation index, that is, it is a software system which is able to automatically extract references out of the full texts of documents and create links between citing references and cited papers.

With its last update, the CitEc database has reached almost three million references and more than one million citations between documents available in RePEc. This is an important threshold but still is far of being a complete set of citations. There are some limits in the references extracion process:

First, the system needs to have open access to a electronic version of the documents full text. Many journals listed in RePEc have restricted access and therefore are excluded of CitEc unless they grant special access or push the citations to RePEc in other ways. We are working with some publishers that kindly provide us with metadata about references. We try to get on board as many publishers as possible but unfortunately not all of them are willing to collaborate with us at this time. As a result, the data set is still made up mainly of references extracted from working papers. This has the advantage of provide the most updated data about citations since working papers contains the most recent research results.

Second, the URL provided by the RePEc archive maintainer must be correct and must point to the PDF file containing the document full text and not to an intermediate abstract page or similar. Some archives provides this kind of links to force the researchers to pass through their institutional web pages. The system is unable to follow the links to the hidden papers and they are missed in the references extraction process.

The third limit is more technical. In order to extract references, the PDFs files need to be converted into plain ASCII text. This step is key to successfully complete the process, since a good quality text representation of the document makes easier the identification of references. There are a wide variety of PDF files created in different ways and not all of them can be converted.

Finally, the systems does a parsing of the references section, which first needs to be isolated, to identify each reference and split it in its parts: title, author, year, etc. The parsing is done using pattern matching techniques which in some cases are not able to identify the full list of existing references.

As the las update as of December 31, 2007, the CitEc numbers are: 527,357 articles and working papers available in RePEc. Of them, 343,441 cannot be processed by the system due to limitations mentioned in the first two points above, namely:

101,886 have not an electronic representation

216,110 have restricted access

19,174 have not a direct link to the docuent full text

6,271 have wrong url

That leaves an amount of 183,916 documents available to be processed by CitEc. Of them, the process was successfully completed in 134,130 papers, that is the 73% of the available documents. The complete list of sources and the number of processed documents for each series or journal is available here.

All the previous considerations should be taken into account when CitEc data is used for scientific evaluation purposes. We still consider the data to be experimental.

From the point of view of RePEc archive maintainers there are a few basic steps they can take to improve the situation. For example:

  • provide direct and correct URLs to the documents full text
  • make use of the X-File-Ref to give the system an ASCII version of the references section of a particular document
  • help us to lobby the publishers and editors of the restricted journals asking them to send us metadata about references.

5 Responses to The citation extraction process in CitEc

  1. Richard Tol says:

    at the risk of creating more work for volunteers who work very hard already

    I looked at my working paper series. Although we do better than average, CitEc has processed less than 100%. It would be good if the archive maintainers would get feedback on this. In 2006, 2 of 33 papers were not properly processed. Which 2? Why? How can I solve this? Presumably, the software produces an error message. That can be placed on a website for archivers to look at.

    On journals: Why not write to the editors of the offending journals? Editors have leverage over publishers.

    In general: Google Scholar has mastered the trick that CitEc tries to perform. Anne-Wil Harzing has managed to put a user-friendly interface on top of Google Scholar. See Publish or Perish: http://www.harzing.com/resources.htm Should CitEc switch technology?

  2. JMBC says:

    Dear Richard, thanks for your comments.

    In fact editors may know which and why their papers have not been processed by clicking in the series handle that appears after the series name in the table: http://citec.repec.org/topcoverage.html. Furthermore, they may use the section “archive maintainers” at http://citec.repec.org to obtain the same information. May be this posibility is not well known by the editors and we should publicite it in the monthly mailings.

    Unfortunately we can’t compete with Google Schoolar in technology :) At the moment CitEc uses a modified version of CiteSeer software and we are working to improve it. We are always short of volunteers with experience in programming. This is a public call for anyone out there willing to join the CitEc team!!

    Regards, Jose Manuel Barrueco

  3. I find the attitude of some publishers who do not want to collaborate with RePEc scandalous. They are hurted RePEc users. More in blog post.

  4. Presumably they want to set up their own data bank–and charge for it, of course.

    TIME magazine of Dec. 24, 2007, pp. 28f. reports that the program MATRIX of the American company Seisint has been bought by the British-Dutch firm Reed-Elsevier for $24 million. It may be a very effective tool for linking research papers, and it is perhaps not too far-fetched to assume that this is the purpose of the acquisition.

    MATRIX has been originally devised to search data banks and combine information in order to detect terrorists. It has been promoted by the consulting firm Giuliani Partners, directly related to the presidential candidate of the same name. (The consulting firm earned $6.5 million in commissions for promoting the software.) The Federal Government and several states signed up, but MATRIX was dropped eventually because of privacy concerns. Now it has been sold to Reed-Elsevier. Maybe MATRIX will combine RePEc data with Elsevier data that are kept away from RePEc, and thereby outperform RePEc.

    I have no idea how to react to this. I do not like the idea to prevent commercial data banks from harvesting RePEc data. A weak response would be that economists increasingly refuse to referee for journals of publishers who pursue policies they find questionable. This seems morally sound and effort saving at the same time. It is also easy to withdraw from editorial boards of journals pursuing questionable policies. This would be particularity effective if prominent economists decide that way. After all, they help those journals with their reputation, rather than the other way round. The community should take notice of economists who support journals pursuing questionable policies and should reckon members of editorial boards as being supportive of the policies of the journals they endorse. Another suggestion would be to give preference to citing open access versions, rather than proprietary versions of papers, or to cite all proprietary stuff through the IDEAS link which gives the list of all available versions.

    Ekkehart

  5. netpromoguy says:

    Dear Mr Richard,
    Thank you for your comments. According to the numbers of the last update, December 31, 2007 we see there has been a major improvement if taken into consideration that 72% of the documents available wre processed. I feel optimist.

Leave a comment