RePEc in October 2022

November 8, 2022

We had a first last month: not a single new RePEc archive. Does this mean that every working paper provider and every publisher is now participating in RePEc? We do not believe so and hope to see many more onboard. Still, we got good traffic, with 476,515 file downloads and 1,878,675 abstract views in October 2022. And we reached the following major milestone:

400,000,000 cumulative abstract views on IDEAS


How to create a (good) PDF

November 4, 2022

You may read the title of this blog and think, “Elementary.” Before making that assumption based on years of experience creating PDFs for sharing papers, take a moment to consider the notion of a good PDF. Despite its namesake of Portable Document Format, PDF isn’t fully portable. The look and feel, and sometimes even meaning of a document won’t transfer across operating systems unless fully self-contained in the file. For instance, unless native to the system rendering a font, a different font will render. For example, a document created with ITC Symbol Medium, a proprietary TrueType font may render differently between PDF viewers, losing the intended meaning. Let’s avoid this embarrassing mishap and create a good PDF.

Screen Shot 2022-11-04 at 8.55.24 AMPage from Federal Reserve Bank of Dallas 1999 Annual Report, rendered in Firefox v.103.0.1 PDF viewer. Text appears in Latin script and characters are spaced appropriately.

Screen Shot 2022-11-04 at 8.56.46 AM
Page from Federal Reserve Bank of Dallas 1999 Annual Report, rendered with PDF.js iframe v.2.9.359. Text appears in Greek rather than Latin script and characters are spaced appropriately.

Screen Shot 2022-11-04 at 9.18.43 AM
Page from Federal Reserve Bank of Dallas 1999 Annual Report, rendered with Preview v.11.0 (used on MacOS). Text appears in Latin script and characters are spaced far apart.

A short history of PDF

What is PDF? In 1991, John Warnock, co-founder of Adobe Inc. ideated a universal format to communicate visually meaningful information across operating systems. Adobe realized this technology in 1992 as the Portable Document Format, or PDF. In 2008, Adobe’s proprietary file format was standardized as ISO 32000 and is based on PDF v. 1.4. The latest version of ISO 32000 was released in 2020 and details PDF v. 2.0. PDFs are electronic documents that are either digitized or born digital—i.e., digital surrogates of a physical document or documents created with a digital editing software, respectively.

​Key components

  1. Portable look and feel between operating systems—inherent to PDF
  2. Embedded structure and semantics—enabled with Tagged PDF
  3. Fully self-contained—defined by PDF/A

The world of PDF is vast. For the purpose of this post, we’re thinking about PDF as a format for disseminating born digital scholarly papers, and a good PDF is understood as one that is accessible and self-contained.

About the Good PDF

In addition to the standard PDF, there are PDF extensions or subsets, including PDF/E, PDF/VT, PDF/X, PDF/UA, and PDF/A. PDF/E, PDF/VT, and PDF/X specify requirements that optimize publishing and printing and are largely focused on handling complex graphics and layout, whereas PDF/UA and PDF/A are more generalized to any type of content and focus on how that content exists and is presented in the PDF. Standardized in 2012 as ISO 14289, PDF/UA is a “Universally Accessible” PDF variant that requires content blocks to be tagged, making them navigable for screenreaders. Tagged PDF defines structure and semantics so that the content is not only machine readable, the order of content is meaningful. PDF/A or PDF-Archival is standardized in ISO 19005 as a format for long-term preservation of electronic documents. PDF/A is defined in four versions, as well as three levels of conformance to those versions. Together, the versions and conformance levels are flavors of PDF/A. These flavors do not suggest preference; they are simply variants that provision different levels of flexibility as to what can or cannot be contained within the file. In addition to content tagging, PDF/A limits the types of content—or objects—that can be included in a PDF.

​As aforementioned, in the world of PDF, there are two types of documents: digitized and born digital. Asserting tagging on digitized documents requires manual tagging of the file that is time consuming and often not possible due to the nature of how documents are digitized. As such, digitized documents generally forgo the tagged PDF requirement, taking the PDF/A-b (basic) conformance level that does not require tagging. This is an inherent vice of print material. Born digital content, however, can easily be tagged, as meaningful structure is built into word processing software and can be understood by PDF creation software. When possible, born digital documents should conform to PDF/A-a (accessible).

​Now that you’ve decided on the conformance level, what version should you use? Subsequent versions of ISO 19005 consider the evolution of documents and standards. For example. PDF/A-1 does not permit embedding of certain image and content objects, including JPEG2000 and 3D images, and CAD drawings. These embedded objects are permitted in later versions of 19005 as standardization, support, and uptake around those previously prohibited types of content increased. Repositories prefer and may prohibit later versions of PDF/A because they are more flexible and, thus, have been considered less preservation-friendly.

How to create a Good PDF

​Considering the type of content and your born digital file, PDF/A-1a (version 1, accessible) is the preferred PDF/A flavor for working papers. However, due to the landscape of version preference and software support PDF/A-1b (version 1, basic) may be the only possible version to achieve, as not all software support tagged PDF, and adding that structure would need to be done with a PDF creation software.

​Word processing software, such LaTeX, Microsoft Word, and LibreOffice, have built-in functions that create PDF derivatives from the native file format—i.e., .docx, .tex, and other word processed formats. Additional steps are needed to create a PDF/A, and listed are some guides for creating your PDF/A-1a or -1b.

LaTeX

Microsoft Word

LibreOffice

Because software and software uptake changes, there is no universal guide for creating a PDF/A. These guides should send you in the right direction. While software may create a seemingly good PDF/A, you can complete manual and automated validate to ensure that your PDF/A is compliant.

Validation software

Manual checklist

  • Is meaningful descriptive metadata embedded?
  • Did fonts embed as expected or are there visual discrepancies?
  • Is the content ordered correctly so that the document can be read by a screenreader?

Next steps

Create a good PDF and be a steward of accessible and sustainable research dissemination throughout the working paper lifecycle.

Further reading

​Oettler, A. (2013). PDF/A in a Nutshell 2.0. PDF Association. https://www.pdfa.org/resource/pdfa-in-a-nutshell-2-0/


RePEc in September 2022

October 11, 2022

25 years ago in September 1997, IDEAS was launched. That was celebrated last month with a special cake. In other news, we counted 414,537 file downloads and 1,698,145 abstract views for the month, welcomed Società Italiana di Economia dello Sviluppo and University of Bremen as newly participating RePEc archive, and reached the following milestones:

1,200,000 articles with citations
1,800,000 items with citations
1,000,000 new working paper announcements distributed through NEP (a paper may be announced through several reports)
60,000 indexed books


What is RePEc? How does it operate?

September 29, 2022

Many are confused about RePEc is and how it operates, in particular in relationship with the various RePEc services. The core RePEc team gathered and drafted an attempted at high-level explanations that are found below as well as on the RePEc homepage.

RePEc (Research Papers in Economics) is an initiative that seeks to enhance the dissemination of research in Economics and related areas. We want to make research more accessible both for the authors and the readers. RePEc is a crowd-sourced effort: a) thousands of people and organizations contribute the underlying data, b) a core team of contributors manage the system, and c) sponsor organizations provide the infrastructure. As such, the RePEc initiative has no central expenses, and thus can provide all services for free to all users.

How RePEc operates:

Every publisher or provider puts text files describing their publications on their server. These files follow a simple but rigorous machine-readable syntax. They can then be automatically mirrored and made available to the public on the various RePEc websites. Some RePEc services complement these data with additional information such as citations or author details. RePEc is thus a facilitator that organizes the data for others to use.

How you can use RePEc as a provider or publisher:

Join over 2000 providers and publishers to increase the visibility of your publications. Follow these step-by-step instructions to create your RePEc archive. They show how to quickly set up your RePEc archive on your http, https, or ftp server and describe the syntax of the required metadata for working papers, journal articles, books, chapters, and software. For the complete technical details on the infrastructure and the metadata, you can also read about the Guilford protocol and ReDIF.

How you can use RePEc as a reader:

You can explore economic literature on two RePEc services. On EconPapers and IDEAS, search and browse, or follow links to author profiles, references, citations, keywords, or classifications. You can get notifications of new material with two other RePEc services, NEP and MyIDEAS.

How you can use RePEc as an author:

With the RePEc Author Service, you can create a profile of your indexed works. This allows the other RePEc services to link your profile to your works and vice versa. You also get notifications about the visibility of your works and citations newly found by CitEc. And if your publisher does not participate in RePEc, you can upload missing items to MPRA, copyright permitting.

How you can use RePEc as an institution:

RePEc can help you make your working papers (pre-prints) more visible, track how your researchers publish, and provide metrics to evaluate impact.

How you can leverage RePEc data as a researcher:

Data assembled by RePEc can be used for many purposes. Examples are academic research, tracking how working papers get published, adding metrics to a website, and evaluating researchers or institutions. We have instructions on how to access the data, including through an API.

There is much more that RePEc can do for you. Explore the RePEc homepage and the various services listed there!


RePEc in August 2022

September 6, 2022

Summer (vacation) is over and RePEc users are getting back to work. Just one new archive last month: New Zealand Productivity Commission. We counted 373,428 file downloads and 1,487,271 abstract views through reporting services. And we reached the following milestones:

240,000,000 cumulative article abstract views
2,500,000 cumulative software component downloads
1,700,000 items with extracted references
65,000 registered authors
5,000 software components available online


IDEAS turns 25

September 1, 2022

25 years ago, IDEAS was launched. A few months after RePEc was created, it built on the data about publications that RePEc was was making available. At launch, about 40,000 papers were indexed, with about 4,000 being online. Now the numbers amount to over 4.1 million and 3.7 million. Abstract pages have received a total of about 400 million unique views from every country, with raw totals a large multiple of that thanks to a myriad of bots (hint: an API is available).

IDEAS did not start in a vacuum. At the time, two other sites were already displaying RePEc data, BibEc and WoPEc, part of the NetEc family of websites dedicated to Economics. The first release of IDEAS was in fact based on code used for these sites, contributed by José Manuel Barrueco Cruz. Over the next years other sites were created to display RePEc data for the user, with ultimately only EconPapers and IDEAS surviving the healthy competition. Other sites outside of the repec.org domain also leverage RePEc data.

Initially, IDEAS was just displaying the bibliographic data that is at the core of RePEc. Over time, it gradually integrated data from other RePEc services, such as author profiles, references and citations, which fields they belong to, how much they are viewed. Rankings and impact factors are now the most popular single pages after the search form.

IDEAS has also little by little added some custom features for the user, most prominently MyIDEAS that allows economists to build an online bibliography or track new additions to RePEc in many customizable ways. With the recent pandemic, a calendar of online seminars was introduced and proved to be quite popular.

IDEAS never got funding. It has been hosted over time by three sponsors: Université du Québec à Montréal, the University of Connecticut, and the Federal Reserve Bank of St. Louis.


RePEc in July 2022

August 8, 2022

While the Summer months typically put RePEc sites somewhat into a lull in terms of traffic, the CitEc project continues to add newly discovered citations at a furious pace. We welcome three new RePEc archives: Regional Agency for Technology and Innovation (Puglia), CoronaNet Project, INAPP. We counted 364,085 file downloads and 1,490,661 abstract views on the reporting RePEc services. And here are the milestones reached last month:
2,000,000 cumulative book chapter downloads
15,000 authors in the RePEc Genealogy
5,000 indexed software components


RePEc Genealogy, the academic family tree of economists

August 5, 2022

The RePEc Genealogy has now reached 15,000 entries. This site describes where and when economists got their terminal degree, as well as who their advisors were. This allows to build a family tree of the economics profession as well as gather information about graduate programs.

The data is gathered by crowdsourcing, much like a wiki: users registered with the RePEc Author Service can log in to the RePEc Genealogy and add or amend the records for themselves, their students, or their advisors. They can also add former students of the graduate programs they currently work with or graduated from.

Beyond displaying it on the RePEc Genealogy site, the collected data is used in a myriad on ways:

  • Various studies on economists have leveraged this data.
  • Profiles of economists on IDEAS display part of the RePEc Genealogy information where available.
  • EDIRC, the directory of economics institutions, has for each relevant institutions a list of alumni and a link to a compilation of their publications.
  • Female representation in economics by cohort.
  • How well graduate students do is a criterion for the rankings of economists and institutions.
  • The year of graduation is also used for rankings of economists by cohort.

Over 4000 people have already contributed to the RePEc Genealogy, everyone is welcome to make it more complete and useful.


RePEc in June 2022

July 11, 2022

CitEc is continuing a remarkable effort at citation matching, adding several million in a month. We welcomed a few new archives: International Association on Public and NonProfit Marketing, Yildiz Social Science Review, Bulletin of Political Economy, Spanish National Markets and Competition Commission (CNMC), Universidad ORT Uruguay. We counted 428,920 file downloads and 1,735,689 abstract views last month. And we reached the following milestones:
25,000,000 matched citations
1,000,000 journal articles with extracted references
600,000 working papers with extracted references
30,000 books available online
30,000 book chapters with citations
20,000 books with citations


How publishers can ensure their data looks right on RePEc

July 4, 2022

All material indexed in RePEc is provided by the respective publishers. They make this information available using a metadata syntax defined in 1997 by RePEc and that has not changed since, except for a few additions. But adhering to this syntax is important, as errors disqualify items from indexing and other problems may leads to various issues. If something is amiss or missing, every IDEAS or EconPapers page has an email contact listed for alerting the maintainer of the relevant data.

That said, RePEc helps the maintainers in various ways so that they can address proactively with any problems. They receive each month and email with various statistics and a link to their “problems” page on the EconPapers checker (add the three-letter archive code to the URL to get more details), which shows data download problems, detected syntax issues, and bad URLs to full text. EconPapers and IDEAS also provide FAQs. Also, re-reading the intial setup instructions or the ones for new maintainers can prove useful.

The most frequent issues that appears in the EconPapers checker are:

  • RePEc archive has moved from http to https: the maintainer needs to change the URL line in the archive template and alert someone in the RePEc team about the new location to fix the download process.
  • A series or journal is missing the correspondent series template.
  • A handle (identifier) is used multiple times. Handles are supposed to uniquely and permanently define any item in RePEc. Re-using them is a source of major problems.
  • Missing end-of-line that merges two fields.

Other problems cannot be detected through an automated process. Here, maintainers need to follow appropriate conventions or check that the visuals on the RePEc sites look right. Examples are:

  • Inappropriate use of a data field. Examples are putting a working paper number in a title, adding affiliations to an author name, putting an abstract in a title, or putting keywords and JEL classifications in the abstract. Each piece of information has its own field so can appropriate bibliographic records can be created.
  • Each author needs to be in their own author name field. Lumping them together in one field makes it impossible to attribute the work to registered authors.
  • When some work is available in multiple languages or is translated, each title goes into it own title fields instead of being merged into one. Also, the mention of the language goes into the language field, not in the title.
  • Errors in character encoding leads to records with funny looking characters. This happens by cutting-and-pasting strings from a file in one encoding to a file with a different encoding. Characters with accents (é, ñ, ü, ç, å), ligatures (ff, fi, ffl, æ, ß), non-latin character sets (cyrillic, arabic), and other special characters (long hyphens, Windows quotation marks and apostrophes) are especially problematic. They also make author or citation matching more difficult. The solutions are to fix these individually in the RePEc files, and if those are encoded as UTF-8 use and .redif extension instead of .rdf (be careful not to have both files in the RePEc archive, leading to duplicated handles).
  • No HTML markups should be present. The result in RePEc services and sites in unpredictable. The only exception is to be used to separate paragraphs in an abstract. The same applies to LaTeX or TeX markup.