You may read the title of this blog and think, “Elementary.” Before making that assumption based on years of experience creating PDFs for sharing papers, take a moment to consider the notion of a good PDF. Despite its namesake of Portable Document Format, PDF isn’t fully portable. The look and feel, and sometimes even meaning of a document won’t transfer across operating systems unless fully self-contained in the file. For instance, unless native to the system rendering a font, a different font will render. For example, a document created with ITC Symbol Medium, a proprietary TrueType font may render differently between PDF viewers, losing the intended meaning. Let’s avoid this embarrassing mishap and create a good PDF.
Page from Federal Reserve Bank of Dallas 1999 Annual Report, rendered in Firefox v.103.0.1 PDF viewer. Text appears in Latin script and characters are spaced appropriately.
Page from Federal Reserve Bank of Dallas 1999 Annual Report, rendered with PDF.js iframe v.2.9.359. Text appears in Greek rather than Latin script and characters are spaced appropriately.
Page from Federal Reserve Bank of Dallas 1999 Annual Report, rendered with Preview v.11.0 (used on MacOS). Text appears in Latin script and characters are spaced far apart.
A short history of PDF
What is PDF? In 1991, John Warnock, co-founder of Adobe Inc. ideated a universal format to communicate visually meaningful information across operating systems. Adobe realized this technology in 1992 as the Portable Document Format, or PDF. In 2008, Adobe’s proprietary file format was standardized as ISO 32000 and is based on PDF v. 1.4. The latest version of ISO 32000 was released in 2020 and details PDF v. 2.0. PDFs are electronic documents that are either digitized or born digital—i.e., digital surrogates of a physical document or documents created with a digital editing software, respectively.
- Portable look and feel between operating systems—inherent to PDF
- Embedded structure and semantics—enabled with Tagged PDF
- Fully self-contained—defined by PDF/A
The world of PDF is vast. For the purpose of this post, we’re thinking about PDF as a format for disseminating born digital scholarly papers, and a good PDF is understood as one that is accessible and self-contained.
About the Good PDF
In addition to the standard PDF, there are PDF extensions or subsets, including PDF/E, PDF/VT, PDF/X, PDF/UA, and PDF/A. PDF/E, PDF/VT, and PDF/X specify requirements that optimize publishing and printing and are largely focused on handling complex graphics and layout, whereas PDF/UA and PDF/A are more generalized to any type of content and focus on how that content exists and is presented in the PDF. Standardized in 2012 as ISO 14289, PDF/UA is a “Universally Accessible” PDF variant that requires content blocks to be tagged, making them navigable for screenreaders. Tagged PDF defines structure and semantics so that the content is not only machine readable, the order of content is meaningful. PDF/A or PDF-Archival is standardized in ISO 19005 as a format for long-term preservation of electronic documents. PDF/A is defined in four versions, as well as three levels of conformance to those versions. Together, the versions and conformance levels are flavors of PDF/A. These flavors do not suggest preference; they are simply variants that provision different levels of flexibility as to what can or cannot be contained within the file. In addition to content tagging, PDF/A limits the types of content—or objects—that can be included in a PDF.
As aforementioned, in the world of PDF, there are two types of documents: digitized and born digital. Asserting tagging on digitized documents requires manual tagging of the file that is time consuming and often not possible due to the nature of how documents are digitized. As such, digitized documents generally forgo the tagged PDF requirement, taking the PDF/A-b (basic) conformance level that does not require tagging. This is an inherent vice of print material. Born digital content, however, can easily be tagged, as meaningful structure is built into word processing software and can be understood by PDF creation software. When possible, born digital documents should conform to PDF/A-a (accessible).
Now that you’ve decided on the conformance level, what version should you use? Subsequent versions of ISO 19005 consider the evolution of documents and standards. For example. PDF/A-1 does not permit embedding of certain image and content objects, including JPEG2000 and 3D images, and CAD drawings. These embedded objects are permitted in later versions of 19005 as standardization, support, and uptake around those previously prohibited types of content increased. Repositories prefer and may prohibit later versions of PDF/A because they are more flexible and, thus, have been considered less preservation-friendly.
How to create a Good PDF
Considering the type of content and your born digital file, PDF/A-1a (version 1, accessible) is the preferred PDF/A flavor for working papers. However, due to the landscape of version preference and software support PDF/A-1b (version 1, basic) may be the only possible version to achieve, as not all software support tagged PDF, and adding that structure would need to be done with a PDF creation software.
Word processing software, such LaTeX, Microsoft Word, and LibreOffice, have built-in functions that create PDF derivatives from the native file format—i.e., .docx, .tex, and other word processed formats. Additional steps are needed to create a PDF/A, and listed are some guides for creating your PDF/A-1a or -1b.
Because software and software uptake changes, there is no universal guide for creating a PDF/A. These guides should send you in the right direction. While software may create a seemingly good PDF/A, you can complete manual and automated validate to ensure that your PDF/A is compliant.
- Is meaningful descriptive metadata embedded?
- Did fonts embed as expected or are there visual discrepancies?
- Is the content ordered correctly so that the document can be read by a screenreader?
Create a good PDF and be a steward of accessible and sustainable research dissemination throughout the working paper lifecycle.
Oettler, A. (2013). PDF/A in a Nutshell 2.0. PDF Association. https://www.pdfa.org/resource/pdfa-in-a-nutshell-2-0/