PDF/A-3 documents from LaTeX

Information about generating PDF/A-3 documents using LaTeX

Preliminaries

PDF/A-3 file formats can deliver human-readable metrological documents (i.e., PDF reader software can display the document) with digital files embedded. So, calibration laboratories could tailor reports to the needs of their customers using PDF/A documents.

The suggestion to use PDF/A-3 formats for reporting metrological data was first made by METAS. They made a proof-of-concept package available on github, which uses open-source tools to create PDF/A files—the LaTeX system for typesetting documents. However, recent developments have improved LaTeX system support for generating PDF files.

Since 2020, the group who maintains LaTeX embarked on a multi-year development project to produce tagged and accessible PDF from existing LaTeX source files with no or only minimal configuration adjustments. This project has simplified the generation of PDF/A-3 documents. In the longer term, PDF files produced using LaTeX will contain rich semantic machine-readable content, which is an exciting long-term prospect for digital transformation in metrology.

In these pages, the DXFG is making information and examples available to help DXFG members develop PDF/A reporting capabilities. A github repository of associated files provides copies of the files referred to here.

Resources

This resource uses the open-source software tools Python and LaTeX. Some information about installing Python and LaTeX is given here.

Independent software is needed to validate PDF/A-3 documents against the official PDF ISO standard. We use the veraPDF Implementation Checker tool to validate documents.

Files embedded in a PDF/A-3 document can be extracted by hand using any of the popular PDF reader applications or automatically using software. We use the Python package pypdf to extract files in the examples below, but there are other Python packages that may also be considered, such as pikepdf.

Examples

Embedding a text file

Here is an example that produces a PDF/A-3 file with an embedded text file (files shown are available in the minimal folder of the github respository).

The contents of the embedded text file is just one line:

This file will be attached to a PDF document

A LaTeX source file (ex.tex) is used to create the output, ex.pdf:

\DocumentMetadata{
    pdfversion=1.7,
    pdfstandard=A-3b,
}
\documentclass{article}

\usepackage{embedfile}
\embedfilesetup{     
      filesystem=URL,
      mimetype=application/octet-stream,
      afrelationship={/Data},
      stringmethod=escape
}

\begin{document}

You can check that this PDF document complies with the PDF/A-3 standard by using the veraPDF tool. 
A text file, \texttt{attachme.txt}, is embedded in the PDF document. 

\embedfile{attachme.txt}

\end{document}

The output ex.pdf is generated by pdflatex on the comand line, like this:

> pdflatex ex.tex

The \DocumentMetadata command must come before \documentclass in the LaTeX source file. It activates the new features of LaTeX.

Files are embedded using the command \embedfile from the embedfile package.

VeraPDF checker can be used to verify that the PDF file produced complies with the PDF/A-3b standard.

The Python package pypdf can extract the contents of the embedded file attachme.txt:

from pypdf import PdfReader

attached = PdfReader("ex.pdf").attachments  

if len(attached) != 1:
    raise RuntimeError("Expect a single attachment")

for file_name, content_list in attached.items():

    # Elements in content_list are Python byte-literals
    # Here we convert the bytes to a string.
    content_as_str = content_list[0].decode('utf-8')
    
    print(f"\nThe {file_name} file contents are:\t{content_as_str!r}")

Embedding a spreadsheet

This example produces a PDF/A file with an embedded spreadsheet. The LaTeX file for this example is ex_xlsx.tex (in the minimal folder of the github respository).

Here is the LaTeX source code:

\DocumentMetadata{
    pdfversion=1.7,
    pdfstandard=A-3b,
}
\documentclass{article}

\usepackage{embedfile}
\embedfilesetup{     
      filesystem=URL,
      mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,
      afrelationship={/Data},
      stringmethod=escape
}

\begin{document}

You can check that this PDF document complies with the PDF/A-3 standard by using the veraPDF tool. A text file, \texttt{xl.xlsx}, is embedded in the PDF document. 

\embedfile{xl.xlsx}

\end{document}

Note, we now specify mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet in the command \embedfilesetup.

The embedded spreadsheet is accessible using the spreadsheet package openpyxl:

import io
import openpyxl

from pypdf import PdfReader

attached = PdfReader("ex_xlsx.pdf").attachments  

if len(attached) != 1:
    raise RuntimeError("Expect a single attachment")

for file_name, content_list in attached.items():

    # Elements in content_list are Python byte-literals
    # Assuming byte_string contains the byte data of the XLSX file
    xlsx_data = io.BytesIO( content_list[0] )
    
    # Load the Excel workbook from the byte data
    workbook = openpyxl.load_workbook(xlsx_data)
    
    # Access the sheets in the workbook
    sheets = workbook.sheetnames

    # Iterate over each sheet and extract data
    for sheet_name in sheets:
        sheet = workbook[sheet_name]
        print(f"Sheet: {sheet_name}")
        for row in sheet.iter_rows(values_only=True):
            print(row)

XMP metadata

Adobe allows XML metadata to be included in a PDF file. This is known as XMP metadata.

Here is an example where the title and author are recorded as metadata (a complete list of supported metadata is given in the hyperref) package documentation. The LaTeX file for this example is ex_xmp.tex (in the minimal-xmp folder of the github respository).

\DocumentMetadata{
    pdfversion=1.7,
    pdfstandard=A-3b,
}
\documentclass{article}

\usepackage{hyperref}
\hypersetup{
    pdftitle={On a heuristic viewpoint concerning the production and transformation of light},
    pdfauthor={Albert Einstein}
}

\begin{document}
You can check that this PDF document complies with the PDF/A-3b standard by using the \href{https://verapdf.org/}{veraPDF} tool. 

\end{document}

XMP metadata is intended to be read by machines. For example, PDF reader software is often able to display the metadata although it is primarily intended to facilitate sharing and transfer of files across products, vendors, platforms, without metadata getting lost. There are Python packages that can access the XMP data. The pypdf package used earlier makes a few elements accessible as Python attributes, as shown below, but access to the all XMP data is more complicated.

from pypdf import PdfReader

meta = PdfReader("ex_xmp.pdf").xmp_metadata 

print( meta.dc_title['x-default'] )
print( ",".join( meta.dc_creator ) )

Another package that can access XMP data is pikepdf, shown in this code snippet

from pikepdf import Pdf

with Pdf.open('ex_xmp.pdf') as pdf:
    with pdf.open_metadata() as meta:
        print( meta['dc:title'] )
        print( ",".join( meta['dc:creator'] ) )

Calibration Reports in LaTeX

The LaTeX approach to creating documents separates layout from content: in other words, the style and typesetting of a document are defined separately. This section presents an example of a LaTeX style specification for calibration reports.

It will help to begin by looking at a complete report, so we have an idea of what we are working towards. A PDF/A file created using our style is available here. This file also has an XLSX spreadsheet embedded.

The LaTeX source file for this calibration report is shown below (from the LMI-report-cls folder of the github respository). It contains ‘mark-up’ commands, text, and other content. The appearance is specified in a LaTeX file called LMIReport.cls; the mark-up relates to the structure of the document.

\DocumentMetadata{
    pdfversion=1.7,
    pdfstandard=A-3b,
}
\documentclass[11pt,a4paper]{LMIReport}

\Metrologist{Luke Skywalker}
\ChiefMetrologist{Princess Leia Organa}
\ReportNumber{12345} 
\ReportTitle{A report on the calibration of an type-N male open}
\date{17 May 2035}

\begin{document}	
\section{Description}
The components are from a USC vector network analyser calibration kit model 8599. 

\section{Identification}
The component serial number is 2221X.

\section{Client}
United Spacecraft Corporation, 51 Mare Tranquillitatis, The Moon.

\section{Date of Calibration}
The measurements were performed on February 7\textsuperscript{th}, 2035.

\section{Conditions}
Ambient temperature was maintained within \SI{\pm 1}{\celsius} of \SI{-123}{\celsius}.

\section{Method}
Measurements of the voltage reflection coefficient were made according to procedure LMIT.E.063.005. 

\clearpage    % Anticipate the page break
\section{Results}
 
\subsection{Open (male), SN 54673}

\begin{center} 
    \small	% smaller font size for the table entries

    % Increases the vertical spacing between rows slightly  
    \setlength{\extrarowheight}{3pt}
    %
    % the 'S' array column type will align numbers on the decimal 
    % Note 'S[group-minimum-digits=3]' or '\sisetup{group-minimum-digits=3 }'
    % would be used to force a space separator every 3 digits (this
    % does not happen by default until there more than 4 digits)
    \begin{tabular}{SSSSS}
    
        \multicolumn{1}{c}{ frequency } & 
        \multicolumn{2}{c}{ magnitude } &
        \multicolumn{2}{c}{ phase } 
        \\
        % 2nd line 
        \multicolumn{1}{c}{ (/\si{\mega\hertz}) } &  
        \multicolumn{2}{c}{  } &
        \multicolumn{2}{c}{ (/\si{\degree}) } 
        \\
        % 3rd line 
        & $\rho$ & U($\rho$) & $\phi$ & U($\phi$) 
        \\ \hline % Underline the headings

        %%-----------------------------------------------
        45 &   0.9998 &   0.0023$^\dagger$ &    -1.46 &     0.13     \\
        50 &   0.9998 &   0.0023$^\dagger$ &    -1.62 &     0.13     \\
        100 &   0.9999 &   0.0023$^\dagger$ &    -3.27 &     0.13    \\
        300 &   0.9998 &   0.0025 &    -9.80 &     0.14    \\
        500 &   0.9997 &   0.0026 &   -16.34 &     0.15    \\
        1000 &   1.0000 &   0.0032 &   -32.72 &     0.18   \\
        2000 &   0.9994 &   0.0054 &   -65.67 &     0.31  \\
        3000 &    1.000 &    0.011 &   -98.66 &     0.62   \\
        4000 &    0.999 &    0.013 &  -131.74 &     0.78   \\
        5000 &    0.999 &    0.016 &  -164.77 &     0.90   \\
        6000 &    0.998 &    0.017 &  +162.15 &     0.99   \\
        7000 &    0.997 &    0.018 &   +129.0 &      1.1   \\
        8000 &    0.997 &    0.018 &    +95.9 &      1.1   \\
        9000 &    0.996 &    0.018 &    +62.7 &      1.1  \\
        %%-----------------------------------------------
		
    \end{tabular}
        
    \LMICaption{table}{tab1}{%
    Magnitude and phase data, using a linear scale for magnitude and units of degrees for phase. 
    Expanded uncertainties decorated by a $\dagger$ fall outside the scope of accreditation (see Uncertainty section).
    }
	
\end{center}

\section{Uncertainty}
A coverage factor $k=1.96$ was used to calculate the expanded uncertainties $U(\cdot)$ at a level of confidence of approximately \SI{95}{\percent}. 
The number of degrees of freedom associated with each measurement result was large enough to justify this coverage factor.  

Some of the expanded uncertainty values reported fall outside LMI's current scope of accreditation. 
These values are decorated by a $\dagger$ in Table~\ref{tab1}. 
The least expanded uncertainty for a measured magnitude close to unity in the LMI scope of accreditation is currently 0.0024. 

\paragraph{Note:} \referenceGUM	% Standard reference to the GUM

\embedfile[mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]{ex_data.xlsx}

\end{document}

LaTeX markup

The first few lines of the report file are just like those used in earlier examples. The main difference is that the style file LMIReport is loaded by \documentclass instead of article. In fact, LMIReport specialises article, so many regular LaTeX commands are available, as well as a few special commands for the LMI reporting style.

\DocumentMetadata{
    pdfversion=1.7,
    pdfstandard=A-3b,
}
\documentclass[11pt,a4paper]{LMIReport}

The following few lines are macro commands defined in our style file. These capture information about the report: its title, who did the work, etc. These macros must appear in the file before the main body of the document begins (with \begin{document}).

\Metrologist{Luke Skywalker}
\ChiefMetrologist{Princess Leia Organa}
\ReportNumber{12345} 
\ReportTitle{A report on the calibration of an type-N male open}
\date{17 May 2035}

\begin{document}

A LaTeX document is usually structured in terms of sections, subsections, etc. Section titles are specified as parameters to the sectioning commands (e.g., \section{Identification}).

Other commands in the file (such as the \SI macro used to display units) are provided by standard LaTeX packages (e.g., the siunitx package), which are loaded by LMIReport.

The report has one table of measurement results. Tables are relatively complicated to construct in LaTeX, so we will describe this one in detail later. Note, however, that this table is not produced using LaTeX’s \begin{table} ... \end{table} environment. That environment would allow LaTeX to automatically place the table in the document. We prefer to retain control, so we created the table in-place. The \LMICaption command is needed to add a table caption. This command also allows references to the figure to be made elsewhere in the file, by \ref{fig1}.

\LMICaption{figure}{fig1}{%
Magnitude and phase data, using a linear scale for magnitude and units of degrees for phase. 
Expanded uncertainties decorated by a $\dagger$ fall outside the scope of accreditation (see Uncertainty section).
}

Near the end of the document we see a short section defined by

\paragraph{Note:} \referenceGUM	% Standard reference to the GUM

\paragraph is one of LaTeX’s low-level sectioning commands. Here, it starts a section beginning Note:. The following macro \referenceGUM expands, in the final PDF, to give text which refers to the GUM, including a hyperlink.

The report will be produced with a copy of the data embedded as an XLSX file in the final PDF, as shown in an earlier example.

\embedfile[mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]{ex_data.xlsx}

\end{document}

LaTeX style development

To explain how the LMIReport style works and how it was developed, we consider the title page and the report body separately.

We use source files, in which LaTeX packages are included and commands are defined, while refining the style commands. This is convenient when developing or modifying a style, because changes are made easily in the source file being processed. Once we are satisfied, the commands can be transferred to a LaTeX class for general use.

Title style

The title page often has background artwork (logos) produced by graphics professionals. Our title page combines a background page with LaTeX-generated text.

Title style

Body style

Pages in the body of a report have a particular style. Typically, there are page headers and footers, which display information about the report. Page margins are set and the style of section headings is defined. Our body style specifies these sorts of things. It also imports various standard packages that support scientific language and mathematics.

Body style

LaTeX class file

The specifications for title page and report body can be placed in a LaTeX class file. Doing so, allows the style to be placed apart from the report source files in a LaTeX system and reused many times. This effectively separates the content of reports from the way that they are rendered.

Conversion of LaTeX commands in source files to commands for inclusion in a LaTeX class file is straightforward.

LMIReport.cls