Preliminaries
PDF/A-3 file formats can deliver human-readable metrological documents (i.e., PDF reader software can display the document) with digital files embedded. So, calibration laboratories could tailor reports to the needs of their customers using PDF/A documents.
The suggestion to use PDF/A-3 formats for reporting metrological data was first made by METAS. They made a proof-of-concept package available on github, which uses open-source tools to create PDF/A files—the LaTeX system for typesetting documents. However, recent developments have improved LaTeX system support for generating PDF files.
Since 2020, the group who maintains LaTeX embarked on a multi-year development project to produce tagged and accessible PDF from existing LaTeX source files with no or only minimal configuration adjustments. This project has simplified the generation of PDF/A-3 documents. In the longer term, PDF files produced using LaTeX will contain rich semantic machine-readable content, which is an exciting long-term prospect for digital transformation in metrology.
In these pages, the DXFG is making information and examples available to help DXFG members develop PDF/A reporting capabilities. A github repository of associated files provides copies of the files referred to here.
Resources
This resource uses the open-source software tools Python and LaTeX. Some information about installing Python and LaTeX is given here.
Independent software is needed to validate PDF/A-3 documents against the official PDF ISO standard. We use the veraPDF Implementation Checker tool to validate documents.
Files embedded in a PDF/A-3 document can be extracted by hand using any of the popular PDF reader applications or automatically using software. We use the Python package pypdf to extract files in the examples below, but there are other Python packages that may also be considered, such as pikepdf.
Examples
Embedding a text file
Here is an example that produces a PDF/A-3 file with an embedded text file (files shown are available in the minimal
folder of the github respository).
The contents of the embedded text file is just one line:
This file will be attached to a PDF document
A LaTeX source file (ex.tex
) is used to create the output, ex.pdf
:
\DocumentMetadata{
pdfversion=1.7,
pdfstandard=A-3b,
}
\documentclass{article}
\usepackage{embedfile}
\embedfilesetup{
filesystem=URL,
mimetype=application/octet-stream,
afrelationship={/Data},
stringmethod=escape
}
\begin{document}
You can check that this PDF document complies with the PDF/A-3 standard by using the veraPDF tool.
A text file, \texttt{attachme.txt}, is embedded in the PDF document.
\embedfile{attachme.txt}
\end{document}
The output ex.pdf
is generated by pdflatex
on the comand line, like this:
> pdflatex ex.tex
The \DocumentMetadata
command must come before \documentclass
in the LaTeX source file.
It activates the new features of LaTeX.
Files are embedded using the command \embedfile
from the embedfile
package.
VeraPDF checker can be used to verify that the PDF file produced complies with the PDF/A-3b standard.
The Python package pypdf
can extract the contents of the embedded file attachme.txt
:
from pypdf import PdfReader
attached = PdfReader("ex.pdf").attachments
if len(attached) != 1:
raise RuntimeError("Expect a single attachment")
for file_name, content_list in attached.items():
# Elements in content_list are Python byte-literals
# Here we convert the bytes to a string.
content_as_str = content_list[0].decode('utf-8')
print(f"\nThe {file_name} file contents are:\t{content_as_str!r}")
Embedding a spreadsheet
This example produces a PDF/A file with an embedded spreadsheet.
The LaTeX file for this example is ex_xlsx.tex
(in the minimal
folder of the github respository).
Here is the LaTeX source code:
\DocumentMetadata{
pdfversion=1.7,
pdfstandard=A-3b,
}
\documentclass{article}
\usepackage{embedfile}
\embedfilesetup{
filesystem=URL,
mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,
afrelationship={/Data},
stringmethod=escape
}
\begin{document}
You can check that this PDF document complies with the PDF/A-3 standard by using the veraPDF tool. A text file, \texttt{xl.xlsx}, is embedded in the PDF document.
\embedfile{xl.xlsx}
\end{document}
Note, we now specify mimetype=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
in the command \embedfilesetup
.
The embedded spreadsheet is accessible using the spreadsheet package openpyxl
:
import io
import openpyxl
from pypdf import PdfReader
attached = PdfReader("ex_xlsx.pdf").attachments
if len(attached) != 1:
raise RuntimeError("Expect a single attachment")
for file_name, content_list in attached.items():
# Elements in content_list are Python byte-literals
# Assuming byte_string contains the byte data of the XLSX file
xlsx_data = io.BytesIO( content_list[0] )
# Load the Excel workbook from the byte data
workbook = openpyxl.load_workbook(xlsx_data)
# Access the sheets in the workbook
sheets = workbook.sheetnames
# Iterate over each sheet and extract data
for sheet_name in sheets:
sheet = workbook[sheet_name]
print(f"Sheet: {sheet_name}")
for row in sheet.iter_rows(values_only=True):
print(row)
XMP metadata
Adobe allows XML metadata to be included in a PDF file. This is known as XMP metadata.
Here is an example where the title and author are recorded as metadata (a complete list of supported metadata is given in the hyperref) package documentation.
The LaTeX file for this example is ex_xmp.tex
(in the minimal-xmp
folder of the github respository).
\DocumentMetadata{
pdfversion=1.7,
pdfstandard=A-3b,
}
\documentclass{article}
\usepackage{hyperref}
\hypersetup{
pdftitle={On a heuristic viewpoint concerning the production and transformation of light},
pdfauthor={Albert Einstein}
}
\begin{document}
You can check that this PDF document complies with the PDF/A-3b standard by using the \href{https://verapdf.org/}{veraPDF} tool.
\end{document}
XMP metadata is intended to be read by machines.
For example, PDF reader software is often able to display the metadata although it is primarily intended to facilitate sharing and transfer of files across products, vendors, platforms, without metadata getting lost.
There are Python packages that can access the XMP data.
The pypdf
package used earlier makes a few elements accessible as Python attributes, as shown below, but access to the all XMP data is more complicated.
from pypdf import PdfReader
meta = PdfReader("ex_xmp.pdf").xmp_metadata
print( meta.dc_title['x-default'] )
print( ",".join( meta.dc_creator ) )
Another package that can access XMP data is pikepdf, shown in this code snippet
from pikepdf import Pdf
with Pdf.open('ex_xmp.pdf') as pdf:
with pdf.open_metadata() as meta:
print( meta['dc:title'] )
print( ",".join( meta['dc:creator'] ) )
Calibration Reports in LaTeX
The LaTeX approach to creating documents separates layout from content: in other words, the style and typesetting of a document are defined separately. This section presents an example of a LaTeX style specification for calibration reports.
It will help to begin by looking at a complete report, so we have an idea of what we are working towards.
A PDF/A file created using our style is available here.
This file also has an XLSX
spreadsheet embedded.
The LaTeX source file for this calibration report is shown below (from the LMI-report-cls
folder of the github respository).
It contains ‘mark-up’ commands, text, and other content.
The appearance is specified in a LaTeX file called LMIReport.cls
; the mark-up relates to the structure of the document.
LaTeX markup
The first few lines of the report file are just like those used in earlier examples.
The main difference is that the style file LMIReport
is loaded by \documentclass
instead of article
.
In fact, LMIReport
specialises article
, so many regular LaTeX commands are available, as well as a few special commands for the LMI reporting style.
The following few lines are macro commands defined in our style file. These capture information about the report: its title, who did the work, etc.
These macros must appear in the file before the main body of the document begins (with \begin{document}
).
A LaTeX document is usually structured in terms of sections, subsections, etc. Section titles are specified as parameters to the sectioning commands (e.g., \section{Identification}
).
Other commands in the file (such as the \SI
macro used to display units) are provided by standard LaTeX packages (e.g., the siunitx package), which are loaded by LMIReport
.
The report has one table of measurement results.
Tables are relatively complicated to construct in LaTeX, so we will describe this one in detail later.
Note, however, that this table is not produced using LaTeX’s \begin{table} ... \end{table}
environment.
That environment would allow LaTeX to automatically place the table in the document.
We prefer to retain control, so we created the table in-place.
The \LMICaption
command is needed to add a table caption.
This command also allows references to the figure to be made elsewhere in the file, by \ref{fig1}
.
Near the end of the document we see a short section defined by
\paragraph
is one of LaTeX’s low-level sectioning commands. Here, it starts a section beginning Note:.
The following macro \referenceGUM
expands, in the final PDF, to give text which refers to the GUM, including a hyperlink.
The report will be produced with a copy of the data embedded as an XLSX file in the final PDF, as shown in an earlier example.
LaTeX style development
To explain how the LMIReport
style works and how it was developed, we consider the title page and the report body separately.
We use source files, in which LaTeX packages are included and commands are defined, while refining the style commands. This is convenient when developing or modifying a style, because changes are made easily in the source file being processed. Once we are satisfied, the commands can be transferred to a LaTeX class for general use.
Title style
The title page often has background artwork (logos) produced by graphics professionals. Our title page combines a background page with LaTeX-generated text.
Body style
Pages in the body of a report have a particular style. Typically, there are page headers and footers, which display information about the report. Page margins are set and the style of section headings is defined. Our body style specifies these sorts of things. It also imports various standard packages that support scientific language and mathematics.
LaTeX class file
The specifications for title page and report body can be placed in a LaTeX class file. Doing so, allows the style to be placed apart from the report source files in a LaTeX system and reused many times. This effectively separates the content of reports from the way that they are rendered.
Conversion of LaTeX commands in source files to commands for inclusion in a LaTeX class file is straightforward.