Improved translation po file handling by ditching gettext autotools integration

Posted: November 29th, 2018 | Author: | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: , , , | No Comments »

The libvirt library has long provided translations of its end user facing strings, which largely means error messages and console output from command line tools / daemons. Since libvirt uses autotools for its build system, it naturally used the standard automake integration provided by gettext for handling .po files. The libvirt.pot file with master strings is exported to Zanata, where the actual translation work is outsourced to the Fedora translation team who support up to ~100 languages. At time of writing libvirt has some level of translation in ~45 languages.

With use of Zanata, libvirt must periodically create an updated libvirt.pot file and push it to Zanata, and then just before release it must pull the latest translated .po files back into GIT for release.

There have been a number of problems with this approach which have been annoying us pretty much since the start, and earlier this year it finally became too much to bear any longer.

  • The per-language translation files stored in git contain source file name and line number annotations to indicate where each translatable string originates. Since the translation files are not re-generated on every single source file changes, the file locations annotations becomes increasingly out of date after every commit. When the translation files are updated 98% of the diff is simply changing source file locations leading to a very poor signal/noise ratio.
  • The strings in the per-language translation files are sorted according to source filename. Thus when code is moved between files, or when files are renamed, the strings in the updated translation files all get needlessly reordered, again leading to a poor signal/noise ratio in diffs.
  • Each language translation file contains every translatable string even those which do not have any translation yet. This makes sense if translators are working directly against the .po files, but in libvirt everything is done via the Zanata UI which already knows the list of untranslated strings.
  • The per-language translation files grow in size over time with previously used message strings appended to the end of the file, never discarded by the gettext tools. This again makes sense if translators are working directly against .po files, but Zanata already provides a full translation memory containing historically used strings.
  • Whenever ‘make dist’ is run the gettext autotools integration will regenerate the per-language translation files. As a result of the three previous points, every time a release is made there’s a giant commit more than 100MB in size that contains diffs for translated files which are entirely noise and no signal.

One suggested approach to deal with this is to stop storing translations in GIT at all and simply export them from Zanata only at time of ‘make dist’. The concern with this approach is that the GIT repository no longer contains the full source for the project in a self-contained manner. ‘make dist‘ now needs a live network connection to the Zanata servers. If we were to replace Zanata with a new tool in the future (Zanata is already a replacement for the previously used Transifex), we would potentially loose access to translations for old releases.

With this in mind we decided to optimize the way translations are managed in GIT.

The first easy win was to simply remove the master libvirt.pot file from GIT entirely. This file is auto-generated from the source files and is out of date the moment any source file changes, so no one would ever want to use the stored copy.

The second more complex step was to minimize and canonicalize the per-language translation files. msgmerge is used to take the full .po file and strip out the source file locations and sort the string alphabetically. A perl script is then used to further process the content dropping any translations marked as “fuzzy” and drop any strings for which there is no translated text available. The resulting output is still using the normal .po file format but we call these ‘.mini.po‘ files to indicate that they are stripped down compared to what you’d normally expect to see.

The final step was to remove the gettext / autotools integration and write a custom Makefile.am to handle the key tasks.

  • A target ‘update-mini-po‘ to automate the process of converting full .po files into .mini.po files. This is used when pulling down new translations from Zanata to be stored in git before release.
  • A target ‘update-po’ to automate the inverse process of converting .mini.po files back into full .po files. This is to be used by anyone who might need to look at full language translations outside of Zanata.
  • An install hook to generate the binary .gmo files from the .mini.po files and install them into /usr/share/locale for use at runtime. This avoids the need to ship the full .po files in release tarballs.
  • A target ‘zanata-push‘ to automate the process of re-generating the libvirt.pot file and uploading it to Zanata.
  • A target ‘zanata-pull‘ to automate the process of pulling new translations down from zanata and then triggering ‘update-mini-po

After all this work was completed the key benefits are

  • The size of content stored in GIT was reduced from ~100MB to ~18MB.
  • Updates to the translations in GIT now produce small diffstats with a high signal/noise ratio
  • Files stored in GIT are never changed as a side effect of build system commands like ‘make dist’
  • The autotools integration is easier to understand

while not having any visible change on the translators using Zanata. In the event anyone does need to see full translation languages outside of Zanata there is an extra step to generate the full .po files from the .mini.po files but this is countered by the fact that the result will be fully up to date with respect to translatable strings and source file locations.

I’d encourage any project which is using gettext autotools integration, while also outsourcing to a system like Zanata, to consider whether they’d benefit from taking similar steps to libvirt. Not all projects will get the same degree of space saving but diffstats with good signal/noise ratios and removing side effects from ‘make dist’ are wins that are likely desirable for any project.