Introduction

This document is work-in-progress.

I never reviewed anything under the .github directory and those files are skipped here as well.

Changes to translations were checked separately, not per commit basis.

While the commits aren’t signed, I didn’t spot any signs of committer fraud.

v5.2, v5.4, and v5.6

Cherry-picks to v5.2, v5.4, and v5.6 are OK. This means that (apart from translations) it’s enough to review the master branch. If master is fine then the stable branches are too.

Translations

In short, the translation files are OK.

Translations in xz.git

This section applies to the xz.git tags v5.6.1, v5.4.6, and v5.2.12. The master branch had the same .po files as v5.6.1.

The .po files for both program (in the po directory) and man page translations (po4a) are OK. There are differences to the files from the Translation Project but the reasons for the differences are clear.

For example, it has been common that white space changes are needed to fix alignment or line length problems. In most cases these have been discussed with the translators. However, such white space changes might not be in the Translation Project for a few reasons:

  • The translator replied via email that the proposed changes are OK but never uploaded the new file to the Translation Project.

  • The translator or the translation team couldn’t be reached. For simple white space changes it seemed better to include a modified file instead of omitting the the translation completely.

Translations in tarballs

This section applies to XZ Utils 5.2.12, 5.4.6, and 5.6.1 tarballs and the matching xz.git tags.

When creating a distribution tarball with make dist (in a generic Autotools-based project) or the make mydist wrapper target in XZ Utils:

  1. The .po files get updated to match the current source code. (This can be disabled by setting DIST_DEPENDS_ON_UPDATE_PO = no in po/Makevars.)

  2. From the updated .po files, matching .gmo binary files are generated and included in the tarball. This way the msgfmt tool isn’t needed when the package is built.

I updated the .po files in the matching xz.git tags using make -C po update-po and compared them against the .po files in the tarballs. There are no differences except POT-Creation-Date. Thus the .po files are OK.

Recreating the .gmo files from the .po files produced identical files as in the tarballs. To reproduce the .gmo files in 5.2.12 and 5.4.6 tarballs using GNU gettext 0.22 or later, the option --no-redundancy needs to be passed on the msgfmt command line.

Timezones

There has been discussion about commit timezones and if those could be giving clues about something. There are mundane explanations for at least most of these though:

  • Putting a commit in other person’s name with permission.

  • Use of git rebase --reset-author-date.

  • Possibly use of git rebase --committer-date-is-author-date.

Of these I never did the last one but I don’t know if he did.

Commits

Trivial commits like updates to THANKS or typo fixes in comments aren’t mentioned here.

The months refer to commit date, not author date.

2023-01

OK with my style fixes (Tests: Adjust style in test_compress.sh.).


OK. Also, the VS project files aren’t present anymore.


OK.


OK.


I made a bunch of cleanups to this a few days later, including fixing a memory leak (Tests: test_index_hash: Fix a memory leak.). It’s OK enough now.


OK.


He wanted to use liblzma’s internal headers in some tests and eventually I gave in. This required making those particular internal headers standalone in sense that they don’t include other internal headers.


OK enough.

The second commit was combined from my edits squashed into a single commit. The commit message was clearly edited by him, collecting the messages from my separate commits. For example, the way the bullet points are formatted doesn’t match the way I would have done them.


OK.


These are OK together.

When all filters are disabled, the array initialization becomes { } which isn’t valid C99. This was fixed in Tests: Fix C99/C11 compatibility when features are disabled.


OK.


OK.


OK.


I didn’t like this and reverted it a few days later. See Revert "tuklib_common: Define __has_warning if it is not defined." for the reasons.



OK.


It has quoting typos in the messages in configure.ac. These were fixed in Build: Avoid different quoting style in --enable-doxygen doc.. That code isn’t present anymore anyway.


Likely OK. The Doxyfile was updated later again anyway.


OK.


The intent is good, fixing a very minor issue in a rarely-seen corner case. It had waited some time on a branch until I gave the permission to merge it. This introduced a regression that was fixed in 2023-11.


The fix is good but commit message goes beyond overthinking. The doc simply was incorrect and that’s it.


2023-02

OK. These are the first commits that update the API docs.



OK.


OK.


OK.


OK (documentation updates).

Some cleanups were made in liblzma: Very minor API doc tweaks..


OK (more documentation updates).


OK.


OK (even more documentation updates).


OK (documentation updates continue still).

The return value updates were good to do:

  • lzma_properties_encode should only return LZMA_OK or LZMA_PROG_ERROR. LZMA_OPTIONS_ERROR shouldn’t be possible although it was listed before as a possible value too.

  • lzma_filter_flags_decode can return also LZMA_DATA_ERROR.


OK.


2023-03

OK.


This series has intermediate steps that I didn’t like. The end result is good though (the last two commits are by me).

This series is identical to the two following commits in v5.4:


With the tweaks that were made later, this is OK enough.


I found and still find these useless additions to the API. He kept insisting on them so eventually I gave in. There’s nothing technically wrong, I just think they don’t improve readability.


tuktest.h had made these obsolete.


This is one of the supposedly suspicous commits due to the timezone. In reality he put it in my name by common agreement because I had done a significant portion of it.


OK. These will look different after the 5.6.1 release because pre-generated Doxygen output won’t be included in distribution tarballs anymore (not only to reduce the number of generated files but also to simplify the license questions of the source tarball).


OK.


OK.


OK.


OK.


OK.


OK.


OK.


OK.


OK.


2023-05

The error got to the master branch because my feedback in the chat came slightly too late. Thus the second commit reverted it. The commit message in the second one likely was supposed to say “if the integer size was not 32 bits”.


OK. Needed trivial white space cleanup: liblzma: Clean up white space.


OK.


OK.


OK.


2023-06

This is from PR #51. The commit message seems slightly wrong as it speaks about ZLIB::ZLIB but otherwise everything is fine.


Some genuine fixes were needed later but because ifunc support is now gone from the package, these commits aren’t relevant anymore.


These are my commits that he merged. Since ifunc support is now gone from the master branch, these commits aren’t relevant anymore.



These together are good. In v5.4 these were squashed into liblzma: Prevent uninitialzed warning in mt stream encoder..


OK. An inline function could have been better than a macro.


OK.


2023-07

OK. I think the CI setup with various build configs helped to spot this.


OK.


OK.


OK.

--powerpc had been abbreviated to --power by accident in the old version but it still worked correctly because getopt_long accepts unambiguous abbreviations.


OK. str_to_filter was renamed to more correct plural str_to_filters two commits later.

This commit begins the series of commits that add the --filters1 …​ --filters9 options.


OK.


It’s OK as an intermediate step. The next commits improve things.


OK.


OK.


OK.

I restored the { } to one if statement. I like to keep the two if-statements in the nested form instead of using short-circuiting (&&) to make the function call conditional.


This does what it is supposed to do but it needed cleaning up:

  • filters_memusage_max() sets the static array filter_memusages but it’s better to make that array a local variable.

  • The second half is more complex than it needs to be. Splitting the dictionary size scaling into three steps makes it hard to read.

    • orig_dict_size should be uint32_t, not uint64_t.

    • The while (count > 0) loop and the need for the count variable are an odd way to do this. It adjusts the chains in parallel instead of handling one chain at a time.

    • if (filt_mem_usage < memory_limit) should have had <=. In practice it didn’t matter though as the values are in bytes.

I edited this code quite a bit in 2024-05, including the fixes and cleanups for the things listed above. It might still be a bit heavy to read but it’s not due to the 2023 commits.


OK. Using multiple filter chains with --flush-timeout isn’t a likely use case but this isn’t much extra code either.


This had some issues which were fixed in xz: Minor clean up for coder.c.

The last comment was odd, maybe a leftover from an earlier version. It was fixed in xz: Omit an incorrect comment.

However, it’s better to collect the required information in parse_block_list() because it avoids the need to loop through the opt_block_list array. I did that change in 2024-05, thus not a lot remains from this commit.

This is the last commit to the --filters1 …​ --filters9 series.


OK


OK.


OK.


OK.


OK.


OK. As the commit message says, it only reorders lines. The following can be helpful:

git log -p f5dc172a402f^! | sed 's/^-/+/' | grep ^+ \
    | sort | uniq -u

OK.


OK.


OK


OK.


OK.

Tip
Never tell how many things you are going to list.

OK. We had been asked how to cross-compile on one machine but then run the tests on the target machine, so it was good to document it.


OK.


OK.


OK but it’s gone from the master now.


OK.


OK. It makes test_files.sh use exit status 77 (skipped test) when the feature isn’t configured instead of exit status 0 (passed test).


OK.


The special cases on Windows were researched and commented well but the end result is more complex than it needs to be. Instead of creating the symlinks in the build directory and then installing (copying) them, it’s simpler to create the symlinks at install time directly in the target directory. This was simplified in CMake: Simplify symlink creation and install translated man pages. which itself is a messy commit though as it did multiple things.


OK.


2023-08

OK.


OK.


OK.


This was pondered for a while and determined that in practice the only problem was an assertion failure (but assertions rarely are enabled for production code). A proper fix was good to do still.


The diff might appear misleading. The intent is to move the comment but diff shows it as a move of the lzma_index_end call.


2023-09

OK.


OK.


These are OK, including getopt.m4 which was simplified a lot. xz doesn’t need support for all corner cases of getopt_long so checking for those corner cases isn’t needed either.

The GNU getopt_long isn’t used on GNU/Linux, *BSDs, Solaris, or MinGW-w64.


This was an overdue change. It’s also done properly (I verified it back then as well). It missed one change around in CLOCK_MONOTONIC but that turned out to be mildly convenient as it avoided a simple merge conflict with Build: Check for clock_gettime() even if not using POSIX threads..

To verify these commits, the following can be helpful:

git diff ce162db07f03^..9fb5de41f2fb \
    | sed "/^-/{s/\`/'/g;s/^-/+/}" \
    | grep ^+ | sort | uniq -u

2023-10

It matches the file in Gnulib.


The x86 CRC32 CLMUL commits begin from liblzma: Rename crc_macros.h to crc_common.h. and there is a lot of code churn as the code was moved around more than once.

I made further changes that moved the code. He didn’t like them at first but he changed his mind in January 2024. I merged my changes on 2024-01-11, the big one being liblzma: Avoid extern lzma_crc32_clmul() and lzma_crc64_clmul()..

The final version is fine.


This commit is a continuation to my commit liblzma: Mark crc64_clmul() with __attribute__((__no_sanitize_address__)). where I added no_sanitize_address the first time. It just was needed in more places with some compiler versions when the code was rearranged.

This attribute is obviously scary but it is unfortunately required with this version of the x86 SIMD code. The code makes aligned 16-byte reads which may read up to 15 bytes before the beginning or past the end of the buffer if the buffer is misaligned. The unneeded bytes are then ignored. It cannot cross page boundaries and thus cannot cause access violations.

The correctness is easy to review because memory is read only in a few places. If you do this, I suggest looking at the latest code in the master branch as that is the code that is actually in use now.

However, it also trips memory sanitizer which is a different thing than address sanitizer. Instead of adding another attribute to disable it, this code will be changed so that such attributes aren’t needed. Such a change won’t be backported to 5.4.x though because the current code does work correctly.


OK.


OK.


OK.


2023-11

OK.


In v5.4 these were squashed into these commits:
xz: Fix suffix check.
and
Tests: Create test_suffix.sh.
and
Tests: Fix typo in a comment.

It’s simplest to review this as a whole. It was a bit hilarious series of commits in the master branch as I had spotted a regression that was introduced in xz: Refactor duplicated check for custom suffix when using --format=raw. We discussed it and gave ideas about the fix and then it turned out to be a bit wrong and needing another fix until he thought it’s better to finish it another day. Then the test script was added as well.


These are good although I clarified a comment in the next commit. It fixed a very old bug in Windows build of xz.


Ifunc detection was causing issues with musl. This seems fine but ifunc support was removed in liblzma: Remove ifunc support. so this doesn’t matter anymore anyway.


It really is just space change, no dots.


2023-12

These are from PR #73 and have been merged correctly with cleanups to commit messages and white space usage.


OK. Needed one minor comment fix.


OK.


The URL is correct.


OK. options definitely must not be NULL, and in practice it never is because it would be a bug in the application too.


OK, just like the commit message says.


This together with the next commit liblzma: Initialize lzma_lz_encoder pointers with NULL. were squashed into liblzma: Set all values in lzma_lz_encoder to NULL after allocation. in the v5.4 branch for the 5.4.6 release. Together these commits only add two new lines to add missing initializers.

The bug isn’t as serious as the commit message makes it sound as no one has a reason to call lzma_filters_update on a LZMA1 encoder, a function that is very rarely used anyway.


The order of the macros in the #if directive is different from src/xz/sandbox.h but the directive is correct still.

The Capsicum code is slightly simpler and stricter than in xz because xzdec needs to do less. Silently running without sandbox when kernel support is missing (ENOSYS) is an intentional feature and commonly accepted practice: without this, on such kernels xz wouldn’t run at all.


OK.


This moves code around and edits it a little. Comparing the moved sections shows it’s fine including the small edits. No suspicious typos (like mispelling SANDBOX_COMPILE_DEFINITION) are apparent.


2024-01

The riscv.c file in this commit was almost solely written by me and matched the file I had outside of the Git tree. Jia had some minor work on it like adding macros to avoid repeated code and a few comment improvements. Those I reviewed and edited further.

The rest of the commit was by Jia. Those changes are good.


OK.


This is the first commit preparing for the backdoor.


It’s very similar to the ARM64 code below the new code.


I did author these.


This is fine although now that the backdoor has been removed, this commit alone doesn’t do anything. But I left it there so that it’s ready if proper files along with a generator program are added.


I had mentioned the dictionary size and that gave a good excuse to update something in the backdoor code.


OK.


There is some churn in these commits so it’s simplest to review only the final outcome.

FreeBSD case lacked error checking which was fixed in liblzma: ARM64 CRC32: Add error checking to FreeBSD-specific code.

The check for the CRC32 table omission macro had an odd typo which was fixed in liblzma: ARM64 CRC: Fix omission of CRC32 table.


2024-02

OK.


OK.


These are good and were created at my request. They are big but it’s hard to split them into smaller pieces. The original versions of these are from 2023-04-24. I made small edits but it was agreed that I would commit these in his name.

Since these commits are complex, it’s understandable if they look scary. The new code does input buffer bounds checking only once per loop iteration, checking that there is at least 21 bytes available. At first this may look insufficient because, in the worst case, rc_normalize may be executed up to 48 times (22 bit model + 26 direct) per loop iteration. Each rc_normalize macro will read one input byte if needed. However, the condition to read a new byte cannot be true everytime. This has been verified both with math and experimentally. Comments in the code about this could be improved.

The small edits I made included fixing an off-by-one error with the loop condition. The commit originally had 20 but I changed it to 21 because the final normalization after the end of payload marker (EOPM) was included in the non-resumable path. The next commit by me (liblzma: LZMA decoder improvements.) made the non-resumable variant jump to the safe code (goto eopm) instead and thus 21 was changed to 20.

XZ Embedded has had somewhat comparable code with the per-loop 20-byte check since the beginning (2009) and it’s been in the Linux kernel since 2010.

As most of the core compression code in XZ Utils, the original idea for this detail comes from LZMA SDK. It’s just fairly different-looking code.


Modified build-to-host.m4 had the backdoor trigger. Ignoring the file here is correct because it is added among several other .m4 files when running ./autogen.sh or autoreconf -fi.


OK.


This is correct.


I didn’t like the extra macro that had been added solely for this test. The last commit does more than reverting a single commit but the change is OK.


This test assumes that the encoder output doesn’t change, that is, the test will fail if the encoder is modified too much. If the encoder is modified, updating the test to match isn’t much extra to do.


Backdoor files.

Note that tests/test_files.sh uses globs to pick the files. So just adding files means that a decompression test will be done with them.


We discussed this and I agreed to it. This way man page translations didn’t need to go via translators in the last days of the release rush to fix a typo in the English source text.


The symbol name was supposed to be XZ_5.6 not XZ_5.6.0. It makes no difference in practice as it is merely a string.


OK. He liked to run codespell while I didn’t.


He fixed it because I noticed it. Clearly the test file had been written some time ago already.


OK.


This has a . at the beginning of a line in CMakeLists.txt that sabotages the Landlock detection. This was fixed in CMake: Fix sabotaged Landlock sandbox check..


The need to add this fix had been discussed.


This was discussed in length with multiple people. The commit matches what was decided.


2024-03

This is a real bug but the bug number in the commit message lacks one digit.


It’s one of the backdoor commits.


More backdoor commits.


This is a continuation for xz: Add missing RISC-V on the filter list in the man page. Almost always it’s not good to touch the translations without upstream approval. In this case I felt mixed as I wondered if the date translations would be correct but didn’t want to stop it so that the translation could be in the 5.6.1 release that was going to be made the same day.

When translations are sent to the translators, they will remake these changes anyway (they only pick the original English text from the tarball).


I got the impression that a lot of speculation happened around this commit.

I had asked Jia to simplify the GitHub PR and Issue templates already, and simplifying SECURITY.md seemed reasonable too. People reporting such bugs don’t need detailed handholding.

I felt 21–30 days would be appropriate but Jia wanted to keep 90. With backdoor like “bugs”, fast full disclosure is the only correct way though.