Class SeekableXZInputStream

All Implemented Interfaces:
Closeable, AutoCloseable

public class SeekableXZInputStream
extends SeekableInputStream
Decompresses a .xz file in random access mode. This supports decompressing concatenated .xz files.

Each .xz file consist of one or more Streams. Each Stream consist of zero or more Blocks. Each Stream contains an Index of Streams' Blocks. The Indexes from all Streams are loaded in RAM by a constructor of this class. A typical .xz file has only one Stream, and parsing its Index will need only three or four seeks.

To make random access possible, the data in a .xz file must be splitted into multiple Blocks of reasonable size. Decompression can only start at a Block boundary. When seeking to an uncompressed position that is not at a Block boundary, decompression starts at the beginning of the Block and throws away data until the target position is reached. Thus, smaller Blocks mean faster seeks to arbitrary uncompressed positions. On the other hand, smaller Blocks mean worse compression. So one has to make a compromise between random access speed and compression ratio.

Implementation note: This class uses linear search to locate the correct Stream from the data structures in RAM. It was the simplest to implement and should be fine as long as there aren't too many Streams. The correct Block inside a Stream is located using binary search and thus is fast even with a huge number of Blocks.

Memory usage

The amount of memory needed for the Indexes is taken into account when checking the memory usage limit. Each Stream is calculated to need at least 1 KiB of memory and each Block 16 bytes of memory, rounded up to the next kibibyte. So unless the file has a huge number of Streams or Blocks, these don't take significant amount of memory.

Creating random-accessible .xz files

When using XZOutputStream, a new Block can be started by calling its endBlock method. If you know that the decompressor will only need to seek to certain uncompressed positions, it can be a good idea to start a new Block at (some of) these positions (and only at these positions to get better compression ratio).

liblzma in XZ Utils supports starting a new Block with LZMA_FULL_FLUSH. XZ Utils 5.1.1alpha added threaded compression which creates multi-Block .xz files. XZ Utils 5.1.1alpha also added the option --block-size=SIZE to the xz command line tool. XZ Utils 5.1.2alpha added a partial implementation of --block-list=SIZES which allows specifying sizes of individual Blocks.

Example: getting the uncompressed size of a .xz file

 String filename = "foo.xz";
 SeekableFileInputStream seekableFile
         = new SeekableFileInputStream(filename);

 try {
     SeekableXZInputStream seekableXZ
             = new SeekableXZInputStream(seekableFile);
     System.out.println("Uncompressed size: " + seekableXZ.length());
 } finally {
     seekableFile.close();
 }
 
See Also:
SeekableFileInputStream, XZInputStream, XZOutputStream
  • Constructor Details

    • SeekableXZInputStream

      public SeekableXZInputStream​(SeekableInputStream in) throws IOException
      Creates a new seekable XZ decompressor without a memory usage limit.
      Parameters:
      in - seekable input stream containing one or more XZ Streams; the whole input stream is used
      Throws:
      XZFormatException - input is not in the XZ format
      CorruptedInputException - XZ data is corrupt or truncated
      UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
      EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
      IOException - may be thrown by in
    • SeekableXZInputStream

      public SeekableXZInputStream​(SeekableInputStream in, ArrayCache arrayCache) throws IOException
      Creates a new seekable XZ decompressor without a memory usage limit.

      This is identical to SeekableXZInputStream(SeekableInputStream) except that this also takes the arrayCache argument.

      Parameters:
      in - seekable input stream containing one or more XZ Streams; the whole input stream is used
      arrayCache - cache to be used for allocating large arrays
      Throws:
      XZFormatException - input is not in the XZ format
      CorruptedInputException - XZ data is corrupt or truncated
      UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
      EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
      IOException - may be thrown by in
      Since:
      1.7
    • SeekableXZInputStream

      public SeekableXZInputStream​(SeekableInputStream in, int memoryLimit) throws IOException
      Creates a new seekable XZ decomporessor with an optional memory usage limit.
      Parameters:
      in - seekable input stream containing one or more XZ Streams; the whole input stream is used
      memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
      Throws:
      XZFormatException - input is not in the XZ format
      CorruptedInputException - XZ data is corrupt or truncated
      UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
      MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
      EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
      IOException - may be thrown by in
    • SeekableXZInputStream

      public SeekableXZInputStream​(SeekableInputStream in, int memoryLimit, ArrayCache arrayCache) throws IOException
      Creates a new seekable XZ decomporessor with an optional memory usage limit.

      This is identical to SeekableXZInputStream(SeekableInputStream,int) except that this also takes the arrayCache argument.

      Parameters:
      in - seekable input stream containing one or more XZ Streams; the whole input stream is used
      memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
      arrayCache - cache to be used for allocating large arrays
      Throws:
      XZFormatException - input is not in the XZ format
      CorruptedInputException - XZ data is corrupt or truncated
      UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
      MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
      EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
      IOException - may be thrown by in
      Since:
      1.7
    • SeekableXZInputStream

      public SeekableXZInputStream​(SeekableInputStream in, int memoryLimit, boolean verifyCheck) throws IOException
      Creates a new seekable XZ decomporessor with an optional memory usage limit and ability to disable verification of integrity checks.

      Note that integrity check verification should almost never be disabled. Possible reasons to disable integrity check verification:

      • Trying to recover data from a corrupt .xz file.
      • Speeding up decompression. This matters mostly with SHA-256 or with files that have compressed extremely well. It's recommended that integrity checking isn't disabled for performance reasons unless the file integrity is verified externally in some other way.

      verifyCheck only affects the integrity check of the actual compressed data. The CRC32 fields in the headers are always verified.

      Parameters:
      in - seekable input stream containing one or more XZ Streams; the whole input stream is used
      memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
      verifyCheck - if true, the integrity checks will be verified; this should almost never be set to false
      Throws:
      XZFormatException - input is not in the XZ format
      CorruptedInputException - XZ data is corrupt or truncated
      UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
      MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
      EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
      IOException - may be thrown by in
      Since:
      1.6
    • SeekableXZInputStream

      public SeekableXZInputStream​(SeekableInputStream in, int memoryLimit, boolean verifyCheck, ArrayCache arrayCache) throws IOException
      Creates a new seekable XZ decomporessor with an optional memory usage limit and ability to disable verification of integrity checks.

      This is identical to SeekableXZInputStream(SeekableInputStream,int,boolean) except that this also takes the arrayCache argument.

      Parameters:
      in - seekable input stream containing one or more XZ Streams; the whole input stream is used
      memoryLimit - memory usage limit in kibibytes (KiB) or -1 to impose no memory usage limit
      verifyCheck - if true, the integrity checks will be verified; this should almost never be set to false
      arrayCache - cache to be used for allocating large arrays
      Throws:
      XZFormatException - input is not in the XZ format
      CorruptedInputException - XZ data is corrupt or truncated
      UnsupportedOptionsException - XZ headers seem valid but they specify options not supported by this implementation
      MemoryLimitException - decoded XZ Indexes would need more memory than allowed by the memory usage limit
      EOFException - less than 6 bytes of input was available from in, or (unlikely) the size of the underlying stream got smaller while this was reading from it
      IOException - may be thrown by in
      Since:
      1.7
  • Method Details

    • getCheckTypes

      public int getCheckTypes()
      Gets the types of integrity checks used in the .xz file. Multiple checks are possible only if there are multiple concatenated XZ Streams.

      The returned value has a bit set for every check type that is present. For example, if CRC64 and SHA-256 were used, the return value is (1 << XZ.CHECK_CRC64) | (1 << XZ.CHECK_SHA256).

    • getIndexMemoryUsage

      public int getIndexMemoryUsage()
      Gets the amount of memory in kibibytes (KiB) used by the data structures needed to locate the XZ Blocks. This is usually useless information but since it is calculated for memory usage limit anyway, it is nice to make it available to too.
    • getLargestBlockSize

      public long getLargestBlockSize()
      Gets the uncompressed size of the largest XZ Block in bytes. This can be useful if you want to check that the file doesn't have huge XZ Blocks which could make seeking to arbitrary offsets very slow. Note that huge Blocks don't automatically mean that seeking would be slow, for example, seeking to the beginning of any Block is always fast.
    • getStreamCount

      public int getStreamCount()
      Gets the number of Streams in the .xz file.
      Since:
      1.3
    • getBlockCount

      public int getBlockCount()
      Gets the number of Blocks in the .xz file.
      Since:
      1.3
    • getBlockPos

      public long getBlockPos​(int blockNumber)
      Gets the uncompressed start position of the given Block.
      Throws:
      IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
      Since:
      1.3
    • getBlockSize

      public long getBlockSize​(int blockNumber)
      Gets the uncompressed size of the given Block.
      Throws:
      IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
      Since:
      1.3
    • getBlockCompPos

      public long getBlockCompPos​(int blockNumber)
      Gets the position where the given compressed Block starts in the underlying .xz file. This information is rarely useful to the users of this class.
      Throws:
      IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
      Since:
      1.3
    • getBlockCompSize

      public long getBlockCompSize​(int blockNumber)
      Gets the compressed size of the given Block. This together with the uncompressed size can be used to calculate the compression ratio of the specific Block.
      Throws:
      IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
      Since:
      1.3
    • getBlockCheckType

      public int getBlockCheckType​(int blockNumber)
      Gets integrity check type (Check ID) of the given Block.
      Throws:
      IndexOutOfBoundsException - if blockNumber < 0 or blockNumber >= getBlockCount().
      Since:
      1.3
      See Also:
      getCheckTypes()
    • getBlockNumber

      public int getBlockNumber​(long pos)
      Gets the number of the Block that contains the byte at the given uncompressed position.
      Throws:
      IndexOutOfBoundsException - if pos < 0 or pos >= length().
      Since:
      1.3
    • read

      public int read() throws IOException
      Decompresses the next byte from this input stream.
      Specified by:
      read in class InputStream
      Returns:
      the next decompressed byte, or -1 to indicate the end of the compressed stream
      Throws:
      CorruptedInputException
      UnsupportedOptionsException
      MemoryLimitException
      XZIOException - if the stream has been closed
      IOException - may be thrown by in
    • read

      public int read​(byte[] buf, int off, int len) throws IOException
      Decompresses into an array of bytes.

      If len is zero, no bytes are read and 0 is returned. Otherwise this will try to decompress len bytes of uncompressed data. Less than len bytes may be read only in the following situations:

      • The end of the compressed data was reached successfully.
      • An error is detected after at least one but less than len bytes have already been successfully decompressed. The next call with non-zero len will immediately throw the pending exception.
      • An exception is thrown.
      Overrides:
      read in class InputStream
      Parameters:
      buf - target buffer for uncompressed data
      off - start offset in buf
      len - maximum number of uncompressed bytes to read
      Returns:
      number of bytes read, or -1 to indicate the end of the compressed stream
      Throws:
      CorruptedInputException
      UnsupportedOptionsException
      MemoryLimitException
      XZIOException - if the stream has been closed
      IOException - may be thrown by in
    • available

      public int available() throws IOException
      Returns the number of uncompressed bytes that can be read without blocking. The value is returned with an assumption that the compressed input data will be valid. If the compressed data is corrupt, CorruptedInputException may get thrown before the number of bytes claimed to be available have been read from this input stream.
      Overrides:
      available in class InputStream
      Returns:
      the number of uncompressed bytes that can be read without blocking
      Throws:
      IOException
    • close

      public void close() throws IOException
      Closes the stream and calls in.close(). If the stream was already closed, this does nothing.

      This is equivalent to close(true).

      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class InputStream
      Throws:
      IOException - if thrown by in.close()
    • close

      public void close​(boolean closeInput) throws IOException
      Closes the stream and optionally calls in.close(). If the stream was already closed, this does nothing. If close(false) has been called, a further call of close(true) does nothing (it doesn't call in.close()).

      If you don't want to close the underlying InputStream, there is usually no need to worry about closing this stream either; it's fine to do nothing and let the garbage collector handle it. However, if you are using ArrayCache, close(false) can be useful to put the allocated arrays back to the cache without closing the underlying InputStream.

      Note that if you successfully reach the end of the stream (read returns -1), the arrays are automatically put back to the cache by that read call. In this situation close(false) is redundant (but harmless).

      Throws:
      IOException - if thrown by in.close()
      Since:
      1.7
    • length

      public long length()
      Gets the uncompressed size of this input stream. If there are multiple XZ Streams, the total uncompressed size of all XZ Streams is returned.
      Specified by:
      length in class SeekableInputStream
    • position

      public long position() throws IOException
      Gets the current uncompressed position in this input stream.
      Specified by:
      position in class SeekableInputStream
      Throws:
      XZIOException - if the stream has been closed
      IOException
    • seek

      public void seek​(long pos) throws IOException
      Seeks to the specified absolute uncompressed position in the stream. This only stores the new position, so this function itself is always very fast. The actual seek is done when read is called to read at least one byte.

      Seeking past the end of the stream is possible. In that case read will return -1 to indicate the end of the stream.

      Specified by:
      seek in class SeekableInputStream
      Parameters:
      pos - new uncompressed read position
      Throws:
      XZIOException - if pos is negative, or if stream has been closed
      IOException - if pos is negative or if a stream-specific I/O error occurs
    • seekToBlock

      public void seekToBlock​(int blockNumber) throws IOException
      Seeks to the beginning of the given XZ Block.
      Throws:
      XZIOException - if blockNumber < 0 or blockNumber >= getBlockCount(), or if stream has been closed
      IOException
      Since:
      1.3