-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First steps towards zipfile.ZipFile integration #109
Comments
(just a comment that ZIP might use zlib, aka deflate, but can also store uncompressed, bzip2 and maybe others. deflate should be the most common) |
Yes, true. we would want to raise an exception if the zip file didn't use zlib compression. |
Hi @forrestfwilliams @martindurant, I'm not at all familiar with the zip format, so I'm afraid I can't really comment on how complex it would be to support zip. I don't have the time to spend on this project (apart from bugs), but am happy to consult/review PRs. |
Actually, I think it would be trivial, if the code can apply to zlib/inflate streams, not only gzip. I believe this is already the case. |
The big hurdle to overcome is allowing The If we could expose the capability of |
I believe you are correct: they share the same compression algorithm, but add headers (or none for deflate). If we extract an index, we just skip the header to the compressed stream, and all three should look the same. |
@forrestwilliams as you note, the The code should be agnostic to whether a stream has a
But There is also the matter of the footer - currently Again, optimistically, I don't think it would be too difficult to add support for By the way, the |
Here is how to get a gzip version of a DEFLATE file in a ZIP: import zipfile
import gzip
import io
f = open(zipfile) # of other seekable file-like
z = zipfile.ZipFile(file=f)
zinf = z.infolist()[0] # or whichever member file we want
assert zinf.compress_type == zipfile.ZIP_DEFLATED
f.seek(zinf.header_offset + len(zinf.FileHeader()))
data = f.read(zinf.compress_size)
# https://www.rfc-editor.org/rfc/rfc1952#section-2.3.1
ghead = b"\x1f\x8b\x08\x00" + b"\x00\x00\x00\x00" + b"\x00\xff"
gfoot = zinf.CRC.to_bytes(4, "little") + (zinf.file_size % 2**32).to_bytes(4, "little")
b = io.BytesIO(ghead + data + gfoot)
b.seek(0)
gzip.GzipFile(fileobj=b).read() # OK This could be modified to work with a streaming remote file (s3, etc.) without needing to hold the whole data in memory at once. Instead of calling GZipFile, we would pass this custom file-like object to indexed_gzip for both indexing and random-access reading. This is well within the kinds of tricks we routinely play with kerchunk. |
@martindurant this looks great! I've taken this code/info and created a script that works for a specific use case of mine. This script follows the code you wrote above, but I ran into an issue when finding the location to start reading the raw DEFLATE stream from the zip. Specifically, using a While this seems to work in a few cases, I'm betting this won't work in the majority of cases and I would like to figure out what is happening. I think it has something to do with the varying length of the |
|
|
Thanks @martindurant! should the last line of the above be: |
You are right, of course :) |
@martindurant this solution worked, thanks! Next week I'll try to re-work the scripts in the repo I linked to above so that the parts relevant to |
Yes, it would be great for kerchunk to be able to directly access chunks within files in ZIP or tar.gz archives. It will, I think, take some work to get this right; the index files need o be stored somewhere and loaded on demand, for instance. |
Hello @pauldmccarthy this package is really fantastic! I work at the Alaska Satellite Facility where we distribute satellite imagery data at no cost to scientists and we've run into a similar issue to the one
indexed_gzip
was designed to solve.Unfortunately, instead of
gzip
, the files we're trying index usezip
compression. We would love to have anindexed_zip
package that has the same functionality asindexed_gzip
, but is compatible with python'szipfile.ZipFile
. Sincezipfile.ZipFile
also useszlib
, under the hood how big of a lift do you think this would be?The text was updated successfully, but these errors were encountered: