My brother is a wonderful photographer, and took 14 gigabytes of photos at my recent graduation from Columbia, some of which I hope to post on PhotoFloat — my web 2.0 photo gallery done right via static JSON & dynamic javascript.

. He was kind enough to upload a ZIP of the RAW (Canon Raw 2 – CR2) photos to my FTP server overnight from his killer 50mbps pipe. The next day, he left for a long period of traveling.

I downloaded the ZIP archive, eager to start playing with the photographs and learning about RAW photos and playing with tools like dcraw, lensfun, and ufraw, and also seeing if I could forge Canon’s “Original Decision Data” tags. To my dismay, the ZIP file was corrupted. I couldn’t ask my brother to re-upload it or rsync the changes or anything like that because he was traveling and it was already a great burden for him to upload these in the nick of time. I tried zip -F and zip -FF and even a few Windows shareware tools. Nothing worked. So I decided to write my own tool, using nothing more than the official PKZIP spec and man pages.

First a bit about how ZIP files are structured — everything here is based on the famous official spec in APPNOTE.TXT. Zip files are structured like this:

    [local file header 1]
    [file data 1]
    [data descriptor 1]
    . 
    .
    .
    [local file header n]
    [file data n]
    [data descriptor n]
    [archive decryption header] 
    [archive extra data record] 
    [central directory]
    [zip64 end of central directory record]
    [zip64 end of central directory locator] 
    [end of central directory record]

Generally unzippers seek to the central directory at the end of the file, which has the locations of all the files in the zip, along with their sizes and names. It reads this in, then seeks back up to the top to read the files off one by one.

The strange thing about my brother’s broken file was that the beginning files would work and the end files would work, but the middle 11 gigabytes were broken, with Info-ZIP complaining about wrong offsets and lseeks. I figured that some data had been duplicated/reuploaded at random spots in the middle, so the offsets in the zip file’s central directory were broken.

For each file, however, there is a local file header and an optional data descriptor. Each local file header starts with the same signature (0x04034b50), and contains the file name and the size of the file that comes after the local file header. But sometimes, the size of the file is not known until the file has already been inserted in the zip file, in which case, the local file header reports “0″ for the file size and sets bit 3 in a bit flag. This indicates that after the file, of unknown length, there will be a data descriptor that says the file size. But how do we know where the file ends, if we don’t know the length before hand? Well, usually this data is duplicated in the central directory at the end of the zip file, but I wanted to avoid parsing this all together. Instead, it turns out that, though not in the official spec, APPNOTE.TXT states, “Although not originally assigned a signature, the value 0x08074b50 has commonly been adopted as a signature value for the data descriptor record. Implementers should be aware that ZIP files may be encountered with or without this signature marking data descriptors and should account for either case when reading ZIP files to ensure compatibility. When writing ZIP files, it is recommended to include the signature value marking the data descriptor record.” Bingo.

So the recovery algorithm works like this:

  • Look for a local file header signature integer, reading 4 bytes, and rewinding 3 each time it fails.
  • Once found, see if the size is there. If the size is in it, read the data to the file path.
  • If the size isn’t there, search for the data descriptor signature, reading 4 bytes, and rewinding 3 each time it fails.
  • When found, rewind to the start of the data segment and read the number of bytes specified in the data descriptor.
  • Rewind to 4 bytes after the local file header signature and repeat process.

The files may optionally be deflated, so I use zlib inline to inflate, the advantage of which is that this has its own verification built in, so I don’t need to use zip’s crc32 (though I should).

Along the way there is some additional tricky logic for making sure we’re always searching with maximum breadth.

The end result of all this was… 100% recovery of the files in the archive, complete with their full file names. Win.

You can check out the code here. Suggestions are welcome. It’s definitely a quick hack, but it did the job. Took a lot of fiddling with to make it work, especially figuring out __attribute__((packed)) to turn off gcc’s power-of-two padding.

May 21, 2011 · [Print]

15 Comments to “Repairing Corrupted ZIP Files by Brute Force Scanning in C”

  1. Jochen Goerdts says:

    kudos, you are the man :-) . now write a nice windows gui version.

  2. MOHAN says:

    WHAT AN IDEA SIR JI……………

  3. Samir says:

    This is why the release scene is using RAR and not ZIP.

  4. Dmitry says:

    I’ve got once rolled up a huge 5g zip file, which unzip refused to unpack. I rolled the file myself, no transfer was ever involved, so i assumed that was some bug in unzip code for files greater than 4g… So i tried 7zip instead and lo, it unpacked the file flawlessly. Would be interesting to know if you’ve hit the same problem.

  5. human says:

    Dude! This is pretty old stuff atleast 20 years old technology! But full marks for reinventing the wheel. PkZipFix was released with the original zip utilities package in the 90′s which did exactly this.

    • Jason says:

      I didn’t find this one… Every other tool I tried didn’t work at all. PkZipFix, eh?

      It’s not a very complicated solution here, but I guess it’s just fallen out of favor.

  6. caf says:

    You should add an extra step at step 4 – verify the size makes sense (is neither longer nor significantly shorter than the distance from the start of the data segment to the putative data descriptor). This will make it robust in the face of files that just happen to include the magic bytes in their compressed form (and with 11G of archive, you’re likely to see that value come up 2 or 3 times just through random chance).

    • Jason says:

      The is a good idea. Perhaps I should almost prefer the dd – lfh value more than the reported file size.

  7. JamesP says:

    “Look for a local file header signature integer, reading 4 bytes, and rewinding 3 each time it fails.”

    This is inneficient for several reasons (like using several syscalls)

    There are faster ways, but you could do it like this:
    read 4 bytes and do 3 word compares – that is, compare &(short int *)p, p+1 and p+2 to the first 2 bytes of the tag. Then if any matches you compare the last 2 bytes

    Or just do a plain 32bit comparison. But rewinding is going to slow you down.

    There are other algorithms for efficiently seeking this, I’m sure google has a lot of examples

    • Jason says:

      Yea yea yea yea yea bla bla I knowwwwww. I pretty much just pounded this out, but indeed, this is absolutely what I should be doing. Rewinding is horrible.

  8. Mark Stosberg says:

    Thanks for this. It could potentially be a big help. However, when I try to compile it, I get this result:

    read.c:(.text+0×533): undefined reference to `inflateInit2_’
    read.c:(.text+0x63a): undefined reference to `inflate’
    read.c:(.text+0x7a6): undefined reference to `inflateEnd’

  9. Mark Stosberg says:

    Oh, this fixes that:

    gcc read.c -lz

  10. Mark Stosberg says:

    Ok, I got it to run, but it quickly segfaults with no diagnostic message. I guess it handles some kinds of corruption more gracefully than others. :)

    # ./a.out ./00131608-copy.zip
    Writing [Content_Types].xml of length 0.000470 MB
    Deflating.
    All is going according to plan.
    Writing _rels/.rels of length 0.000232 MB
    Deflating.
    Segmentation fault

  11. Massa says:

    When I try to compile with:

    gcc zip-fixer.c -lz

    I get these warnings:

    zip-fixer.c:118:19: warning: assigning to ‘Bytef *’ (aka ‘unsigned char *’) from ‘char [65536]‘ converts
    between pointers to integer types with different sign [-Wpointer-sign]
    strm.next_in = buffer;
    ^ ~~~~~~
    zip-fixer.c:121:21: warning: assigning to ‘Bytef *’ (aka ‘unsigned char *’) from ‘char [65536]‘ converts
    between pointers to integer types with different sign [-Wpointer-sign]
    strm.next_out = zbuffer;
    ^ ~~~~~~~
    2 warnings generated.

    Sure they’re “just” warnings, but is there any way to fix these?

  12. Jane says:

    Coollll! Have you also try: jar xvf abc.zip

Leave a Reply