Skip to content

Saving Duplicate Images #359

@mlh758

Description

@mlh758
Contributor

Description

When adding an image, it would be useful to detect if the image is a duplicate and instead of adding the image again, return a reference to the existing media.

We have a header that gets added in some large workbooks to each sheet, and this substantially increases the file size. If we open the file in Excel and save it again, Excel detects the duplicate images and consolidates them on save, reducing the file size.

I think it would be possible to change addMedia in picture.go to handle this behavior. AddPictureFromBytes would change slightly as well:

  1. addMedia steps through all the saved media and looks for byte slices that have the same length as the file we are trying to save. Compute the hash for both (we can obviously re-use the hash of the file being saved for future comparisons)
  2. If the hash matches, return the media path for the existing image. If no matches are found, save the new media and return its path
  3. AddPictureFromBytes calls addMedia before calling addDrawingRelationships and uses the media path provided by addMedia in the call to addDrawingRelationships.

Since checking the length of a slice is a constant time operation, and very few slices should have the exact same length without actually being the same media there shouldn't be much of a performance impact from the hashing. However, it would be worth adding a benchmark to ensure this doesn't cause a regression for #274. It would also be useful to incorporate a benchmark for actually saving the xlsx file since it's likely that the performance impact of this check would be offset by not having to write as many files.

I will wait for feedback on this one since while I think it may be useful but I understand not wanting to risk the performance impact.

Activity

xuri

xuri commented on Mar 20, 2019

@xuri
Member

Thanks for the insight @mlh758. That's a useful feature. I'll certainly accept that patch if somebody did that. we can store the hash of the image in the File object for reducing the impact on performance.

mlh758

mlh758 commented on Mar 20, 2019

@mlh758
ContributorAuthor

A hash may actually be overkill, the bytes package has has an Equal function that already handles equality and would probably be more efficient.

added a commit that references this issue on Mar 26, 2019
eca6618
added 2 commits that reference this issue on Oct 23, 2020
dfd03b0
4971d0e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @xuri@mlh758

        Issue actions

          Saving Duplicate Images · Issue #359 · qax-os/excelize