Skip to content

feature: Paragraph.delete() #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
scanny opened this issue Apr 3, 2014 · 18 comments
Open

feature: Paragraph.delete() #33

scanny opened this issue Apr 3, 2014 · 18 comments
Labels

Comments

@scanny
Copy link
Contributor

scanny commented Apr 3, 2014

In order to modify an existing document
As a developer using python-pptx
I need a way to delete a paragraph

Need to account for the possibility the paragraph contains the last reference to a relationship, such as might a hyperlink or inline picture.

@scanny scanny modified the milestones: v0.6.0, 0.6.2 May 1, 2014
@scanny scanny modified the milestones: v0.6.0 Cursors, 0.6.2 May 13, 2014
@scanny scanny added the text label Jun 17, 2014
@scanny scanny changed the title feature: delete_paragraph() feature: Paragraph.delete() Feb 13, 2015
@jeffreinhart
Copy link

Would like to see this available for python-docx. It would be very useful in populating a document full of placeholders given that it would allow the placeholder paragraph to be deleted if the value to populate the placeholder is None.

@scanny
Copy link
Contributor Author

scanny commented Mar 7, 2015

You should be able to do this for the simple case with this code:

def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    p._p = p._element = None

Any subsequent access to the "deleted" paragraph object will raise AttributeError, so you should be careful not to keep the reference hanging around, including as a member of a stored value of Document.paragraphs.

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

But if you know for sure none of those are present, these few lines should get the job done.

@jeffreinhart
Copy link

That works! Thank you!!

@scanny
Copy link
Contributor Author

scanny commented Mar 9, 2015

Glad it worked out Jeff :)

@waynerth
Copy link

Steve, thanks so much. I was having trouble after merging cells in a table which left extra empty paragraphs. Used your function and worked great, which let the cells shrink back by getting rid of empty space. Used it in a nested loop as follows:

    delete_paragraph(table.rows[rx].cells[cx].paragraphs[-1])

thanks - wayne (retired HW designer, having fun with python while hopefully helping out the non-profit I volunteer for)

@scanny scanny removed this from the Cursors / Insert items milestone Apr 9, 2016
@zooyf
Copy link

zooyf commented Nov 8, 2019

Hi @scanny
Why not implement the feature and close the issue?

@zooyf
Copy link

zooyf commented Nov 8, 2019

You should be able to do this for the simple case with this code:

def delete_paragraph(paragraph):
    p = paragraph._element
    p.getparent().remove(p)
    p._p = p._element = None

Any subsequent access to the "deleted" paragraph object will raise AttributeError, so you should be careful not to keep the reference hanging around, including as a member of a stored value of Document.paragraphs.

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

But if you know for sure none of those are present, these few lines should get the job done.

What's the difference compared to this solution?

def delete_element(el):
    el._element.getparent().remove(el._element)

@scanny
Copy link
Contributor Author

scanny commented Nov 8, 2019

Well, in fact, on review, there is an error in that code. The last line should be:

paragraph._p = paragraph._element = None

But as for the rest of it:

  1. delete_element and el are misleading name choices in my view. A Paragraph object is an element-proxy object which composes an element object; it is not itself an element. So in general we reserve the name element and its derivatives for the XML element objects themselves.

  2. The core code is essentially the first two lines combined into one, so that's a matter of taste; the operation is the same. I would personally probably choose something like yours in my own code, but for someone learning, sometimes breaking things down more step-by-step eases figuring out what the underlying process is, like first get the element from the proxy, then do this thing with the element, etc.

  3. The (previously incorrect) last line is setting the _p and _element attributes of the "host" Paragraph proxy object to None so the now-deleted (or actually only orphaned) element is not accidentally accessed in later code and also is freed up for garbage collection. Removing an element in lxml does not delete it, it only breaks its relationship with its parent. So the original Paragraph object could still make changes to it and the user might puzzle for quite a while to figure out why their code wasn't working but wasn't raising an error. So you can think of it as preventative medicine.

@abubelinha
Copy link

Thanks for this @scanny
I suggest you to edit the original previously incorrect last line, because that's the answer which is still linked by you from Stackoverflow.

@mrufsvold
Copy link

mrufsvold commented Nov 17, 2021

Steve, thanks so much. I was having trouble after merging cells in a table which left extra empty paragraphs. Used your function and worked great, which let the cells shrink back by getting rid of empty space. Used it in a nested loop as follows:

    delete_paragraph(table.rows[rx].cells[cx].paragraphs[-1])

thanks - wayne (retired HW designer, having fun with python while hopefully helping out the non-profit I volunteer for)

I have this same problem. However, when I use the delete_paragraph function with the corrected last line, the resulting document throws an error when opened that reads "Word found unreadable content in document_name.docx. Do you want to recover the contents of this document?" Clicking yes works to open the document, but I'm trying to figure out why deleting the paragraphs is causing this problem.

I think it might be related to the fact that this paragraph exists in a merged cell, but it sounds like @waynerth didn't experience this problem.

Any thoughts?

Thanks for your work on this @scanny!

@scanny
Copy link
Contributor Author

scanny commented Nov 17, 2021

@mrufsvold each cell must contain at least one block item, so a paragraph or a table. If you get rid of all the paragraphs, that leaves the cell in an invalid state. You might want to delete paragraphs[1:] or something like that, just be sure there's at least one left.

@mrufsvold
Copy link

@scanny That makes complete sense! Thanks for your quick reply. I'll give that a shot when I get back to that project!

@mrufsvold
Copy link

It worked!

@scanny
Copy link
Contributor Author

scanny commented Nov 18, 2021

Glad you got it working @mrufsvold :)

@abubelinha
Copy link

abubelinha commented Dec 8, 2021

The reason it's not in the library yet is because the general case is much trickier, in particular needing to detect and handle the variety of linked items that can be present in a paragraph; things like a picture, a hyperlink, or chart etc.

@scanny Does that mean that if I delete a paragraph containing a link, my document will/might crash because the linked stuff is still kept/referenced somewhere else in the document ... or something alike?

@scanny
Copy link
Contributor Author

scanny commented Dec 8, 2021

It depends a little on what you mean by link, but deleting is not so much a problem in practice as copying is.

If you have a hyperlink, for example, in a paragraph, that hyperlink element in the XML contains a relationship reference (like "rId7") to a Relationship element in the .rels "file" associated with the part containing the paragraph (maybe the document-part most commonly). That Relationship element contains the URL of the hyperlink and that's the extent of the relationship (a so-called "external" relationship). If you delete the paragraph but don't delete the Relationship element in the .rels collection that Relationship element will hang around and be saved with the document. This actually shouldn't cause a problem and I don't believe by itself represents a file "corruption" that might give rise to a so-called "repair error" when opening the file.

If you have something "bigger", like say an image embedded in the paragraph (a so-called inline-shape), and you delete the paragraph without attending to the now-dangling relationship, then both the Relationship element in the .rels _as well as the Image-part it refers to will be retained in the document. That bloats the file a little but again, shouldn't cause a problem and may or may not give rise to a "repair-error" on opening the document. You'd have to experiment and behavior might vary by client, like maybe PowerPoint doesn't complain but LibreOffice does or vice-versa.

So deleting a paragraph is worth trying if you don't mind a little wasted space.

But if you copy a paragraph and don't re-establish the relationships (which may need to change "name", e.g. "rId7" -> "rId9") and also copy over target part(s) (e.g. the image in the example above) then that will definitely trigger a repair error on loading the document because Word can't find the image to render in that paragraph.

@abubelinha
Copy link

abubelinha commented Apr 23, 2023

I think deleting is working for me, at least for the tests I made with many small controlled documents.

Now with a big document (where I do lots of things, not just deleting paragraphs) I am getting errors when opening it.
Word gives the chance to correct them and save the document, but I wonder if I have any chances of finding out the error source:

  • Do you know of any way to make Word report where the "unreadable content" is?
    I tried opc-diag but the output is so huge I can't really see anything there (BTW, no diff colours, just black and white interface: probably not designed for my Windows 7 machine?)
  • Reading again your last comment, I wonder what you exactly mean with copying a paragraph. Could you post a simple code example? (maybe I am unconsciously doing it since I reuse quite a few functions made by some other people).

Thanks @scanny

@star-starry-sea
Copy link

Wow, thank you. It works!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants