Skip to content

feature: BlockItemContainer.iter_block_items() #40

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danmilon opened this issue Apr 25, 2014 · 43 comments
Closed

feature: BlockItemContainer.iter_block_items() #40

danmilon opened this issue Apr 25, 2014 · 43 comments
Labels
inner-content Methods to access all content inside doc, para, run, etc.

Comments

@danmilon
Copy link

I'd like to iterate over the elements of they document as they appear in it. For example if there is a paragraph a table and then a paragraph again, I want to get them in that order. AFAIK currently there are two properties on Document, paragraphs and tables but have no notion of ordering between them.

@scanny scanny modified the milestones: v0.6.0, 0.6.1 May 1, 2014
@pmagsino
Copy link

pmagsino commented May 2, 2014

This is a feature that would be useful for data mining. My use case is such that the primary data to be extracted are within tables. The related secondary data are from paragraphs that are either precede or straddle the table.

@scanny
Copy link
Contributor

scanny commented May 3, 2014

This workaround should work for anyone who can't wait for the Document.iter_block_items() feature to be implemented. I haven't tested it, so please provide feedback if it gives any trouble or you get it to work.

It can accept either a document or a table cell for its parent argument.

from docx.api import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text import CT_P
from docx.table import _Cell, Table
from docx.text import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child)
        elif isinstance(child, CT_Tbl):
            yield Table(child)

@pmagsino
Copy link

pmagsino commented May 6, 2014

Awesome. The workaround works great for my use case. Thanks again.

@scanny
Copy link
Contributor

scanny commented May 6, 2014

Glad to hear it Paul :)

I'll leave this issue open as the feature request.

@scanny scanny changed the title iterate paragraphs/tables as they appear in the document feature: BlockItemContainer.iter_block_items() May 6, 2014
@scanny scanny modified the milestones: v0.6.0 Cursors, 0.6.1 May 13, 2014
@ghost
Copy link

ghost commented Aug 26, 2014

Had to make these changes to the code to get the function to work

if isinstance(child, CT_P):
    yield Paragraph(child,parent_elm)
elif isinstance(child, CT_Tbl):
    yield Table(child,parent_elm)

@scanny
Copy link
Contributor

scanny commented Aug 26, 2014

I think None would probably be better than parent_elm. The parent parameter which was added to the Paragraph and Table constructors since this issue opened expects the parent proxy object like _Body or (table)_Cell, not the lxml parent element (e.g. <w:body>).

These are only used when an upward reference is required, such as when inserting a picture, so depending on the use case, using None might work well enough to get the job done. Using parent and making sure it was a reference to _Body or _Cell would be better.

In any case, this hack is due for a proper solution once I can get back to it. Been very busy on python-pptx just lately getting chart functionality going there :)

UPDATE:
On later reflection, it became clear the new parameter should simply be parent as provided as an original call argument to iter_block_items. An updated full version is a couple comments down.

@ghost
Copy link

ghost commented Sep 2, 2014

Thanks for the help.I am also interested to know how would you go about this function,more specifically how would you want to handle inline images,charts and mathematical equations when they come in the text.I am thinking of just returning the xml in case of charts or equations and returning the image in case there is an image in the run.

@scanny
Copy link
Contributor

scanny commented Sep 4, 2014

Well, a solution for the general case would yield a proxy object (e.g. Paragraph, Table) for each element encountered so the developer could operate on the object without having to go down to the XML level. This gets a little tricky because there are a surprisingly large array of types that can possibly appear within a block context or inline context and not nearly all of them have proxy objects yet. Things like a <w:del> and <w:ins> element that have to do with the revision tracking, for example.

One solution would be to return a proxy object when you could and then a generic NotImplementedObject or something when no suitable proxy class existed for the item.

Note also that there are two main contexts one might want to iterate over, a block context and an inline context. An element like the <w:body> element of a document part contains block-level objects like Paragraph and Table. A Paragraph itself is an inline context and contains things like Run, and inline pictures, hyperlinks, etc.

This issue was originally about block-level items, but a corresponding method for iterating over inline objects would also be handy.

@scanny
Copy link
Contributor

scanny commented Oct 24, 2014

An updated snippet that should do the trick and is consistent with the latest internals would look like this. I haven't had time to test it, so if it gives you trouble let me know and I'll help fix :)

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

@danmilon
Copy link
Author

danmilon commented Nov 3, 2014

@scanny, yes, this works perfectly. Do you want me to wrap this up, and send in a PR?

@scanny
Copy link
Contributor

scanny commented Nov 5, 2014

The tests will be the key outstanding components for this one. If you want to take a crack at it, by all means :)

@cez81
Copy link

cez81 commented Mar 9, 2015

Is there a way of doing this after the recent changes to docx.Document?

@scanny
Copy link
Contributor

scanny commented Mar 10, 2015

Not yet; this feature is still in the backlog. The last release focused on styles support.

@scanny
Copy link
Contributor

scanny commented Mar 27, 2015

Oh, I think I misinterpreted your question. Some of the imports have to change due to recent refactoring:

from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

Is that what you were asking @cez81 ?

I've updated the example.

@cez81
Copy link

cez81 commented Apr 3, 2015

Yes I think it was. Unfortunately I can't get it to work tho... I get an error creating the Document instance
"TypeError: init() missing 1 required positional argument: 'part'". I'm guessing it has to do with the first line importing the wrong Document class?

from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


doc = Document('test.docx')
for block in iter_block_items(doc):
    print(block.text)

@scanny
Copy link
Contributor

scanny commented Apr 7, 2015

Ah, right. If you do this that should fix your case where you have them both in the same module and need to use both the docx.document.Document class and the docx.Document factory function:

import docx

doc = docx.Document('test.docx')
for block in iter_block_items(doc):
    print(block.text)

@pdelsante
Copy link

Hi, I think @cez81 is right: there seems to be something more that changed in your code lately. To make your example work again with 0.8.5 I had to change it like this:

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

In particular, I had to change the following line:

        parent_elm = parent._document_part.body._body

to this:

        parent_elm = parent.element

@scanny
Copy link
Contributor

scanny commented Apr 8, 2015

Ah, yes, I see what you're saying. The way to get a reference to to the <w:body> element changed as well. I think what you want is this though:

if isinstance(parent, Document):
    parent_elm = parent.element._body

... because parent.element is the <w:document> element if I'm reading the code correctly.

Apologies I don't have time to test this right now, but hope that helps. Such are the wages of workaround functions because they rely on internals that aren't guaranteed to be stable between releases.

@cez81
Copy link

cez81 commented Apr 8, 2015

Ok got it working now! Changed it to:

if isinstance(parent, Document):
    parent_elm = parent.element.body

Thanks for the help both of you!

@igorsavinkin
Copy link

igorsavinkin commented Nov 7, 2018 via email

@aistellar
Copy link

nested tables should be easy to handle with recursion

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)

@romran
Copy link

romran commented Dec 10, 2018

Hello everyone,
It helped me a lot when parsing docx files.
May you tell, if it is possible to refactor this function and also find InlineShapes ?

@Slowhalfframe
Copy link

Hello everyone.
I have a question: how to use Python to read pictures in word order?

@lxj0276
Copy link

lxj0276 commented Aug 15, 2019

I want to know how to read pictures or charts like function "iter_block_items"

@Slowhalfframe
Copy link

I want to know how to read pictures or charts like function "iter_block_items"

def read_item_block(parent):
'''
顺序读取wordneir
:param parent: 文档
:return: p/t
'''
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
elif isinstance(parent, _Row):
parent_elm = parent._tr
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
count = 1
count_flase = 0
res = Paragraph(child, parent)
if res.text != '':
yield (res,count_flase)
else:
try:
# 试着去取内联元素
from xml.dom.minidom import parseString
DOMTree = parseString(child.xml)
data = DOMTree.documentElement
nodelist = data.getElementsByTagName('pic:blipFill')
print('*nodelist'9,nodelist)
if len(nodelist) < 1:
yield (res,count_flase)
else:
yield (res, count)
except Exception as e:
print('
'*9,e)
yield (res,count_flase)
elif isinstance(child, CT_Tbl):
yield (Table(child, parent),)

This is how I read pictures.

@devanshugupta
Copy link

Having this error in your code:

Traceback (most recent call last):
File "C:/Users/home/PycharmProjects/Sentiment_analysis/yup.py", line 46, in
for cell in row.cells:
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 401, in cells
return tuple(self.table.row_cells(self._index))
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 106, in row_cells
return self._cells[start:end]
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 173, in _cells
cells.append(cells[-col_count])
IndexError: list index out of range

@Labyrins
Copy link

Labyrins commented Sep 6, 2020

Ok got it working now! Changed it to:

if isinstance(parent, Document):
    parent_elm = parent.element.body

Thanks for the help both of you!

Thank you. this works to me!

@div1996
Copy link

div1996 commented Sep 7, 2020

how to read paragraph,table,shapes all in one place....Kindly Help ASAP

@ejaca
Copy link

ejaca commented Mar 17, 2023

Hi admin,

Do you have any idea what can I change in the code?

This is the current code I have in iterating tables and paragraphs:

def iterate_tables_and_paragraphs(
    parent: Union[DocxDocument, _Cell]
) -> Union[DocxParagraph, DocxTable]:
    if isinstance(parent, DocxDocument):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("Invalid type parameter, expected DocxDocument or _Cell")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield DocxParagraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield DocxTable(child, parent)

There were no problems in running the code but the output was incorrect.

The problem is that in my document, there are 2 tables on top of each other but they don't have the same number of columns. Table 1 has 18 columns while table 2 has 20 columns but this code sets the number of columns to the max which is 20 so when I tried reading the data, table 1 produced incorrect results since it looped 20 times so some data from the next row were included as table headers.

Please help. Thanks.

@cyrillkuettel
Copy link

The imports are tricky to get right, so here you go. This should work for the latest version.

from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)
                    
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
	print(block.text)

@abubelinha
Copy link

abubelinha commented Apr 12, 2023

@cyrillkuettel Thanks a lot for sharing!

@abubelinha

@scanny scanny added inner-content Methods to access all content inside doc, para, run, etc. and removed text table navigation labels Sep 24, 2023
@scanny
Copy link
Contributor

scanny commented Nov 3, 2023

Added BlockItemContainer.iter_inner_content() in v.1.0.2. Document, Header, Footer, and (table) _Cell are all block-item containers. The behavior is to generate Paragraph | Table in document-order from within that container. Contrast with Section.iter_inner_content() which does the same but only within a single section.

It is not recursive, so you'll need to take care of that aspect if you want it (not everyone will hence why it's not implemented here).

Maybe something like:

def recursively_iter_block_items(blkcntnr: BlockItemContainer) -> Iterator[Paragraph | Table]:
    for item in blkcntnr.iter_inner_content():
        if isinstance(item, Paragraph):
            yield item
        elif isinstance(item, Table):
            for row in item.rows:
                for cell in row.cells:
                    yield from recursively_iter_block_items(cell)

@scanny scanny closed this as completed Nov 3, 2023
@LucianoMan
Copy link

The imports are tricky to get right, so here you go. This should work for the latest version.

from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)
                    
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
	print(block.text)

This works wonderfully, however It seems to repeat text that are in tables. Does anyone know how to stop the code from doing this. I attempted to use sets but that got rid of repeated text that I need.

@scanny
Copy link
Contributor

scanny commented Jan 11, 2024

@LucianoMan Something like this should do the trick:

def iter_visible_row_cells(row: Row) -> Iterator[_Cell]:
    """Generate only "concrete" cells, those with a `tc` element.

    Vertically spanned cells have a `tc` element but are skipped.
    """
    yield from (_Cell(tc, row) for tc in row._tr.tc_lst if tc.vMerge != "continue")

@LucianoMan
Copy link

@scanny would you happen to know how to append one word document to another?

@scanny
Copy link
Contributor

scanny commented Jan 13, 2024

Not related so not a good use of this thread.

That topic comes up from time to time, search should be your first stop. Google knows many things about python-docx :)

@LucianoMan
Copy link

LucianoMan commented Jan 13, 2024

I apologize for the off topic question but my friend and I looked everywhere and could not find anything useful if you know a link it would be appreciated :'(.

@cyrillkuettel
Copy link

If you have pandoc you can run

pandoc -s document1.docx document2.docx  -o merged.docx

This can work for simple cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inner-content Methods to access all content inside doc, para, run, etc.
Projects
None yet
Development

No branches or pull requests