feature: BlockItemContainer.iter_block_items() #40

danmilon · 2014-04-25T11:49:25Z

I'd like to iterate over the elements of they document as they appear in it. For example if there is a paragraph a table and then a paragraph again, I want to get them in that order. AFAIK currently there are two properties on Document, paragraphs and tables but have no notion of ordering between them.

The text was updated successfully, but these errors were encountered:

pmagsino · 2014-05-02T15:59:14Z

This is a feature that would be useful for data mining. My use case is such that the primary data to be extracted are within tables. The related secondary data are from paragraphs that are either precede or straddle the table.

scanny · 2014-05-03T05:36:34Z

This workaround should work for anyone who can't wait for the Document.iter_block_items() feature to be implemented. I haven't tested it, so please provide feedback if it gives any trouble or you get it to work.

It can accept either a document or a table cell for its parent argument.

from docx.api import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text import CT_P
from docx.table import _Cell, Table
from docx.text import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child)
        elif isinstance(child, CT_Tbl):
            yield Table(child)

pmagsino · 2014-05-06T16:04:00Z

Awesome. The workaround works great for my use case. Thanks again.

scanny · 2014-05-06T18:02:44Z

Glad to hear it Paul :)

I'll leave this issue open as the feature request.

ghost · 2014-08-26T06:09:20Z

Had to make these changes to the code to get the function to work

if isinstance(child, CT_P):
    yield Paragraph(child,parent_elm)
elif isinstance(child, CT_Tbl):
    yield Table(child,parent_elm)

scanny · 2014-08-26T06:30:35Z

I think None would probably be better than parent_elm. The parent parameter which was added to the Paragraph and Table constructors since this issue opened expects the parent proxy object like _Body or (table)_Cell, not the lxml parent element (e.g. <w:body>).

These are only used when an upward reference is required, such as when inserting a picture, so depending on the use case, using None might work well enough to get the job done. Using parent and making sure it was a reference to _Body or _Cell would be better.

In any case, this hack is due for a proper solution once I can get back to it. Been very busy on python-pptx just lately getting chart functionality going there :)

UPDATE:
On later reflection, it became clear the new parameter should simply be parent as provided as an original call argument to iter_block_items. An updated full version is a couple comments down.

ghost · 2014-09-02T09:25:49Z

Thanks for the help.I am also interested to know how would you go about this function,more specifically how would you want to handle inline images,charts and mathematical equations when they come in the text.I am thinking of just returning the xml in case of charts or equations and returning the image in case there is an image in the run.

scanny · 2014-09-04T06:31:55Z

Well, a solution for the general case would yield a proxy object (e.g. Paragraph, Table) for each element encountered so the developer could operate on the object without having to go down to the XML level. This gets a little tricky because there are a surprisingly large array of types that can possibly appear within a block context or inline context and not nearly all of them have proxy objects yet. Things like a <w:del> and <w:ins> element that have to do with the revision tracking, for example.

One solution would be to return a proxy object when you could and then a generic NotImplementedObject or something when no suitable proxy class existed for the item.

Note also that there are two main contexts one might want to iterate over, a block context and an inline context. An element like the <w:body> element of a document part contains block-level objects like Paragraph and Table. A Paragraph itself is an inline context and contains things like Run, and inline pictures, hyperlinks, etc.

This issue was originally about block-level items, but a corresponding method for iterating over inline objects would also be handy.

scanny · 2014-10-24T02:26:46Z

An updated snippet that should do the trick and is consistent with the latest internals would look like this. I haven't had time to test it, so if it gives you trouble let me know and I'll help fix :)

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

danmilon · 2014-11-03T15:36:19Z

@scanny, yes, this works perfectly. Do you want me to wrap this up, and send in a PR?

scanny · 2014-11-05T11:29:20Z

The tests will be the key outstanding components for this one. If you want to take a crack at it, by all means :)

cez81 · 2015-03-09T11:10:39Z

Is there a way of doing this after the recent changes to docx.Document?

scanny · 2015-03-10T16:06:04Z

Not yet; this feature is still in the backlog. The last release focused on styles support.

scanny · 2015-03-27T22:12:40Z

Oh, I think I misinterpreted your question. Some of the imports have to change due to recent refactoring:

from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

Is that what you were asking @cez81 ?

I've updated the example.

cez81 · 2015-04-03T09:39:55Z

Yes I think it was. Unfortunately I can't get it to work tho... I get an error creating the Document instance
"TypeError: init() missing 1 required positional argument: 'part'". I'm guessing it has to do with the first line importing the wrong Document class?

from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent._document_part.body._body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


doc = Document('test.docx')
for block in iter_block_items(doc):
    print(block.text)

scanny · 2015-04-07T07:57:00Z

Ah, right. If you do this that should fix your case where you have them both in the same module and need to use both the docx.document.Document class and the docx.Document factory function:

import docx

doc = docx.Document('test.docx')
for block in iter_block_items(doc):
    print(block.text)

pdelsante · 2015-04-07T19:45:24Z

Hi, I think @cez81 is right: there seems to be something more that changed in your code lately. To make your example work again with 0.8.5 I had to change it like this:

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

In particular, I had to change the following line:

        parent_elm = parent._document_part.body._body

to this:

        parent_elm = parent.element

scanny · 2015-04-08T04:40:25Z

Ah, yes, I see what you're saying. The way to get a reference to to the <w:body> element changed as well. I think what you want is this though:

if isinstance(parent, Document):
    parent_elm = parent.element._body

... because parent.element is the <w:document> element if I'm reading the code correctly.

Apologies I don't have time to test this right now, but hope that helps. Such are the wages of workaround functions because they rely on internals that aren't guaranteed to be stable between releases.

cez81 · 2015-04-08T07:29:29Z

Ok got it working now! Changed it to:

if isinstance(parent, Document):
    parent_elm = parent.element.body

Thanks for the help both of you!

igorsavinkin · 2018-11-07T07:22:16Z

This works to me: def iter_block_items(parent): if isinstance(parent, Document): parent_elm = parent.element.body elif isinstance(parent, _Cell): parent_elm = parent._tc else: raise ValueError("something's not right") for child in parent_elm.iterchildren(): if isinstance(child, CT_P): yield Paragraph(child, parent) elif isinstance(child, CT_Tbl): # yeild paragraphs from table cells # Note, it works for single level table (not nested tables) table = Table(child, parent) for row in table.rows: for cell in row.cells: for paragraph in cell.paragraphs: yield paragraph

…

On Wed, Nov 7, 2018 at 6:29 AM alfiyafaisy ***@***.***> wrote: how to get text then from yielded Table object? Hey @igorsavinkin <https://github.com/igorsavinkin> , this worked for me. for block in iter_block_items(doc): if isinstance(block, Table): for row in block.rows: row_data = [] for cell in row.cells: for paragraph in cell.paragraphs: row_data.append(paragraph.text) print("\t".join(row_data) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_LhbDpRjnDDnP3iuorZCrthgqhNW12ks5usmGigaJpZM4B1eh3> .

-- Igor Savinkin, (+371) 27-47-16-33 Skype: igorsavinkin http://ergonotes.com/ http://scraping.pro/ http://about.me/ <http://about.me/igorsavinkin> Riga, EU. In Him.

aistellar · 2018-11-20T17:06:31Z

nested tables should be easy to handle with recursion

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)

romran · 2018-12-10T14:33:01Z

Hello everyone,
It helped me a lot when parsing docx files.
May you tell, if it is possible to refactor this function and also find InlineShapes ?

Slowhalfframe · 2019-04-01T03:29:38Z

Hello everyone.
I have a question: how to use Python to read pictures in word order？

lxj0276 · 2019-08-15T02:44:46Z

I want to know how to read pictures or charts like function "iter_block_items"

Slowhalfframe · 2019-08-15T02:49:46Z

I want to know how to read pictures or charts like function "iter_block_items"

def read_item_block(parent):
'''
顺序读取wordneir
:param parent: 文档
:return: p/t
'''
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
elif isinstance(parent, _Row):
parent_elm = parent._tr
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
count = 1
count_flase = 0
res = Paragraph(child, parent)
if res.text != '':
yield (res,count_flase)
else:
try:
# 试着去取内联元素
from xml.dom.minidom import parseString
DOMTree = parseString(child.xml)
data = DOMTree.documentElement
nodelist = data.getElementsByTagName('pic:blipFill')
print('*nodelist'9,nodelist)
if len(nodelist) < 1:
yield (res,count_flase)
else:
yield (res, count)
except Exception as e:
print(''*9,e)
yield (res,count_flase)
elif isinstance(child, CT_Tbl):
yield (Table(child, parent),)

This is how I read pictures.

devanshugupta · 2020-02-27T09:04:00Z

Having this error in your code:

Traceback (most recent call last):
File "C:/Users/home/PycharmProjects/Sentiment_analysis/yup.py", line 46, in
for cell in row.cells:
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 401, in cells
return tuple(self.table.row_cells(self._index))
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 106, in row_cells
return self._cells[start:end]
File "C:\Users\home\Desktop\devu\venv\lib\site-packages\docx\table.py", line 173, in _cells
cells.append(cells[-col_count])
IndexError: list index out of range

Labyrins · 2020-09-06T00:31:23Z

Ok got it working now! Changed it to:
if isinstance(parent, Document):
    parent_elm = parent.element.body
Thanks for the help both of you!

Thank you. this works to me!

div1996 · 2020-09-07T05:58:58Z

how to read paragraph,table,shapes all in one place....Kindly Help ASAP

ejaca · 2023-03-17T05:15:58Z

Hi admin,

Do you have any idea what can I change in the code?

This is the current code I have in iterating tables and paragraphs:

def iterate_tables_and_paragraphs(
    parent: Union[DocxDocument, _Cell]
) -> Union[DocxParagraph, DocxTable]:
    if isinstance(parent, DocxDocument):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("Invalid type parameter, expected DocxDocument or _Cell")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield DocxParagraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield DocxTable(child, parent)

There were no problems in running the code but the output was incorrect.

The problem is that in my document, there are 2 tables on top of each other but they don't have the same number of columns. Table 1 has 18 columns while table 2 has 20 columns but this code sets the number of columns to the max which is 20 so when I tried reading the data, table 1 produced incorrect results since it looped 20 times so some data from the next row were included as table headers.

Please help. Thanks.

cyrillkuettel · 2023-04-11T13:18:28Z

The imports are tricky to get right, so here you go. This should work for the latest version.

from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)
                    
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
	print(block.text)

abubelinha · 2023-04-12T14:48:08Z

@cyrillkuettel Thanks a lot for sharing!

@abubelinha

scanny · 2023-11-03T23:10:14Z

Added BlockItemContainer.iter_inner_content() in v.1.0.2. Document, Header, Footer, and (table) _Cell are all block-item containers. The behavior is to generate Paragraph | Table in document-order from within that container. Contrast with Section.iter_inner_content() which does the same but only within a single section.

It is not recursive, so you'll need to take care of that aspect if you want it (not everyone will hence why it's not implemented here).

Maybe something like:

def recursively_iter_block_items(blkcntnr: BlockItemContainer) -> Iterator[Paragraph | Table]:
    for item in blkcntnr.iter_inner_content():
        if isinstance(item, Paragraph):
            yield item
        elif isinstance(item, Table):
            for row in item.rows:
                for cell in row.cells:
                    yield from recursively_iter_block_items(cell)

LucianoMan · 2024-01-11T15:24:58Z

The imports are tricky to get right, so here you go. This should work for the latest version.

from docx.text.paragraph import Paragraph
from docx.document import Document
from docx.table import _Cell, Table
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
import docx

def iter_block_items(parent):
    if isinstance(parent, Document):
        parent_elm = parent.element.body
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            table = Table(child, parent)
            for row in table.rows:
                for cell in row.cells:
                    yield from iter_block_items(cell)
                    
doc = docx.Document('word.docx')
for block in iter_block_items(doc):
	print(block.text)

This works wonderfully, however It seems to repeat text that are in tables. Does anyone know how to stop the code from doing this. I attempted to use sets but that got rid of repeated text that I need.

scanny · 2024-01-11T22:17:47Z

@LucianoMan Something like this should do the trick:

def iter_visible_row_cells(row: Row) -> Iterator[_Cell]:
    """Generate only "concrete" cells, those with a `tc` element.

    Vertically spanned cells have a `tc` element but are skipped.
    """
    yield from (_Cell(tc, row) for tc in row._tr.tc_lst if tc.vMerge != "continue")

LucianoMan · 2024-01-12T22:48:53Z

@scanny would you happen to know how to append one word document to another?

scanny · 2024-01-13T00:35:54Z

Not related so not a good use of this thread.

That topic comes up from time to time, search should be your first stop. Google knows many things about python-docx :)

LucianoMan · 2024-01-13T17:51:47Z

I apologize for the off topic question but my friend and I looked everywhere and could not find anything useful if you know a link it would be appreciated :'(.

cyrillkuettel · 2024-01-13T19:13:00Z

If you have pandoc you can run

pandoc -s document1.docx document2.docx  -o merged.docx

This can work for simple cases.

scanny modified the milestones: v0.6.0, 0.6.1 May 1, 2014

scanny mentioned this issue May 3, 2014

feature: _Cell.add_table() #45

Closed

scanny changed the title ~~iterate paragraphs/tables as they appear in the document~~ feature: BlockItemContainer.iter_block_items() May 6, 2014

scanny modified the milestones: v0.6.0 Cursors, 0.6.1 May 13, 2014

scanny added the navigation label Jun 17, 2014

scanny added table labels Jul 18, 2014

scanny mentioned this issue Aug 31, 2015

Extract table contents from docx deanmalmgren/textract#92

Merged

This was referenced Apr 5, 2020

[help] get a table after a paragraph #802

Closed

How to read docx in turn ( eg: paragraph, table, table; paragraph, table ) #779

Closed

alexmosc mentioned this issue May 27, 2022

Finding table(s) in between two paragraphs which contain key words in docx #1085

Closed

This was referenced Mar 26, 2023

How to delete table #663

Open

stile from other document import doesnt' work #88

Closed

scanny mentioned this issue Sep 6, 2023

Loop through all elements of a doc and determine their type? #1239

Closed

scanny added inner-content and removed text table navigation labels Sep 24, 2023

scanny closed this as completed Nov 3, 2023

feature: BlockItemContainer.iter_block_items() #40

feature: BlockItemContainer.iter_block_items() #40

Comments

danmilon commented Apr 25, 2014

pmagsino commented May 2, 2014

Uh oh!

scanny commented May 3, 2014

Uh oh!

pmagsino commented May 6, 2014

Uh oh!

scanny commented May 6, 2014

Uh oh!

ghost commented Aug 26, 2014

Uh oh!

scanny commented Aug 26, 2014

Uh oh!

ghost commented Sep 2, 2014

Uh oh!

scanny commented Sep 4, 2014

Uh oh!

scanny commented Oct 24, 2014

Uh oh!

danmilon commented Nov 3, 2014

Uh oh!

scanny commented Nov 5, 2014

Uh oh!

cez81 commented Mar 9, 2015

Uh oh!

scanny commented Mar 10, 2015

Uh oh!

scanny commented Mar 27, 2015

Uh oh!

cez81 commented Apr 3, 2015

Uh oh!

scanny commented Apr 7, 2015

Uh oh!

pdelsante commented Apr 7, 2015

Uh oh!

scanny commented Apr 8, 2015

Uh oh!

cez81 commented Apr 8, 2015

Uh oh!

igorsavinkin commented Nov 7, 2018 via email

Uh oh!

aistellar commented Nov 20, 2018

Uh oh!

romran commented Dec 10, 2018

Uh oh!

Slowhalfframe commented Apr 1, 2019

Uh oh!

lxj0276 commented Aug 15, 2019

Uh oh!

Slowhalfframe commented Aug 15, 2019

Uh oh!

devanshugupta commented Feb 27, 2020

Uh oh!

Labyrins commented Sep 6, 2020

Uh oh!

div1996 commented Sep 7, 2020

Uh oh!

ejaca commented Mar 17, 2023

Uh oh!

cyrillkuettel commented Apr 11, 2023

Uh oh!

abubelinha commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scanny commented Nov 3, 2023

Uh oh!

LucianoMan commented Jan 11, 2024

Uh oh!

scanny commented Jan 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucianoMan commented Jan 12, 2024

Uh oh!

scanny commented Jan 13, 2024

Uh oh!

LucianoMan commented Jan 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

abubelinha commented Apr 12, 2023 •

edited

Loading

scanny commented Jan 11, 2024 •

edited

Loading

LucianoMan commented Jan 13, 2024 •

edited

Loading