Skip to content

how to read paragraphs AND tables? #276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bit111 opened this issue Mar 14, 2016 · 23 comments
Closed

how to read paragraphs AND tables? #276

bit111 opened this issue Mar 14, 2016 · 23 comments

Comments

@bit111
Copy link

bit111 commented Mar 14, 2016

Hi,
I have read the entire discussion on issue #40 but the solutions do not work with 0.8.5 release (I think because are dated solutions).
This is my problem: I have a large docx to read with more than 400 pages. In this document I have some data in rows and some data in tables.
How I can do to read paragraphs and tables in the order they appears in the doc?
Thanks

@scanny
Copy link
Contributor

scanny commented Mar 14, 2016

What code are you using for iter_block_items() and what error are you getting?

@bit111
Copy link
Author

bit111 commented Mar 14, 2016

I am using this code

from docx import *
#from docx.document import Document
from docx.oxml.text.paragraph import CT_P
from docx.text.paragraph import Paragraph

def main():
    doc = Document('file1.docx')
    #output = open("output_"+cur_date+"_.txt", "w")
    parent_elem = doc.element
    for block in iter_block_items(doc) :
        print(block.text)


def iter_block_items(parent):
    """
    Yield each paragraph and table child within *parent*, in document order.
    Each returned value is an instance of either Table or Paragraph. *parent*
    would most commonly be a reference to a main Document object, but
    also works for a _Cell object, which itself can contain paragraphs and tables.
    """
    if isinstance(parent, Document):
        parent_elm = parent.element
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    parent_elm = parent.element

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)

And the error is:

Traceback (most recent call last):
File "beta-iterblock.py", line 47, in
main()
File "beta-iterblock.py", line 21, in main
for block in iter_block_items(doc) :
File "beta-iterblock.py", line 32, in iter_block_items
if isinstance(parent, Document):
TypeError: isinstance() arg 2 must be a class, type, or tuple of classes and types

@scanny
Copy link
Contributor

scanny commented Mar 14, 2016

Why do you have the Document import commented out? What does the 'Document' name resolve to?

@bit111
Copy link
Author

bit111 commented Mar 15, 2016

Because if I use these lines of import

from docx.document import Document
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph

I have this error:

Traceback (most recent call last):
  File "beta-iterblock.py", line 43, in <module>
    main()
  File "beta-iterblock.py", line 15, in main
    doc = Document('file1.docx')
TypeError: __init__() takes exactly 3 arguments (2 given)

For this reason I wrote from doc import * instead of from docx.document import Document although it probably is not correct

@scanny
Copy link
Contributor

scanny commented Mar 15, 2016

Ah, got it. If you change that line to:

from docx.document import Document as _Document

and the lower one in iter_block_items() to:

if isinstance(parent, _Document):

That should do the trick. The problem in this case is a namespace collision between two items named 'Document'.

@bit111
Copy link
Author

bit111 commented Mar 15, 2016

If I follow your suggestion, I have another error:

Traceback (most recent call last):
  File "beta-iterblock.py", line 43, in <module>
    main()
  File "beta-iterblock.py", line 15, in main
    doc = Document('file1.docx')
NameError: global name 'Document' is not defined

I also tried to use
doc = _Document('file1.docx')
but I have the same error of my previous post

@scanny
Copy link
Contributor

scanny commented Mar 15, 2016

In this case, you need to add this to the top of the file:

from docx import Document

There are two distinct Document classes in python-docx. Generally the only one used by an end-user is docx.Document. The docx.document.Document class is used internally and is required by the workaround function.

@bit111
Copy link
Author

bit111 commented Mar 16, 2016

I followed your suggestion but if I use:

    doc = Document('file1.docx')
    for block in iter_block_items(doc) :
        print(block)

It nothing prints.
If print only iter_block_item(doc) it prints
<generator object iter_block_items at 0x02909080>
How can I iterate on this object?

@scanny
Copy link
Contributor

scanny commented Mar 16, 2016

@bit111 - Sounds like you need to brush up on your Python basics.

I recommend you study the code in the iter_block_items() function and perhaps review iteration and yield in Python, as well as constructors like Paragraph() and Table() until you can explain what that function is doing. I expect once you've accomplished that you'll be able to figure this out for yourself.

I can't write your code for you, and even if I did it would leave you still unable to do so for yourself.

You can also post questions like this on Stack Overflow http://stackoverflow.com/ where someone might be willing to help you with the basics. Make sure you vote up their answer and accept the answer that works. Doing so is your "payment" for the help you've received. You should use the "python-docx" tag where appropriate, along with any other tags that suit the question.

@scanny scanny closed this as completed Mar 16, 2016
@bit111
Copy link
Author

bit111 commented Mar 18, 2016

Why you delete my posts? I find it very fair!

@scanny
Copy link
Contributor

scanny commented Mar 18, 2016

@bit111 - Posts that are not constructive are deleted.

Here's the thing. This project, like many open source projects, are all volunteer. It's important to maintain a high level of courtesy for things to operate smoothly. In large part this is because the emotional bandwidth of short text communications like this is very narrow. Add this to the fact that most of us don't know each other and have precious little way of getting to know each other, and it makes for a bit of a delicate situation. We deal with that by keeping the courtesy level high because the risk of offense is so much higher.

I was interpreting your sequence of questions as you being a beginner, which is not a problem, but also not holding up your end as regards your responsibility to find answers for yourself. It seemed to me that rather than puzzling over each error you received and doing what you could to learn how to interpret and resolve it, you were simply asking for me or someone else to solve it for you. This sort of thing is not uncommon, but also not welcome. We welcome learners, but expect them to be active in the learning process. Otherwise it reduces to asking someone else to do your work for you, and that someone doesn't know you and has no good reason to do you a favor and has a lot of other work they're doing with their day job and the volunteer work to make a package like this available.

So I hope that gives you an idea about the reaction you produced. I apologize for any overstatement I might have made regarding writing your code for you. But I stand by the structure of the sentence which is "I can't your code for you", which might be one of (debug, design, write, ...) or whatever. You need to learn Python basics somewhere else and be mindful of the respect due a community that has worked hard to produce something you are deriving benefit from at no cost.

Now, just to show there are no hard feelings here, I think this is your situation:

I followed your suggestion but if I use:

doc = Document('file1.docx')
for block in iter_block_items(doc) :
    print(block)

It nothing prints.
If print only iter_block_item(doc) it prints
<generator object iter_block_items at 0x02909080>
How can I iterate on this object?

iter_block_items() is what is commonly referred to as an iterator function, but more precisely known as a generator.
http://stackoverflow.com/questions/2776829/difference-between-pythons-generators-and-iterators

This explains why you get this when you inspect the return value:
<generator object iter_block_items at 0x02909080>

A generator, or any iterator, can be used in a for block, as you did:

doc = Document('file1.docx')
for block in iter_block_items(doc) :
    print(block)

Now it's possible you are actually getting a block item back from each iteration, but print() is not displaying anything useful. So the first thing would be to reliably inspect the return value somehow.:

for block in iter_block_items(doc) :
    print("found a block")
    print(block.__class__.__name__)

Which might lead to something like this:

found a block
Paragraph
found a block
Paragraph
found a block
Table
found a block
Paragraph

However, if you are actually not receiving any items back from iter_block_items(), it could be because your document is empty. Otherwise something is not working right in the iter_block_items() function you used and you should start debugging up there to make sure you're getting a sequence of w:p elements and/or w:tbl elements as it traverses the XML.

Note that all of this is digging into python-docx internals. None of this is supported at the API level and none of it should be expected to be documented other than in the docstrings of the code. So you're inherently taking on a somewhat advanced job here that requires decoding how python-docx works. Most folks don't want to do that of course :)

I hope that gives you a place to start :)

@bit111
Copy link
Author

bit111 commented Mar 21, 2016

Hi,
I thank you for your support and for your long explaination.
However I tried to use
for block in iter_block_items(doc): print("find a block")
but it nothing prints.
Than, I tried to write a simple docx such as you can see in this post's attachamente but also in this case it nothing prints.
Maybe I'm doing something wrong. I will do other tests.

Let me just one clarification: I do not usually ask others to do my job.
Normally, before asking for help, I spend a lot of time reading documents and other similar issues and only when I could not find anything, then I ask for support.

[
test_par.docx
](url)

@scanny
Copy link
Contributor

scanny commented Mar 21, 2016

Can you send me the latest code you're using for iter_block_item() that doesn't work? I'll have a look. Include the imports etc. like above so I can recreate.

@scanny
Copy link
Contributor

scanny commented Mar 21, 2016

Okay, the code below works for me.

Note the changed line parent_elm = parent.element.body. The paragraph elements are not the direct children of the <w:document> element, there is a <w:body> element intervening.

I encourage you to uncomment the print(parent_elm.xml) line and inspect the XML it produces. That may shed light on just what's happening underneath the covers here :)


#!/usr/bin/env python
# encoding: utf-8

"""
Testing iter_block_items()
"""

from __future__ import (
    absolute_import, division, print_function, unicode_literals
)

from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Generate a reference to each paragraph and table child within *parent*,
    in document order. Each returned value is an instance of either Table or
    Paragraph. *parent* would most commonly be a reference to a main
    Document object, but also works for a _Cell object, which itself can
    contain paragraphs and tables.
    """
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
        # print(parent_elm.xml)
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


document = Document('test.docx')
for block in iter_block_items(document):
    print('found one')
    print(block.text if isinstance(block, Paragraph) else '<table>')

@bit111
Copy link
Author

bit111 commented Mar 22, 2016

Ok, now it seems to work.
Thank you very very much for your patience and your time.

@kaushikkumar356
Copy link

is it possible to print the content of the tables, when we come across tables??

@scanny
Copy link
Contributor

scanny commented Sep 11, 2017

Yes.

@kaushikkumar356
Copy link

Hi scanny, can you help me in letting me know how to print the table content when we come across text and tables simultaneously

@kaushikkumar356
Copy link

kaushikkumar356 commented Sep 12, 2017

Hi thank you all for ur suggestions which helped me alot, as of know i am done with the present requirement which i was in need.

for block in iter_block_items(document):
    if isinstance(block, Paragraph):
        print(block.text)
    elif isinstance(block, Table):
        table_print(block)

def table_print(block):
    table=block
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text,'  ',end='')
                #y.write(paragraph.text)
                #y.write('  ')
        print("\n")
        #y.write("\n")

above i have shared the modified code i had built..........................

@SyntaxShade
Copy link

hello, any ideas about nested table? meaning table B is inside table A

@alex123321123321
Copy link

Hi thank you all for ur suggestions which helped me alot, as of know i am done with the present requirement which i was in need.

for block in iter_block_items(document):
    if isinstance(block, Paragraph):
        print(block.text)
    elif isinstance(block, Table):
        table_print(block)

def table_print(block):
    table=block
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                print(paragraph.text,'  ',end='')
                #y.write(paragraph.text)
                #y.write('  ')
        print("\n")
        #y.write("\n")

above i have shared the modified code i had built..........................

Worked like a charm together with the code of scanny.

Big Big Thank you for posting it

@kmrambo
Copy link

kmrambo commented Oct 22, 2019

You can read paragraphs, tables and images in document order in the following github repo:

https://github.com/kmrambo/Python-docx-Reading-paragraphs-tables-and-images-in-document-order-

@jfthuong
Copy link

jfthuong commented Sep 5, 2022

You can read paragraphs, tables and images in document order in the following github repo:

https://github.com/kmrambo/Python-docx-Reading-paragraphs-tables-and-images-in-document-order-

The hyperlink is not working but the URL is correct. Maybe you meant:
https://github.com/kmrambo/Python-docx-Reading-paragraphs-tables-and-images-in-document-order-

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants