-
Notifications
You must be signed in to change notification settings - Fork 1.2k
how to read paragraphs AND tables? #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What code are you using for iter_block_items() and what error are you getting? |
I am using this code
And the error is:
|
Why do you have the Document import commented out? What does the 'Document' name resolve to? |
Because if I use these lines of import
I have this error:
For this reason I wrote from doc import * instead of from docx.document import Document although it probably is not correct |
Ah, got it. If you change that line to: from docx.document import Document as _Document and the lower one in iter_block_items() to: if isinstance(parent, _Document): That should do the trick. The problem in this case is a namespace collision between two items named 'Document'. |
If I follow your suggestion, I have another error:
I also tried to use |
In this case, you need to add this to the top of the file: from docx import Document There are two distinct |
I followed your suggestion but if I use:
It nothing prints. |
@bit111 - Sounds like you need to brush up on your Python basics. I recommend you study the code in the iter_block_items() function and perhaps review iteration and yield in Python, as well as constructors like Paragraph() and Table() until you can explain what that function is doing. I expect once you've accomplished that you'll be able to figure this out for yourself. I can't write your code for you, and even if I did it would leave you still unable to do so for yourself. You can also post questions like this on Stack Overflow http://stackoverflow.com/ where someone might be willing to help you with the basics. Make sure you vote up their answer and accept the answer that works. Doing so is your "payment" for the help you've received. You should use the "python-docx" tag where appropriate, along with any other tags that suit the question. |
Why you delete my posts? I find it very fair! |
@bit111 - Posts that are not constructive are deleted. Here's the thing. This project, like many open source projects, are all volunteer. It's important to maintain a high level of courtesy for things to operate smoothly. In large part this is because the emotional bandwidth of short text communications like this is very narrow. Add this to the fact that most of us don't know each other and have precious little way of getting to know each other, and it makes for a bit of a delicate situation. We deal with that by keeping the courtesy level high because the risk of offense is so much higher. I was interpreting your sequence of questions as you being a beginner, which is not a problem, but also not holding up your end as regards your responsibility to find answers for yourself. It seemed to me that rather than puzzling over each error you received and doing what you could to learn how to interpret and resolve it, you were simply asking for me or someone else to solve it for you. This sort of thing is not uncommon, but also not welcome. We welcome learners, but expect them to be active in the learning process. Otherwise it reduces to asking someone else to do your work for you, and that someone doesn't know you and has no good reason to do you a favor and has a lot of other work they're doing with their day job and the volunteer work to make a package like this available. So I hope that gives you an idea about the reaction you produced. I apologize for any overstatement I might have made regarding writing your code for you. But I stand by the structure of the sentence which is "I can't your code for you", which might be one of (debug, design, write, ...) or whatever. You need to learn Python basics somewhere else and be mindful of the respect due a community that has worked hard to produce something you are deriving benefit from at no cost. Now, just to show there are no hard feelings here, I think this is your situation:
This explains why you get this when you inspect the return value: A generator, or any iterator, can be used in a doc = Document('file1.docx')
for block in iter_block_items(doc) :
print(block) Now it's possible you are actually getting a block item back from each iteration, but print() is not displaying anything useful. So the first thing would be to reliably inspect the return value somehow.: for block in iter_block_items(doc) :
print("found a block")
print(block.__class__.__name__) Which might lead to something like this:
However, if you are actually not receiving any items back from iter_block_items(), it could be because your document is empty. Otherwise something is not working right in the iter_block_items() function you used and you should start debugging up there to make sure you're getting a sequence of w:p elements and/or w:tbl elements as it traverses the XML. Note that all of this is digging into python-docx internals. None of this is supported at the API level and none of it should be expected to be documented other than in the docstrings of the code. So you're inherently taking on a somewhat advanced job here that requires decoding how python-docx works. Most folks don't want to do that of course :) I hope that gives you a place to start :) |
Hi, Let me just one clarification: I do not usually ask others to do my job. [ |
Can you send me the latest code you're using for iter_block_item() that doesn't work? I'll have a look. Include the imports etc. like above so I can recreate. |
Okay, the code below works for me. Note the changed line I encourage you to uncomment the #!/usr/bin/env python
# encoding: utf-8
"""
Testing iter_block_items()
"""
from __future__ import (
absolute_import, division, print_function, unicode_literals
)
from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph
def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
parent_elm = parent.element.body
# print(parent_elm.xml)
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document = Document('test.docx')
for block in iter_block_items(document):
print('found one')
print(block.text if isinstance(block, Paragraph) else '<table>') |
Ok, now it seems to work. |
is it possible to print the content of the tables, when we come across tables?? |
Yes. |
Hi scanny, can you help me in letting me know how to print the table content when we come across text and tables simultaneously |
Hi thank you all for ur suggestions which helped me alot, as of know i am done with the present requirement which i was in need. for block in iter_block_items(document):
if isinstance(block, Paragraph):
print(block.text)
elif isinstance(block, Table):
table_print(block)
def table_print(block):
table=block
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
print(paragraph.text,' ',end='')
#y.write(paragraph.text)
#y.write(' ')
print("\n")
#y.write("\n") above i have shared the modified code i had built.......................... |
hello, any ideas about nested table? meaning table B is inside table A |
Worked like a charm together with the code of scanny. Big Big Thank you for posting it |
You can read paragraphs, tables and images in document order in the following github repo: https://github.com/kmrambo/Python-docx-Reading-paragraphs-tables-and-images-in-document-order- |
The hyperlink is not working but the URL is correct. Maybe you meant: |
Hi,
I have read the entire discussion on issue #40 but the solutions do not work with 0.8.5 release (I think because are dated solutions).
This is my problem: I have a large docx to read with more than 400 pages. In this document I have some data in rows and some data in tables.
How I can do to read paragraphs and tables in the order they appears in the doc?
Thanks
The text was updated successfully, but these errors were encountered: