-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Initial PR #5
🎉 Initial PR #5
Conversation
it is still without function
and find the important AHB tables
and iterate through all paragraphs and tables
and cover some special cases for empty middle cells
Co-authored-by: Annika <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vielen Dank für das bisherige Feedback.
Die größeren Änderungen wie bspw. Klassen zu nutzen würde ich gerne in einem eigenen PR bearbeiten.
|
||
## PDF Dokumente | ||
|
||
The following sections give a short overview where to find the start and end for the Formate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jop
table: Table, | ||
dataframe: pd.DataFrame, | ||
current_df_row_index: int, | ||
last_two_row_types: List[RowType], | ||
edifact_struktur_cell_left_indent_position: int, | ||
middle_cell_left_indent_position: int, | ||
tabstop_positions: List[int], | ||
) -> Tuple[List[RowType], int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ich würde das in einem nächsten PR dann angehen. Habe auch schon issue dafür aufgemacht um es nicht zu vergessen.
#6
Args: | ||
table (Table): Current table in the docx | ||
dataframe (pd.DataFrame): Contains all infos of the Prüfidentifikators | ||
current_df_row_index (int): Current row of the dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auch das würde ich in einem extra PR bearbeiten.
Issue wurde angelegt: #7
if table._column_count == 4: | ||
index_for_middle_column = 2 | ||
else: | ||
index_for_middle_column = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aktuell brauche ich es noch, aber ich denke man könnte den index auch los werden.
list of the last two RowTypes, | ||
the current row index for the DataFrame | ||
""" | ||
header_cells = [cell.text for cell in item.row_cells(0)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ich nehme bereits die prüfidentifikatoren aus dem header?
indicator edifact struktur cell | ||
middle_cell_left_indent_position (int): Position of the left indent in the indicator middle cell | ||
tabstop_positions (List[int]): All tabstop positions of the indicator middle cell | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
würde ich auch mal zu dem issue #6 packen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puuuuh, was für ein PR... 😅
Es scheint ja zu funktionieren und wenn ich dich richtig verstehe, willst du die größeren strukturellen Veränderungen in seperaten PRs machen. Das Ausgangsmaterial ist natürlich allein schon wirr genug und benötigt einiges an Sonderfällen, aber ich glaube wir können die Programmstruktur trotzdem noch ein bisschen klarer machen und finden dann vllt noch die eine oder andere Optimierung im Programmverlauf und Lesbarkeit.
Aber das Ergebnis spricht schon mal für sich 😉
# write actual row into dataframe | ||
if not (current_row_type is RowType.EMPTY and last_two_row_types[0] is RowType.HEADER): | ||
current_df_row_index = write_new_row_in_dataframe( | ||
row_type=current_row_type, | ||
table=table, | ||
row=row, | ||
index_for_middle_column=index_for_middle_column, | ||
dataframe=dataframe, | ||
dataframe_row_index=current_df_row_index, | ||
edifact_struktur_cell_left_indent_position=edifact_struktur_cell_left_indent_position, | ||
middle_cell_left_indent_position=middle_cell_left_indent_position, | ||
tabstop_positions=tabstop_positions, | ||
) | ||
|
||
else: | ||
current_df_row_index = write_new_row_in_dataframe( | ||
row_type=last_two_row_types[1], | ||
table=table, | ||
row=row, | ||
index_for_middle_column=index_for_middle_column, | ||
dataframe=dataframe, | ||
dataframe_row_index=current_df_row_index, | ||
edifact_struktur_cell_left_indent_position=edifact_struktur_cell_left_indent_position, | ||
middle_cell_left_indent_position=middle_cell_left_indent_position, | ||
tabstop_positions=tabstop_positions, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im Grunde gibt es keinen besonderen Grund die write-Funktionen in den read-Funktionen aufzurufen. Das könntest du einfach hintereinander in get_ahb_extract
machen.
Damit würdest du auch das 'Problem' lösen, dass die read-Funktionen auch writen.
del row_cell_texts_as_list[1] | ||
row_cell_texts_as_list[2] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wieso deletest du einmal und ersetzt das ander durch einen leeren String?
; isolated_build = True | ||
; skipsdist = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wieso auskommentiert und nicht gelöscht?
# def test_not_implemented_middle_cell_paragraph(self): | ||
# # insert text | ||
# self.test_cell.text = "" | ||
# test_paragraph = self.test_cell.paragraphs[0] | ||
|
||
# # set left indent positon | ||
# test_paragraph.paragraph_format.left_indent = None | ||
|
||
# df = pd.DataFrame(dtype="str") | ||
# row_index = 0 | ||
|
||
# with pytest.raises(NotImplementedError) as excinfo: | ||
# parse_paragraph_in_middle_column_to_dataframe( | ||
# paragraph=test_paragraph, | ||
# dataframe=df, | ||
# row_index=row_index, | ||
# left_indent_position=self.left_indent_position_of_indicator_paragraph, | ||
# tabstop_positions=self.tabstop_positions_of_indicator_paragraph, | ||
# ) | ||
|
||
# assert "Could not parse paragraphe in middle cell with " in str(excinfo.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was ist mit diesem Test?
Bumps [lxml](https://github.com/lxml/lxml) from 4.6.3 to 4.6.5. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](lxml/lxml@lxml-4.6.3...lxml-4.6.5) --- updated-dependencies: - dependency-name: lxml dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [xlsxwriter](https://github.com/jmcnamara/XlsxWriter) from 1.4.3 to 3.0.2. - [Release notes](https://github.com/jmcnamara/XlsxWriter/releases) - [Changelog](https://github.com/jmcnamara/XlsxWriter/blob/main/Changes) - [Commits](jmcnamara/XlsxWriter@RELEASE_1.4.3...RELEASE_3.0.2) --- updated-dependencies: - dependency-name: xlsxwriter dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pandas](https://github.com/pandas-dev/pandas) from 1.2.4 to 1.3.5. - [Release notes](https://github.com/pandas-dev/pandas/releases) - [Changelog](https://github.com/pandas-dev/pandas/blob/master/RELEASE.md) - [Commits](pandas-dev/pandas@v1.2.4...v1.3.5) --- updated-dependencies: - dependency-name: pandas dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [openpyxl](https://openpyxl.readthedocs.io) from 3.0.7 to 3.0.9. --- updated-dependencies: - dependency-name: openpyxl dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Annika <[email protected]>
Co-authored-by: Annika <[email protected]>
Co-authored-by: Annika <[email protected]>
Co-authored-by: Annika <[email protected]>
Co-authored-by: Annika <[email protected]>
Co-authored-by: Annika <[email protected]>
Co-authored-by: Annika <[email protected]>
Cause of the wild start of this project, we have to make a PR from main to develop branch.
Features of the AHB Extractor