Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Initial PR #5

Merged
merged 131 commits into from
Aug 31, 2022
Merged

🎉 Initial PR #5

merged 131 commits into from
Aug 31, 2022

Conversation

hf-krechan
Copy link
Collaborator

Cause of the wild start of this project, we have to make a PR from main to develop branch.

Features of the AHB Extractor

  • Read AHB docx files and extract infos from tables of Prüfidentifikatoren
  • Export infos in machine readable formats
    • json
    • csv
    • excel
  • Unittests run without docx files

commit before clean up
delete all unused lines
it is still without function
and find the important AHB tables
and iterate through all paragraphs and tables
and cover some special cases for empty middle cells
Copy link
Collaborator Author

@hf-krechan hf-krechan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vielen Dank für das bisherige Feedback.
Die größeren Änderungen wie bspw. Klassen zu nutzen würde ich gerne in einem eigenen PR bearbeiten.

README.md Outdated Show resolved Hide resolved

## PDF Dokumente

The following sections give a short overview where to find the start and end for the Formate.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jop

ahbextractor/helper/check_row_type.py Show resolved Hide resolved
ahbextractor/helper/check_row_type.py Show resolved Hide resolved
ahbextractor/helper/check_row_type.py Show resolved Hide resolved
Comment on lines +62 to +69
table: Table,
dataframe: pd.DataFrame,
current_df_row_index: int,
last_two_row_types: List[RowType],
edifact_struktur_cell_left_indent_position: int,
middle_cell_left_indent_position: int,
tabstop_positions: List[int],
) -> Tuple[List[RowType], int]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ich würde das in einem nächsten PR dann angehen. Habe auch schon issue dafür aufgemacht um es nicht zu vergessen.
#6

Args:
table (Table): Current table in the docx
dataframe (pd.DataFrame): Contains all infos of the Prüfidentifikators
current_df_row_index (int): Current row of the dataframe
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auch das würde ich in einem extra PR bearbeiten.
Issue wurde angelegt: #7

Comment on lines +87 to +90
if table._column_count == 4:
index_for_middle_column = 2
else:
index_for_middle_column = 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aktuell brauche ich es noch, aber ich denke man könnte den index auch los werden.

list of the last two RowTypes,
the current row index for the DataFrame
"""
header_cells = [cell.text for cell in item.row_cells(0)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ich nehme bereits die prüfidentifikatoren aus dem header?

indicator edifact struktur cell
middle_cell_left_indent_position (int): Position of the left indent in the indicator middle cell
tabstop_positions (List[int]): All tabstop positions of the indicator middle cell

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

würde ich auch mal zu dem issue #6 packen

Copy link
Contributor

@hf-aschloegl hf-aschloegl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puuuuh, was für ein PR... 😅
Es scheint ja zu funktionieren und wenn ich dich richtig verstehe, willst du die größeren strukturellen Veränderungen in seperaten PRs machen. Das Ausgangsmaterial ist natürlich allein schon wirr genug und benötigt einiges an Sonderfällen, aber ich glaube wir können die Programmstruktur trotzdem noch ein bisschen klarer machen und finden dann vllt noch die eine oder andere Optimierung im Programmverlauf und Lesbarkeit.
Aber das Ergebnis spricht schon mal für sich 😉

Comment on lines +137 to +162
# write actual row into dataframe
if not (current_row_type is RowType.EMPTY and last_two_row_types[0] is RowType.HEADER):
current_df_row_index = write_new_row_in_dataframe(
row_type=current_row_type,
table=table,
row=row,
index_for_middle_column=index_for_middle_column,
dataframe=dataframe,
dataframe_row_index=current_df_row_index,
edifact_struktur_cell_left_indent_position=edifact_struktur_cell_left_indent_position,
middle_cell_left_indent_position=middle_cell_left_indent_position,
tabstop_positions=tabstop_positions,
)

else:
current_df_row_index = write_new_row_in_dataframe(
row_type=last_two_row_types[1],
table=table,
row=row,
index_for_middle_column=index_for_middle_column,
dataframe=dataframe,
dataframe_row_index=current_df_row_index,
edifact_struktur_cell_left_indent_position=edifact_struktur_cell_left_indent_position,
middle_cell_left_indent_position=middle_cell_left_indent_position,
tabstop_positions=tabstop_positions,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im Grunde gibt es keinen besonderen Grund die write-Funktionen in den read-Funktionen aufzurufen. Das könntest du einfach hintereinander in get_ahb_extract machen.
Damit würdest du auch das 'Problem' lösen, dass die read-Funktionen auch writen.

Comment on lines +116 to +117
del row_cell_texts_as_list[1]
row_cell_texts_as_list[2] = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wieso deletest du einmal und ersetzt das ander durch einen leeren String?

ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved
ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved
ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved
ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved
ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved
Comment on lines +7 to +8
; isolated_build = True
; skipsdist = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wieso auskommentiert und nicht gelöscht?

unittests/test_check_row_type.py Outdated Show resolved Hide resolved
Comment on lines +198 to +218
# def test_not_implemented_middle_cell_paragraph(self):
# # insert text
# self.test_cell.text = ""
# test_paragraph = self.test_cell.paragraphs[0]

# # set left indent positon
# test_paragraph.paragraph_format.left_indent = None

# df = pd.DataFrame(dtype="str")
# row_index = 0

# with pytest.raises(NotImplementedError) as excinfo:
# parse_paragraph_in_middle_column_to_dataframe(
# paragraph=test_paragraph,
# dataframe=df,
# row_index=row_index,
# left_indent_position=self.left_indent_position_of_indicator_paragraph,
# tabstop_positions=self.tabstop_positions_of_indicator_paragraph,
# )

# assert "Could not parse paragraphe in middle cell with " in str(excinfo.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was ist mit diesem Test?

hf-kklein and others added 18 commits December 7, 2021 07:39
Bumps [lxml](https://github.com/lxml/lxml) from 4.6.3 to 4.6.5.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](lxml/lxml@lxml-4.6.3...lxml-4.6.5)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [xlsxwriter](https://github.com/jmcnamara/XlsxWriter) from 1.4.3 to 3.0.2.
- [Release notes](https://github.com/jmcnamara/XlsxWriter/releases)
- [Changelog](https://github.com/jmcnamara/XlsxWriter/blob/main/Changes)
- [Commits](jmcnamara/XlsxWriter@RELEASE_1.4.3...RELEASE_3.0.2)

---
updated-dependencies:
- dependency-name: xlsxwriter
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [pandas](https://github.com/pandas-dev/pandas) from 1.2.4 to 1.3.5.
- [Release notes](https://github.com/pandas-dev/pandas/releases)
- [Changelog](https://github.com/pandas-dev/pandas/blob/master/RELEASE.md)
- [Commits](pandas-dev/pandas@v1.2.4...v1.3.5)

---
updated-dependencies:
- dependency-name: pandas
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [openpyxl](https://openpyxl.readthedocs.io) from 3.0.7 to 3.0.9.

---
updated-dependencies:
- dependency-name: openpyxl
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants