New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

🎉 Initial PR #5

Merged

hf-kklein merged 131 commits into develop from main

Aug 31, 2022

Collaborator

hf-krechan commented Jun 17, 2021

Cause of the wild start of this project, we have to make a PR from main to develop branch.

Features of the AHB Extractor

Read AHB docx files and extract infos from tables of Prüfidentifikatoren
Export infos in machine readable formats
- json
- csv
- excel
Unittests run without docx files

hf-krechan added 30 commits

June 1, 2021 18:39


          🚧 WIP

a6b7f5c


          ➕ Add dependencies

d8b9602


          🙈 Add document folders to gitignore

99c608d


          🎨 Rename path for documents

73740fe


          🚧 Add three column tables

157ea29


          ♻️ Big code refactoring part 01

7e19d19


          🚧 WIP

feaa3ab

commit before clean up
delete all unused lines


          🔥 remove old and unstructured code

98d4951


          ✨ Parse multi line codelists into separate dataframe rows

c6b48f0


          🔥 Remove unused import of numpy

c80e4cf


          ♻ Put check row type functions into separate file

1555d73


          🔥 Remove unused functions

3f76626


          🔧 Add black profile to isort config

c167240


          ♻ Put write functions into separate file

fe34b3a


          🐛 Small bug fixes for the list row_cell_texts_as_list

07ff01d


          🎨Add additional attribute middle_cell to write SEGMENT and DATENELEMENT

1f0fd74


          ✨ Add case for EMPTY cell

2bee88f

it is still without function


          🎨 Add type hint for create_list_of_column_indices

47c3abc


          🎨 Rename parser function for middle column cells

969408b


          🔥 Remove unused lines

c45ca28


          🎨 Change start of tables list to loop through all tables

fcab3c8


          🔥 Remove unused line

3728ff2


          🎨 Use CamelCase for enum class

0810ea1


          🎨 Define one function for writing row into dataframe

bb1a7a6


          ✨ Iterate through all paragraphs and tables

da60cce

and find the important AHB tables


          ✨ Get Prüfis from Header and tabstop positons from third row

17afb54


          🎨 Create new function to read tables

496feb8

and iterate through all paragraphs and tables


          🙈 Add xlsx and csv files to gitignore

145de0e


          🔥 Delete deprecated code

e91b1e9


          🎨 Add dynamic tabstop positions as argument

741047a

and cover some special cases for empty middle cells

hf-krechan and others added 8 commits

June 22, 2021 17:09


          💡 Improve docstring of RowType enum

78f40cb


          💡 Improve docstring of define_row_type

ebdb196

Co-authored-by: Annika <[email protected]>


          🔥 Remove deprecated code


          🎨 Improve function name to export *single* pruefidentifikator

a806d54


          💡 Add comment to explain list(df.columns)[:5]

a3e3e72


          👩‍💻 Improve saving feedback

9e56a10


          💡 Precise error message

39e1d5a


          💡 Remove unnecessary text

6f2e529

hf-krechan commented

View reviewed changes

Collaborator Author

hf-krechan left a comment

Vielen Dank für das bisherige Feedback.
Die größeren Änderungen wie bspw. Klassen zu nutzen würde ich gerne in einem eigenen PR bearbeiten.

README.md Outdated Show resolved Hide resolved

README.md


		## PDF Dokumente

		The following sections give a short overview where to find the start and end for the Formate.

Collaborator Author

hf-krechan Jun 22, 2021

jop

ahbextractor/helper/check_row_type.py Show resolved Hide resolved

ahbextractor/helper/check_row_type.py Show resolved Hide resolved

ahbextractor/helper/check_row_type.py Show resolved Hide resolved

ahbextractor/helper/read_functions.py

Comment on lines +62 to +69

+                  table: Table,
+                  dataframe: pd.DataFrame,
+                  current_df_row_index: int,
+                  last_two_row_types: List[RowType],
+                  edifact_struktur_cell_left_indent_position: int,
+                  middle_cell_left_indent_position: int,
+                  tabstop_positions: List[int],
+              ) -> Tuple[List[RowType], int]:

Collaborator Author

hf-krechan Jun 22, 2021

Ich würde das in einem nächsten PR dann angehen. Habe auch schon issue dafür aufgemacht um es nicht zu vergessen.
#6

ahbextractor/helper/read_functions.py

+                  Args:
+                      table (Table): Current table in the docx
+                      dataframe (pd.DataFrame): Contains all infos of the Prüfidentifikators
+                      current_df_row_index (int): Current row of the dataframe

Collaborator Author

hf-krechan Jun 22, 2021

Auch das würde ich in einem extra PR bearbeiten.
Issue wurde angelegt: #7

ahbextractor/helper/read_functions.py

Comment on lines +87 to +90

+                  if table._column_count == 4:
+                      index_for_middle_column = 2
+                  else:
+                      index_for_middle_column = 1

Collaborator Author

hf-krechan Jun 22, 2021

Aktuell brauche ich es noch, aber ich denke man könnte den index auch los werden.

ahbextractor/helper/read_functions.py

+                          list of the last two RowTypes,
+                          the current row index for the DataFrame
+                  """
+                  header_cells = [cell.text for cell in item.row_cells(0)]

Collaborator Author

hf-krechan Jun 22, 2021

ich nehme bereits die prüfidentifikatoren aus dem header?

ahbextractor/helper/read_functions.py

+                          indicator edifact struktur cell
+                      middle_cell_left_indent_position (int): Position of the left indent in the indicator middle cell
+                      tabstop_positions (List[int]): All tabstop positions of the indicator middle cell

Collaborator Author

hf-krechan Jun 22, 2021

würde ich auch mal zu dem issue #6 packen

hf-aschloegl approved these changes

View reviewed changes

Contributor

hf-aschloegl left a comment

Puuuuh, was für ein PR... 😅
Es scheint ja zu funktionieren und wenn ich dich richtig verstehe, willst du die größeren strukturellen Veränderungen in seperaten PRs machen. Das Ausgangsmaterial ist natürlich allein schon wirr genug und benötigt einiges an Sonderfällen, aber ich glaube wir können die Programmstruktur trotzdem noch ein bisschen klarer machen und finden dann vllt noch die eine oder andere Optimierung im Programmverlauf und Lesbarkeit.
Aber das Ergebnis spricht schon mal für sich 😉

ahbextractor/helper/read_functions.py

Comment on lines +137 to +162

+                      # write actual row into dataframe
+                      if not (current_row_type is RowType.EMPTY and last_two_row_types[0] is RowType.HEADER):
+                          current_df_row_index = write_new_row_in_dataframe(
+                              row_type=current_row_type,
+                              table=table,
+                              row=row,
+                              index_for_middle_column=index_for_middle_column,
+                              dataframe=dataframe,
+                              dataframe_row_index=current_df_row_index,
+                              edifact_struktur_cell_left_indent_position=edifact_struktur_cell_left_indent_position,
+                              middle_cell_left_indent_position=middle_cell_left_indent_position,
+                              tabstop_positions=tabstop_positions,
+                          )
+                      else:
+                          current_df_row_index = write_new_row_in_dataframe(
+                              row_type=last_two_row_types[1],
+                              table=table,
+                              row=row,
+                              index_for_middle_column=index_for_middle_column,
+                              dataframe=dataframe,
+                              dataframe_row_index=current_df_row_index,
+                              edifact_struktur_cell_left_indent_position=edifact_struktur_cell_left_indent_position,
+                              middle_cell_left_indent_position=middle_cell_left_indent_position,
+                              tabstop_positions=tabstop_positions,
+                          )

Contributor

hf-aschloegl Jun 23, 2021

Im Grunde gibt es keinen besonderen Grund die write-Funktionen in den read-Funktionen aufzurufen. Das könntest du einfach hintereinander in get_ahb_extract machen.
Damit würdest du auch das 'Problem' lösen, dass die read-Funktionen auch writen.

ahbextractor/helper/read_functions.py

Comment on lines +116 to +117

		del row_cell_texts_as_list[1]
		row_cell_texts_as_list[2] = ""

Contributor

hf-aschloegl Jun 23, 2021

Wieso deletest du einmal und ersetzt das ander durch einen leeren String?

ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved

ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved

ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved

ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved

ahbextractor/helper/write_functions.py Outdated Show resolved Hide resolved

tox.ini

Comment on lines +7 to +8

		; isolated_build = True
		; skipsdist = True

Contributor

hf-aschloegl Jun 23, 2021

Wieso auskommentiert und nicht gelöscht?

unittests/test_check_row_type.py Outdated Show resolved Hide resolved

unittests/test_write_functions.py

Comment on lines +198 to +218

+                  # def test_not_implemented_middle_cell_paragraph(self):
+                  #     # insert text
+                  #     self.test_cell.text = ""
+                  #     test_paragraph = self.test_cell.paragraphs[0]
+                  #     # set left indent positon
+                  #     test_paragraph.paragraph_format.left_indent = None
+                  #     df = pd.DataFrame(dtype="str")
+                  #     row_index = 0
+                  #     with pytest.raises(NotImplementedError) as excinfo:
+                  #         parse_paragraph_in_middle_column_to_dataframe(
+                  #             paragraph=test_paragraph,
+                  #             dataframe=df,
+                  #             row_index=row_index,
+                  #             left_indent_position=self.left_indent_position_of_indicator_paragraph,
+                  #             tabstop_positions=self.tabstop_positions_of_indicator_paragraph,
+                  #         )
+                  #     assert "Could not parse paragraphe in middle cell with " in str(excinfo.value)

Contributor

hf-aschloegl Jun 23, 2021

Was ist mit diesem Test?

hf-kklein and others added 18 commits

December 7, 2021 07:39


          Install Dependabot

8154b64


          Bump lxml from 4.6.3 to 4.6.5 (#12)

b55d2fb

Bumps [lxml](https://github.com/lxml/lxml) from 4.6.3 to 4.6.5.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](lxml/lxml@lxml-4.6.3...lxml-4.6.5)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>


          Bump xlsxwriter from 1.4.3 to 3.0.2 (#8)

3da3dc0

Bumps [xlsxwriter](https://github.com/jmcnamara/XlsxWriter) from 1.4.3 to 3.0.2.
- [Release notes](https://github.com/jmcnamara/XlsxWriter/releases)
- [Changelog](https://github.com/jmcnamara/XlsxWriter/blob/main/Changes)
- [Commits](jmcnamara/XlsxWriter@RELEASE_1.4.3...RELEASE_3.0.2)

---
updated-dependencies:
- dependency-name: xlsxwriter
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>


          Bump pandas from 1.2.4 to 1.3.5 (#11)

af8250b

Bumps [pandas](https://github.com/pandas-dev/pandas) from 1.2.4 to 1.3.5.
- [Release notes](https://github.com/pandas-dev/pandas/releases)
- [Changelog](https://github.com/pandas-dev/pandas/blob/master/RELEASE.md)
- [Commits](pandas-dev/pandas@v1.2.4...v1.3.5)

---
updated-dependencies:
- dependency-name: pandas
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>


          Bump openpyxl from 3.0.7 to 3.0.9 (#10)

d30b252

Bumps [openpyxl](https://openpyxl.readthedocs.io) from 3.0.7 to 3.0.9.

---
updated-dependencies:
- dependency-name: openpyxl
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>


          Request Dependabot Reviews from 🐍Review Team (#15)

7a12783


          Bump lxml from 4.6.5 to 4.9.1 (#21)

766d869


          Bump xlsxwriter from 3.0.2 to 3.0.3 (#16)

f8e407f


          Update ahbextractor/helper/write_functions.py

4789e5b

Co-authored-by: Annika <[email protected]>


          Update ahbextractor/helper/write_functions.py

f3ae77d

Co-authored-by: Annika <[email protected]>


          Update unittests/test_check_row_type.py

16e7d7e

Co-authored-by: Annika <[email protected]>


          Update ahbextractor/helper/write_functions.py

3a55f9e

Co-authored-by: Annika <[email protected]>


          Update ahbextractor/helper/write_functions.py

9255cdd

Co-authored-by: Annika <[email protected]>


          Update ahbextractor/helper/write_functions.py

66348e6

Co-authored-by: Annika <[email protected]>


          Update ahbextractor/helper/write_functions.py

52a89f2

Co-authored-by: Annika <[email protected]>


          Bump pandas from 1.3.5 to 1.4.3 (#20)

fe04463


          Bump openpyxl from 3.0.9 to 3.0.10 (#18)

77318a8


          Bump numpy from 1.21.5 to 1.22.0 (#19)

d316084

hf-kklein mentioned this pull request

Das ist 1:1 der Code der auch in der export funcition beautify_bedingungen steht. Ich nehme an, du kannst ihn an einer Stelle rausnehmen oder die Funktion aufrufen. #22

Closed

hf-kklein merged commit 561c091 into develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet