-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some tools for working with spreadsheets (C4-1088) #275
Conversation
netsettler
commented
Aug 14, 2023
Pull Request Test Coverage Report for Build 5958607573
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some very minor stylistic questions, maybe add a few docstings here and there, but nothing blocking on my end.
I would make sure that Doug has a chance to compare this to his spec before merging just so we are aligned on that front.
dcicutils/sheet_utils.py
Outdated
@classmethod | ||
def _all_rows(cls, sheet: Worksheet): | ||
row_max = sheet.max_row | ||
for row in range(2, row_max + 1): | ||
yield row | ||
|
||
@classmethod | ||
def _all_cols(cls, sheet: Worksheet): | ||
col_max = sheet.max_column | ||
for col in range(1, col_max + 1): | ||
yield col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to belabor the point as we've discussed this before, but I don't really understand the use of @classmethod
here when there are no references to cls
. I looked this up and as it turns out most styling standards do not really mind that you do this, so I guess it's ok, but generally I find it confusing as when I see a @classmethod
I am implicitly assuming there is a dependency for that method elsewhere in the class ie: it cannot be ported as a standalone function outside of the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like @staticmethod because it's not a concept in many languages. A compiler can figure out what's a static method and there should be no need to declare it. If I want to call other class methods, I like having cls
already available as a variable so I don't have to distract myself by adding data flow to receive the class name. If something is truly promised to be a static method, I just make it a function. There is no need to go through the syntax of calling it indirect through a class. It could be called by any class and it's weird to hide it somewhere that makes it syntactically hard for other things to get to it. If it's part of a class, then it should have access to the class methods.
Also, as a practical matter, Python fusses if something is an instance method but could be a static method but it does not fuss if it's a class method and could be a static method, probably because it wants to accommodate exactly the programming style that I use.
dcicutils/sheet_utils.py
Outdated
def _load_row(self, *, sheet: Worksheet, row: int): | ||
headers = self.sheet_headers(sheet.title) | ||
row_dict: Dict[str, Any] = {headers[col-1]: sheet.cell(row=row, column=col).value | ||
for col in self._all_cols(sheet)} | ||
return row_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-pick: I'd move this definition above the usage in the prior method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I want to rearrange some other stuff, too, so I'll do that when I do. I didn't want to move them out from under you once you were reviewing things. I guess now is a good time.
cls.set_path_value(datum[key], more_path, value) | ||
|
||
@classmethod | ||
def parse_value(cls, value: SheetCellValue) -> AnyJsonData: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would describe semantics of this processing in docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm adding some doc strings in parallel with your review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a couple style comments that may be worth discussing, as well as asking one important question as to the scope of what this module is designed to do with respect to the casting of values performed here.
dcicutils/sheet_utils.py
Outdated
self.headers_by_sheetname: Dict[List[str]] = {} | ||
self.content_by_sheetname: Dict[List[Any]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For dictionary-like type annotations, consider annotating the key type as well as the value type.
self.headers_by_sheetname: Dict[List[str]] = {} | |
self.content_by_sheetname: Dict[List[Any]] = {} | |
self.headers_by_sheetname: Dict[str, List[str]] = {} | |
self.content_by_sheetname: Dict[str, List[Any]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that got fixed later. I don't even think the way I annotated it was legit. I was just asleep while typing.
But this is a really crappy way to notate dictionaries. Newer python (I think in 3.8) has better ways to declare this. Dictionaries are often not uniform in output type and this notation has trouble handling that.
def parse_value(cls, value: SheetCellValue) -> AnyJsonData: | ||
if isinstance(value, str): | ||
lvalue = value.lower() | ||
# TODO: We could consult a schema to make this less heuristic, but this may do for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the purpose of this module is parsing spreadsheets purely for generating items to validate against schema, I think the schema parsing is required here. Otherwise, we will likely be casting to one type here only to check the schema later and cast back, for example if the field is a string per the schema and alphanumeric (including only numeric) inputs are accepted.
If we want this module for parsing spreadsheets more generically, then no issues with this at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just talking to Will about that today. It's not just validating. I think of it (well, Excel, really, or some other spreadsheet editor like OpenOffice or LibreOffice) as a kind of user interface layer that accepts more flexible syntax and offers more interesting editing capabilities. Your original ask was to tolerate differences in case, etc. That can't be done without schemas, I agree. That part is coming downstream. Stay tuned. Just trying to keep a separation of concerns.
self.filename: str = filename | ||
self.workbook: Optional[Workbook] = None | ||
self.headers_by_sheetname: Dict[str, List[str]] = {} | ||
self.content_by_sheetname: Dict[str, List[Any]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interested to hear your perspective on whether it's preferable to update these variables (as is done here) or just assign them more explicitly, possibly caching the call. I've been gravitating more and more towards using frozen dataclasses with explicit properties or functions for getting particular pieces of information, such as headers_by_sheetname
here, and possibly caching the result if it's expensive or may be called multiple times. Given that the input here fixes everything that follows (assuming the filename and its contents are fixed when parsing), it seems like a natural fit here for me as well, with the benefit that one does not have to see where the variable is updated throughout the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not expensive to get, though. It's all O(1). And these are really only set in one place. The only thing that's done a lot is accesses, and mainly this is just making those references a little prettier. Dataclasses look like a lot more computational and textual overhead than I need here, but I don't use them, so you're welcome to send me an example or I'd be happy to take time to discuss them.
def sheet_headers(self, sheetname: str) -> List[str]: | ||
return self.headers_by_sheetname[sheetname] | ||
|
||
def sheet_content(self, sheetname: str) -> List[Any]: | ||
return self.content_by_sheetname[sheetname] | ||
|
||
@classmethod | ||
def _all_rows(cls, sheet: Worksheet): | ||
row_max = sheet.max_row | ||
for row in range(2, row_max + 1): | ||
yield row | ||
|
||
@classmethod | ||
def _all_cols(cls, sheet: Worksheet): | ||
col_max = sheet.max_column | ||
for col in range(1, col_max + 1): | ||
yield col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another stylistic thing you're welcome to ignore but for which I'm interested to hear your input given your background: I tend to utilize noun-only names for class properties and variables, whereas I would start these methods with verbs since deliver output dependent on input (e.g. _get_sheet_headers
). I know we don't have a style guide, but sometimes I wish we did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be verbs so that it's clear to someone that a new one is made on each call and that it's not accessing a cache. (Otherwise if two people looped at the same time, they'd reuse the same info. There were bugs in the early collection classes in python around this.) I mostly use the style rule you say, but sometimes the system is overconstrained and it matters more that it looks right in context, which is:
for x in foo._all_cols(...):
and you don't want to say
for x in foo._get_all_cols():
or _compute_all_cols()
or whatever. But I also don't want:
for x in foo._all_cols:
if key0 == n: | ||
parent.append(placeholder) | ||
elif key0 > n: | ||
raise Exception("Numeric items must occur sequentially.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if key0 < n
? If someone submits something along the lines of friend#-1.age
as a header, we should either handle it appropriately or report some kind of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, < n
is not that big a deal. but < 0
would be a problem. I could add that. The reason it wants to worry about > n
is to make someone not use a huge number, because that would allocate storage beyond what is reasonable. I'm less worried about negative storage allocation, but it's still reasonable to have an error check for nonsensical info. And it's possible someone might think -1
means "last", so that's worth heading off.
elif key0 > n: | ||
raise Exception("Numeric items must occur sequentially.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parsing these spreadsheets, we generally won't want to raise exceptions outright that won't be caught quickly because we want to accumulate all errors across the spreadsheet before reporting them back to the user for fixing. This could certainly be caught downstream, but may be worth considering here from the get-go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there might be some finessing to do on this. I had certainly been trying to minimize errors because indeed raising errors at all means you only get one bit of feedback instead of many and there's no obvious handler anyway, but some of these things I didn't work all the way through and figured I can come back to.
self.patch_prototypes_by_sheetname: Dict[str, Dict] = {} | ||
self.parsed_headers_by_sheetname: Dict[str, List[List[Union[int, str]]]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the comment above, consider more explicitly building/returning these rather than updating them in separate class methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree with this. We could talk about what the advantage is. The primary reason I think that's a bad idea is it impedes or slows unit testing or possibly other operations where you don't need to construct these because they're not used, or where there might be unused sheets you're never going to touch. I think demand-creating them makes sense, and a dictionary is a good way to do that.