Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support derived predicates between static predicates and plain predicates #145

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

justin13601
Copy link
Owner

@justin13601 justin13601 commented Oct 25, 2024

For #144

Summary by CodeRabbit

  • New Features

    • Enhanced handling of static variables in derived predicates for improved output accuracy.
    • Improved error messaging for unsupported data standards.
    • Support for defining predicates in a separate configuration file, allowing for greater flexibility.
    • Expanded predicates configuration language to include regular expressions and multi-column constraints.
    • Added validation for predicate definitions to prevent misconfigurations.
    • New logging feature to check the uniqueness of labels extracted from cohorts.
  • Documentation

    • Added example usages in the documentation for better user guidance on new functionalities.
    • Enhanced clarity and detail in the README and configuration documents regarding task setup and usage.
    • Updated FAQs section to address common queries about static data handling.
    • Clarified installation instructions and refined usage guidance for task configuration files.
  • Chores

    • Updated code quality checks and test workflows for improved reliability and performance.
    • Updated pre-commit hooks to the latest version for enhanced functionality.
    • Improved error handling and validation in configuration loading processes.

Copy link
Contributor

coderabbitai bot commented Oct 25, 2024

Walkthrough

The changes in this pull request involve enhancements to the get_predicates_df function in src/aces/predicates.py, improving the handling of static variables in derived predicates and refining error messaging. Additionally, several GitHub Actions workflow files have been updated to use newer versions of actions and to include additional steps for installing development packages. The .pre-commit-config.yaml has been updated to a newer version of pre-commit-hooks, while the pyproject.toml file restricts the pre-commit dependency to versions less than 4. The README.md and docs/source/configuration.md files have also been updated to improve documentation clarity and usability.

Changes

File Change Summary
src/aces/predicates.py Enhanced get_predicates_df to handle static variables in derived predicates; improved error handling; added example usages in docstring.
.github/workflows/code-quality-main.yaml Updated actions/checkout and actions/setup-python to versions 4 and 5; added step to install dev packages; updated pre-commit/action.
.github/workflows/code-quality-pr.yaml Similar updates as code-quality-main.yaml, including action version upgrades and new dev package installation step.
.github/workflows/tests.yml Upgraded actions/checkout and actions/setup-python; changed package installation command.
.pre-commit-config.yaml Updated pre-commit-hooks repository version from v4.4.0 to v5.0.0.
pyproject.toml Updated pre-commit dependency constraint from "pre-commit" to "pre-commit<4".
README.md Enhanced documentation for ACES, added details on new features, predicates configuration, and installation instructions.
docs/source/configuration.md Clarified configuration language specification, added notes on memory optimization, and detailed field explanations.
src/aces/config.py Improved error handling in TaskExtractorConfig for predicate definitions; updated docstring examples.
src/aces/query.py Enhanced logging and error handling in query function for label extraction.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Predicates
    participant DataFrame

    User->>Predicates: Call get_predicates_df()
    Predicates->>DataFrame: Retrieve static variables
    DataFrame-->>Predicates: Return static variable values
    Predicates->>DataFrame: Process derived predicates
    DataFrame-->>User: Return updated DataFrame
Loading

🐇 "In code we hop and play,
With predicates bright as day,
Static values now align,
In our DataFrame, they shine!
With workflows fresh and neat,
Our coding journey's sweet!" 🐇


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f5b0dbc and 8902365.

📒 Files selected for processing (1)
  • src/aces/config.py (6 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/aces/config.py

[warning] 1378-1378: src/aces/config.py#L1378
Added line #L1378 was not covered by tests

🔇 Additional comments (5)
src/aces/config.py (5)

1077-1077: LGTM! Improved error message clarity.

The error message now clearly indicates the number of missing relationships and provides better context.


1169-1169: Excellent documentation improvements!

The added examples effectively demonstrate:

  1. Configuration loading with predicates
  2. Static and plain predicate joining
  3. Complex nested derived predicates

This significantly improves the understanding of the configuration system.

Also applies to: 1199-1258


1399-1404: LGTM! Enhanced input validation.

The added validation properly catches incorrect string-based predicate definitions and provides clear guidance on the correct format.


1454-1454: LGTM! Consistent error message formatting.

The error message formatting maintains consistency with the codebase style and improves readability.


1339-1339: LGTM! Robust predicate handling implementation.

The changes effectively:

  1. Merge predicates from multiple sources
  2. Recursively validate nested derived predicates
  3. Build a comprehensive set of referenced predicates

However, there's an uncovered code path. Let's verify the test coverage:

Also applies to: 1353-1391


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
src/aces/predicates.py (1)

702-711: LGTM! Efficient implementation of static predicate propagation.

The implementation correctly handles static variables in derived predicates by propagating their values within subject_id groups using polars' optimized window functions.

Consider adding a brief comment explaining the propagation logic:

 if any(x in code.static for x in code.input_predicates):
     data = data.with_columns(
         [
+            # Propagate static values by taking the first value for each subject_id
             pl.col(static_var)
             .first()
             .over("subject_id")  # take the first value in each subject_id group and propagate it
             .alias(static_var)
             for static_var in code.static
         ]
     )
src/aces/config.py (1)

1296-1296: Add validation for demographic keys and document assumptions.

The code assumes all demographics are static predicates. Consider:

  1. Adding validation to ensure demographic keys are valid predicate names
  2. Documenting this assumption in the code
  3. Making the static nature of demographics configurable if needed

Example implementation:

            if "expr" in p:
-                predicate_objs[n] = DerivedPredicateConfig(**p, static=list(final_demographics.keys()))
+                # Validate demographic keys
+                invalid_keys = [k for k in final_demographics.keys() if not re.match(r"^\w+$", k)]
+                if invalid_keys:
+                    raise ValueError(f"Invalid demographic keys: {invalid_keys}")
+                
+                # All demographics are treated as static predicates
+                predicate_objs[n] = DerivedPredicateConfig(
+                    **p,
+                    # Demographics are assumed to be static (constant within same subject_id)
+                    static=list(final_demographics.keys())
+                )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 89291a0 and 6b2ec02.

📒 Files selected for processing (2)
  • src/aces/config.py (3 hunks)
  • src/aces/predicates.py (2 hunks)
🔇 Additional comments (1)
src/aces/predicates.py (1)

613-668: LGTM! Well-documented example demonstrating static predicates in derived predicates.

The example effectively illustrates how to combine static and regular predicates in derived predicates, with clear test data and expected output that validates the functionality.

src/aces/config.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Oct 25, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aces/config.py 93.75% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/aces/predicates.py 83.59% <100.00%> (-0.41%) ⬇️
src/aces/query.py 72.09% <ø> (-23.66%) ⬇️
src/aces/config.py 97.22% <93.75%> (-0.41%) ⬇️

... and 4 files with indirect coverage changes

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 6b2ec02 and 76cbe91.

📒 Files selected for processing (6)
  • .github/workflows/code-quality-main.yaml (1 hunks)
  • .github/workflows/code-quality-pr.yaml (2 hunks)
  • .github/workflows/tests.yml (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • pyproject.toml (1 hunks)
  • src/aces/predicates.py (2 hunks)
🔇 Additional comments (13)
.github/workflows/code-quality-main.yaml (4)

16-16: LGTM: Checkout action upgrade is safe.

The upgrade to actions/checkout@v4 is a stable release that includes improved performance and better support for newer Git features.


23-25: LGTM: Development dependencies installation step is appropriate.

Adding pip install .[dev] ensures all development dependencies are available for code quality checks.

Let's verify the dev dependencies are properly defined:

#!/bin/bash
# Description: Check if dev dependencies are defined in project configuration
# Look for [dev] or dev-dependencies section in project config files

rg -A 5 '\[dev\]|\[tool\.poetry\.dev-dependencies\]' pyproject.toml setup.cfg

28-28: LGTM: Minor version upgrade for pre-commit action.

The upgrade to pre-commit/[email protected] is a patch release that maintains compatibility.

Let's verify pre-commit configuration compatibility:

#!/bin/bash
# Description: Check pre-commit configuration for any version constraints
# Look for version specifications in pre-commit config

cat .pre-commit-config.yaml

18-21: LGTM: Python setup configuration is well-defined.

The upgrade to actions/setup-python@v5 with explicit Python version 3.10 follows best practices for reproducible workflows.

Let's verify Python version compatibility with the project:

✅ Verification successful

Python 3.10 is compatible with the project setup

The verification confirms that Python 3.10 is appropriate for this project based on:

  • The pre-commit configuration explicitly uses --py310-plus in pyupgrade hook, indicating Python 3.10 as the minimum version
  • No conflicting version requirements found in requirements.txt or other configuration files
  • The workflow's Python 3.10 setup aligns with the project's development tooling
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if Python 3.10 is within the supported versions range
# Look for Python version specifications in pyproject.toml or setup.cfg

rg -l 'python_requires|python_version' pyproject.toml setup.cfg

Length of output: 3517


Script:

#!/bin/bash
# Check for Python version requirements in multiple potential locations
echo "=== Checking requirements.txt ==="
fd requirements.txt --exec cat {}

echo -e "\n=== Checking setup.py ==="
fd setup.py --exec cat {}

echo -e "\n=== Checking Python files for version checks ==="
rg "python_version|sys\.version|requires" --type py -C 2

echo -e "\n=== Checking pyupgrade configuration ==="
rg "py\d+" .pre-commit-config.yaml

Length of output: 2033

.github/workflows/tests.yml (2)

20-20: LGTM! Good updates to the GitHub Actions.

The upgrades to actions/checkout@v4 and actions/setup-python@v5 bring performance improvements and security enhancements to the workflow.

Also applies to: 23-23


29-29: Verify if removing editable install mode is intentional.

Removing the -e flag changes from an editable install to a regular package install. While this is generally fine for CI environments, please verify this won't impact any development workflows that might depend on immediate reflection of code changes.

.github/workflows/code-quality-pr.yaml (3)

19-19: LGTM! Good practice to use latest stable checkout action.

The upgrade to actions/checkout@v4 is a good security practice and aligns with other workflow files.


26-28: LGTM! Standard practice for installing development dependencies.

Using .[dev] is the recommended way to install development dependencies.


40-42: Verify impact of running pre-commit on all files.

Changing from checking only modified files to checking all files (--all-files) will provide more thorough validation but may increase PR check duration. This could impact developer productivity if the codebase is large.

✅ Verification successful

Running pre-commit on all files is acceptable given the repository size and hooks

The repository has only 82 files in total, and while there are numerous pre-commit hooks (24 in total), most of them are lightweight checks like trailing whitespace, YAML validation, and code formatting. The hooks are focused on Python code quality, documentation, and general file hygiene. Given the small codebase size, running these checks on all files should not significantly impact PR check duration.

Key observations:

  • Small repository size (82 files)
  • Standard code quality hooks that are typically fast to execute
  • No heavy computational hooks (e.g., no test runners or complex static analyzers)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check the potential impact of running pre-commit on all files

# Count total number of files that would be checked
echo "Number of files that would be checked:"
fd -t f | wc -l

# List all pre-commit hooks to understand the scope
if [ -f .pre-commit-config.yaml ]; then
    echo -e "\nPre-commit hooks that would run:"
    grep "id:" .pre-commit-config.yaml
fi

Length of output: 878

pyproject.toml (1)

40-40: LGTM! Version constraint helps maintain build stability.

The addition of <4 constraint for pre-commit is a good practice to prevent potential breaking changes from future major versions.

Let's verify the compatibility with the current pre-commit setup:

✅ Verification successful

Version constraint is appropriate and compatible with current setup

The pre-commit configuration uses hooks from pre-commit-hooks v5.0.0 and other dependencies that are compatible with pre-commit v3.x. The constraint pre-commit<4 in pyproject.toml is appropriate as it:

  • Ensures stability with the current pre-commit hooks ecosystem
  • Prevents potential breaking changes from pre-commit v4
  • Aligns with the project's existing pre-commit configuration
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify pre-commit configuration compatibility
# Expected: Should find pre-commit-config.yaml and confirm hooks versions

# Check pre-commit config exists and show its content
if [ -f ".pre-commit-config.yaml" ]; then
    echo "Found .pre-commit-config.yaml:"
    cat .pre-commit-config.yaml
else
    echo "Warning: .pre-commit-config.yaml not found"
fi

Length of output: 3531

src/aces/predicates.py (3)

613-668: LGTM! Well-documented test cases.

The added test cases effectively demonstrate the usage of static variables in derived predicates, providing clear examples of configuration and expected behavior.


701-701: LGTM! Clear identification of static variables.

The implementation efficiently extracts static variables from plain predicates.


703-712: Verify handling of inconsistent static values.

The implementation propagates the first value of static variables within each subject_id group. Please verify if this is the intended behavior when a static variable has different values within the same subject_id group.

Run the following script to check for inconsistent static values:

✅ Verification successful

Based on the gathered information, I can now provide a final response:

Taking first value for static variables is the intended behavior

The implementation is correct because:

  • Static variables in this context represent patient demographics that should remain constant per subject
  • The code is part of a larger system where static variables are explicitly marked in configuration (static=True)
  • The README confirms that static variables are handled specially, either stored with null timestamps in MEDS or in a separate subjects_df table in ESGPT
  • If inconsistencies exist in static variables for a subject, taking the first value is a reasonable default as these should be constant by definition
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for subjects with inconsistent static variable values.

# Test: Search for test cases or examples that verify this behavior.
rg -A 5 "static.*=.*True" | rg -B 5 "subject_id.*=.*\["

Length of output: 57


Script:

#!/bin/bash
# Let's try a different approach to find relevant test cases and usage patterns

# Search for test files related to static variables and predicates
fd "test.*\.py" | xargs rg -l "static.*variable"

# Search for actual usage of static variables in the codebase
rg -B 3 -A 3 "static_variables" 

# Look for test data or fixtures that might contain static variable examples
fd "fixtures|test_data" | xargs rg -l "static"

# Search for documentation or comments about static variables
rg -B 2 -A 2 "static.*variable" -g "*.md" -g "*.py"

Length of output: 7186


Script:

#!/bin/bash
# Let's look for the implementation of static variable handling and constraints

# Check the implementation of check_static_variables function
rg -A 15 "^def check_static_variables" src/aces/constraints.py

# Look for documentation about static variables in the configuration
rg -A 5 "static.*=.*True" src/aces/config.py

# Search for test cases with static variables
fd "test_.*\.py" | xargs rg -B 3 -A 3 "check_static_variables"

Length of output: 1725

.github/workflows/code-quality-pr.yaml Show resolved Hide resolved
.pre-commit-config.yaml Show resolved Hide resolved
@justin13601
Copy link
Owner Author

@mmcdermott I tried to implement a way to allow for derived predicates joining static and plain predicates - basically I just copy over the static predicate values from the null timestamp rows throughout the same subject_id if they exist since these are static and do not change. Then the logic for the evaluation of the derived predicates shouldn't need to change. Does this seem reasonable?

* #141 note about memory in README

* #141 warning about memory in the docs

* #142 add warning messages if labels are all the same

* Add error message when predicates are specified using only strings (includes ??? case)

Closes #141, #142, and #146
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
docs/source/configuration.md (2)

52-56: LGTM! Consider adding an example for clarity.

The memory optimization note is a valuable addition that will help users avoid potential performance issues.

Consider adding a brief example to illustrate the recommended approach:

  **Note**: Each individual definition of `PlainPredicateConfig` and `code` will generate a separate predicate
  column. Thus, for memory optimization, it is strongly recommended to match multiple values using either the
  List of Values or Regular Expression formats whenever possible.
+ 
+ For example, instead of:
+ ```yaml
+ - code: "CODE1"
+ - code: "CODE2"
+ - code: "CODE3"
+ ```
+ 
+ Use:
+ ```yaml
+ - code:
+     any: ["CODE1", "CODE2", "CODE3"]
+ # or
+ - code:
+     regex: "CODE[1-3]"
+ ```

63-74: Fix formatting and consider simplifying the explanation.

The clarification of value constraints is helpful, but there are some formatting inconsistencies and opportunities for more concise wording.

Apply these changes to improve readability:

- `value_max` (with these options being decided on the basis of `value_max_inclusive`, where
- `value_max_inclusive=True` indicating that an observation satisfies this predicate if its value is less
- than or equal to `value_max`, and `value_max_inclusive=False` indicating a less than but not equal to
- will be used).
+ `value_max`: If specified, requires the observation's value to be:
+ - Less than or equal to `value_max` when `value_max_inclusive=True`
+ - Less than `value_max` when `value_max_inclusive=False`

- `value_min_inclusive`: See `value_min`
+ `value_min_inclusive`: Controls whether `value_min` is inclusive (True) or exclusive (False)

- `value_max_inclusive`: See `value_max`
+ `value_max_inclusive`: Controls whether `value_max` is inclusive (True) or exclusive (False)

Also, fix the loose punctuation marks by removing extra spaces before list items.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...equal to will be used). - value_max: If specified, an observation will only ...

(UNLIKELY_OPENING_PUNCTUATION)


[style] ~66-~66: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ..._max(with these options being decided on the basis ofvalue_max_inclusive, where value_m...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)


[uncategorized] ~71-~71: Loose punctuation mark.
Context: ... will be used). - value_min_inclusive: See value_min - `value_max_inclusive...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~73-~73: Loose punctuation mark.
Context: ...See value_min - value_max_inclusive: See value_max - other_cols: This o...

(UNLIKELY_OPENING_PUNCTUATION)

README.md (2)

221-222: LGTM! Consider adding examples for the new predicate formats.

The documentation clearly explains the new list-based and regex-based predicate formats. To further enhance user understanding, consider adding concrete examples:

  - a list of strings as above in the form of `{any: [???, ???, ...]}`, which will match any of the listed codes.
  - a regex in the form of `{regex: "???"}`, which will match any code that matches that regular expression.
+ - a list of strings as above in the form of `{any: ["lab//SpO2", "lab//O2_sat"]}`, which will match any of the listed codes.
+ - a regex in the form of `{regex: "lab//.*O2.*"}`, which will match any code that matches that regular expression.

229-230: LGTM! Consider adding performance metrics.

The memory optimization note provides valuable guidance. To make it more compelling, consider adding quantitative metrics:

- **Note**: For memory optimization, we strongly recommend using either the List of Values or Regular Expression formats whenever possible, especially when needing to match multiple values. Defining each code as an individual string will increase memory usage significantly, as each code generates a separate predicate column. Using a list or regex consolidates multiple matching codes under a single column, reducing the overall memory footprint.
+ **Note**: For memory optimization, we strongly recommend using either the List of Values or Regular Expression formats whenever possible, especially when needing to match multiple values. Defining each code as an individual string will increase memory usage significantly, as each code generates a separate predicate column. Using a list or regex consolidates multiple matching codes under a single column, reducing the overall memory footprint. For example, consolidating 10 related predicates into a single list-based predicate can reduce memory usage by up to 90% in typical scenarios.
src/aces/predicates.py (1)

699-714: LGTM! Clean implementation of static variable handling.

The implementation efficiently handles static variables in derived predicates by propagating the first value within each subject_id group. This aligns well with the PR objectives.

Consider adding a brief comment explaining why we take the first value for static variables:

 if any(x in static_variables for x in code.input_predicates):
+    # For static variables, take the first value in each subject_id group as these values
+    # should remain constant throughout the subject's timeline
     data = data.with_columns(
         [
             pl.col(static_var)
src/aces/config.py (1)

Line range hint 1198-1320: Enhance error message with format examples.

The error handling for string predicates is good, but the error message could be more helpful by including examples of the correct format.

Consider updating the error message to include format examples:

-                        f"Predicate '{n}' is not defined correctly in the configuration file. "
-                        f"Currently defined as the string: {p}. "
-                        "Please refer to the documentation for the supported formats."
+                        f"Predicate '{n}' is not defined correctly in the configuration file. "
+                        f"Currently defined as the string: {p}. "
+                        "Expected a dictionary with 'code' field, for example:\n"
+                        "  admission:\n"
+                        "    code: 'admission'\n"
+                        "Please refer to the documentation for all supported formats."
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 76cbe91 and 2c57b85.

📒 Files selected for processing (5)
  • README.md (1 hunks)
  • docs/source/configuration.md (1 hunks)
  • src/aces/config.py (2 hunks)
  • src/aces/predicates.py (2 hunks)
  • src/aces/query.py (2 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/source/configuration.md

[uncategorized] ~57-~57: Loose punctuation mark.
Context: ...ormats whenever possible. - value_min: If specified, an observation will only ...

(UNLIKELY_OPENING_PUNCTUATION)


[style] ~59-~59: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ..._min(with these options being decided on the basis ofvalue_min_inclusive, where value_m...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)


[uncategorized] ~64-~64: Loose punctuation mark.
Context: ...equal to will be used). - value_max: If specified, an observation will only ...

(UNLIKELY_OPENING_PUNCTUATION)


[style] ~66-~66: ‘on the basis of’ might be wordy. Consider a shorter alternative.
Context: ..._max(with these options being decided on the basis ofvalue_max_inclusive, where value_m...

(EN_WORDINESS_PREMIUM_ON_THE_BASIS_OF)


[uncategorized] ~71-~71: Loose punctuation mark.
Context: ... will be used). - value_min_inclusive: See value_min - `value_max_inclusive...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~73-~73: Loose punctuation mark.
Context: ...See value_min - value_max_inclusive: See value_max - other_cols: This o...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 GitHub Check: codecov/patch
src/aces/query.py

[warning] 142-143: src/aces/query.py#L142-L143
Added lines #L142 - L143 were not covered by tests


[warning] 149-151: src/aces/query.py#L149-L151
Added lines #L149 - L151 were not covered by tests

🔇 Additional comments (3)
src/aces/query.py (2)

7-8: LGTM: Clean import addition

The Counter import is well-placed and appropriately used for analyzing label distributions.


142-147: Add test coverage for label validation logic

The label validation is a critical diagnostic feature that helps users identify potential configuration issues. However, this code segment currently lacks test coverage.

Would you like me to help create test cases for:

  1. Single-value label scenario
  2. Multi-value label scenario
  3. Edge cases (empty labels, null values)?

Here's a script to verify the current test coverage:

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 142-143: src/aces/query.py#L142-L143
Added lines #L142 - L143 were not covered by tests

src/aces/predicates.py (1)

613-668: LGTM! Well-documented example usage.

The new example effectively demonstrates how to use static variables in derived predicates, showing both the configuration and expected behavior.

src/aces/query.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 2c57b85 and 495f2c0.

📒 Files selected for processing (1)
  • src/aces/query.py (2 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/aces/query.py

[warning] 142-143: src/aces/query.py#L142-L143
Added lines #L142 - L143 were not covered by tests


[warning] 149-152: src/aces/query.py#L149-L152
Added lines #L149 - L152 were not covered by tests


[warning] 155-155: src/aces/query.py#L155
Added line #L155 was not covered by tests

🔇 Additional comments (2)
src/aces/query.py (2)

7-7: LGTM!

The Counter import is appropriately placed and well-suited for calculating label distributions.


142-157: Add test coverage for label validation logic.

The new label validation and distribution logging code lacks test coverage. Please add test cases to cover:

  1. Single-label scenario triggering the warning
  2. Multi-label scenario with distribution logging

Would you like help creating the test cases for these scenarios?

✅ Verification successful

Based on the shell script results, I can now provide a final response. We can see that there's an established testing infrastructure with test utilities (tests/utils.py) that includes DataFrame assertions and test data handling. We also see examples of test patterns in test_e2e.py and other test files.

The review comment is valid, but needs refinement based on the codebase context

The label validation logic should indeed be tested, but we should leverage the existing test infrastructure:

  1. Add test cases to test_e2e.py since it already handles similar DataFrame-based scenarios and contains label-related test data
  2. Use the assert_df_equal utility from tests/utils.py for DataFrame comparisons
  3. Follow the existing pattern of using polars DataFrame construction as seen in the test files

Key test locations to add:

  • tests/test_e2e.py: Add new test cases for both single-label and multi-label scenarios
  • Utilize the existing test fixtures and utilities shown in tests/utils.py
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing test files and patterns
# Test: Look for existing test patterns we can follow
rg -A 5 "def test.*label" tests/

Length of output: 34


Script:

#!/bin/bash
# Let's check for existing test files related to query.py
fd "test.*query.*\.py$" tests/

# Let's also look for any existing test patterns in the test files
rg -l "test.*cohort" tests/

# Check the actual implementation to understand the test context better
cat src/aces/query.py

Length of output: 7585


Script:

#!/bin/bash
# Let's check for the test directory structure and existing test files
fd "test_.*\.py$" tests/

# Let's also check for any existing test fixtures or utilities
rg -l "fixture" tests/

# Let's specifically look for any DataFrame-related test patterns
rg -A 5 "DataFrame" tests/

Length of output: 6947

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 142-143: src/aces/query.py#L142-L143
Added lines #L142 - L143 were not covered by tests


[warning] 149-152: src/aces/query.py#L149-L152
Added lines #L149 - L152 were not covered by tests


[warning] 155-155: src/aces/query.py#L155
Added line #L155 was not covered by tests

src/aces/query.py Outdated Show resolved Hide resolved
@justin13601
Copy link
Owner Author

Might be a bit difficult to properly test the new error/warning messages in query.py...

@mmcdermott
Copy link
Collaborator

@justin13601, can you add tests to get the codecov up to the desired target?

.github/workflows/code-quality-pr.yaml Show resolved Hide resolved
src/aces/predicates.py Show resolved Hide resolved
src/aces/predicates.py Show resolved Hide resolved
src/aces/query.py Outdated Show resolved Hide resolved
src/aces/query.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
src/aces/query.py (1)

140-146: Add pragma comment to exclude logging from coverage.

The warning logic is well-implemented, but needs to be excluded from coverage requirements.

Add the following pragma comment to line 140:

-        if result["label"].n_unique() == 1:
+        if result["label"].n_unique() == 1:  # pragma: no cover
🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 140-141: src/aces/query.py#L140-L141
Added lines #L140 - L141 were not covered by tests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 495f2c0 and ca85310.

📒 Files selected for processing (2)
  • .github/workflows/code-quality-pr.yaml (2 hunks)
  • src/aces/query.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/code-quality-pr.yaml
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/aces/query.py

[warning] 140-141: src/aces/query.py#L140-L141
Added lines #L140 - L141 were not covered by tests

src/aces/query.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
README.md (1)

220-223: LGTM! Consider varying sentence structure.

The documentation clearly explains the different formats for the code field. However, there's a minor style suggestion to improve readability.

Consider varying the sentence beginnings to improve flow:

-  - a string matching values in a column named `code` (for `MEDS` only).
-  - a string with a `//` sequence separating the column name and the matching column value (for `ESGPT` only).
-  - a list of strings as above in the form of `{any: \[???, ???, ...\]}` (or the corresponding expanded indented `YAML` format), which will match any of the listed codes.
-  - a regex in the form of `{regex: "???"}` (or the corresponding expanded indented `YAML` format), which will match any code that matches that regular expression.
+  - For MEDS: a string matching values in a column named `code`.
+  - For ESGPT: a string with a `//` sequence separating the column name and the matching column value.
+  - Multiple values: a list of strings in the form of `{any: \[???, ???, ...\]}` (or the corresponding expanded indented `YAML` format), which will match any of the listed codes.
+  - Pattern matching: a regex in the form of `{regex: "???"}` (or the corresponding expanded indented `YAML` format), which will match any code that matches that regular expression.
🧰 Tools
🪛 LanguageTool

[style] ~223-~223: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...will match any of the listed codes. - a regex in the form of {regex: "???"} (...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ca85310 and 7121127.

📒 Files selected for processing (2)
  • README.md (1 hunks)
  • src/aces/query.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aces/query.py
🧰 Additional context used
🪛 LanguageTool
README.md

[style] ~223-~223: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...will match any of the listed codes. - a regex in the form of {regex: "???"} (...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🔇 Additional comments (1)
README.md (1)

230-231: LGTM! Excellent performance optimization guidance.

The added note about memory optimization is valuable, providing clear guidance on using List of Values or Regular Expression formats to reduce memory usage. This is particularly important for scalability.

@justin13601
Copy link
Owner Author

Also closes #94

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/aces/config.py (1)

1198-1214: Enhance error message clarity for invalid predicate definitions.

The error message could be more helpful by providing an example of the correct format.

Consider updating the error message to include an example:

-            raise ValueError(
-                f"Predicate '{n}' is not defined correctly in the configuration file. "
-                f"Currently defined as the string: {p}. "
-                "Please refer to the documentation for the supported formats."
-            )
+            raise ValueError(
+                f"Predicate '{n}' is not defined correctly in the configuration file. "
+                f"Currently defined as the string: {p}. "
+                "Predicates must be defined as dictionaries with a 'code' field, for example: "
+                "{'code': 'admission', 'static': false}"
+            )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7121127 and f5b0dbc.

📒 Files selected for processing (2)
  • src/aces/config.py (5 hunks)
  • src/aces/query.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aces/query.py
🧰 Additional context used
🪛 GitHub Check: codecov/patch
src/aces/config.py

[warning] 1311-1311: src/aces/config.py#L1311
Added line #L1311 was not covered by tests


[warning] 1313-1314: src/aces/config.py#L1313-L1314
Added lines #L1313 - L1314 were not covered by tests


[warning] 1316-1317: src/aces/config.py#L1316-L1317
Added lines #L1316 - L1317 were not covered by tests


[warning] 1323-1326: src/aces/config.py#L1323-L1326
Added lines #L1323 - L1326 were not covered by tests


[warning] 1329-1329: src/aces/config.py#L1329
Added line #L1329 was not covered by tests

🔇 Additional comments (4)
src/aces/config.py (4)

1338-1343: Improved validation for string predicates.

The added validation prevents silent failures by explicitly checking for and rejecting string-based predicate definitions. This is a good defensive programming practice.


1392-1394: Enhanced error message readability.

The error message formatting has been improved to show the count of missing relationships, making it easier to understand the scope of configuration issues.


1292-1305: Improved referenced predicates validation.

The code now properly collects and validates all referenced predicates, including those from windows and triggers. This helps catch configuration errors early.


1310-1330: Verify recursive predicate validation implementation.

The new recursive validation logic for derived predicates is not covered by tests. While the logic appears sound, we should ensure it handles all edge cases correctly.

Consider adding test cases that cover:

  1. Deeply nested derived predicates
  2. Circular references
  3. Missing nested predicates
🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 1311-1311: src/aces/config.py#L1311
Added line #L1311 was not covered by tests


[warning] 1313-1314: src/aces/config.py#L1313-L1314
Added lines #L1313 - L1314 were not covered by tests


[warning] 1316-1317: src/aces/config.py#L1316-L1317
Added lines #L1316 - L1317 were not covered by tests


[warning] 1323-1326: src/aces/config.py#L1323-L1326
Added lines #L1323 - L1326 were not covered by tests


[warning] 1329-1329: src/aces/config.py#L1329
Added line #L1329 was not covered by tests

@justin13601
Copy link
Owner Author

@mmcdermott I realized nested derived predicates will most likely be needed for creating these different reference ranges so I tried to enable that. Let me know if the following seems reasonable:

To create different reference ranges (for illustration purposes, not following actual clinical guidelines):

Different normal ranges:

  normal_spo2_male_range:
    code: lab_name//O2 saturation pulseoxymetry (%)
    value_min: 90
    value_max: 120
    value_min_inclusive: True
    value_max_inclusive: True
  normal_spo2_female_range:
    code: lab_name//O2 saturation pulseoxymetry (%)
    value_min: 80
    value_max: 110
    value_min_inclusive: True
    value_max_inclusive: True
  normal_spo2_male:
    expr: and(normal_spo2_male_range, male)
  normal_spo2_female:
    expr: and(normal_spo2_female_range, female)

Different abnormal ranges (may be worth looking into not() logic...:

  abnormally_low_spo2_male:
    code: lab_name//O2 saturation pulseoxymetry (%)
    value_max: 90
    value_max_inclusive: False
  abnormally_low_spo2_female:
    code: lab_name//O2 saturation pulseoxymetry (%)
    value_max: 80
    value_max_inclusive: False
  abnormally_high_spo2:
    code: lab_name//O2 saturation pulseoxymetry (%)
    value_min: 120
    value_min_inclusive: False
  abnormal_spo2_male_range:
    expr: or(abnormally_low_spo2_male, abnormally_high_spo2)
  abnormal_spo2_male:
    expr: and(abnormal_spo2_male_range, male)
  abnormal_spo2_female_range:
    expr: or(abnormally_low_spo2_female, abnormally_high_spo2)
  abnormal_spo2_female:
    expr: and(abnormal_spo2_female_range, female)
  abnormal_spo2:
    expr: or(abnormal_spo2_male, abnormal_spo2_female)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow derived predicates between static predicates and plain predicates
2 participants