Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem Description
The original
boxplot
function (experiments/RegInf/utils.py
) was designed to create a box plot with marginal distributions. However, it contained a bug where the inputDataFrame
,data
, was being overwritten after agroupby
operation. This resulted in a loss of the original DataFrame structure, leading to aKeyError
when attempting to group by thehue
column later in the code. The function was trying to access a column that no longer existed in the modified DataFrame.Proposed Changes
To resolve this issue, I have made the following changes to the function:
Separation of Data Manipulation and Plotting: I introduced a new variable,
plot_data
, as a copy of the inputDataFrame
. This ensures that the original data is not altered during the plotting process. All manipulations are performed onplot_data
instead.Categorical Conversion and Type Assertion: The function now checks and converts the
x
andhue
columns to categorical data types if they are not already. This ensures that the grouping operations work as expected.Stacked Bar Plot Calculation: The calculation for the fractions in the stacked bar plot has been corrected. Instead of modifying the original DataFrame, a new
grouped_data
variable is created. It stores the normalized value counts necessary for the stacked bar plot, preserving the original data.Legend Handling: The function has been updated to include a legend only when there is more than one level in the
hue
column, enhancing the clarity of the plot when hue distinctions are present.These changes correct the error and improve the function's robustness by preventing unintended side effects on the input data.
Additional Notes
The revised function has been thoroughly tested to ensure that it handles the input data correctly and that the resulting plots are generated as expected.
I believe these improvements will enhance the functionality and user experience for others utilizing this box plot function.