Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-factor ipc check function #72

Merged
merged 2 commits into from
Nov 4, 2024
Merged

Re-factor ipc check function #72

merged 2 commits into from
Nov 4, 2024

Conversation

tomaszaba
Copy link
Collaborator

@tomaszaba tomaszaba commented Nov 3, 2024

Hi @ernestguevarra , please review this PR.
I re-factored the IPC check function.

In this branch I also attempted to address the issue #71 but there was an error when I deleted them from data file: devtools::check() was not passing because it was not finding those datasets. I use those datasets in some examples and tests about prevalence, but I wanted to keep them as internal. I reviewed this chapter: https://r-pkgs.org/data.html#sec-data-sysdata and the approach is used to create the sysdata.rda is correct. I wonder if the error stem from the fact that I documented these datasets?

You will not see this error in the PR that I just raised because I cancelled changes on this.

If OK with the changes on the main objective of this branch/PR, please approve and we will address the data issue in the coming branch. Alternatively, please feel free to fix the issue locally and push your changes.

Thanks.

@tomaszaba tomaszaba added documentation Improvements or additions to documentation refactor code improvement/refactoring testing code testing labels Nov 3, 2024
@tomaszaba tomaszaba added this to the 3. Re-factor functions milestone Nov 3, 2024
@ernestguevarra
Copy link
Member

ernestguevarra commented Nov 4, 2024

In this branch I also attempted to address the issue #71 but there was an error when I deleted them from data file: devtools::check() was not passing because it was not finding those datasets. I use those datasets in some examples and tests about prevalence, but I wanted to keep them as internal. I reviewed this chapter: https://r-pkgs.org/data.html#sec-data-sysdata and the approach is used to create the sysdata.rda is correct. I wonder if the error stem from the fact that I documented these datasets?

When you make included dataset/datasets in your package as internal, that dataset becomes available to you for use within a function. So, for example, in zscorer, the LMS table for calculating z-score is an internal dataset that is used as a reference table within addWGSR() for calculating z-scores using the LMS method. If you look at the examples stated in Hadley's book for packages that uses an internal dataset (munsell package and googledrive and googlesheets4 package), you will note also that they use the internal dataset as something they refer to within the functions themselves. Internal datasets are also available/loaded lazily for use for tests. So, the main reason for using internal datasets is to have a reference data for use internally within functions (primary usage) and for testing and that the dataset/datasets are not useful for users as is hence you keep them internal. For these purposes, you can just simply make a call to the dataset name within a function or in a test.

But if you want to use an internal dataset in an example in documentation, you cannot just invoke the dataset by its name because they are not lazily loaded during the running of examples. This is why when you removed the exported datasets, the examples that use them are erroring (but the test are running ok). If you want to use an internal dataset for examples in your function documentation, you need to use the triple-colon accessor (:::). So for you, you need to use mwana:::wfhz.01 to access the whfz.01 internal dataset after you remove the exported datasets.

When you make included dataset/datasets in your package as external, on the other hand, that dataset becomes available to you for use in examples and tests just by invoking the name of the dataset. However, when you want to use it within one of your functions, you need to use the double colon accessor (::) to access that dataset. So, if you wanted to use your anthro.01 dataset within a function, you will need to use mwana::anthro.01.

In your package, you don't use any of the datasets within a function. You use the included dataset for examples (mainly in the outliers utility functions) and mostly for tests. So, unless you have a really good reason to use internal instead (other than for the reasons stated in Hadley's book), for the usage that you want them for, what you really want are external datasets to use for examples and testing.

I am very confident with what I am saying above is appropriate and best practice because my very first package that I learned to create (and is my first package on CRAN) is a data package so I learned early on the difference between an internal dataset and an external dataset. Then immediately, my second package is zscorer that needed a reference LMS table that is of no immediate use for users and can be very complex and confusing for general users if exported and made available to them. Hence, this was made internal while data for testing and examples where exported (made external).

I also tested what I said above on your package.

Test 1. I removed the exported datasets that are also internal and then removed documentation and then used the triple colon accesssor (:::) in your examples and devtools::check() ran without errors.

Tests 2. I removed sysdata.rda and devtools::check() ran without errors.

For your package and how you use the accompanying dataset, it will be simpler for you to make all your datasets external (remove sysdata.rda).

When trying to understand why you want some internal and some external, the only possible logic that I can think of is that you worry that those 3 datasets that you made internal are SMART datasets that you are not allowed to distribute or don't have permission to share widely hence you think by making them internal you are keeping others from getting access to them.

I can tell you now that if that is your logic, then making them internal will not make it OK because the data is still accessible by anyone who knows what an internal R dataset is and will be able to access this dataset programmatically once they install the package or from the GitHub repository. Internal dataset doesn't mean private or no access dataset. These datasets can be accessed by other people than you. But even if it were the case that internal datasets were private, just the fact that you still exported them means that the whole logic of you making them internal in the first place is contradicted by you then making them external.

My final point here is that even if we suppose that you really wanted those datasets to be both internal and external, it should have been obvious to you that because the datasets had the same name internally and externally, they would have overwritten each other in the common environments that they are loaded into. And because you were not experiencing errors when you both had them internal and external, this tells us that only the external datasets are actually being used in your package (it is overwriting the lazily loaded internal dataset).

I'd like to say that this comment is a suggestion but I can't live with myself to allow both internal and external dataset to exist the way you have implemented it in this package. This has to change and you just need to choose which approach to take as described above. In my opinion, keeping all of your datasets external and removing the sysdata.rda is the most clean and coherent approach.

Copy link
Member

@ernestguevarra ernestguevarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomaszaba see my comment above about your internal and external dataset approach and my explanation of what internal and external datasets are for and my recommendations of what the way forward should be. I am approving this pull request as I know you don't like hearing my feedback, but please listen to what I am saying about the datasets. You are not doing this correctly and I give my reasons behind this above.

@tomaszaba
Copy link
Collaborator Author

In this branch I also attempted to address the issue #71 but there was an error when I deleted them from data file: devtools::check() was not passing because it was not finding those datasets. I use those datasets in some examples and tests about prevalence, but I wanted to keep them as internal. I reviewed this chapter: https://r-pkgs.org/data.html#sec-data-sysdata and the approach is used to create the sysdata.rda is correct. I wonder if the error stem from the fact that I documented these datasets?

When you make included dataset/datasets in your package as internal, that dataset becomes available to you for use within a function. So, for example, in zscorer, the LMS table for calculating z-score is an internal dataset that is used as a reference table within addWGSR() for calculating z-scores using the LMS method. If you look at the examples stated in Hadley's book for packages that uses an internal dataset (munsell package and googledrive and googlesheets4 package), you will note also that they use the internal dataset as something they refer to within the functions themselves. Internal datasets are also available/loaded lazily for use for tests. So, the main reason for using internal datasets is to have a reference data for use internally within functions (primary usage) and for testing and that the dataset/datasets are not useful for users as is hence you keep them internal. For these purposes, you can just simply make a call to the dataset name within a function or in a test.

But if you want to use an internal dataset in an example in documentation, you cannot just invoke the dataset by its name because they are not lazily loaded during the running of examples. This is why when you removed the exported datasets, the examples that use them are erroring (but the test are running ok). If you want to use an internal dataset for examples in your function documentation, you need to use the triple-colon accessor (:::). So for you, you need to use mwana:::wfhz.01 to access the whfz.01 internal dataset after you remove the exported datasets.

When you make included dataset/datasets in your package as external, on the other hand, that dataset becomes available to you for use in examples and tests just by invoking the name of the dataset. However, when you want to use it within one of your functions, you need to use the double colon accessor (::) to access that dataset. So, if you wanted to use your anthro.01 dataset within a function, you will need to use mwana::anthro.01.

In your package, you don't use any of the datasets within a function. You use the included dataset for examples (mainly in the outliers utility functions) and mostly for tests. So, unless you have a really good reason to use internal instead (other than for the reasons stated in Hadley's book), for the usage that you want them for, what you really want are external datasets to use for examples and testing.

I am very confident with what I am saying above is appropriate and best practice because my very first package that I learned to create (and is my first package on CRAN) is a data package so I learned early on the difference between an internal dataset and an external dataset. Then immediately, my second package is zscorer that needed a reference LMS table that is of no immediate use for users and can be very complex and confusing for general users if exported and made available to them. Hence, this was made internal while data for testing and examples where exported (made external).

I also tested what I said above on your package.

Test 1. I removed the exported datasets that are also internal and then removed documentation and then used the triple colon accesssor (:::) in your examples and devtools::check() ran without errors.

Tests 2. I removed sysdata.rda and devtools::check() ran without errors.

For your package and how you use the accompanying dataset, it will be simpler for you to make all your datasets external (remove sysdata.rda).

When trying to understand why you want some internal and some external, the only possible logic that I can think of is that you worry that those 3 datasets that you made internal are SMART datasets that you are not allowed to distribute or don't have permission to share widely hence you think by making them internal you are keeping others from getting access to them.

No, it is not about that. I just thought I already had enough example datasets to export.

I can tell you now that if that is your logic, then making them internal will not make it OK because the data is still accessible by anyone who knows what an internal R dataset is and will be able to access this dataset programmatically once they install the package or from the GitHub repository. Internal dataset doesn't mean private or no access dataset. These datasets can be accessed by other people than you. But even if it were the case that internal datasets were private, just the fact that you still exported them means that the whole logic of you making them internal in the first place is contradicted by you then making them external.

My final point here is that even if we suppose that you really wanted those datasets to be both internal and external, it should have been obvious to you that because the datasets had the same name internally and externally, they would have overwritten each other in the common environments that they are loaded into. And because you were not experiencing errors when you both had them internal and external, this tells us that only the external datasets are actually being used in your package (it is overwriting the lazily loaded internal dataset).

I'd like to say that this comment is a suggestion but I can't live with myself to allow both internal and external dataset to exist the way you have implemented it in this package. This has to change and you just need to choose which approach to take as described above. In my opinion, keeping all of your datasets external and removing the sysdata.rda is the most clean and coherent approach.

Hi @ernestguevarra , I really appreciate your comprehensive explanation about the difference between external and internal datasets. I really found it very helpful to me and to my future package development. Based on this and on your recommendation, I have deleted the sysdata.rda file and pushed the changes.

@tomaszaba
Copy link
Collaborator Author

@tomaszaba see my comment above about your internal and external dataset approach and my explanation of what internal and external datasets are for and my recommendations of what the way forward should be. I am approving this pull request as I know you don't like hearing my feedback, but please listen to what I am saying about the datasets. You are not doing this correctly and I give my reasons behind this above.

Thank you, Ernest. To have read your comment was absolutely useful to clear my understanding about the differences between an internal and external dataset. I appreciate. I have removed the sysdata.rda file.

Copy link
Member

@ernestguevarra ernestguevarra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomaszaba all checks pass. the main issue I raised with this PR has now been addressed. Please feel free to merge to dev and/or main as you deem necessary.

@tomaszaba
Copy link
Collaborator Author

@tomaszaba all checks pass. the main issue I raised with this PR has now been addressed. Please feel free to merge to dev and/or main as you deem necessary.

Great! Thanks. I will merge it to dev.

@tomaszaba tomaszaba merged commit 4f86ea3 into dev Nov 4, 2024
5 checks passed
@tomaszaba tomaszaba deleted the ipc-check branch November 4, 2024 09:04
@ernestguevarra ernestguevarra linked an issue Nov 4, 2024 that may be closed by this pull request
@ernestguevarra ernestguevarra linked an issue Nov 4, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation refactor code improvement/refactoring testing code testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

find most suitable hex sticker
2 participants