Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release version 1.0.0 #227

Merged
merged 33 commits into from
May 6, 2024
Merged

Release version 1.0.0 #227

merged 33 commits into from
May 6, 2024

Conversation

dfalster
Copy link
Member

@dfalster dfalster commented May 6, 2024

First major release of APCalign. A preprint is available at
https://www.biorxiv.org/content/10.1101/2024.02.02.578715v1.
Article has been accepted for publication at Australian Journal of Botany.

Following review, a number of changes have been implemented. These have sped &
streamlined the package.

  • Update function documentation
  • Speed up extract_genus
  • Write a replacement function for stringr::word that is much faster.
  • Additional speed up and accuracy of fuzzy_match function by
    • Restricting reference list to names with the same first letter as input string.
    • Switch from using utils::adist to stringdist:stringdist(method = "dl")
  • Rework standardise_names to remove punctuation from the start of the string
  • Rework strip_names_extra (previously strip_names_2) to just perform
    additional functions to strip_names, rather than repeating those performed by strip_names.
  • Avoid importing entire packages by using package::function format throughout
    and removing functions from @import
  • Add fuzzy match arguments to create_taxonomic_update_lookup
  • Add 3 additional family-level APC matches to match_taxa.
  • Refine tests
  • Make messages to console optional
  • Fix issue with fails when github is down (fix CRAN issue before release #205)

fontikar and others added 30 commits April 18, 2024 10:36
* Added redevp to gitignore
* Bumped version and refined graceful failing
*  minor syntax fixes
*  corrections to matches that can't match to genus (these were still assigning taxon_rank = genus)
*  remove test checks for alignment codes (creating unnecessary errors)
---------

Co-authored-by: Fonti Kar <[email protected]>
Co-authored-by: Daniel Falster <[email protected]>
* This PR refactors a few functions to increase speed. The time to run load_taxonomic_resources has dropped from 15.0s to 2.2s (on Daniel's MacBook Pro M2)

* Faster version of extract_genus (#187)
* Faster version of stringr::word
* Function to standardise taxon rank
* Speed up strip_name
* update tests
* First commit updated DESCRIPTION and NEWS

* Updated installation instructions

* Added reproducibility article and exported default_version

* Added citation

* Added reproducibility article

* Update vignettes/articles/reproducibility.Rmd

---------

Co-authored-by: Daniel Falster <[email protected]>
* adding progress bar for loading

* trying to get caching/output option to work

* passing output through

* reviving caching

* fixing counting

* roxygen update

* adding quiet option

* checking cached file

* documenting caching functionality

* getting message working

* removing cutting edge arrow

* reverting change back to cran, too soon

* nope arrow github not working yet
Changes to `standardise_names` to standardise corner cases that were being missed with standardise names. This mainly focused on removing stray punctuation at the beginning and end of name strings.

There were also minor required tweaks to `extract_genus` to ensure genera were split on "\" and that names were standardised to remove stray characters at the beginning of strings before genus names were extracted.

As a final step, excepted changes to the tests for standardise_names, strip_names, strip_names_extra, and extract_genus were made. The outputs of a list of 42 unusual names are now all correct.

Closes #197
…provements (#203)

* removing hard cap on file size of current downloads.  this is slower, but safer going forward

* better wording in documentation

* other place there was a hard cap
Add message that indicates how many taxa have perfect matches to APC.
* trying to update actions to best practices

* further updating

* more updates

* adding develop back

* changing release hash

* how do commit hashes work?

* another try

* giving up

* really giving up
- fix spelling in name
- remove duplicate set of tests
Updates to the family-level matching algorithms to allow:

    fuzzy matches to APC-accepted and APC-synonymous families
    updates from APC-synonymous family names to accepted APC family names
---------

Co-authored-by: Will Cornwell <[email protected]>
Co-authored-by: Daniel Falster <[email protected]>
-    amend fuzzy matching algorithm to only compare to subset of accepted_list with the same first letter
-   greatly speeds up fuzzy matching

---------

Co-authored-by: Daniel Falster <[email protected]>
The fuzzy_match function had not previously worked if n_allowed > 1 (the number of shortest-distance matches), even though `n_allowed` was included as an argument in the function. The actual APCalign functions still do not have `n_allowed` included as an argument (they use n_allowed = 1), but fixing fuzzy_match is the first step toward eventually implementing this.

Also added simple tests to confirm 1 vs 2 outputs, as expected.
---------

Co-authored-by: Fonti Kar <[email protected]>
Co-authored-by: Daniel Falster <[email protected]>
Fixes a known issue when reading in identifiers from a column - if there were two rows with distinct identifiers but the same original_name, the code broke.

Identifier has now been added to lines of code in `align_taxa.R` that were determining how many distinct rows to retain for matching.

There will now, occasionally, be repeat original names run through the match algorithms, but this is necessary to attach the correct identifier to each instance of the original name.

I've also added a new test.

Closes issue #177
Switching from util:adist to stringdist:stringdist for matching. This is both much faster and allows us to use a more nuanced matching algorithm by implementing the Damerau–Levenshtein distance method, and prioritising types of string changes (based on their algorithm)

I've run all 47,000 AusTraits names through this and there were 33 that were different - it seems they are all instances of names that were passed over during fuzzy matching (match 5's) previously and now are being caught. So some additional matching power, but nothing being misaligned.

(additional minor typo being fixed - Wasn't running "distinct()" on original_name but on entire row - was leading to humorous output that perfect matches greater than total taxa being checked)
Needed to add `if else` loop to `fuzzy_match.R` to only search for fuzzy matches if the subset accepted list (with same first letter) is non-empty. If there were no strings on the accepted list with the same first letter as the input text, warnings were generated.

Test added to check this functionality.
Add fuzzy match arguments to  `create_taxonomic_update_lookup`

We'd omitted the fuzzy match arguments from `create_taxonomic_update_lookup`, which meant users who wanted to change the fuzzy match sliders would need to separately align and update taxonomy.

Closes issue #212
As part of #196, we found that stringr::word was quite slow, and so implemented a faster version. This PR makes the new word function a private function accessible via APCalign:::word; 
adds tests for new function; 
extends use of this new function throughout

Co-authored-by: ehwenk <[email protected]>
* removing unused function option and updating readme

* more readme updates

* more work on readme
* better description of `imprecise_fuzzy_matches`

closes issue #155
* cleaning up the namespace
* Remove importing of dplyr, stringr, remove tibble
* add explicit namespace to calls of relevant functions
* Add the pipe

---------

Co-authored-by: Daniel Falster <[email protected]>
Have greatly reduced number of lines > 80 characters, in all R files except in the file match_taxa.R which we will likely refactor - as this is the one file with lots of longer lines within the code itself.

Closes #188
ehwenk and others added 2 commits May 5, 2024 17:48
* update roxygen documention for all functions

---------

Co-authored-by: Will Cornwell <[email protected]>
Co-authored-by: Daniel Falster <[email protected]>
@dfalster dfalster requested review from ehwenk and fontikar May 6, 2024 03:42
ehwenk
ehwenk previously approved these changes May 6, 2024
Copy link
Collaborator

@ehwenk ehwenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is excellent to see how small refinements can keep improving the package.

I corrected a few typos in your message, but otherwise all good.

R/reexports.R Show resolved Hide resolved
fontikar
fontikar previously approved these changes May 6, 2024
Copy link
Collaborator

@fontikar fontikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change for importing pipe but not crucial!

@dfalster dfalster dismissed stale reviews from fontikar and ehwenk via 809b11c May 6, 2024 06:41
@dfalster dfalster merged commit 6bc7e1f into master May 6, 2024
9 checks passed
dfalster added a commit that referenced this pull request May 6, 2024
First major release of APCalign. A preprint is available at
https://www.biorxiv.org/content/10.1101/2024.02.02.578715v1.
Article has been accepted for publication at Australian Journal of Botany.

Following review, a number of changes have been implemented. These have sped &
streamlined the package.

* Update function documentation
* Speed up `extract_genus`
* Write a replacement function for `stringr::word` that is much faster.
* Additional speed up and accuracy of `fuzzy_match` function by
  - Restricting reference list to names with the same first letter as input string.
  - Switch from using `utils::adist` to `stringdist:stringdist(method = "dl")`
* Rework `standardise_names` to remove punctuation from the start of the string
* Rework `strip_names_extra` (previously `strip_names_2`) to just perform
additional functions to `strip_names`, rather than repeating those performed by `strip_names`.
* Avoid importing entire packages by using package::function format throughout
and removing functions from @import
* Add fuzzy match arguments to `create_taxonomic_update_lookup`
* Add 3 additional family-level APC matches to `match_taxa`.
* Refine tests
* Make messages to console optional
* Fix issue with fails when github is down (#205)

---------

Co-authored-by: Elizabeth Wenk <[email protected]>
Co-authored-by: Fonti Kar <[email protected]>
Co-authored-by: Daniel Falster <[email protected]>
Co-authored-by: Will Cornwell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants