Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: match_name gains join_id col, allowing for an initial matching override based on some unique ID column #460

Merged
merged 28 commits into from
Mar 14, 2024

Conversation

jdhoffa
Copy link
Member

@jdhoffa jdhoffa commented Mar 7, 2024

This ID column could be, for example, lei or isin.

Some open questions I have before marking this as ready for review:

  • Currently the ID column must be identical between each dataset, but I think perhaps it might be better to allow for a named argument, with one name indicating the loanbook column name (e.g. lei_direct_loantaker), and another column indicating the abcd column name (e.g. lei)
  • Currently the functionality only allows for a SINGLE join column. e.g. there is no way to give a handful of columns, and join by priority (ISIN -> LEI -> etc.)
  • Definitely needs way more tests. Especially, there should be some input checks that assess that there aren't multiple joins per company. I can imagine a loanbook with multiple identical LEIs, and an abcd with several name_company values that have the same LEI. Need to consider what to do in that case

See reprex:

library(tibble)
library(r2dii.match)

loanbook <- tibble(
  sector_classification_system = "NACE",
  sector_classification_direct_loantaker = "100", # this generally shouldn't match to anything
  id_ultimate_parent = c("UP15", "UP16"),
  name_ultimate_parent = c("Foo", "Bar"),
  id_direct_loantaker = c("C294", "C295"),
  name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd",
  lei = c("LEI123", NA_character_)
)

abcd <- tibble(
  name_company = "alpine knits india pvt. limited",
  sector = "power",
  lei = "LEI123"
)

# no match found, as expected
# outputs "message" that no match was found by fuzzy matching, and warning that no match was found overall
match_name(loanbook, abcd, join_id = NULL)
#> Found no match via fuzzy matching.
#> Warning: Found no match.
#> # A tibble: 0 × 16
#> # ℹ 16 variables: sector_classification_system <chr>,
#> #   sector_classification_direct_loantaker <chr>, id_ultimate_parent <chr>,
#> #   name_ultimate_parent <chr>, id_direct_loantaker <chr>,
#> #   name_direct_loantaker <chr>, lei <chr>, id_2dii <lgl>, level <lgl>,
#> #   sector <lgl>, sector_abcd <lgl>, name <lgl>, name_abcd <lgl>, score <lgl>,
#> #   source <lgl>, borderline <lgl>

# no match found via fuzzy matching, as expected
# outputs "message" that no match was found by fuzzy matching, but successfully outputs the joined match
match_name(loanbook, abcd, join_id = "lei")
#> Found no match via fuzzy matching.
#> # A tibble: 1 × 10
#>   sector_classification_system sector_classification_direct…¹ id_ultimate_parent
#>   <chr>                        <chr>                          <chr>             
#> 1 NACE                         100                            UP15              
#> # ℹ abbreviated name: ¹​sector_classification_direct_loantaker
#> # ℹ 7 more variables: name_ultimate_parent <chr>, id_direct_loantaker <chr>,
#> #   name_direct_loantaker <chr>, name <chr>, sector <chr>, alias <chr>,
#> #   score <dbl>

Created on 2024-03-07 with reprex v2.1.0

Closes #135

@jdhoffa jdhoffa requested a review from jacobvjk March 7, 2024 08:10
@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 7, 2024

@jacobvjk requesting your review here for some input already

@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 7, 2024

and cc @cjyetman and @AlexAxthelm for visibility

@cjyetman
Copy link
Member

cjyetman commented Mar 7, 2024

Allowing a named character vector, e.g. join_id = c(lei_direct_loantaker = "lei"), might be an easy way to facilitate naming different columns in the loanbook and abcd datasets.

@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 7, 2024

Yup, that's what I was thinking too!

@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 7, 2024

I think may make sense to only allow a single join column for now.
Adding multiple requires some join priority logic, and I'd rather not introduce that complexity yet.
Could be a future feature once this one is in the wild, and proven to not be buggy

Copy link
Member

@jacobvjk jacobvjk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

particularly unclear about removing the join_id after the inner_join and before the filter.
the rest seems quite reasonable

R/restructure_abcd.R Outdated Show resolved Hide resolved
R/match_name.R Outdated Show resolved Hide resolved
@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 7, 2024

So there's a bunch of failing checks that I will look into later (I have a feeling it's because of changes in r2dii.data messing up our vignettes etc.)

But the core functionality seems to be there :-)

@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 12, 2024

  • Make sure this works well with prioritize()

@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 12, 2024

This seems to function now as expected, with dplyr::*_join-like syntax.
See reprex below:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(r2dii.data)
library(r2dii.match)

# 441 points to the company Russo, Russo e Russo Group
abcd_demo <- dplyr::mutate(
  abcd_demo, 
  lei = dplyr::case_when(
    company_id == "441" ~ "LEI123",
    TRUE ~ lei
  )
)

# C267 points to the company Russo s.r.l.
loanbook_demo <- dplyr::mutate(
  loanbook_demo, 
  lei_direct_loantaker = dplyr::case_when(
    id_direct_loantaker == "C267" ~ "LEI123",
    TRUE ~ lei_direct_loantaker
  )
)

out <- match_name(
  loanbook_demo, 
  abcd_demo, 
  join_id = c(lei_direct_loantaker = "lei")
)

prioritized <- prioritize(out)

prioritized |> 
  filter(id_direct_loantaker == "C267") |> 
  select(id_direct_loantaker, lei_direct_loantaker, level, source, score, name, name_abcd)
#> # A tibble: 1 × 7
#>   id_direct_loantaker lei_direct_loantaker level    source score name  name_abcd
#>   <chr>               <chr>                <chr>    <chr>  <dbl> <chr> <chr>    
#> 1 C267                LEI123               lei_dir… id jo…     1 Russ… Russo, R…

Created on 2024-03-12 with reprex v2.1.0

@jdhoffa jdhoffa marked this pull request as ready for review March 12, 2024 10:07
@jdhoffa jdhoffa requested a review from jacobvjk March 12, 2024 10:07
@jdhoffa jdhoffa requested a review from cjyetman March 12, 2024 15:53
@jdhoffa
Copy link
Member Author

jdhoffa commented Mar 12, 2024

@cjyetman I've tagged you as a reviewer, understanding you may not have enough context to get exactly what is going on here (that's what @jacobvjk is there for), but I would appreciate a second pair of eyes on this 😊

R/match_name.R Outdated Show resolved Hide resolved
cjyetman
cjyetman previously approved these changes Mar 12, 2024
Copy link
Member

@cjyetman cjyetman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me

Copy link
Member

@jacobvjk jacobvjk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the comments are show stoppers, but I would like to hear your thoughts before I approve. I think the error messages are a bit confusing.

tests/testthat/test-match_name.R Show resolved Hide resolved
tests/testthat/test-match_name.R Outdated Show resolved Hide resolved
R/match_name.R Outdated Show resolved Hide resolved
R/match_name.R Outdated Show resolved Hide resolved
tests/testthat/test-match_name.R Show resolved Hide resolved
jdhoffa and others added 2 commits March 14, 2024 11:20
@jdhoffa jdhoffa requested a review from jacobvjk March 14, 2024 10:25
Copy link
Member

@jacobvjk jacobvjk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still two instances mentioning list inputs

R/match_name.R Outdated Show resolved Hide resolved
tests/testthat/test-match_name.R Outdated Show resolved Hide resolved
@jdhoffa jdhoffa requested a review from jacobvjk March 14, 2024 11:16
Copy link
Member

@jacobvjk jacobvjk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: match_name gains functionality to automatically join by LEI/ISIN (or other specified identifer)
3 participants