Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toBiblatex can't handle CJK author names #106

Open
kijinosu opened this issue Sep 18, 2024 · 16 comments · May be fixed by #107
Open

toBiblatex can't handle CJK author names #106

kijinosu opened this issue Sep 18, 2024 · 16 comments · May be fixed by #107

Comments

@kijinosu
Copy link

I am trying to use RefManageR for biblatex bibliographies that
include CJK text. While most of the bibliography is handled
splendidly, toBiblatex replaces author names with question marks.

My understanding that this is caused by using the old utils::person
object.

Are there any workarounds or other ways to avoid this?

library(rlang)
library(RefManageR)
library(stringi)

b <- new_environment()
ls(b)

b$bib <- BibEntry(bibtype = "article", 
		key = "shiotsuki2011kasai", 
		title = "葛西賢太著,『現代瞑想論-変性意識がひらく世界-』",
		author = "塩月亮子", 
		journal = "宗教と社会",
		volume = 17,
		pages = "67--69",
		year = 2011, 
		publisher = "「宗教と社会」学会")

b$bib

toBiblatex(b$bib)

## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {{????}},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
@kijinosu
Copy link
Author

Partial workaround that uses R package stringi:

b <- new_environment()
ls(b)

b$bib <- c(BibEntry(bibtype = "article", 
		key = "shiotsuki2011kasai", 
		title = "葛西賢太著,『現代瞑想論-変性意識がひらく世界-』",
		author = "塩,亮子 and 葛西,賢太", 
		journal = "宗教と社会",
		volume = 17,
		pages = "67--69",
		year = 2011, 
		publisher = "「宗教と社会」学会"),
		BibEntry(bibtype = "article", 
		key = "hiromitsu2022altered", 
		title = "意識状態の変容と脳内ネットワーク",
		author = "弘光健太郎 and ヒロミツケンタロウ", 
		journal = "鶴見大学仏教文化研究所紀要",
		volume = 27,
		pages = "53--66",
		year = 2022, 
		publisher = "鶴見大学")
		)
b$bib

b$biblatex <- toBiblatex(b$bib, escape=TRUE)
writeLines(b$biblatex)

## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {?? ? and ?? ??},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
## 
## @Article{hiromitsu2022altered,
##   title = {意識状態の変容と脳内ネットワーク},
##   author = {{?????} and {?????????}},
##   journal = {鶴見大学仏教文化研究所紀要},
##   volume = {27},
##   pages = {53--66},
##   year = {2022},
##   publisher = {鶴見大学},
## }

lapply(b$bib, function(v) {
	austr <- unlist(stri_split_boundaries(stri_flatten(unlist(v$author), collapse=""), type='character') )
	biblatex <- toBiblatex(v, escape=TRUE)
	auform <- as.character(biblatex['author'] )
	places <- stri_locate_all_regex(auform,"(?=\\?)", get_length=TRUE)[[1]][,1]
	replaced <- stri_sub_replace_all(auform,places,places,replacement=austr)
	biblatex['author'] <- replaced 
	writeLines(biblatex)
})

## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {亮子 塩 and 賢太 葛西},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
## @Article{hiromitsu2022altered,
##   title = {意識状態の変容と脳内ネットワーク},
##   author = {{弘光健太郎} and {ヒロミツケンタロウ}},
##   journal = {鶴見大学仏教文化研究所紀要},
##   volume = {27},
##   pages = {53--66},
##   year = {2022},
##   publisher = {鶴見大学},
## }

@kijinosu
Copy link
Author

A more complete workaround:

write_bib <- function(bib, file=stdout(), overwrite=FALSE){
    library(stringi)
    if (!length(bib))
        return(NULL)
    if (!inherits(bib, 'BibEntry')){
        message("bib object is not a BibEntry object")
        return(NULL)
    }
    pAFields <- c("author","editor","translator")
    zz <- file(file, "w")

    biblatex <- NULL
    lapply(bib, function(v) {
        biblatex <- unlist(toBiblatex(v, escape=TRUE))
        flds <- names(biblatex)
        for(pf in pAFields){
            if(pf %in% flds) {
                austr <- unlist(stri_split_boundaries(stri_flatten(unlist(v$author), collapse=""), type='character') )
                hasideo <- stri_detect_regex(austr, "\\p{Ideographic}")
                auform <- as.character(biblatex[pf] )
                places <- stri_locate_all_regex(auform,"(?=\\?)", get_length=TRUE)[[1]][,1]
                if(places[1] > 0){
                    replaced <- tryCatch(
                        {
                            stri_sub_replace_all(auform,places,places,replacement=austr)
                        },
                        warning = function(cond) {
                            writeLines(conditionMessage(cond),con=zz)
                            writeLines(paste("auform: ", auform),con=zz)
                            writeLines(paste("places: ", places),con=zz)
                            writeLines(paste("austr: ", austr),con=zz)
                        }
                    )
                    if(!is.null(replaced) & length(replaced) > 0) biblatex[pf] <- replaced
                }
            }
        }
        writeLines(biblatex,con=zz)
    })

    close(zz)
}

@mwmclean
Copy link
Collaborator

mwmclean commented Sep 23, 2024

@kijinosu thank for your report. I'm not able to reproduce the behaviour you describe on my machine/locale; I get an error just creating your BibEntry objects with CJK characters for journal and title. Can you share your sessionInfo() please?

My understanding that this is caused by using the old utils::person
object.

Can you elaborate on this?

Are you able to submit a pull request?

mwmclean added a commit that referenced this issue Sep 23, 2024
* tools::encoded_text_to_latex(text, UTF-8) can fail when text contains
Japanese characters, replacing valid text with questions marks
* Add escape hatch to return origin/unformatted/unescaped text if this
occurs
* Closes #106
@mwmclean
Copy link
Collaborator

@kijinosu turned out to be an issue with my IDE. I've opened #107 if you could please install and test it and/or review 🙏

@kijinosu
Copy link
Author

This worked partially:

b <- new_environment()
ls(b)

b$bib <- c(BibEntry(bibtype = "article", 
		key = "shiotsuki2011kasai", 
		title = "葛西賢太著,『現代瞑想論-変性意識がひらく世界-』",
		author = "塩,亮子 and 葛西,賢太", 
		journal = "宗教と社会",
		volume = 17,
		pages = "67--69",
		year = 2011, 
		publisher = "「宗教と社会」学会"),
		BibEntry(bibtype = "article", 
		key = "hiromitsu2022altered", 
		title = "意識状態の変容と脳内ネットワーク",
		author = "弘光健太郎 and ヒロミツケンタロウ", 
		journal = "鶴見大学仏教文化研究所紀要",
		volume = 27,
		pages = "53--66",
		year = 2022, 
		publisher = "鶴見大学")
		)

b$biblatex <- toBiblatex(b$bib)

writeLines(b$biblatex)
## @Article{shiotsuki2011kasai,
##   title = {葛西賢太著,『現代瞑想論-変性意識がひらく世界-』},
##   author = {亮子 塩 and 賢太 葛西},
##   journal = {宗教と社会},
##   volume = {17},
##   pages = {67--69},
##   year = {2011},
##   publisher = {「宗教と社会」学会},
## }
## 
## @Article{hiromitsu2022altered,
##   title = {意識状態の変容と脳内ネットワーク},
##   author = {{?????} and {?????????}},
##   journal = {鶴見大学仏教文化研究所紀要},
##   volume = {27},
##   pages = {53--66},
##   year = {2022},
##   publisher = {鶴見大学},
## }

@mwmclean
Copy link
Collaborator

Did you install the branch I mentioned with e.g. remotes::install_github("ROpenSci/RefManageR#107")?

@kijinosu
Copy link
Author

@kijinosu thank for your report. I'm not able to reproduce the behaviour you describe on my machine/locale; I get an error just creating your BibEntry objects with CJK characters for journal and title. Can you share your sessionInfo() please?
sessionInfo()
R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.utf8 LC_CTYPE=Japanese_Japan.utf8 LC_MONETARY=Japanese_Japan.utf8 LC_NUMERIC=C LC_TIME=Japanese_Japan.utf8

time zone: Asia/Tokyo
tzcode source: internal

My understanding that this is caused by using the old utils::person
object.

Can you elaborate on this?
I seem to be mistaken about utils::person.

Are you able to submit a pull request?
Sorry, I am not familiar enough with github.

@mwmclean
Copy link
Collaborator

Did you see my previous message? #106 (comment) Are you able to install R packages from GitHub?

@kijinosu
Copy link
Author

Did you install the branch I mentioned with e.g. remotes::install_github("ROpenSci/RefManageR#107")?

Yes

@mwmclean
Copy link
Collaborator

I'm no longer able to produce output with ???? for the author names with that branch. This is tested with the unit tests here and the tests pass on r-universe CI on macOS, windows, and Ubuntu.

mwmclean added a commit that referenced this issue Sep 25, 2024
* See toBibtex.person, braces seem necessary if no family name is detected
* Addresses #106
@kijinosu
Copy link
Author

How about replacing tools::encoded_text_to_latex with dplR::latexify?

@mwmclean
Copy link
Collaborator

How about replacing tools::encoded_text_to_latex with dplR::latexify?

What are the benefits?

@kijinosu
Copy link
Author

kijinosu commented Sep 25, 2024 via email

@mwmclean
Copy link
Collaborator

By handles CJK properly, you mean leaves it as is? That's what the PR currently does without adding an extra package as a dependency.

mwmclean added a commit that referenced this issue Sep 25, 2024
* See toBibtex.person, braces seem necessary if no family name is detected
* Addresses #106
mwmclean added a commit that referenced this issue Sep 25, 2024
* Switch from tools::encoded_text_to_latex to dplR::latexify in
toBiblatex and toBibtex to fix translation of accented i characters
to latex
* Fixes #102, #106
@mwmclean
Copy link
Collaborator

@kijinosu Looks like latexify fixes #102. You can test in out by installing #109. Thanks for the suggestion.

@kijinosu
Copy link
Author

By handles CJK properly, you mean leaves it as is? That's what the PR currently does without adding an extra package as a dependency.

I mean that it uses stringi, which is a wrapper for ICU4C , a part of the Unicode standard https://icu.unicode.org/.

mwmclean added a commit that referenced this issue Oct 7, 2024
* Add latexify() from dplR instead of tools::encoded_text_to_latex
to improve conversion of non-ASCII characters to valid latex
* Fixes #102, #105, #106

Signed-off-by: Mathew W. McLean <[email protected]>
mwmclean added a commit that referenced this issue Oct 7, 2024
* Add latexify() from dplR instead of tools::encoded_text_to_latex
to improve conversion of non-ASCII characters to valid latex
* Fixes #102, #105, #106

Signed-off-by: Mathew W. McLean <[email protected]>
mwmclean added a commit that referenced this issue Oct 15, 2024
* Add latexify() from dplR instead of tools::encoded_text_to_latex
to improve conversion of non-ASCII characters to valid latex
* Fixes #102, #105, #106

Signed-off-by: Mathew W. McLean <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants