From af2ff3434997efb5e884d0f98c225a80a5c88153 Mon Sep 17 00:00:00 2001 From: sammo3182 Date: Mon, 21 Aug 2023 09:54:42 +0800 Subject: [PATCH] vignette edited --- vignettes/regioncode-vignette.Rmd | 280 +++++++++++++++--------------- vignettes/s_regioncode.bib | 12 ++ 2 files changed, 153 insertions(+), 139 deletions(-) diff --git a/vignettes/regioncode-vignette.Rmd b/vignettes/regioncode-vignette.Rmd index 8552a92..baeea4b 100644 --- a/vignettes/regioncode-vignette.Rmd +++ b/vignettes/regioncode-vignette.Rmd @@ -1,14 +1,14 @@ --- -title: "regioncode: Convert Region Names and Division Codes of China Over Years" +title: "regioncode: One-Step Solution for Chinese Region Conversions" author: - - "Yue Hu, Xinyi Ye, Wenquan Wu, Yufei Sun" + - "HU Yue, YE Xinyi, WU Wenquan, SUN Yufei" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: keep_md: false vignette: > %\VignetteEncoding{UTF-8} - %\VignetteIndexEntry{regioncode: Convert Region Names and Division Codes of China Over Years} + %\VignetteIndexEntry{regioncode: One-Step Solution for Chinese Region Conversions} %\VignetteEngine{knitr::rmarkdown} bibliography: s_regioncode.bib @@ -26,92 +26,90 @@ library(regioncode) library(tidyverse) ``` -"City" is a complex concept in China. -It may refer to a county-level, prefectural, or provincial administrative unit. -Scholars of China often suffer from convert these names or corresponding geocodes, especially when dealing with data over years, since, for every a while, some unit's name may be modified or cancelled by the central government [@GuoJiaTongJiJu2022]. +The term "city" in China encompasses a multifaceted concept. +It may denote a county-level, prefectural, or provincial administrative unit. +Scholars focusing on China often encounter frustration in converting these names or corresponding geocodes, particularly when handling data spanning multiple years. +This complexity further arises due to periodic modifications or cancellations of some unit's name by the central government [@GuoJiaTongJiJu2022]. -Inspired by Vincent Arel-Bundock's [`countrycode`](https://joss.theoj.org/papers/10.21105/joss.00848) package, we created `regioncode`, a package to achieve similar functions but specifically for region name/code conversions within China. -`regioncode` aims to enable seamlessly converting regions' formal names, common-used names, and administrative division codes between each other in modern China (1986--2019 in the current version). +Inspired by Vincent Arel-Bundock's [`countrycode`](https://joss.theoj.org/papers/10.21105/joss.00848) package, we developed `regioncode`. +This package aims to perform similar functions but is tailored specifically for region name/code conversions within China for the period 1986--2019. # Why `regioncode`? -The Chinese government gives unique geocodes for each county, city (prefecture), and provincial-level administrative unit. -These "administrative division codes" are consistently [adjusted and updated](http://www.mca.gov.cn/article/sj/xzqh/1980/) to matched national and regional plans of development [@MinZhengBu2022]. -The adjustments however may disturb researchers when they conduct studies over time or merge geo-based data from different years. -Especially, when researchers render statistical data on a Chinese map, different geocodes between map data and statistical data can cause mess-up outputs. +The Chinese government assigns unique geocodes to each county, city (prefecture), and provincial-level administrative unit. +These "administrative division codes" are consistently [adjusted and updated](http://www.mca.gov.cn/article/sj/xzqh/1980/) to align with national and regional development plans [@MinZhengBu2022]. +However, these adjustments may pose challenges for researchers conducting longitudinal studies or merging geo-based data from different years. +For instance, inconsistencies between map data and statistical data can result in erroneous outputs when rendering statistical data on a Chinese map. -This package aims to conquer such difficulties by a one-step solution. -In the current version, `regioncode` enables seamlessly converting formal names, common-used names, language zone, and division codes of Chinese provinces and prefectures between each other and across thirty-four years from 1986 to 2019. +*A One-Step Solution: `regioncode`* -# Installation +`regioncode` offers a one-step solution to these challenges. +In its current version, it enables seamless conversion of formal names, commonly used names, and administrative division codes of Chinese provinces and prefectures between each other, covering a span of thirty-four years from 1986 to 2019. -To install: -- the latest released version: `install.packages("regioncode")`. -- the latest developing version: `remotes::install_github("sammo3182/regioncode")`. +# Installation -# Basic Usage +To install: -We uses a randomly sample from the [`China's Corruption Investigations Dataset`](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/9QZRAD) to illustrate how the functions work. +- The latest released version: `install.packages("regioncode")`. +- The latest developing version: `remotes::install_github("sammo3182/regioncode")`. +Certainly! Below is the revised text, edited to align with your specified style: -In `regioncode` package, we named administrative division codes as `code`, regions' formal names as `name`, and their commonly used abbreviation ("short name") as `sname`. -The current version enables mutual conversion between any pair of them. -To do so, users just need to pass a character vector of names or a numeric vector of geocodes into the function and specify which type of the output they want in the argument `convert_to` to gain the type of output of interest. -The current version includes three basic types of output (together with three types of extensions, `pinyin`, `dia_group`, and `rank`, elaborated in the next section): +# Basic Usage -1. Geocodes (`code`) -1. Names of the given cities/provinces (`name`) +We demonstrate the basic application of `regioncode` with a toy data randomly sampled from @Wang2020c's China's Corruption Investigations Dataset. +In the `regioncode` field, administrative division codes are denoted as `code`, and the formal names of regions are referred to as `name`. +The current version facilitates the mutual conversion between any pair of these elements. +Users merely need to input a character vector of names or a numeric vector of geocodes into the function, specifying the desired output type with the `convert_to` argument. -In the following example, the 2019 geocodes in the toy data to their 1989 version. -Users need to correctly set the `year_from` argument to point to the proper reference. -Then they can use `year_to` and `convert_to` to set which year's projection should be output and in what type of format (e.g., geocodes or names in the following example). +The following example illustrates the conversion of 2019 geocodes in the sample data to their 1989 version. +It is essential for users to correctly set the `year_from` argument to reference the appropriate year. +Subsequently, the `year_to` and `convert_to` arguments can be used to determine the desired year's projection and the format type. ```{r code2code} library(regioncode) data("corruption") -# Convert to the 1989 version +# Conversion to the 1989 version regioncode(data_input = corruption$prefecture_id, - convert_to = "code", # default set + convert_to = "code", # default setting year_from = 2019, year_to = 1989) -# Comparision - +# Comparison tibble( code2019 = corruption$prefecture_id, code1989 = regioncode(data_input = corruption$prefecture_id, - convert_to = "code", # default set + convert_to = "code", # default setting year_from = 2019, year_to = 1989), name2019 = regioncode(data_input = corruption$prefecture_id, - convert_to = "name", # default set + convert_to = "name", # default setting year_from = 2019, year_to = 2019), name1989 = regioncode(data_input = corruption$prefecture_id, - convert_to = "name", # default set + convert_to = "name", # default setting year_from = 2019, year_to = 1989) ) ``` -Note that if a region was initially geocoded in e.g., 1989 and included in a new region, in 2019, the new region geocode will be used hereafter. -If a big place was broken into several regions, the later-year codes will be aligned with the first region according to the ascendant order of the regions' numeric geocodes. +Note that if a region was initially geocoded in, for example, 1989 and later included in a new region in 2019, the new region geocode will be subsequently used. +If a large area was divided into several regions, the later-year codes will align with the first region according to the ascending order of the regions' numeric geocodes. -In the current version, `regioncode` automatically detects the input format: numerics for geocodes and characters for names. -The following example illustrate the conversions from different types of input to alternative formats of outputs: +In the current version, `regioncode` automatically identifies the input format: numerics for geocodes and characters for names. +The following example demonstrates the conversions from various types of input to alternative formats of outputs: ```{r code2name} -# The original name +# Original name tibble( id = corruption$prefecture_id, name = corruption$prefecture ) # Codes to name - regioncode(data_input = corruption$prefecture_id, convert_to = "name", year_from = 2019, @@ -124,7 +122,6 @@ regioncode(data_input = corruption$prefecture, year_to = 2019) # Name to name of a different year - regioncode(data_input = corruption$prefecture, convert_to = "name", year_from = 2019, @@ -133,34 +130,32 @@ regioncode(data_input = corruption$prefecture, # Advanced Applications -To further help uses with more "messier" data and more diverse demands, `regioncode` provides several special converting functions. -They usually work for very specific data, but you will feel blessed that `regioncode` have these functions when you encounter those types of data. -Such functions include: +The `regioncode` package also offers specialized conversion functions to assist users with more complex data and diverse requirements, including: -1. Convert from and to names without administrative levels; -1. Convert from data regarding municipalities as cities; -1. Convert from provincial data; -1. Return pinyin spellings for the name; -1. Return sociopolitical areas; -1. Return city ranks; -1. Return linguistic zones. +1. Conversion from/to incomplete names (without administrative levels). +2. Different handling of municipalities. +3. Return of population-based city ranks. +4. Return of pinyin format of outputs. +5. Conversion of provincial data. +6. Return of administrative areas. +7. Return of linguistic zones. -## Incomplete naming prefectures. +## Incomplete Naming of Prefectures -More than often, data codes may omit the administrative levela when recording geo-information, for instnace, "北京" instead of "北京市", denoted as "incomplete names." -To accomplish conversions for such data, one needs to specify the `incomplete_name` argument: +Frequently, data codes may exclude the administrative level when recording geographical information, such as "北京" instead of "北京市," referred to as "incomplete names." +To execute conversions for such data, one must specify the `incomplete_name` argument: -- If the administrative level in the input data is incomplete, users should set `incomplete_name = "from"`; -- If the users need the output to be in the incomplete form (only working when `convert_to = "name"`), for instance, for later merging with other data, users should set `incomplete_name = "to"`; -- If both the input and output need to be in the in the incomplete form `incomplete_name = "both"`. +- If the administrative level in the input data is incomplete, users should set `incomplete_name = "from"`. +- If the output needs to be in the incomplete form (only applicable when `convert_to = "name"`), users should set `incomplete_name = "to"`. +- If both the input and output need to be in the incomplete form, `incomplete_name = "both"`. -The above setting are illustrated in the example below, in which we convert the full names in the 2019 version to incomplete names in the 1989 version, and change it full names in 2008 in a next step: +The settings above are exemplified in the subsequent example, where we convert the full names in the 2019 version to incomplete names in the 1989 version, and then to full names in 2008: ```{r incomplete_name} # Original full names corruption$prefecture -# Convert to incomplete names in 1989 +# Conversion to incomplete names in 1989 fake_incomplete <- regioncode(data_input = corruption$prefecture, convert_to = "name", year_from = 2019, @@ -168,7 +163,7 @@ fake_incomplete <- regioncode(data_input = corruption$prefecture, incomplete_name = "to") fake_incomplete -# Convert to full names in 2008 +# Conversion to full names in 2008 fake_full <- regioncode(data_input = fake_incomplete, convert_to = "name", year_from = 1989, @@ -179,16 +174,15 @@ fake_full ## Municipalities -Municipalities ("直辖市") are geographically cities but administratively provincial. -When they are recognized as the provincial units, the districts within these municipalities are thus considered prefectural, according to the Chinese geocode system. -Nevertheless, different geographic data may treat them differently. -Some data (especially some socioeconomic statistics) may treat the municipalities equivalently as prefectures. +Municipalities ("直辖市") in China are geographically cities but administratively provincial. +Different geographic data may categorize them differently. +Some data may treat municipalities as equivalent to prefectures. -To convert this type of data, `regioncode` sets a specific argument `zhixiashi`. -The default value of the argument is "FALSE," by which the municipalities are treated as provinces. -When it is set "TRUE," the municipalities are treated as prefectures, and their provincial codes are used as the geocodes. +To convert this type of data, `regioncode` introduces a specific argument `zhixiashi`. +The default value is "FALSE," treating municipalities as provinces. +When set to "TRUE," municipalities are considered as prefectures, and their provincial codes are utilized as geocodes. -In the following example, we illustrate the municipalities identifier with a mixed string of names of municipalities, their districts, and a prefecture: +The following example illustrates the municipalities identifier with a mixed string of names of municipalities, their districts, and a prefecture: ```{r municipality} names_municipality <- c("北京", # Beijing, a municipality @@ -198,15 +192,13 @@ names_municipality <- c("北京", # Beijing, a municipality "济南市") # A prefecture of Shandong # When `zhixiashi` is FALSE, only the districts are recognized - regioncode(data_input = names_municipality, year_from = 2019, year_to = 2019, convert_to = "code", zhixiashi = FALSE) -# When `zhixiashi` is TRUE, muncipalities are - +# When `zhixiashi` is TRUE, municipalities are recognized # regioncode(data_input = names_municipality, # year_from = 2019, # year_to = 2019, @@ -214,33 +206,32 @@ regioncode(data_input = names_municipality, # zhixiashi = TRUE) ``` -## City ranking +## City Ranking -The *Statistical Yearbook of Urban and Rural Construction* divides Chinese cities into different levels from small cities to super cities, largely according to their populations [@GuoJiaTongJiJu2022a]. -From 1989 to 2014, there were four levels of cities, and the system extend to a 7-level scale after 2014, as shown in the following table: +The *Statistical Yearbook of Urban and Rural Construction* classifies Chinese cities into different levels, largely based on their populations [@GuoJiaTongJiJu2022a]. +From 1989 to 2014, there were four levels of cities, and the system expanded to a 7-level scale after 2014, as detailed in the following table: -| criteria | population rank +| Criterion | Population | Rank |:-------------------|-----------------------|------------------| -| old criteria(1989) | >1 million | 超大城市 | +| Old (1989)| > 1 million | 超大城市 | | | 500,000 ~ 1 million | 大城市 | | | 200,000 ~ 500,000 | 中等城市 | -| | <200,000 | 小城市 | +| | < 200,000 | 小城市 | | | | | -|new criteria(2014) | >10 million | 超大城市 | +| New (2014)| > 10 million | 超大城市 | | | 5 million ~ 10 million| 特大城市 | | | 3 million ~ 5 million | I型大城市 | | | 1 million ~ 3 million | II型大城市 | | | 500,000 ~ 1 million | 中等城市 | | | 200,000 ~ 500,000 | I型小城市 | -| | <200,000 | II型小城市 | +| | < 200,000 | II型小城市 | -`regioncode` provides a function to return the rank of the cities according to their populations of the given year. -The population data were collected from the official statistics. -If the population is not traceable, the rank will be marked as `NA`. -Users just need to set `convert_to = "rank"` to conduct the conversion. -For the regions in and before 1989, the old ranking system is applied. -For the rest region-year, the function will return the new ranks. -In the following example, we compare the ranks from the same input in different years. +The `regioncode` function can return the rank of cities according to their populations for a given year. +If the population is untraceable, the rank will be marked as `NA`. +Users simply need to set `convert_to = "rank"` to perform the conversion. +For regions in and before 1989, the old ranking system is applied. +For other region-years, the function will return the new ranks. +The following example compares the ranks from the same input in different years: ```{r rank} tibble( @@ -256,46 +247,59 @@ tibble( ) ``` -## Pinyin^[Thanks Liu Xueyan's contribution to this function.] +## Pinyin -Pinyin is a Chinese phonetic romanization. -Some data stores the region names with pinyin instead of Chinese characters. -The default name output of `regioncode` uses Chinese characters, but one can gain pinyin output by setting the argument `to_pinyin = TRUE`. -The effect can be applied to either official name, incomplete name, or sociopolitical area outputs. +Pinyin is a phonetic romanization of Chinese characters. +Some data may store region names in pinyin instead of Chinese characters. +The default name output of `regioncode` is in Chinese characters. +However, thanks to Peng Zhao and Qu Cheng's [pinyin](https://github.com/pzhaonet/pinyin) package, users can now obtain pinyin format output from the `regioncode` function by setting the argument `to_pinyin = TRUE`. +This function also corrects the romanization output for areas with special spellings, such as Shanxi vs. Shaanxi, Inner Mongolia, and special administrative regions. +It works for official names, incomplete names, and administrative area outputs. +The following example demonstrates how this function operates on various demands: ```{r pinyin} -regioncode(data_input = corruption$prefecture, +tibble( + city = corruption$prefecture, + cityPY = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, - convert_to="name", - to_pinyin=TRUE - ) - -regioncode(data_input = corruption$prefecture, + convert_to = "name", + to_pinyin = TRUE + ), + cityIncomplete = regioncode(data_input = corruption$prefecture, year_from = 2019, year_to = 1989, - convert_to="name", + convert_to = "name", incomplete_name = "to", - to_pinyin=TRUE + to_pinyin = TRUE + ), + areaPY = regioncode(data_input = corruption$prefecture, + year_from = 2019, + year_to = 1989, + convert_to = "area", + to_pinyin = TRUE ) +) -regioncode(data_input = corruption$prefecture, +# Regions with special spelling +regioncode(data_input = c("山西", "陕西", "内蒙古", "香港", "澳门"), year_from = 2019, - year_to = 1989, - convert_to="area", - to_pinyin=TRUE + year_to = 2008, + convert_to = "name", + incomplete_name = "both", + province = TRUE, + to_pinyin = TRUE ) ``` ## Provinces -`regioncode` enables conversions at not only the prefectural but also the provincial level. -By setting the argument `province = TRUE`, users can convert all the geocodes and names at the provincial level. -Chinese provinces have abbreviations. -When the converted data only have abbreviations, users can set the `convert_to` argument to `abbreTocode`, `abbreToname`, or `abbreToarea` to gain the data types they want. -When they want abbreviation outputs, just set `convert_to = "abbre"`. +The `regioncode` function also supports conversions at the provincial level. +By setting the argument `province = TRUE`, users can convert all geocodes and names at this level. +Chinese provinces have abbreviations, and when the converted data only contain abbreviations, users can set the `convert_to` argument to `abbreTocode`, `abbreToname`, or `abbreToarea` to obtain the desired data types. +To receive abbreviation outputs, simply set `convert_to = "abbre"`. -In the following example, we convert a vector of province geocodes to their official names and abbreviations. +The following example demonstrates the conversion of a vector of province geocodes to their official names and abbreviations: ```{r provinces} tibble( @@ -311,18 +315,17 @@ tibble( year_to = 1989, province = TRUE) ) - ``` ## Geographic Units Beyond Provinces -The current version of `regioncode` includes two types of region conversion beyond the provincial level: administrative area and linguistic zones. +The current version of `regioncode` encompasses two types of region conversion beyond the provincial level: administrative area and linguistic zones. ### Administrative Area -Due to social, political, and martial reasons, Chinese regions are divided into seven areas [@SunPing2020]: +Chinese regions are divided into seven areas for social, political, and martial reasons [@SunPing2020]: -| region | provincial-level administrative unit | +| Region | Provincial-level Administrative Unit | |:-------|----------------------------------------------------------------| | 华北 | 北京市, 天津市, 山西省, 河北省, 内蒙古自治区 | | 东北 | 黑龙江省, 吉林省, 辽宁省 | @@ -332,8 +335,7 @@ Due to social, political, and martial reasons, Chinese regions are divided into | 西南 | 重庆市, 四川省, 贵州省, 云南省, 西藏自治区 | | 西北 | 陕西省, 甘肃省, 青海省, 宁夏回族自治区, 新疆维吾尔自治区 | - -In some cases, users may want to know which areas a prefecture or province belongs. +In certain cases, users may wish to identify the area to which a prefecture or province belongs. `regioncode` offers a function to convert codes and names of the region (both prefectures and provinces) into areas by setting the output format as "area": ```{r 2area} @@ -343,17 +345,17 @@ regioncode(data_input = corruption$prefecture, convert_to = "area") ``` -### Linguistic Zone^[Thanks ZHU Meng's contribution to this function.] +### Linguistic Zone -China is a multilingual country with a variety of dialects. -These dialects may be used by several prefectures in a province or province. -Prefectures from different provinces may also share the same dialect. -For the convenience of political and sociolinguistic studies, `regioncode` includes a function to return approximate linguistic zones of the given geocodes or prefectural names. -In the current version, `regioncode` offers two levels of lignuistic zone identification, i.e., the dialect groups (`dia_group`, "方言大类") and dialect sub-groups (`dia_sub_group`, "分区片"), according to the 1987 language atlas of China [@LiEtAl1987].^[Adding the 2012 version is a project on the list [@LanguageInstitutionEtAl2012].] -(When `province = TRUE`, the linguistic conversion can be only to the dialect group level.) -In the following example, we convert the toy data to dialect groups and sub-groups: +China is a multilingual country with various dialects. +These dialects may be used across several prefectures in a province or even across different provinces. +For political and sociolinguistic studies, `regioncode` includes a function to return approximate linguistic zones of given geocodes or prefectural names. +In the current version, `regioncode` offers two levels of linguistic zone identification: dialect groups (`dia_group`, "方言大类") and dialect sub-groups (`dia_sub_group`, "分区片"), according to the 1987 language atlas of China [@LiEtAl1987]. +(When `province = TRUE`, the linguistic conversion can only be to the dialect group level.) + +The following example converts the toy data to dialect groups and sub-groups: ```{r language_zone} tibble( @@ -369,26 +371,26 @@ tibble( ) ``` -Note that, the linguistic distribution in China is too complex for precisely gauging at the prefectural level, not saying that they continually change along with the population dynamic. -The linguistic zone output from `regioncode` is thus at most for reference rather than rigorous linguistic research. +Note that the linguistic distribution in China is too complex for precise gauging at the prefectural level, and it continually changes with population dynamics. +The linguistic zone output from `regioncode` is thus for reference rather than rigorous linguistic research. + +# Conclusion + +`regioncode` offers a convenient method for converting Chinese administrative division codes, official names, and facilitating various specific conversions. +The development of the package is ongoing, with future versions aiming to add more administrative level choices and enriching data. +Collaboration is welcome, and questions, comments, or bug reports can be directed to [Github Issues](https://github.com/sammo3182/regioncode/issues). -## Conclusion +We extend our appreciation to SHI Yuyang, XU Yujia, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LI Ruizhe for their contributions to data collection and function editing of this package. -`regioncode` provides a convenient way to convert Chinese administrative division codes, official names, sociopolitical and linguistic areas, abbreviations, and so on between each other. -This vignette offers a quick view of package features and a short tutorial for users. +# Reference -The development of the package is ongoing. -Future versions aim to add more administrative level choices, from province level to county level. -Data are also enriching. -Welcome to join us if you are also interested (see the affiliations below). -Please contact us with any questions or comments. -Bug reports can be conducted by [Github Issues](https://github.com/sammo3182/regioncode/issues). +::: {#refs} +::: -We appreciate the efforts of SHI Yuyang, XU Yujia, TIAN Haiting, SHAO Weihang, CHEN Yuanqian, and LI Ruizhe's on data collection and function editing of this package. -## Affiliation +# Affiliation -Dr. Yue Hu +Yue Hu Department of Political Science,\ Tsinghua University,\ diff --git a/vignettes/s_regioncode.bib b/vignettes/s_regioncode.bib index 87370ad..6c569b4 100644 --- a/vignettes/s_regioncode.bib +++ b/vignettes/s_regioncode.bib @@ -64,3 +64,15 @@ @article{SunPing2020 langid = {chinese}, file = {D\:\\zotero_system\\storage\\7Y2BQQQB\\content.html} } + +@data{Wang2020c, + title = {China's Corruption Investigations Dataset}, + author = {Wang, Yuhua}, + date = {2020}, + number = {UNF:6:pt1h9LKzO0aD6F30y7KQGg==}, + publisher = {{Harvard Dataverse}}, + doi = {10.7910/DVN/9QZRAD}, + langid = {english}, + unf = {UNF:6:pt1h9LKzO0aD6F30y7KQGg==}, + version = {DRAFT VERSION} +}