Skip to content

Commit

Permalink
Merge pull request #20 from lukes/6.0
Browse files Browse the repository at this point in the history
6.0
  • Loading branch information
lukes authored Apr 10, 2018
2 parents 12e876b + b8b0c1b commit fc02d73
Show file tree
Hide file tree
Showing 13 changed files with 604 additions and 608 deletions.
35 changes: 35 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
GEM
remote: https://rubygems.org/
specs:
activesupport (4.2.4)
i18n (~> 0.7)
json (~> 1.7, >= 1.7.7)
minitest (~> 5.1)
thread_safe (~> 0.3, >= 0.3.4)
tzinfo (~> 1.1)
builder (3.2.2)
hpricot (0.8.6)
htmlentities (4.3.4)
i18n (0.7.0)
json (1.8.3)
mini_portile (0.6.2)
minitest (5.8.0)
nokogiri (1.6.6.2)
mini_portile (~> 0.6.0)
thread_safe (0.3.5)
tzinfo (1.2.2)
thread_safe (~> 0.1)

PLATFORMS
ruby

DEPENDENCIES
activesupport
builder
hpricot
htmlentities
json
nokogiri

BUNDLED WITH
1.10.6
2 changes: 1 addition & 1 deletion LAST_UPDATED.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2016-08-26 15:45:42 +1200
2018-04-10 21:14:21 +1200
31 changes: 17 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@

### ISO-3166 Country and Dependent Territories Lists with UN Regional Codes

These lists are the result of merging data from two sources, the Wikipedia [ISO 3166-1 article](http://en.wikipedia.org/wiki/ISO_3166-1#Officially_assigned_code_elements) for alpha and numeric country codes, and the [UN Statistics](http://unstats.un.org/unsd/methods/m49/m49regin.htm) site for countries' regional, and sub-regional codes. In addition to countries, it includes dependent territories.
These lists are the result of merging data from two sources, the Wikipedia [ISO 3166-1 article](http://en.wikipedia.org/wiki/ISO_3166-1#Officially_assigned_code_elements) for alpha and numeric country codes, and the [UN Statistics](https://unstats.un.org/unsd/methodology/m49) site for countries' regional, and sub-regional codes. In addition to countries, it includes dependent territories.

The [International Organization for Standardization (ISO)](http://www.iso.org/iso/english_country_names_and_code_elements) site provides partial data (capitalised and sometimes stripped of non-latin ornamentation), but sells the complete data set as a Microsoft Access 2003 database. Other sites give you the numeric and character codes, but there appeared to be no sites that included the associated UN-maintained regional codes in their data sets. I scraped data from the above two websites that is all publicly available already to produce some ready-to-use complete data sets that will hopefully save someone some time who had similar needs.
The [International Organization for Standardization (ISO)](https://www.iso.org/iso-3166-country-codes.html) site provides partial data (capitalised and sometimes stripped of non-latin ornamentation), but sells the complete data set as a Microsoft Access 2003 database. Other sites give you the numeric and character codes, but there appeared to be no sites that included the associated UN-maintained regional codes in their data sets. I scraped data from the above two websites that is all publicly available already to produce some ready-to-use complete data sets that will hopefully save someone some time who had similar needs.

### What's available?

Expand All @@ -29,15 +29,17 @@ Using JSON as an example:

[
{
"name":"New Zealand",
"alpha-2":"NZ",
"alpha-3":"NZL",
"country-code":"554",
"sub-region-code":"053",
"region-code":"009",
"iso_3166-2":"ISO 3166-2:NZ",
"region":"Oceania",
"sub-region":"Australia and New Zealand"
"name":"Nigeria",
"alpha-2":"NG",
"alpha-3":"NGA",
"country-code":"566",
"iso_3166-2":"ISO 3166-2:NG",
"region":"Africa",
"sub-region":"Sub-Saharan Africa",
"intermediate-region":"Western Africa",
"region-code":"002",
"sub-region-code":"202",
"intermediate-region-code":"011"
},
// ...
]
Expand Down Expand Up @@ -75,7 +77,7 @@ Using JSON as an example:

To install the gems in the Gemfile:

bundle install
bundle

To run:

Expand All @@ -85,11 +87,12 @@ Note, due to file encoding issues the script should only be run using Ruby 1.9 o

### Timestamp

* UN Statistical data retrieved 26 August 2016, from a document last revised 31 October 2013
* Wikipedia data retrieved 26 August 2016, from a document last revised 13 August 2016
* UN Statistical data retrieved 10 April 2018
* Wikipedia data retrieved 10 April 2018, from a document last revised 2 April 2018

### Revisions

* 10 April 2018 - `tag 6.0`
* 26 August 2016 - `tag 5.0`
* 28 August 2015 - `tag 4.0`
* 20 April 2014 - `tag 3.0`
Expand Down
500 changes: 250 additions & 250 deletions all/all.csv

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion all/all.json

Large diffs are not rendered by default.

498 changes: 249 additions & 249 deletions all/all.xml

Large diffs are not rendered by default.

124 changes: 41 additions & 83 deletions scrubber.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@

entities = HTMLEntities.new

wikipedia_page = "https://en.wikipedia.org/wiki/ISO_3166-1"
un_page = "http://unstats.un.org/unsd/methods/m49/m49regin.htm"
WIKIPEDIA_URI = "https://en.wikipedia.org/wiki/ISO_3166-1"
UN_URI = "https://unstats.un.org/unsd/methodology/m49/overview/"

puts "Fetching data from Wikipedia table #{wikipedia_page}..."
puts "Fetching data from Wikipedia table #{WIKIPEDIA_URI}..."

doc = Hpricot(open(wikipedia_page).read)
doc = Hpricot(open(WIKIPEDIA_URI).read)

# array that will hold the iso 3166 data
codes = []
Expand All @@ -39,7 +39,7 @@
country["alpha-3"] = entities.decode(tds[2].search("span").inner_html.strip) rescue nil
country["country-code"] = entities.decode(tds[3].search("span").inner_html.strip) rescue nil
country["iso_3166-2"] = entities.decode(tds[4].search("a").inner_html.strip) rescue nil
codes << country unless country.values.any?{ |v| v.nil? }
codes << country unless country.values.any?(&:blank?)
end

puts " Data for #{codes.size} countries found\n"
Expand All @@ -49,92 +49,46 @@

# note ISO doesn't give away all information for free - so refer to the Wikipedia table (assuming it is mostly kept up to date)

puts "Fetching data from UN table #{un_page}..."
puts "Fetching data from UN table #{UN_URI}..."

#doc = Hpricot(open(un_page).read.force_encoding("UTF-8"))
#doc = Nokogiri::XML(open(un_page).read)
doc = Nokogiri::HTML(open(un_page))
doc = Nokogiri::HTML(open(UN_URI));

region_name = nil
region_code = nil
sub_region_name = nil
sub_region_code = nil
found_table = false
doc.css("table#downloadTableEN tbody").css("tr").each do |row|

#doc.css("table [text()*=Numerical code]")[0].css("tr").each do |row|
doc.css("table[width='100%']")[2].css("tr").each do |row|
_, _, region_code, region_name, sub_region_code, sub_region_name, intermediate_region_code, intermediate_region_name, country, _, iso_alpha_3 = row.css("td").map{|td| td.inner_html.strip }

# table has more sections than we want, like row
# of "Developed and developing regions" code. look for
# the next instance of td.header2 after we've started
# finding results, and end the loop when found
if !row.css("td.cheader2").blank? && found_table
break
end

tds = row.css("td")

next if tds[0].blank?

# get the code number
code = tds[0].css("p")[0].try(:inner_html).try(:strip)
code = tds[0].css("p span").inner_html.strip if code.blank?
code = tds[0].inner_html.strip if code.blank? # certain codes aren't wrapped in a <p>
next unless code.match(/^\d+\Z/)

# detemine what kind of row this is
# is this a region row?
region = tds[1].css("h3 b")
unless region.blank?
region.css("a").remove # remove the empty <a>
unless region.css("span").blank?
region = region.css("span") # remove wayward <span> (appearing on first Africa result)
end
region = region.inner_html.strip
region = entities.decode(region)
unless region.nil? || region.blank?
found_table = true
region_code = code
region_name = region
puts "#{region_name}: #{region_code}"
next
# find this country in our array and modify in place
codes.each_with_index do |element, i|
if element["alpha-3"] == iso_alpha_3
codes[i]["region"] = entities.decode(region_name)
codes[i]["sub-region"] = entities.decode(sub_region_name)
codes[i]["intermediate-region"] = entities.decode(intermediate_region_name)
codes[i]["region-code"] = region_code
codes[i]["sub-region-code"] = sub_region_code
codes[i]["intermediate-region-code"] = intermediate_region_code
break
end
end
# is this a subregion row?
sub_region = tds[1].css("b").inner_html.strip
unless sub_region.blank?
sub_region = entities.decode(sub_region)
sub_region_code = code
sub_region_name = sub_region
puts " #{sub_region_name}: #{sub_region_code}"
next
end
# is this a country row?
country = tds[1].css("p").inner_html.strip
country = tds[1].css("p span").inner_html.strip if country.blank?
country = tds[1].inner_html.strip if country.blank?

unless country.blank? || !country.match(/^[A-Z]/)
# find this country in our array and modify in place
codes.each_with_index do |element, i|
if element["country-code"] == code
codes[i]["region"] = region_name
codes[i]["sub-region"] = sub_region_name
codes[i]["region-code"] = region_code
codes[i]["sub-region-code"] = sub_region_code
break
end
end
country = entities.decode(country)
puts " #{country}: #{code}"
end

# puts "\t#{entities.decode(country)}: #{iso_alpha_3}"
end

# For ISO data from the Wikipedia page that we couldn't correlate with
# regional codes from the UN, give them blank data.
codes.select{ |c| c["region-code"].nil? }.map do |c|
c["region"] = c["sub-region"] = c["region-code"] = c["sub-region-code"] = nil
blanks = codes.select do |c|
# (note, don't consider intermediate region data as required)
c.slice("alpha-3", "region", "sub-region", "region-code", "sub-region-code").values.any?(&:blank?) ||
!["alpha-3", "region", "sub-region", "region-code", "sub-region-code"].all? {|k| c.key?(k) }
end

# Ensure they have all the keys (we'll write them to the data files with blank values)
blanks.map! do |c|
c["region"] ||= nil
c["sub-region"] ||= nil
c["intermediate-region"] ||= nil
c["region-code"] ||= nil
c["sub-region-code"] ||= nil
c["intermediate-region-code"] ||= nil
c
end

puts "Writing files..."
Expand Down Expand Up @@ -179,8 +133,12 @@ def json_to_xml(json)
File.open("slim-3/slim-3.csv", "w:UTF-8") { |f| f.write(json_to_csv(json)) }
File.open("slim-3/slim-3.xml", "w:UTF-8") { |f| f.write(json_to_xml(json)) }

puts "\nCouldn't find regional table data to save in all.csv, all.json and all.xml for the following countries (you may want to manually check #{un_page}) -- sorry!:\n\n"
puts "Done."

codes.select{ |c| c["region-code"].nil? }.each {|c| puts c.inspect }
if blanks.present?
puts "\nThere was some missing data for #{blanks.size} countries"
puts "(you may want to manually check #{UN_URI}):\n"
puts blanks.each(&:inspect)
end

File.open("LAST_UPDATED.txt", "w:UTF-8") { |f| f.write(Time.now.to_s) }
4 changes: 2 additions & 2 deletions slim-2/slim-2.csv
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ Brunei Darussalam,BN,096
Bulgaria,BG,100
Burkina Faso,BF,854
Burundi,BI,108
Cabo Verde,CV,132
Cambodia,KH,116
Cameroon,CM,120
Canada,CA,124
Cabo Verde,CV,132
Cayman Islands,KY,136
Central African Republic,CF,140
Chad,TD,148
Expand All @@ -58,7 +58,7 @@ Croatia,HR,191
Cuba,CU,192
Curaçao,CW,531
Cyprus,CY,196
Czech Republic,CZ,203
Czechia,CZ,203
Denmark,DK,208
Djibouti,DJ,262
Dominica,DM,212
Expand Down
Loading

0 comments on commit fc02d73

Please sign in to comment.