Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate airline data with start/end dates #1439

Open
jpatokal opened this issue Sep 4, 2023 · 9 comments
Open

Annotate airline data with start/end dates #1439

jpatokal opened this issue Sep 4, 2023 · 9 comments
Labels
airline Airline data update request. enhancement New feature requests or improvements to existing functionality.

Comments

@jpatokal
Copy link
Owner

jpatokal commented Sep 4, 2023

Airline data on the live site is badly out of date because IATA codes are frequently reused and it is not possible to match airlines to flight dates.

This issue tracks extracting start (creation) and end (dissolution) dates from Wikidata and uploading this to the live DB.

@jpatokal jpatokal added enhancement New feature requests or improvements to existing functionality. airline Airline data update request. labels Sep 4, 2023
@jpatokal
Copy link
Owner Author

jpatokal commented Sep 4, 2023

SPARQL, how do I hate thee? Let me count the ways.

# Name, Alias, IATA, ICAO, Callsign, Country
SELECT ?airline ?airlineLabel ?iata ?icao ?callsign ?countryLabel ?countryIso
  (xsd:date(?start) AS ?startDate) (xsd:date(?end) AS ?endDate) # format as ISO dates
WHERE 
{
  ?airline wdt:P31 wd:Q46970 .  # instance of airline
  ?airline wdt:P229 ?iata .  # IATA code not optional
  OPTIONAL{?airline wdt:P230 ?icao .}
  OPTIONAL{?airline wdt:P571 ?start .}
  OPTIONAL{?airline wdt:P576 ?end .}
  OPTIONAL { ?airline wdt:P17 ?country .
             ?country wdt:P297 ?countryIso }
  OPTIONAL{?airline wdt:P432 ?callsign .}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

https://w.wiki/7PhS

@jpatokal
Copy link
Owner Author

jpatokal commented Sep 7, 2023

Here are dry runs against prod, both the full output and a grep of rename operations only -- please find bugs! If there's no major/systemic problems I'll pull the trigger in prod tomorrow, which will update 1296 airlines.

update_airlines.log

renamed_airlines.log

@reedy
Copy link
Collaborator

reedy commented Sep 7, 2023

Woo; wikidata. I've pinged some people at Wikimedia Deutschland to let them know about the move to using wikidata :)

@reedy
Copy link
Collaborator

reedy commented Sep 7, 2023

Here are dry runs against prod, both the full output and a grep of rename operations only -- please find bugs! If there's no major/systemic problems I'll pull the trigger in prod tomorrow, which will update 1296 airlines.

update_airlines.log

renamed_airlines.log

from the rename log... Not reviewed every line, but this stuck out to me:

>> MATCH WITH RENAME: from Myanma Airways to Myanmar National Airlines
>> MATCH WITH RENAME: from Myanma Airways to Union of Burma Airways

Not sure we can rename it twice? Or is there some hidden/simplified characters in there not being displayed? Or more likely, two different IATA airline codes for the same "display" name, at least, in the OpenFlights DB?

If it is that, we should probably output the IATA and/or the ICAO codes for the airlines in the logs

Then we rename something to the old name; which is fine itself, I think.

>> MATCH WITH RENAME: from Burma Airways to Myanma Airways

@reedy
Copy link
Collaborator

reedy commented Sep 7, 2023

Screenshot 2023-09-07 at 17 34 02

@reedy
Copy link
Collaborator

reedy commented Sep 7, 2023

The updated log looks more reasonable though... I guess the grep/rename shows the condensed version.

BUT.

> MATCH Myanma Airways (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
UPDATE airlines SET source='Wikidata', start_date='1989-04-01', end_date='2014-12-01', active='N' WHERE alid=3569
...
>> MATCH WITH RENAME: from Myanma Airways to Union of Burma Airways
> MATCH Union of Burma Airways (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
>> UPDATED: field name from Myanma Airways to Union of Burma Airways
UPDATE airlines SET name='Union of Burma Airways', alias='Myanma Airways', start_date='1948-12-15', end_date='1972-12-01' WHERE alid=3569
>> MATCH WITH RENAME: from Union of Burma Airways to Burma Airways
> MATCH Burma Airways (UB/UBA, Wikidata) == Union of Burma Airways (UB/UBA, 3569)
>> UPDATED: field name from Union of Burma Airways to Burma Airways
UPDATE airlines SET name='Burma Airways', alias='Union of Burma Airways', start_date='1972-12-01', end_date='1989-04-01' WHERE alid=3569
>> MATCH WITH RENAME: from Burma Airways to Myanma Airways
> MATCH Myanma Airways (UB/UBA, Wikidata) == Burma Airways (UB/UBA, 3569)
>> UPDATED: field name from Burma Airways to Myanma Airways
UPDATE airlines SET name='Myanma Airways', alias='Burma Airways', start_date='1989-04-01', end_date='2014-12-01' WHERE alid=3569
>> MATCH WITH RENAME: from Myanma Airways to Myanmar National Airlines
> MATCH Myanmar National Airlines (UB/UBA, Wikidata) == Myanma Airways (UB/UBA, 3569)
>> UPDATED: field name from Myanma Airways to Myanmar National Airlines
UPDATE airlines SET name='Myanmar National Airlines', alias='Myanma Airways', start_date='2014-12-01' WHERE alid=3569

And trim out the comments...

UPDATE airlines SET source='Wikidata', start_date='1989-04-01', end_date='2014-12-01', active='N' WHERE alid=3569
UPDATE airlines SET name='Union of Burma Airways', alias='Myanma Airways', start_date='1948-12-15', end_date='1972-12-01' WHERE alid=3569
UPDATE airlines SET name='Burma Airways', alias='Union of Burma Airways', start_date='1972-12-01', end_date='1989-04-01' WHERE alid=3569
UPDATE airlines SET name='Myanma Airways', alias='Burma Airways', start_date='1989-04-01', end_date='2014-12-01' WHERE alid=3569
UPDATE airlines SET name='Myanmar National Airlines', alias='Myanma Airways', start_date='2014-12-01' WHERE alid=3569

As they all have the same alid...

We can see it makes a few different updates (at least 3 of which are no-ops by the end; not the end of the world, but not the most useful/performant), and will result in a row that (incorrectly) has an end date, for the current name.

It will also incorrectly be marked as active='N' from the initial SQL query, which isn't what we want, I imagine

It kinda looks like we need multiple aliases for an airline (which is probably best served via a separate table, for ease of searchability, rather than storing multiple in one column and the fun that might bring) and can store start/end dates there too...

@jpatokal
Copy link
Owner Author

jpatokal commented Sep 8, 2023

I don't think we need aliases: the ideal end state would be a chain of airlines with start/end dates that smoothly cover the entire date range. Unfortunately Wikidata is quite inconsistent here, there's at least three different relations for inactive/followed by/replaced by and particularly for obscure stuff like Myanmar the data quality is poor anyway.

And the other problem is that the updater script mostly matches and renames the existing entry, which isn't consistent with this. So I think it's back to the drawing board here.

@reedy
Copy link
Collaborator

reedy commented Sep 8, 2023

Unfortunately Wikidata is quite inconsistent here, there's at least three different relations for inactive/followed by/replaced by and particularly for obscure stuff like Myanmar the data quality is poor anyway.

The joys of crowd sourcing...

Some of this could be unintentional; different people adding different things for different reasons. But they could be different succession types too, to be really unhelpful.

Depending on how wide spread the issue is, I could bring it up on the wiki, and see if we can get any normalisation etc. Or just fix them myself.

I don't think we need aliases: the ideal end state would be a chain of airlines with start/end dates that smoothly cover the entire date range.

Aha, so if that was the intention, we don't want to be always updating using the same airline id, but would potentially also mean updating airline ids in existing flights depending on the date...

@jpatokal
Copy link
Owner Author

jpatokal commented Sep 8, 2023

Yup, updating the airlines to match dates would have to be retrofitted later. This seems like it should be quite straightforward, although there's still some nuance to eg. picking the passenger airline instead of the cargo operation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airline Airline data update request. enhancement New feature requests or improvements to existing functionality.
Projects
None yet
Development

No branches or pull requests

2 participants