You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In phaser-example, the Seattle data for bicycle counters separates each location into a different file, whereas the data format we decided to output from the pipeline has a column for location description.
Since the Seattle data puts the location name in the file name, one place we could get the location name from is the file name
"Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv"
"Thomas St Overpass Bike Ped Counter 20240526.csv"
in fact this illustrates another common pattern, which is to put the date of a data file in the name of the file.
For pipelines that need to move information out of the filename into a field, how should we give access to the source file name?
It's possible to do today by overriding the init_source method in the Pipeline to learn the name of 'source' and add that to the Context as a variable, then call super().init_source and proceed... later on, a step can pull the location out of the source filename and add it as a column value. Pretty complicated but we could document it.
Another approach would be to allow the command line invocation to pass a variable name in so the person typing in the command line would type python3 -m phaser run seattle output "sources/Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv" --var location="Burke Gilman Trail NE 70th". Passing variables on the command line is a good idea anyway for all kinds of variables so I'll create a separate ticket for it.
The phaser library could provide the source file names by adding them to the context for all steps and phases to access:
For extra sources, this would happen in pipeline in init_source . Currently context saves source names and data, but the names are the internal names like "temp_data" not the external file name like "temps Seattle 20240606.csv", so the data structure does not currently have room for the filename
The main source would have to be handled differently.
I think 3 is a good idea, but it's not trivial and may involve some refactoring of loading sources.
The text was updated successfully, but these errors were encountered:
In phaser-example, the Seattle data for bicycle counters separates each location into a different file, whereas the data format we decided to output from the pipeline has a column for location description.
Since the Seattle data puts the location name in the file name, one place we could get the location name from is the file name
"Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv"
"Thomas St Overpass Bike Ped Counter 20240526.csv"
in fact this illustrates another common pattern, which is to put the date of a data file in the name of the file.
For pipelines that need to move information out of the filename into a field, how should we give access to the source file name?
It's possible to do today by overriding the init_source method in the Pipeline to learn the name of 'source' and add that to the Context as a variable, then call super().init_source and proceed... later on, a step can pull the location out of the source filename and add it as a column value. Pretty complicated but we could document it.
Another approach would be to allow the command line invocation to pass a variable name in so the person typing in the command line would type
python3 -m phaser run seattle output "sources/Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv" --var location="Burke Gilman Trail NE 70th"
. Passing variables on the command line is a good idea anyway for all kinds of variables so I'll create a separate ticket for it.The phaser library could provide the source file names by adding them to the context for all steps and phases to access:
I think 3 is a good idea, but it's not trivial and may involve some refactoring of loading sources.
The text was updated successfully, but these errors were encountered: