Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make file name accessible (in batch? in context?) in phases while running #152

Open
lisad opened this issue Jul 6, 2024 · 1 comment
Open

Comments

@lisad
Copy link
Owner

lisad commented Jul 6, 2024

In phaser-example, the Seattle data for bicycle counters separates each location into a different file, whereas the data format we decided to output from the pipeline has a column for location description.

Since the Seattle data puts the location name in the file name, one place we could get the location name from is the file name
"Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv"
"Thomas St Overpass Bike Ped Counter 20240526.csv"

in fact this illustrates another common pattern, which is to put the date of a data file in the name of the file.

For pipelines that need to move information out of the filename into a field, how should we give access to the source file name?

  1. It's possible to do today by overriding the init_source method in the Pipeline to learn the name of 'source' and add that to the Context as a variable, then call super().init_source and proceed... later on, a step can pull the location out of the source filename and add it as a column value. Pretty complicated but we could document it.

  2. Another approach would be to allow the command line invocation to pass a variable name in so the person typing in the command line would type python3 -m phaser run seattle output "sources/Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv" --var location="Burke Gilman Trail NE 70th". Passing variables on the command line is a good idea anyway for all kinds of variables so I'll create a separate ticket for it.

  3. The phaser library could provide the source file names by adding them to the context for all steps and phases to access:

  • For extra sources, this would happen in pipeline in init_source . Currently context saves source names and data, but the names are the internal names like "temp_data" not the external file name like "temps Seattle 20240606.csv", so the data structure does not currently have room for the filename
  • The main source would have to be handled differently.

I think 3 is a good idea, but it's not trivial and may involve some refactoring of loading sources.

@jeffkole
Copy link
Collaborator

jeffkole commented Jul 9, 2024

Update phaser-example to use this new feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants