Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for discussing a more organized code structure #162

Closed
ArneNx opened this issue Mar 28, 2019 · 56 comments
Closed

Suggestions for discussing a more organized code structure #162

ArneNx opened this issue Mar 28, 2019 · 56 comments

Comments

@ArneNx
Copy link
Contributor

ArneNx commented Mar 28, 2019

I understand that removing/ deactivating the Theano backend is not an option right now.
I don't have enough insight into these parts to know whether moving the corresponding files into a separate folder (instead of adding a prefix to every file) would break something.

However, there appears to be some work on support for a pytorch backend (at least there is a new branch with work on this).
Would it be possible to move this stuff to a separate folder directly instead of adding more files to the root folder?

In general, I think it might be a good idea to collect some ideas on how we could improve the structure of the code.
I would suggest moving the layers into a folder structure similar to the organization in the docu by @JackTemaki. (see: https://returnn.readthedocs.io/en/latest/layer_reference/index.html#layer-types )

@albertz
Copy link
Member

albertz commented Mar 28, 2019

Example projects (for examples of structuring):

  • TensorFlow. src in /tensorflow/, tests alongside with code in *_test.py files
  • HuggingFace NLP. src in /src/nlp/, tests in /tests/
  • AllenNLP. src in /allennlp/, tests in /tests/

I was planning this since a while. But this is a bit non-trivial. There are multiple cases you have to think about:

  • When RASR imports RETURNN, it does so by importing the SprintInterface.py module (or some of the others). I am not sure if it will correctly work if this is now part of a Python package (submodule). Esp think about relative imports or so.

  • The main entry point (rnn.py) cannot be part of the package itself. So it has to be separate. Not sure if so important.

  • Python 2 and Python 3 behave slightly different when it comes to packages (as far as I remember).

  • All scripts/configs which use the old-style imports like TFUtil.safe_log or TFNetworkLayer.DotLayer should still work in any case (otherwise the wrapper would be wrong).

  • All scripts/configs which gradually updated the code, e.g. to returnn.tf.layers.DotLayer or so, should also still work, even if we move it then to returnn.tf.layers.base.DotLayer or whatever. The base package returnn.tf.layers would import all others.


Draft about the structure:

  • Use PEP8 snake case for all file package/module names.
  • Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)
  • Move all code files to /returnn/ (or further sub directories, see below).
    • Exceptions: rnn.py, setup.py, tools, demos, tests, docs (leave in /)
    • Include: extern, extern_private, c_support_code, cuda_implementation
    • File name conventions: lower case with _ between words
  • Move most code from rnn.py to /returnn/__main__.py.
  • Keep /rnn.py, but mostly just call to /returnn/__main__.py.
  • Move TF*.py to /returnn/tf/*.py with new name conventions . (Example: import TFEngine becomes import returnn.tf.engine)
  • Move TFNetwork.py to /returnn/tf/network.py.
  • Move TFNetworkLayer.py to /returnn/tf/layers/basic.py and also create /returnn/tf/layers/__init__.py.
  • Move TFNetworkRecLayer.py to /returnn/tf/layers/rec.py.
  • Move TFNetworkSegModLayer.py to /returnn/tf/layers/segmental_model.py.
  • Move TFNetworkSigProcLayer.py to /returnn/tf/layers/signal_processing.py.
  • Move TFNetworkNeuralTransducer.py to /returnn/tf/layers/neural_transducer.py.
  • Move TFUtil.py to /returnn/tf/util/basic.py.
  • Move Theano files to /returnn/theano/.
  • Moving Theano layer files to /returnn/theano/layers/.
  • Move Op*.py, TwoState*, CTC* and BestPathDecoder* to /returnn/theano/ops/.
  • Move Util.py to /returnn/util/basic.py.
  • Move TaskSystem*.py to /returnn/util/task_system_*.py .
  • Move Debug*.py to /returnn/util/debug_*.py.
  • Move Sprint*.py (except SprintDataset.py) to /returnn/sprint/*.py.
  • Move Engine(Base|Batch|Util).py to /returnn/engine/
  • Move *Dataset.py to /returnn/datasets/
  • Move Fsa.py to /returnn/util/fsa.py

Code structure:

  • Use mostly absolute imports (from returnn.tf.layers.base import LayerBase) except maybe a few exceptions for relative imports (from .base import LayerBase).

Other tasks:
(Just an overview. Let's not go into too much details for now, unless there are real problems.)

  • Fix setup.py for the new structure.
  • Fix tests.
  • Create /__init__.py to handle a direct import of the repo directory itself.
    This should just wrap all to /returnn/__init__.py. Probably needs some path trickery.
  • Setup wrappers that old code/configs don't break. TFUtil, TFNetworkLayers, etc...
    This could be handled via the lazy loader in /__init__.py or /returnn/__init__.py.
    The lazy loader would otherwise not be needed anymore.
  • Test it.

Practical organization:

  • Finish planning.
  • Notify relevant people about the upcoming changes, and ask for help on testing.
  • Create new branch and freeze master (via GitHub branch protection rule).
  • Rename & move files without changing them (just git mv, such that the Git history stays clean, such that Git blame and co can correctly identify the rename/move of files; see here for a related question).
  • Apply other needed changes (fixes etc). Iterate by testing.
  • Merge into master. Ideally shortly after creating the branch, so that master is not frozen too long. Then we can unfreeze it.

@ArneNx

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki
Copy link
Collaborator

JackTemaki commented Jul 2, 2020

To revive this issue, I would like to add a new draft to tackle this:

Reasons for Restructuring

  • With the current file structure it is difficult to navigate trough the code without using the help of an IDE as e.g. PyCharm
  • Very long files cause lags or interruptions in PyCharm during coding
  • With the current file structure it is difficult to understand the code, especially it is not visible which files are used for both backends, and which are Theano only.
  • Many files contain arbitrary code, such as GeneratingDataset.py, which contains Datasets made for tests, but also for specific ASR corpora and also vocabulary functions/classes and feature extraction pipelines.
  • Large files make it difficult to see which parts exactly are changed in a commit, and when working on different problems --patch needs to be used often to select the relevant parts.

Known Issues

  • Sprint/RASR Search:

There are multiple approaches used to access the Sprint. Common examples are:

pymod-name                = crnn.SprintInterface
pymod-path                = ./path/to/crnn/..

or

pymod-name                = SprintInterface
pymod-path                = ./path/to/crnn

where crnn is the old repository name of RETURNN. Those variants should continue to work. If the SprintInterface would now be part of a subpackage (e.g. returnn.sprint.SprintInterface), absolute imports may break as the python path will not be set to the correct root directory. [1]

  • Standalone Execution:

When running RETURNN just from the rnn.py, the old imports need to be still valid, common is e.g.:
from Pretrain import WrapEpochValue [2]. Also, gobal variables such as config inside the config need to be present. [3]

  • RETURNN Scritps:

All scipts that import from RETURNN will have broken dependencies when files are moved to subdirectories, so a root-level __init__.py file needs to make sure that all old imports are still valid when the path is set to /path/to/returnn-git [4], but also when calling import returnn when the path is only /path/to/ and the repo name itself is returnn. [5]

  • Python 2 Compatibility:

All modifications need to be compatible to Python2, but no issues appeared yet. [6]

Proposal:
I have a current test environment which was able to solve the Issues [1-3][5] and [6], but fails on [4].

This changes included:

  • moving all .py code files except rnn.py to a returnn folder.
  • creating subpackages for theano and tensorflow related files
  • creating subpackages for all datasets and sprint related files
  • update all imports as absolute imports from the root level
  • keeping the __init__.py containg the lazy loader at the root-level, and adjusting the lazy loader to recursively add all modules, but store them under their name only to ensure that [1] and [5] are working
  • moving all content except the main function to returnn/main.py, and adding from returnn.main import * in the rnn.py to ensure all content is still accessible.

The test cases used were:

  • RASR search with python2.7 with three different RETURNN import variants.
  • Training with SprintExternDataset
  • Standalone-search with Attention-ASR

Todo:

  • Finding a solution for issue [4]
  • Further restructuring of the code
  • Finding more test setups and checking for those
  • Ensuring all current test-cases run, as this proposal was made with an older RETURNN version
  • fixing remaining cyclic dependencies
  • Fixing the setup.py and pip installation

@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz
Copy link
Member

albertz commented Jul 14, 2020

Well, the objective should be how easy it is to read. Things like "Op" and "CTC" are clear of course (or at least writing those out wouldn't help), but without having looked into them I honestly don't know what "Inv.py" and "TFNetworkSegModLayer.py" are about. Currently there are not many of those cases anyway. But seg_mod.py? Nah ;P

But the problem here is not so much the abbreviation. Having "inverted" instead of "inv", or "segmental" instead of "seg", or "module" instead of "mod", that doesn't really change much, or does it? I think the names itself might be better chosen, but I still would keep using very standard abbreviations ("inv", "seg", "mod", etc). Actually your examples are I think mostly for outdated/unused code anyway.

@albertz
Copy link
Member

albertz commented Jul 14, 2020

/returnn/util/util.py

That does not look too nice.... would we generally accept to have modules in an __init__.py file, or do we want to avoid that? Otherwise Util.py -> /returnn/util/__init__.py would be ok

No, I think we should not put too much code into __init__.py.
Maybe /returnn/util/basic.py instead?

... I would rather use the full words instead of unclear abbreviations.

Yea, sure. I was anyway speaking mostly about clear and very common abbreviations.

@JackTemaki
Copy link
Collaborator

or "module" instead of "mod"

This is exactly why we should avoid it. This abbreviation was intended to be "segment model" not "segment module".

And even inverse.py would be a bad name... inverse of what? Inverse model? But then inverse model with respect to? To be honest, even when looking shortly over the code (e.g. class names) I have no idea what it does, as there is also AlignOp and InvAlignOp. Of course this is because of bad naming and non-existing docstrings, but the file name can be better as well.

@albertz
Copy link
Member

albertz commented Jul 14, 2020

This is exactly why we should avoid it. This abbreviation was intended to be "segment model" not "segment module".

Well, "mod" is definitely not common for "model", but it is very common for "module", esp in the context of Python. I would say, for Python code, having "mod" for "module" is always totally ok. I.e., it is not ok for "model", so "model" should be written out. I think this is a pretty straight-forward case.

inverse.py would be a bad name... inverse of what? Inverse model?

You can extend it, but this is not about the abbreviation here. Call it inv_model.py or so.
(I don't really know what this is about. It's anyway too specific, and does not belong to RETURNN at all, just as well as the segmental model, so these are anyway very bad examples.)

I would still say, for common abbreviations, we should definitely keep using the abbreviations. util is also such an example. Even returnn itself. Or tf. Or config. It does not make sense to write out these things.
In the same manner, I would say that rec is just as clear. Or sig. Or proc. Or param.

This is really pretty standard in Python code, and I think we should stick to Python conventions. E.g. Python uses def, init, str, int, repr, etc.

So, this effectively becomes more a discussion now about which are the cases which are ok. Let's also not put too much priority on these outdated code files which are anyway very bad examples.

Also, this is a question about consistency. Should we use both "rec" and "recurrent". Or only one of these? RecurrentLayer but _SubnetworkRecWrappedLoss?
(Or maybe to a certain degree, it's not too much a problem... I guess you will find both "param" and "parameter" somewhere in the code...)

@doetsch-apptek
Copy link
Contributor

It would be really nice if we could get access to at least one Theano setup. @doetsch @doetsch-apptek maybe?

What do you mean with theano setup? A training setup? I can provide you with that

there appears to be some work on support for a pytorch backend

Yes, but does it bother you in this branch? If it does then we can delete it, I still have a local copy of this backend and nobody is using it anyway (I just wouldn't want to delete it completely).

In general I think is a good thing to do this code restructuring. We could also make the theano RETURNN a standalone version which people have to checkout explicitly, while the actual RETURNN version only contains the TF backend.

@pavelgolik
Copy link
Contributor

pavelgolik commented Jul 17, 2020

This is also a good opportunity to unify the use of underscores and dashes under tools/.

@albertz
Copy link
Member

albertz commented Jul 17, 2020

This is also a good opportunity to unify the use of underscores and dashes under tools/.

Not sure about that. We are very careful here to not break old configs and scripts. This would likely break some scripts. We could add symlinks for old name -> new name for the tools. But not sure if this is really nicer than just leaving them as they are.

@JackTemaki
Copy link
Collaborator

I just saw that these two points are still open in the draft:

Move TFNetworkLayer.py to /returnn/tf/layers.py? Or /returnn/tf/layers/base.py and also create /returnn/tf/layers/__init__.py?.
Move TFNetworkRecLayer.py to /returnn/tf/rec_layers.py? Or /returnn/tf/layers/rec.py.

I would definitely favour to use /returnn/tf/layers/base.py and /returnn/tf/layers/rec.py, as we want to support future splitting of the layer files.

@JackTemaki
Copy link
Collaborator

There are still some open questions:

  • Move TFNetworkRecLayer.py to /returnn/tf/rec_layers.py? Or /returnn/tf/layers/rec.py?
  • Moving Theano layer files to /returnn/theano/layers?
  • Unify tool names, and keep compatibility with symlinks?
  • Move Util.py to /returnn/util/basic.py?
  • Move TaskSystem*.py to /returnn/util/task_system_*.py ?
  • Move Debug*.py to /returnn/util/debug_*.py?
  • Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)

About the last point, there are currently inconsistencies (NetworkRecurrentLayer vs TFNetworkRecLayer), so this needs to be resolved. The layer modules will have shorter names anyways, as they are moved to returnn/backend/layers, so writing the files in question without abbreviations (segment_model.py, signal_processing.py and recurrent.py) would not be an issue.

I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.

@albertz
Copy link
Member

albertz commented Jul 28, 2020

So, I guess we agreed on TFNetworkLayer.py to /returnn/tf/layers/base.py.
Or maybe /returnn/tf/layers/basic.py to be consistent with Util.py below?
What about TFNetwork.py?

  • Move TFNetworkRecLayer.py to /returnn/tf/rec_layers.py? Or /returnn/tf/layers/rec.py?

I vote for the latter.

  • Moving Theano layer files to /returnn/theano/layers?

Yes. In any case consistent to the TF structure.

  • Unify tool names, and keep compatibility with symlinks?

I would vote for no. This would just make it even more chaotic.
But in any case, this can also be done (and discussed) independently from this issue, at some later point.

  • Move Util.py to /returnn/util/basic.py?

Ok.

  • Move TaskSystem*.py to /returnn/util/task_system_*.py ?

Ok.

  • Move Debug*.py to /returnn/util/debug_*.py?

Ok.

  • Use only clear abbreviations, but which ones? (e.g. recurrent -> rec)

Let's just be specific and discuss for the cases we currently have here. I guess "rec" is fine. What else needs to be discussed?

About the last point, there are currently inconsistencies (NetworkRecurrentLayer vs TFNetworkRecLayer), so this needs to be resolved.

There is never a must. Maybe it would be nice to be consistent.

But this will anyway not be possible 100%. E.g. you will find sometimes "net", sometimes "network", used for variable names, class names, comments, whatever... But I think that's ok. Let's not be too pedantic.

In general, I prefer short names.

The layer modules will have shorter names anyways, as they are moved to returnn/backend/layers, so writing the files in question without abbreviations (segment_model.py, signal_processing.py and recurrent.py) would not be an issue.

They are not shorter, the complete names are in fact longer now (esp when we mostly use absolute imports).

I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.

Which files are moved there?

@JackTemaki
Copy link
Collaborator

So, I guess we agreed on TFNetworkLayer.py to /returnn/tf/layers/base.py.
Or maybe /returnn/tf/layers/basic.py to be consistent with Util.py below?

Yes, I would say basic.py

I would also vote for introducing op packages /returnn/tf/ops/ and /returnn/theano/ops/.

Which files are moved there?

For Theano we have all Op* files, for Tensorflow there is only TFNativeOp so far, but maybe we want to have a package for future additions and keep the Theano/TF structure consistent. But of course this could be postponed to a later point.

Do we put SprintDataset.py to /returnn/sprint/ or to /returnn/datasets/?

@albertz
Copy link
Member

albertz commented Jul 29, 2020

Ok to basic.py.

ops is a bit more complicated.
The file NativeOp.py, as well as all ops implemented for it, are independent from both Theano and TF. (This is actually a very nice feature. I abstracted this away when I ported the code from Theano to TF. Even for Theano, it was already generic in the way that it auto-created the ops both for CPU and GPU.) (The file actually included Theano code up until recently, but that was just because I did not clean it up properly and moved it to a separate file. I did that now. All Theano related code is in TheanoNativeOp.py.)

Do we put SprintDataset.py to /returnn/sprint/ or to /returnn/datasets/?

This is a question of dependencies also. Code becomes much easier to maintain if dependencies between each other (e.g. at the level of packages or modules) are kept minimal, and also only in one direction if possible.

E.g. Util.py (and thus everything in /returnn/util/) is supposed to be independent from anything else in RETURNN.
Config.py and Log.py as well (although they might depend on util).

In some cases, I also like to keep the code even independent from Config and Log, if possible, such that it is also useful outside and independent of RETURNN. E.g. TFUtil.py is written in such a way.

datasets is also very independent, except that it might depend on Config, Log and Util.
And depending on the dataset, on further optional external dependencies, like h5py or librosa.

I think it is most consistent if SprintDataset.py would be moved to /returnn/datasets/sprint.py, and this particular dataset has then additional dependencies on /returnn/sprint.

The stuff in /returnn/sprint is also very independent, except maybe Config and Log.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz
Copy link
Member

albertz commented Jul 29, 2020

Maybe with some hacks or tricks.

We could e.g. add the returnn package to the path with the sys module when it is run as standalone.

This is not enough. You would also need sth like this:

__path__ = [os.getcwd()]
__name__ = "returnn.%s" % os.path.splitext(os.path.basename(__file__))[0]  # not sure...
__package__ = "returnn.%s" % os.path.splitext(os.path.basename(__file__))[0]
sys.modules[__name__] = sys.modules["__main__"]

Not sure if this is enough then, and works like this... (See e.g. here.)
But not so important now. We can make that work somehow.

Is there anything missing now with respect to the structure?

There are still some question marks in the proposal.
And besides them, not sure if we forgot about sth.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@JackTemaki
Copy link
Collaborator

JackTemaki commented Jul 30, 2020

That effectively means: Either (almost) no-one will use this, and this code will rotten and then also no-one should use it anymore

Ok. So then this can be deleted/moved at some later point.

I think all things should be covered now.

EDIT: I also went over the pull-requests again, and 3 out of 4 are about tests and tools, so for those there should be no problem merging them in after restructuring. The last one is really old, and I am not sure what will happen to it.

@albertz
Copy link
Member

albertz commented Aug 5, 2020

Note: This is ongoing, in branch restructure, and mostly done. We will merge it soon to master. Please check.

@albertz
Copy link
Member

albertz commented Aug 5, 2020

We have performed the changes, and pushed them into master now.
In case you have any problems with some of your older setups, please report. Open a new GitHub issue for any new issue please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment