Move Vocab & Embedding setup code into readers #337

undwie · 2017-11-12T12:51:42Z

I wrote some code that uses a jack reader (its model, pre and post-processing), but an own training loop---see below. This should be useful for others as a starting point for, say, using multi-task learning objectives etc. Personally, I think it's a good idea to think of this as the core way of interfacing jack. Everything else (a CLI training script etc) should be light wrapper around this.

    jack_train = ...some jack QASetting, Answer pairs

    emb_file = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
    embeddings = load_embeddings(emb_file, 'glove')

    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50

    vocab = Vocab(emb=embeddings, init_from_embeddings=True)
    shared_resources = SharedResources(vocab=vocab, config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

    tf.reset_default_graph()
    reader.setup_from_data(jack_train, is_training=True)

    loss = reader.model_module.tensors[Ports.loss]
    optimizer = tf.train.AdagradOptimizer(learning_rate=0.01)
    min_op = optimizer.minimize(loss)

    sess = model_module.tf_session
    sess.run(tf.global_variables_initializer())

    for epoch in range(0, 10):
        for batch in reader.input_module.batch_generator(jack_train, 1, False):
            feed_dict = reader.model_module.convert_to_feed_dict(batch)
            loss_value, _ = sess.run((loss, min_op), feed_dict=feed_dict)
        if epoch % 5 == 1:
            print(loss_value)

I think we should work on streamlining this. What I dislike about the above is the handling of vocabs and embeddings. This is reader-specific, and should be handled internally. So I'd rather it looked like:

    config = load_config("jtr/conf/jack.yaml")
    config['name'] = "blurb"
    config['repr_dim_input'] = 50
    config['emb_file'] = '../bloomsbury_data/external/embeddings/GloVe/glove.6B.50d.txt'
 
    shared_resources = SharedResources(config=config)
    input_module = XQAInputModule(shared_resources)
    model_module = FastQAModule(shared_resources)
    output_module = XQAOutputModule(shared_resources)
    reader = TFReader(shared_resources, input_module, model_module, output_module)

and then the input module (or model module) takes care of setting up vocabs and embeddings. Because don't have this separation, right now, vocab and embedding code creeps into learning code, even though it's specific to readers. The above would fix this.

To fix this, the reader-owners need to move embedding/vocab setup code into their modules.

Any thoughts or objections?

pminervini · 2017-11-12T17:50:16Z

This is also useful for writing more fine-grained tests that check things during training

I quickly turned your code here into a test: https://github.com/uclmr/jack/blob/master/tests/jack/readers/test_fastqa_loop.py

dirkweissenborn · 2017-11-13T10:00:27Z

I think vocabs and emebddings could simply be created and loaded in the shared resources and not the modules. In the modules it would introduce unnecessary overhead for most implementations, because embedding handling is quite reader agnostic.

riedelcastro · 2017-11-13T10:36:20Z

I'd disagree. I have already seen readers that a) need or not need character embeddings, b) several vocabs for different parts (questions, answers, support), c) only use pre-trained embeddings etc. All of these seem like reasonable decisions that are reader-specific (or input module specific)

This doesn't stop us from factoring out a lot of functionality, such that readers/input modules only use a single call to some create_vocab etc. function in case they really share functionality.

dirkweissenborn · 2017-11-13T10:42:22Z

I agree with the vocab part, but the embeddings can be provided separately already through the shared resources.

riedelcastro · 2017-11-13T11:05:32Z

I'd prefer if the reader was in charge of that too. This makes setting up readers more straightforward for the user. Right now I have to think about what I need to prepare (the embeddings), and what the reader is doing to populate the shared resources. And I need to look into the reader to decide whether it needs embeddings or not (if it doesn't, I shouldn't load embeddings). I would like the API usage to be

Setup configuration dictionary
Load the reader with this configuration

I also feel that even embedding loading can get convoluted (for example, if you decide to preload the embeddings for the training set, or if you want to load other types of pre-trained models etc).

dirkweissenborn · 2017-11-13T11:10:15Z

I don't have a strong opinion here and understand your point. It makes it also clear to the user where these resources are coming from when they are not auto-magically created somewhere else first. We had exactly this problem before when people did not understand where embeddings and vocabs actually come from and how they are created. Anyway, this is a major change so we should push that rather sooner than later.

dirkweissenborn · 2017-11-22T15:42:16Z

@riedelcastro @pminervini , anyone working on this?

dirkweissenborn · 2018-07-17T14:41:49Z

vocab and embeddings are now decoupled. The setup code should still move into the readers themselves though.

dirkweissenborn added this to the v0.2 milestone Nov 13, 2017

dirkweissenborn mentioned this issue Nov 21, 2017

Embedding paths do not affect load directory? #340

Closed

dirkweissenborn assigned riedelcastro Nov 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Vocab & Embedding setup code into readers #337

Move Vocab & Embedding setup code into readers #337

undwie commented Nov 12, 2017

pminervini commented Nov 12, 2017

dirkweissenborn commented Nov 13, 2017

riedelcastro commented Nov 13, 2017

dirkweissenborn commented Nov 13, 2017

riedelcastro commented Nov 13, 2017

dirkweissenborn commented Nov 13, 2017 •

edited

Loading

dirkweissenborn commented Nov 22, 2017

dirkweissenborn commented Jul 17, 2018

Move Vocab & Embedding setup code into readers #337

Move Vocab & Embedding setup code into readers #337

Comments

undwie commented Nov 12, 2017

pminervini commented Nov 12, 2017

dirkweissenborn commented Nov 13, 2017

riedelcastro commented Nov 13, 2017

dirkweissenborn commented Nov 13, 2017

riedelcastro commented Nov 13, 2017

dirkweissenborn commented Nov 13, 2017 • edited Loading

dirkweissenborn commented Nov 22, 2017

dirkweissenborn commented Jul 17, 2018

dirkweissenborn commented Nov 13, 2017 •

edited

Loading