Skip to content

Latest commit

 

History

History
1221 lines (973 loc) · 70.8 KB

ideas.md

File metadata and controls

1221 lines (973 loc) · 70.8 KB

Ideas


NEXT VENUES...


DONE

  • Reinforcement Learning demo (Python)

    • DeeR - theano-based
    • AgentNet - theano + lasagne-based
      • Examples include Atari space-invaders using OpenAI Gym
      • In iPython notebooks!
    • Need to understand whether OpenAI gym, ALE, or PLE (PyGame Learning Environment) can be seen from within non-X container
    • Potential to make Javascript renderer of Bubble Breaker written in Python
      • Host within Jupyter notebook (to display game-state, and potentially play interactively)
      • Game mechanics driven by Python backend
      • Interface similar (i.e. identical) to ALR or PLE
        • Idea for 'longer term' : Add this as an OpenAI Gym environment
      • Learn to play using one-step look-ahead and deep-learned value function for boards
        • Possible to add Monte-Carlo depth search too
      • Difficulty : How to deal with random additional columns
        • Would prefer to limit time-horizon of game
          • Perhaps have a 'grey column' added with fixed (high) value as a reward
        • May need to customize reward function, since it is (in principle) unbounded
          • Whereas what's important is the relative value of the different actions (rather than their individual accuracy)
      • Optimisation : Game symmetry under permutation of the colours
        • WOLOG, can assume colour in bottom right is colour '1'
          • But colouring in remainder still gives us 321 choices
          • So that 6x as many training examples available than without re-labelling
          • Perhaps enumerate off colours in bottom-to-top, right-to-left order for definiteness
            • Cuts down redundency in search space, but may open up 'strange holes' in knowledge
      • Should consider what a 'minibatch' would look like
        • Training of batches of samples looks like experience replay
        • Selection of next move requires 'a bunch' of feed-forward evaluations - number unknown
          • Find average # of moves available during a game
          • Find average # of steps played during a game
      • Simple rules to follow:
        • Select next move at random from list of available areas, equally weighted
        • Select next move at random from list of available areas, weighted by score (or simply cell-count)
  • Reinforcement Learning demos (Karpathy, mostly in Javascript)

  • Anomaly detection

TensorFlow and Deep Learning Singapore - (2017-02-16) DONE

https://github.com/tensorflow/models/tree/master/slim

Next workshop venue : FOSSASIA - (2017-03-18 @16:55) 1hr

I gave a Deep Learning talk last year at FOSSASIA. This was followed by more talks within the same subject at PyConSG and FifthElephant (India).

Since the last FOSSASIA, the Deep Learning Workshop repo (on mdda's GitHub) has been extended substantially.
Depending on the time allotted, we'll be able to tackle 1 or 2 'cutting edge' topics in Deep Learning.
Participants will be able to install the working examples on their own machines, and tweak and extend them for themselves.

Like last year, the Virtual Box Appliances will be distributed on USB drives : The set-up has been proven to work well.
Since this is hands-on, some familiarity with installing, running and playing with software will be assumed.
Depending on demand, I can also do a quick intro about Deep Learning itself, though that would be pretty well-trodden ground that people who are interested would have seen several times before.

  • 1hr <--- This is what they've asked for

This looks interesting :: https://aiexperiments.withgoogle.com/ai-duet Also :: Drawing from edges (cats?)

Also :: seq2seq ? https://research.fb.com/downloads/babi/

http://cs.mcgill.ca/~rlowe1/interspeech_2016_final.pdf https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html wget https://people.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip

TensorFlow and Deep Learning MeetUp talk - (2017-03-21) 30mins

Intro to CNNs :

  • Re-work the FOSSASIA presentation to integrate more demo per minute...

See:

Do soon :

  • Description of difference between old and new style TensorFlow Evaluator() calling
  • CNN for MNIST
  • Adversarial images
  • Auto-encoders

TensorFlow and Deep Learning MeetUp talk - (2017-04-13) 30mins

Focus is on GANs. Essentially, 'my turn' to do advanced topic.

  • Would be cool to link in Tacomatic paper to Stamp speech thing somehow.
  • Using Keras

DONE : Implement googlenet in Keras for model zoo

Good post, but requires new BN layer def, etc http://joelouismarino.github.io/blog_posts/blog_googlenet_keras.html

Use Googlenet slim saved model into pure Keras version wget http://download.tensorflow.org/models/inception_v1_2016_08_28.tar.gz tar -xzf inception_v1_2016_08_28.tar.gz http://stackoverflow.com/questions/40118062/how-to-read-weights-saved-in-tensorflow-checkpoint-file

PR : fchollet/deep-learning-models#59

TODO : Implement DenseNet in Keras for model zoo

https://github.com/liuzhuang13/DenseNet

Find papers for : Shape of 1-d profile from start to minimum and beyond _LIVE/Backprop/GoodFellow-2015_Optimisation-is-StraightLine_1412.6544v5.pdf

Eggcarton minima The Loss Surfaces of Multilayer Networks _INBOX/_LossSurfaceOfDeepNets_1412.0233.pdf We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum.

Qualitatively characterizing neural network optimization problems  (Goodfellow+ 2014)
  https://arxiv.org/abs/1412.6544

Information of Layers Opening the Black Box of Deep Neural Networks via Information _INBOX/_TwoPhasesOfSGD_1703.00810.pdf _INBOX/_NN-InformationTheory-SGD_1703.00810.pdf

Rethinking generalisation Understanding Deep Learning Requires Re-Thinking Generalization _INBOX/_RethinkingGeneralisation_a667dbd533e9f018c023e21d1e3efd86cd61c365.pdf Hmm : https://www.reddit.com/r/MachineLearning/comments/5utu1p/d_a_paper_from_bengios_lab_contradicts_rethinking/ https://openreview.net/pdf?id=rJv6ZgHYg

  An empirical analysis of the optimization of deep network loss surfaces
    https://arxiv.org/abs/1612.04010
  
  Qualitatively characterizing neural network optimization problems by Ian J. Goodfellow, Oriol Vinyals, Andrew M. Saxe
    https://arxiv.org/abs/1412.6544

  Flat Minima (1997)
    http://www.bioinf.jku.at/publications/older/3304.pdf

Spatial pyramid pooling

For PyTorch (2017-07-06) :

DeepMind Relation-Networks ("RN"): https://arxiv.org/abs/1706.01427 PyTorch implementation : Implementation of "Sort-of-CLEVR" https://github.com/kimhc6028/relational-networks https://github.com/mdda/relational-networks bAbI : Keras : http://smerity.com/articles/2015/keras_qa.html PyTorch : https://github.com/thomlake/pytorch-notebooks/blob/master/mann-babi.ipynb

TODO : Keras introductory example

https://medium.com/towards-data-science/stupid-tensorflow-tricks-3a837194b7a0 https://github.com/thoppe/tf_thomson_charges

For PyTorch (2017-10-19) "TTS" ??

Voice Synthesis for in-the-Wild Speakers via a Phonological Loop https://ytaigman.github.io/loop/ Paper : https://arxiv.org/abs/1707.06588 Code? : https://github.com/ytaigman/loop (follow author to get an alert) Code : https://github.com/facebookresearch/loop

WORLD Vocoder
  Base : Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7):1877–1884, 2016
    Paper   : https://www.jstage.jst.go.jp/article/transinf/E99.D/7/E99.D_2015EDP7457/_pdf
    Website : http://ml.cs.yamanashi.ac.jp/world/english/introductions.html
  D4c  : Masanori Morise. D4c, a band-aperiodicity estimator for high-quality speech synthesis. Speech Communication, 84:57–65, 2016.
    Paper: https://ecantorix.moe/synthesis/mbrola/mmorise_d4c.pdf
    Code : https://github.com/mmorise/World (BSD : See : http://ml.cs.yamanashi.ac.jp/world/english/faq.html)
    Code : https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder
  TEST:  Does round-trip of parameters work?

Merlin Toolkit
  Zhizheng Wu, Oliver Watts, and Simon King.  Merlin: An Open Source Neural Network Speech Synthesis System, pages 218–223. 9 2016.
    Paper   : http://ssw9.net/papers/ssw9_PS2-13_Wu.pdf
    Code    : https://github.com/CSTR-Edinburgh/merlin  (Apache 2.0)
    Project : http://www.cstr.ed.ac.uk/projects/merlin/
    Sample  : (output via ?WORLD) https://cstr-edinburgh.github.io/merlin/demo.html
    
SampleRNN
  Website : http://www.josesotelo.com/speechsynthesis/  (though actual samples seem to be missing)
  Paper   : https://openreview.net/forum?id=B1VWyySKx
  Code    : https://github.com/soroushmehr/sampleRNN_ICLR2017  (MIT)  
  Code    : https://github.com/sotelo/parrot - RNN-based generative models for speech. 

Graves attention model 
  Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
    https://arxiv.org/abs/1308.0850  (43 pages)
      Handwriting example - has incremental window, defined on p26/corner26
      Effects of 'priming' for handwriting synthesis, shown p37/corner37 onwards

Restructuring REPO

Individual projects should go into their own repos to increase discoverability eg: BubbleBreaker << mostly DONE CNN for Voice Stamps << mostly DONE Transfer Learning with SVM Keras GoogLeNet (including slim-to-keras) Captioning (including AIAYN) ChooseGPU and then pull in these repos suitably for the main VM creation Also : Add RPM proxy mechanism /or/ think about doing it via docker and a wrapper...

Now : Migrate off Theano fully Contemplate installing PyTorch Or, instead, making use of tf.eager

hinton_says_we_should_scrap_back_propagation

https://www.reddit.com/r/MachineLearning/comments/70e4ex/n_hinton_says_we_should_scrap_back_propagation/

[–]Optrode 136 points 2 days ago

So, I'm gonna offer a sort of outside perspective, which is the perspective of a neuroscience researcher who has only a basic understanding of ML.

I can see differences between how information is processed in the brain and in ANNs, but of course the caveat is that I have no clue which (if any) of those differences represent opportunities for improvement via biomimicry.

That said, the notable differences I see between brains and deep learning models are:

  • Sensory systems in the brain usually have a great deal of top down modulation (think early layers receiving recurrent input from later layers). There aren't really any sensory or motor systems in the brain that AREN'T recurrent.

  • Sensory systems in the brain also tend to have a lot of lateral inhibition (i.e. neurons inhibiting other neurons in the same layer).

  • Brain sensory systems tend to separate information into channels. e.g. at all levels of the visual system, there are separate pathways for high and low spatial frequency content (outline & movement vs. texture), and color information.

  • Particularly with regard to the visual system, inputs are always scanned in a dynamic fashion. When a person views a picture, only a very small subsection of the image (see: fovea, saccade) is seen at high detail at any instant. The "high detail zone" skips around the image, lingering on salient points.

  • Obviously, there's STDP. STDP essentially pushes neurons to predict the future, and I think that unsupervised training methods that focus on predicting the future (this came up in the recent AMA, as I recall) obtain some of the same benefits as STDP.

  • I've seen several comments in this thread on how reducing the number of weights per node (e.g. CNN, QRNN) is beneficial, and this resembles the state of affairs in the brain. There is no such thing as a fully connected layer in the brain, connectivity is usually sparse (though not random). This usually is related to the segregation of different channels of information.

  • Lastly, most information processing / discrimination in the brain is assisted by semantic information. If you see a person in a hospital gown, you are primed to see a nurse or doctor. This remains true for a while afterwards, since we rarely use our sensory facilities to view collections of random, unrelated photos.

I read the wiki for STDP but didn't quite get a full understanding. Would you be able to talk a bit about it?

Sure! It's actually pretty simple.

Suppose we have two neurons, A and B. A synapses onto B ( A->B ). The STDP rule states that if A fires and B fires after a short delay, the synapse will be potentiated (i.e. B will increase the 'weight' assigned to inputs from A in the future).

The magnitude of the weight increase is inversely proportional to the delay between A firing and B firing. So, if A fires and then B fires ten seconds later, the weight change will be essentially zero. But if A fires and B fires ten milliseconds later, the weight update will be more substantial.

The reverse also applies. If B fires first, then A, then the synapse will weaken, and the size of the change is again inversely proportional to the delay.

ELI5 version: STDP is a rule that encourages neurons to 'pay more attention' to inputs that predict excitation. Suppose you usually only bring an umbrella if you have reasons to think it will rain (weather report, you see rain outside, etc.). Then you notice that if you see your neighbor carrying an umbrella, even though you haven't seen any rain in the forecast, but sure enough, a few minutes later you see an updated forecast (or it starts raining). This happens a few times, and you get the idea: Your neighbor seems to be getting this information (whether it is going to rain) before your current sources. So in the future, you pay more attention to what your neighbor is doing.

  [–]cbeak 2 points 16 hours ago 

  I think when taking the brain as inspiration, the main question is which kinds of neural computations are necessary 
  and which are merely biological artifacts/spandrels. 
  
  Superficially, short-term plasticity strikes me as an artifact because it results from neurotransmitter depletion. 
  
  Spikes are necessary to avoid noise build-up, and depletion seems to be basically an artifact of this adaptation. 
  And even if depletion is evolved to be more pronounced and useful for computations such as gain-control 
  (down-regulating high-frequency inputs and up-regulating low-frequency input) and high- or low-band-filtering (which it appears to be), 
  it remains a question whether one would lose important computations if one leaves out such details. 
  
  I could image that each additional kind of computation will make a suitable learning rule more complicated 
  because each kind comes with its own set of hyper-parameters (e.g. gain, band specifications and kernel sizes), 
  each of which must probably be balanced in just the right way to avoid positive feedback-loops and catastrophic forgetting. 
  
  I could also imagine that several different priors expressed in the kernel sizes are necessary such that 
  different neurons can efficiently extract temporal information that is interesting in real-world data 
  (basically from the milliseconds to the seconds scale). 
  
  It generally seems like we need 
  (1) plenty of different kinds of computations, 
  (2) a connectome with a stochastic but a fairly simple connection scheme, 
  (3) a free energy minimizing learning rule where the energy is measured by sparse and delayed rewards and prediction errors. 
  
  Among those computations will likely be multiple kinds of non-linearities, 
  spatial and temporal clustering, modulation, normalization, lateral inhibition, 
  plenty of modulatory feedback connections. 
  
  The last step might be a massive hyper-parameter search by evolution of embodied agents in a resource-constrained sim. 
  That's what I would bet my money on.

More explicit description of comms

[–]deathofamorty 1 point 1 day ago

How is it that neurons A and B know when each other fires? Is there a special type of synapse or something?

[–]Optrode 6 points 23 hours ago

Well, assuming that A synapses onto B but there is no reciprocal connection, A does not know when B fires. B, the post-synaptic neuron, knows A fired because it receives synaptic input from that synapses when A fires. Altering that synaptic weight is (in the most common cases) something that B does. A does not have to actively participate, beyond simply having fired at the appropriate time (which B detects).

The exact mechanism for the synaptic potentiation is not clear...
We know what some of the mechanisms in some cases are. There is a type of glutamate receptor, the NMDA receptor, that is well known for its role in long term synaptic potentiation (LTP). The NMDA receptor acts as a coincidence detector: it will only allow calcium ions into the postsynaptic neuron if a synaptic signal is received when the postsynaptic neuron is already depolarized to a positive voltage (i.e. activated).

Mind you, that's extremely ELI5. There's a lot more to it, such as the fact that what actually matters is whether the DENDRITE (input structure of the neuron) is depolarized, not the whole cell, and those don't necessarily go hand in hand. Exactly how strongly the depolarization of the neuron's cell body depolarizes any particular dendrite branch will depend on the structure of the branch, and this can make it so that certain other synaptic inputs (a neuron has an average of 7000) may have a greater effect on whether synapses on a particular dendrite are in a state to be strengthened by LTP.

Dendrites also have other cool properties, like how it's possible for a certain type of inhibitory input (Cl- channel mediated inhibition, as opposed to K+ channel mediated inhibition) to be capable of canceling out only certain excitatory inputs, but not others, as well as controlling how readily the neuron can be excited by repeated excitatory inputs (vs. requiring all the excitatory input to arrive at once).

Which kind demonstrates another important difference between artificial neural networks and real neurons.. The "neurons" in an ANN are mostly linear, they just have a nonlinear activation function. Inputs are linearly summed. Real neurons do not linearly sun their inputs, the whole process of receiving input is nonlinear as fuck.

[–]timtom85 2 points 9 hours ago

I also checked out STDP after reading this and what caught my attention was the weakening of the weight if an input fires slightly after the neuron does. It seems very important to me because it can effectively get rid of spurious correlations, it can suppress feedback loops, and it can weed out unnecessary connections.

[–]CireNeikual 5 points 2 days ago

What about TargetProp? It works without differentiable functions, it can be used with STDP/Hebbian learning (with appropriate discrete timesteps Hebbian and STDP can be equivalent).

I personally like revisting old methods and seeing how they fair with some new upgrades. Adaptive Resonance Theory, Self-Organizing Maps, or any other kind of vector quantizer. When in an appropriate architecture, they can do some interesting things. Interestingly, as soon as one abandons the need for differentiable functions and embraces sparsity, online/lifelong/incremental learning becomes much easier. This also leads to a performance boost, as one doesn't need many decorrelated replay samples in order to update. Further, with sparsity, sparse updates are possible, giving a further performance boost.

The human brain is quite sparse (it's the function of inhibitory neurons), so I feel like this is the right direction to take. Sparsity leads to low processing power use, something I feel this field desperately needs, with all the big projects taking fat GPU-filled server racks.

permalinkembedsaveparentreportgive goldreply

[–]nobackprop 1 point 1 day ago

I'll repeat here what I wrote elsewhere in this thread.

There is only one viable solution to unsupervised learning, the one used by the brain. It is based on spike timing. The cortex tries to find order in sensory discrete signals or spikes. The only type of order that can be found in spikes is temporal order. Here is the clincher: spikes are either concurrent or sequential. I and others have been saying this for years. Here's a link, if you are interested:

Why Deep Learning Is a Hindrance to Progress Toward True AI

It's all about timing.

permalinkembedsaveparentreportgive goldreply

[–]mindbleach 0 points 1 day ago

Train another network to guess future coefficients. So basically, still backprop at heart, but faster and more chaotic. Leap blindly downhill on gradient descent.

Early on, maybe keep the shitty random values, but change the connections.

Papers

FOSSASIA 2018 :

  • Meta-Learning
    • Reptile (OpenAI)
      • Has nice webpage/blog-post : https://blog.openai.com/reptile/

        • Minimal effort required to get going - includes known-good model
          • Now have working example in Jupyter (with better comments/variable-names)
      • But : This builds a network that is as-retrainable-as-possible, rather than solving a single problem as well as possible. ie : less 'delightfull' than a single good result.

        • OTOH, the 3 boxes classification is fun as JS, and the Sine-wave example makes it pretty clear
      • Interesting that Reptile doesn't do descent using proper derivatives, but seems very effective

        • Where does the 'approximately in the right direction' SGD proof come from?
        • Does this allow for optimisation of other non-differentiable steps/ops
          • Even though the gradients don't exist, there's a reasonable proxy that would have the right sign in expectation?
      • Possible new idea :

        • Instead of trying for best weight initialisation to enable one-shot learning
          • Learn best structure, with standardised initialisation (say +1 / -1 for weights)
          • Or have structure consisting of two types of layers, 1 wholely positive, the other negative
          • Or have the structure scheme somehow define the initialisation spatially
        • Highlight that DNA could determine structure conducive to easy training
      • Plan for talk at Thai+Singapore AI Days:

        • Rewrite PyTorch-sines example using TensorFlow Eager mode
          • First-hand experience with Eager Mode
          • Allows wearing of TF polo with pride to Thai event...

Intrinsic Dimension (Uber) = DONE

float16 outline :: DONE

Re-Think modularisation for deep-learning-workshop

Now that Google's Colab can run notebooks directly, should rethink modularity :

  • More benefit (in visibility, for instance) from making individual Repos
    • But should try to avoid losing benefit of centrally installed datasets
    • Perhaps the individual downloader (which would have to exist) should download into ./data/... after checking
    • Within the VM-builder, add softlinks for ./date to a central data storage area
      • Datasets would be deduped naturally, and available off-line
  • Need to have some kind of sub-repo manifest :
    • Unneeded repos + data wouldn't be included
    • OTOH, if the VGG weights are excluded, there's more 'room' in the VM for other stuff
  • Keras (for instance), with its pre-computed weights, breaks having a clean 'build into tree' idea
  • For own repos, should have a 'create artifact' makefile(?)
    • Enable loading into Collab via Google Drive (or via DropBox)
  • Need to clean out existing 'dual source' modules
    • 'speech' specifically needs tidying
    • ReinforcementLearning/BubbleBreak is already separated
      • Improve its README
  • Break out Transfer Learning
    • Including data (and data loader) = Possible to have a standardised data loader in a module - or simply be explicit
    • Advantages of being self-contained : Run on Colab outside of VM
  • Break out MetaLearning now
    • What to do about 3-boxes JS example?
  • Respond to PR acceptance + layer connectivity query on GoogLeNet in Keras
  • Actually, would make sense to have 'an intern' who would re-do the notebooks:
    • All use DataSets API for Keras
      • Maybe build an special 'ingester' bible, with lots of examples
    • Same for PyTorch(?) - except that the training loop thing may get smartened up in v1.0

Also, figure out a good 'private code+data' workflow too:

  • For persistence (and data?), good to have Google Drive mounted
  • Local code repo (more likely : folder, or assembled .tar)
    • Upload to Colab after changes - automation?
    • Possible to store Google Drive key locally?
  • Or have a google cloud bucket that can be easily copied

Jack and PulseAudio

Next TF&DL ideas :

  • ENAS is also a tempting idea

    • Overall number of back-prop steps would be similar
      • But need to have structure updates
      • Possible to do in PyTorch or TensorFlow Eager Mode... = Maybe create a simpler-to-understand version :
      • Have a fixed number of 'slots' that parameterised modules can operate on
      • Have a parameter/op budget, after which chain of ops is truncated
      • Ops are additive to the slots (so naturally 'residual-like')
      • Need to think how to 'impedence match' different sized layers
        • Images and RNN hidden states are two different cases
        • Images : Some kind of transpose operation (flipping depth for area) ?
          • i.e. not information-destructive
        • RNN hidden state vector : Some kind of structured map (sparse-ish) between layers ?
          • But that would be information-changing
          • Possibly use random-projection (fixed seed) to remove memory access bandwith issue
  • Auto-Encoders for Fraud detection (Keras)

Next PyTorch&DL ideas :

Next big conference ideas :

  • WaveRNN : Explore the large, but sparse model ideas

    • Not clear what a toy problem should look like
      • Would be great to do something with attention, or RNN
      • One issue is how to keep track of the derivatives
        • either do masking on a large matrix; or explicitly construct everything on-the-fly
    • Nor clear whether sparseness can be 'discovered from below' or requires a large model, and discovered redundancy
    • Interesting papers :
  • QandA

  • IQ test (DeepMind)

    • This seems like it's written in an overhyped style
      • Demonstrate that it can be done is a dumb way?
    • Alternatively, try to beat their scores using some kind of learning over a 'meta' latent space
  • Latent space predictions (DeepMind)

  • Pay Attention to the Training Data (now in own repo/folder)

    • Classification as Q&A against training examples
      • Positive/Negative Contrast learning to get retrieval ~= attention
      • Problem is : training set is learned pretty efficiently
      • So there's not enough variation from which to build a good meta-learner
  • Learning with Few Labels

    • UMAP to aid understanding of data
      • Random projection vs CNN projection
      • Train on even numbers, predict odd numbers
        • Tests transfer learning ability / applicability of UMAP, say
      • UMAP to provide additional training objectives to CNN?
        • Not sure of overall scheme
      • Overall, though, seems like a lot of steps to show something interesting
    • Has advantage of being of interest to DARPA too
    • Fits with MetaLearning Workshop (probably)
  • Correlation-Norm

    • BatchNorm fixes neural outputs (pre-activation) to be N(0,1)

    • Usually done by tracking mini-batch (mean, stddev), and learning (mu, sigma) parameters to adjust

      • Could also adjust (W_(all), b) to have same effect
        • That would be local learning / adjustment only
        • Adjust b for direct mean shift, or all of W via local gradient calcuation
    • Idea : VAEs are learning IID N(0,1) : HOWEVER : correlations should exist between same classes

      • Note that IID of features just means that they are independent for a given example
      • But this doesn't address intra-batch correlations
    • Locally encourage distribution of corr(yi,yj) to be bimodal : Either ~0 or ~1

      • Sometimes a pair of examples will be in the same class, other times not
      • If the correlation between yi,yj is 'sufficiently' strong, then strengthen it, otherwise suppress it
        • Possible to do this locally, but missing overall signal for which option to choose
      • Better idea : correlation between two samples (at same layer) should be 1 or (mostly) 0
        • Possible to do this locally too (won't need to use class labels to do this, hopefully)
        • Maybe have some hyperparameter that suggests the proportion of examples that should be correlation-1 ?
          • This could be a moving similarity threshold that gets adjusted to imposed some pct of correlation-1 samples
          • Could also have an effect to decorrelate low-similarity samples, and leave high-similarity ones unadjusted
      • Can do this on Dense Layers :
        • Do Batch mean to find current means
        • Do 8x8 blocks across batch of (vec-mean)*(vec-mean).T to get l2 and cov stats within each vector in batch
          • These matrices should be ~I, since we want to make the elements independent if possible
          • The independence is normally given to us by enforcing a small bottleneck, causing the representation elements to fight to be differentiated
          • Local loss is mean^2 + (l^2-1)^2 + cov^2
        • Across the Batch, work out vector cosine similarity (in groups of 'batch_sample', which could be whole batch)
          • There is some hurdle similarity (probably related to batch_sample size, but also vector length
          • Either batch_sample is in small size, or just use a small number of (random?) offsets between samples in batch to bound computation required
          • Local loss is sum_across_pairs( if cosign>hurdle : (1-cosine)^2, else cosine^2 )
      • Can do this (less?) on Conv2D Layers (channels correspond to vector-elements in Dense, but... ):
        • Batch mean produces a mean->0 for each channel
        • l2 and cov seem to be same idea (channel image vs itself, and channel images vs each other, respectively)
        • Across the Batch, this is still the same principle, except that 'hurdle' will probably tune to a really low value
          • Probably possible to work out idealised hurdle values for a given proportion of cosine~1 results
      • Should have a .detach() option after each layer to enforce locality (or not)
      • Check performance of latent layer
        • This will be (implicitly) trained with network structure as a ~prior
        • Adjust existing MNIST.pytorch with UMAP to have a batch_size and next_to_final_hidden_layer_size of multiples of 8
      • Puzzle : Won't this have been explored during the old autoencoder days?
    • Related papers :

      • https://arxiv.org/abs/1809.07023 : Removing the Feature Correlation Effect of Multiplicative Noise (Calgary, NIPS 2018)
        • Talks about applying changes to layers/activations/weights etc ... Also, even closer :
        • https://arxiv.org/abs/1511.06068 : Reducing Overfitting in Deep Networks by Decorrelating Representations
          • substantial computational overhead (according to Calgary)
        • https://arxiv.org/abs/1611.01967 : Regularizing CNNs with Locally Constrained Decorrelations
          • yield marginal improvements (according to Calgary)
          • MNIST as a proof of concept, secondly we regularize wide residual networks on CIFAR-10, CIFAR-100, and SVHN
        • TODO : Yoshua Bengio and James S Bergstra. Slow, decorrelated features for pretraining complex cell-like networks. In NIPS, pp. 99–107, 2009
      • CorrelationExplanation : http://github.com/gregversteeg/CorEx
      • Also : _READ/Papers/_LIVE/Backprop/*.pdf ...
        • Bengio-2015_Biologically-Plausible_1502.04156v1.pdf
        • Bengio-2015_EnergyInferenceApproximatesBackProp_1510.02777v2.pdf (inconclusive, though)
    • Potentially related papers (though probably not):

  • Learn VAE from trained teacher

    • No need to train image-sized Decoder
    • Fits with which Workshop?
      • no longer relevant
  • Reasoning over Fact DB and Knowledge Axioms?

    • V. interesting
    • Potentially v. time consuming to 'get into it'
    • Not clear which workshop would be interested
      • no longer relevant
  • NOPE : Demonstrations@NIPS (due 16-Sept-2018) : https://nips.cc/Demonstrations/demonstrationapplication

    • Dials for latent space changes for voices
      • How about dial positions that reflect prominent conference speakers?
      • no longer relevant
  • Learning to match sentences with graphs

    • Better groundtruth annotator for "Scene Graph Parsing as Dependency Parsing"
      • Current Oracle gets only 70% F1 on the actual data
      • Current best dependency-tree sentence parser idea -> graph gets 50%
      • So bigger win may be in refining the annotator
      • Seems to have merit... => ViGIL workshop paper
  • InfoMax / MinimumDescriptionLength -> Maximise information density