python3 encodingSeq_train.py train.data 10(flanking length)
python3 encodingSeq_test.py test.data 10(flanking length)
The objective of this task is to correctly predict whether CTCF protein can bind on the given DNA sequence, and I tried to finish the task by using the CNN technique of deep learning. The approach mentioned below was inspired by MNIST hand-written digit classification.
-
Data source: CHIP-seq data from https://www.encodeproject.org/experiments/ENCSR000AQU/
-
Each column is separated by 'one space'.
-
Training data
sample size: 77531
data format example (abbrev.): >chr3:13238050-13238150 CTGGCTGTCA...AGAAGAACAC 1
- Testing data
samlpe size: 19383
data format example (abbrev.): CAGTTGGCCT...CACAAGTAGA
- Testing data with label
sample size: 19383 (9709 positive, 9674 negative)
data format example (abbrev.): >chr20:42901189-42901289 CAGTTGGCCT...CACAAGTAGA 1
file name | chromosome number | loci | sequence length | label |
---|---|---|---|---|
train.data | chr # | loci | 101 | 0 negative, 1 positive |
test.data | N / A | N / A | 101 | N / A |
test_ans.data.txt | chr # | loci | 101 | 0 negative, 1 positive |
I use encodingSeq_train.py and encodingSeq_test.py to convert train.data and test.data to pickle format, turning the sequence data into one-hot encoding form. Besides, the flanking length of the sequence can be determined by users.
def model_train(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, stride_1, stride_1, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
drop = tf.nn.dropout(hidden, 0.75)
conv = tf.nn.conv2d(drop, layer2_weights, [1, stride_2, stride_2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
drop = tf.nn.dropout(hidden, 0.75)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
drop = tf.nn.dropout(hidden, 0.75)
hidden = tf.nn.relu(tf.matmul(drop, layer4_weights) + layer4_biases)
return tf.matmul(hidden, layer5_weights) + layer5_biases
def model(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, stride_1, stride_1, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(drop, layer2_weights, [1, stride_2, stride_2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
hidden = tf.nn.relu(tf.matmul(drop, layer4_weights) + layer4_biases)
return tf.matmul(hidden, layer5_weights) + layer5_biases
- batch size = 290 (train_dataset.shape[0] // 200)
- training steps = 15000
- learning rate = 0.25 (starting rate) with exponential decay after 5000 steps, decay rate = 0.96
- optimizer: GradientDescentOptimize
- using regularization to eliminate overfitting circumstances
- Training process
Steps | minibatch loss | minibatch accuracy | validation accuracy |
---|---|---|---|
0 | 1.036431 | 52.759 % | 50.663 % |
500 | 0.683988 | 62.069 % | 60.486 % |
1000 | 0.624465 | 65.862 % | 71.129 % |
1500 | 0.633703 | 68.276 % | 78.105 % |
2000 | 0.470493 | 78.621 % | 84.543 % |
2500 | 0.467052 | 81.724 % | 80.426 % |
3000 | 0.340815 | 85.517 % | 87.535 % |
3500 | 0.343168 | 87.241 % | 88.155 % |
4000 | 0.281665 | 90.000 % | 88.222 % |
4500 | 0.316459 | 89.310 % | 89.223 % |
5000 | 0.306602 | 88.621 % | 89.651 % |
5500 | 0.323350 | 85.517 % | 89.656 % |
6000 | 0.370277 | 84.828 % | 88.278 % |
6500 | 0.326758 | 86.897 % | 89.764 % |
7000 | 0.319943 | 87.586 % | 90.069 % |
7500 | 0.260177 | 90.345 % | 90.058 % |
8000 | 0.330911 | 87.931 % | 89.795 % |
8500 | 0.266430 | 91.034 % | 90.378 % |
9000 | 0.265763 | 90.000 % | 90.528 % |
9500 | 0.285275 | 91.724 % | 90.358 % |
10000 | 0.260739 | 90.690 % | 90.487 % |
10500 | 0.273746 | 89.655 % | 89.919 % |
11000 | 0.281265 | 91.034 % | 90.590 % |
11500 | 0.278727 | 91.034 % | 90.512 % |
12000 | 0.310407 | 88.621 % | 90.595 % |
12500 | 0.282559 | 87.586 % | 90.523 % |
13000 | 0.295398 | 89.310 % | 90.812 % |
13500 | 0.254486 | 90.000 % | 90.636 % |
14000 | 0.266505 | 89.655 % | 90.672 % |
14500 | 0.257149 | 90.690 % | 90.579 % |
15000 | 0.259550 | 88.966 % | 90.559 % |
- Test accuracy: 90.517 %
The model I used in the task is not complicated at all, and maybe trying models with more layers can get better results. Also, I think RNN might be an another way to build or improve the model.
MNIST CNN tutorial
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/udacity
DeepBind
http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html
http://www.nature.com/nbt/journal/v33/n8/extref/nbt.3300-S2.pdf