A decision tree is a graph that uses a branching method to illustrate every possible outcome of a decision. Programmatically, they can be used to assign monetary/time or other values to possible outcomes so that decisions can be automated.
Following is example decsion tree.
We are also comparing results with results get from Naive Bayes. It is a classification technique based on Bayes' Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Source: Downloading and installing Weka The dataset we get was in 5180 text files as shown in following screenshots. (3 Folders in one Folder) (Insight one folder) (One File)As you can see in above screenshots that data is not normalized and well manage and we cannot give it as input to WEKA. We need all data in one arff file to give it WEKA for training. So for this purpose we use following command in command line interface of WEKA. “java weka.core.converters.TextDirectoryLoader -dir F:/Spam_mails > F:/text_example.arff” The output arff file is following. Now data is normalized. So we will give it to the WEKA for further pre-processes as follows. Now we will select and apply our classifier as follows. We are using every frequent word as feature so here we will break string in word vector as follows. First we train data as follows by selecting train data set option and we get following results. Here we get 98% accuracy. Than we further train it on different split percentages and get following results. On 66% split percentage we get 93% accuracy. On 80% split percentage we get 94% percent accuracy. On 90% split percentage we get 89% accuracy.
Now we decided to test our model, so we make test dataset from our own email ids as shown in following screenshot. Now we give this test dataset to our trained model and we get following predictions about this dataset. Our model give prediction as shown in above screenshot. I repeat the same procedure with Naïve Bayes shown in following snapshots. It shows different results with good accuracy. Email spamming is a common technique but can make heavy damage to user’s privacy. Currently, many anti-spam tools are available to fight against spam mails. But text classification is one the best ways to detect email spamming. We can improve it's accuracy with very big dataset and restrict our algorithms to ignore normal dictionary words and classify frequently used spam words. [1] A. Anderson, M. Corney, O. de Vel, and G. Mohay."Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001.
[2] Shlomo Hershkop, Ke Wang, Weijen Lee, Olivier Nimeskern, German Creamer, and Ryan Rowe, "Email Mining Toolkit Technical Manual". (June 2006) Department of Computer Science Columbia University.
[3] Bron, C. and J. Kerbosch. "Algorithm 457: Finding all cliques of an undirected graph." (1973).
[4] Ding Zhou et al and Ya Zhang, "Towards Discovering Organizational Structure from Email Corpus". (2005) Fourth International Conference on Machine Learning and Application.
[5] Giuseppe Carenini, Raymond T. Ng and Xiaodong Zhou , "Scalable Discovery of Hidden Emails from Large Folders". Department of Computer Science, University of British Columbia, Canada.
[6] Hung-Ching Chen el al, "Discover The Power of Social and Hidden Curriculum to Decision Making: Experiments with Enron Email and Movie Newsgroups". Sixth International Conference on Machine Learning and Applications.