Both Character-Based, Word-based, and Vocabulary features:
- Total no of characters (C)
- Total no of alpha chars / C Ratio of alpha chars
- Total no of digit chars / C
- Total no of whitespace chars/C
- Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
- Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
- Total no of words (M)
- Total no of short words/M Two letters or less
- Total no of chars in words/C
- Average word length
- Avg. sentence length in chars
- Avg. sentence length in words
- Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
- Type Token Ratio No. Of unique Words/ M
- Hapax Legomena Freq. of once-occurring words
- Hapax Dislegomena Freq. of twice-occurring words
- Yule’s K measure
- Simpson’s D measure
- Sichel’s S measure
- Brunet’s W measure
- Honore’s R measure
- Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }
You could also add some more features based on the formatting: colors, fonts, sizes, ... used.
The inputs would need to be normalized according to your current pre-classified corpus.
I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.
Neural Network Tutorial is a major part where one can understand this course. Also, for more explanatory terms see the Machine Learning Training Course.