Back

Explore Courses Blog Tutorials Interview Questions
+1 vote
2 views
in Machine Learning by (4.2k points)
edited by

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

  • genuine email
  • spam

How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?

Let's assume that the email fetching is already in place and we need to focus on classification part only.

The main points which I would hope to get answered would be:

  1. Which parameters to choose as the input for the NN, and why?
  2. What structure of the NN would most likely work best for such a task?

Also, any resource recommendations or existing implementations (preferably in C#) are more than welcome

Thank you

EDIT

  • I am set on using neural networks as the main aspect of the project is to test how the NN approach would work for spam detection
  • Also, it is a "toy problem" simply to explore the subject on neural networks and spam

1 Answer

+1 vote
by (6.8k points)

Both Character-Based, Word-based, and Vocabulary features:

  1. Total no of characters (C)
  2. Total no of alpha chars / C Ratio of alpha chars
  3. Total no of digit chars / C
  4. Total no of whitespace chars/C
  5. Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
  6. Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
  7. Total no of words (M)
  8. Total no of short words/M Two letters or less
  9. Total no of chars in words/C
  10. Average word length
  11. Avg. sentence length in chars
  12. Avg. sentence length in words
  13. Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
  14. Type Token Ratio No. Of unique Words/ M
  15. Hapax Legomena Freq. of once-occurring words
  16. Hapax Dislegomena Freq. of twice-occurring words
  17. Yule’s K measure
  18. Simpson’s D measure
  19. Sichel’s S measure
  20. Brunet’s W measure
  21. Honore’s R measure
  22. Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

The inputs would need to be normalized according to your current pre-classified corpus.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

Neural Network Tutorial is a major part where one can understand this course. Also, for more explanatory terms see the Machine Learning Training Course.

Browse Categories

...