Explore Courses Blog Tutorials Interview Questions
0 votes
in Machine Learning by (19k points)

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?

1 Answer

0 votes
by (33.1k points)

You can say that a typical interval is between 100-300. I would suggest at least 50D to reach the lowest accuracy. If you pick a lesser number of dimensions, then you might start to lose properties of high dimensional spaces. If training time is not a big deal for your application, I would hold 200D dimensions as it would provide nice features. Extreme accuracy can be obtained with 300D. After 300D word features, training will be extremely slow.

Hope this answer might help. For more details, you can study Artificial Intelligence Course.

Browse Categories