I am thinking of training word2vec on huge large scale data of more than 10 TB+ in size on web crawl dump.
I personally trained c implementation GoogleNews-2012 dump (1.5gb) on my iMac took about 3 hours to train and generate vectors (impressed with speed). I did not try python implementation though :( I read somewhere that generating vectors on wiki dump (11gb) of 300 vector length takes about 9 days to generate.
How to speed up word2vec? Do i need to use distributed models or what type of hardware i need to do it within 2-3 days? i have iMac with 8gb ram.
Which one is faster? Gensim python or C implemention?
I see that word2vec implementation does not support GPU training.