How to train Word2vec on very large datasets?

Question

asked Jul 31, 2019 in Machine Learning by Clara Daisy (4.2k points)

I am thinking of training word2vec on huge large scale data of more than 10 TB+ in size on web crawl dump.

I personally trained c implementation GoogleNews-2012 dump (1.5gb) on my iMac took about 3 hours to train and generate vectors (impressed with speed). I did not try python implementation though :( I read somewhere that generating vectors on wiki dump (11gb) of 300 vector length takes about 9 days to generate.

How to speed up word2vec? Do i need to use distributed models or what type of hardware i need to do it within 2-3 days? i have iMac with 8gb ram.
Which one is faster? Gensim python or C implemention?

I see that word2vec implementation does not support GPU training.

1 Answer

JaneShaw · Answer 1 · 2019-07-31T12:06:40+0000

There are a variety of opportunities to form Word2Vec models at scale. The candidate solutions are distributed (and/or multi-threaded) or GPU. This is not an exhaustive list but hopefully, you get some ideas as to how to proceed.

Distributed / Multi-threading options:

Gensim uses Cython where it matters, and is equal to, or not much slower than C implementations. Gensim's multi-threading works well and using a machine with ample memory and a large number of cores significantly decreases vector generation time. You may want to investigate using Amazon EC2 16 or 32-core instances.
Deepdist can utilize gensim and Spark to distribute gensim workloads across a cluster. Deepdist additionally has some clever SGD optimizations that synchronize gradient across nodes. If you use multi-core machines as nodes, you can take advantage of both clustering and multi-threading.
A number of Word2Vec GPU implementations exist. Given the massive dataset size, and limited GPU memory you may have to consider a clustering strategy.
Biomech is apparently very fast (documentation is however lacking, and admittedly I've struggled to get it working).
DL4J has a Word2Vec implementation but the team has yet to implement cuBLAS gem and it's relatively slow vs CPUs.
Keras is a Python deep learning framework that utilizes Theano. While it doesn't implement word2vec as such, it will implement an associate embedding layer and might be wont to produce and question word vectors.

For a detailed view on this, check out the Machine Learning Certification and Gradient Boosting by Intellipaat.

How to train Word2vec on very large datasets?

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources