Answer the question
In order to leave comments, you need to log in
I don't understand how to connect word2vec with the Minimum Spanning Tree (MST) algorithm?
I collected a dataset of 1.3 million documents. I ran it through the word2vec algorithm. With the help of MST, I want to get clusters of topics for these documents.
import pandas as pd
import gensim.models.word2vec as w2v
import networkx as nx
import matplotlib.pyplot as plt
df = pd.read_excel('history_of_groups_by_user.xlsx', header=None, encoding='windows-1251')
df = df.dropna(subset=[0])
df = pd.DataFrame([item for item in df[0].values if not isinstance(item, int)])
text = []
for i in df[0]:
text.append(i.lower().split())
model = w2v.Word2Vec(
sentences=text,
seed=42,
size=50,
min_count=5,
window=4,
sample=1e-3)
Answer the question
In order to leave comments, you need to log in
In general, Word2Vec will convert each word into a vector, in your case, of dimension 50. Next, you need to build a vector of the entire document, for example, taking the average of all words (vectors).
Pseudocode:
embedding_matrix = []
def build_matrix(text):
for token in text:
embedding_matrix.append(model.wv[token] )
return np.mean(embedding_matrix, axis=0)
df['vector'] = df['text'].apply(lambda t: build_matrix(t))
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question