I don't understand how to connect word2vec with the Minimum Spanning Tree (MST) algorithm?

N

nasdi2019-07-21 20:20:50

Python

nasdi, 2019-07-21 20:20:50

I collected a dataset of 1.3 million documents. I ran it through the word2vec algorithm. With the help of MST, I want to get clusters of topics for these documents.

import pandas as pd
import gensim.models.word2vec as w2v
import networkx as nx
import matplotlib.pyplot as plt

df = pd.read_excel('history_of_groups_by_user.xlsx', header=None, encoding='windows-1251')
df = df.dropna(subset=[0])
df = pd.DataFrame([item for item in df[0].values if not isinstance(item, int)])

text = []
for i in df[0]:
    text.append(i.lower().split())

model = w2v.Word2Vec(
    sentences=text,
    seed=42,
    size=50,
    min_count=5,
    window=4,
    sample=1e-3)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

Danil, 2019-07-22
@DanilBaibak

In general, Word2Vec will convert each word into a vector, in your case, of dimension 50. Next, you need to build a vector of the entire document, for example, taking the average of all words (vectors).
Pseudocode:

embedding_matrix = []

def build_matrix(text):
    for token in text:
         embedding_matrix.append(model.wv[token] )

    return np.mean(embedding_matrix, axis=0)

df['vector'] = df['text'].apply(lambda t: build_matrix(t))

As a result, each document is represented by a vector of 50 dimensions, which can be input to any algorithm.