Добрый день, сообщество.
Кто-то уже осилил badge по Word2Vec?
Есть у меня желание прикоснуться к ML и вот из вот этого трейла https://trailhead.salesforce.com/en/content/learn/trails/explore-deep-learning-for-nlp - первый модуль сделала, а вот с построением модели для Word2Vec - затык.
Первая трудность была с Construct examples for each W2V variant - вот тут подробности https://developer.salesforce.com/forums/ForumsMain?id=9062I000000g6UaQAI
Да и теперь, хоть эти примеры и строятся, для одного из вопросов (9) - нет подходящего варианта
И вот теперь пункт Define the Word2Vec models. Написала примерно такое:
class Word2VecModel(nn.Module):
def __init__(self, vocab_size, embedding_size=300):
super().__init__()
self.vocab_size = vocab_size
self.embedding_size = embedding_size
# TODO: Use randn to initialize a tensor for vocab_size vectors each
# of size embedding_size
vectors = torch.randn(vocab_size, embedding_size)
# entries of the inporch.randn(vocab_size, embedding_size)
self.vectors = nn.Parameter(vectors)
def forward(self, batch, samples=None):
inputs = batch[0]
# TODO: Obtain the input vector portion (the first self.embedding_size//2
# entries of the input vectors) of self.vectors for inputs
input_vectors = self.vectors[0:self.embedding_size//2]
print (input_vectors.size())
# TODO: If this is a CBOW model,
# compute a continous bag-of-words over the input vectors
if input_vectors.dim() == 3:
input_vectors = input_vectors[0]
if len(batch) == 2: # Full Softmax
# TODO: obtain the output portion of all vectors
# HINT: you can index into the last dimension of tensors with any number
# of dimensions by using the following notation tensor[..., idx]
target_vectors = self.vectors[self.embedding_size//2:self.embedding_size]
print (target_vectors.size())
print ("Target vectors")
print (torch.transpose(target_vectors, 0,-1))
# TODO: compute scores between input and output vectors
# via matrix multiplication (you'll need to transpose output_vectors)
scores = input_vectors * torch.transpose(target_vectors, -1, 0)
else: # Negative Sampling
outputs = batch[1]
# TODO: obtain the output vectors only for samples
output_vectors = samples
# TODO: compute scores between the input vectors and the output vectors
# First, you'll need to expand the input vectors along dimension 1 using
# unsqueeze() so that the input vectors are now of shape (batch_size, 1, embedding_size)
input_vectors = torch.unsqueeze(input_vectors, 0)
# Then you'll need to transpose the output_vectors along dimensions 1 and 2
# so that the output_vectors is of shape (batch_size, embedding_size, k + 1)
output_vectors = torch.transpose(output_vectors, 0, 1)
# Now a matrix multiply should yield a tensor of size (batch_size, 1, k + 1)
# and you can get the matrix of scores by calling squeeze() to get a tensor
# of size (batch_size, k + 1), which should match the size of labels
scores = torch.squeeze( input_vectors * output_vectors)
return scores
Вроде размеры у тензора [150, 300] и [300, 150] у транспонированного тензора - должны умножаться, а я получаю вот такую ошибку:
The size of tensor a (300) must match the size of tensor b (150) at non-singleton dimension 1
Застряла.