Build a Model with Word2Vec

Vasilina · 30.10.2019 14:25:53

Добрый день, сообщество.

Кто-то уже осилил badge по Word2Vec?

Есть у меня желание прикоснуться к ML и вот из вот этого трейла https://trailhead.salesforce.com/en/content/learn/trails/explore-deep-learning-for-nlp - первый модуль сделала, а вот с построением модели для Word2Vec - затык.

Первая трудность была с Construct examples for each W2V variant - вот тут подробности https://developer.salesforce.com/forums/ForumsMain?id=9062I000000g6UaQAI
Да и теперь, хоть эти примеры и строятся, для одного из вопросов (9) - нет подходящего варианта

И вот теперь пункт Define the Word2Vec models. Написала примерно такое:

class Word2VecModel(nn.Module):
  
  def __init__(self, vocab_size, embedding_size=300):
    super().__init__()
    self.vocab_size = vocab_size
    self.embedding_size = embedding_size
    # TODO: Use randn to initialize a tensor for vocab_size vectors each
    # of size embedding_size
    vectors = torch.randn(vocab_size, embedding_size)
    # entries of the inporch.randn(vocab_size, embedding_size)
    self.vectors = nn.Parameter(vectors)
    
  def forward(self, batch, samples=None):
    inputs = batch[0]
    # TODO: Obtain the input vector portion (the first self.embedding_size//2
    # entries of the input vectors) of self.vectors for inputs
    input_vectors = self.vectors[0:self.embedding_size//2]
    print (input_vectors.size())
    # TODO: If this is a CBOW model, 
    # compute a continous bag-of-words over the input vectors
    if input_vectors.dim() == 3:
      input_vectors = input_vectors[0]
    if len(batch) == 2: # Full Softmax
      # TODO: obtain the output portion of all vectors 
      # HINT: you can index into the last dimension of tensors with any number 
      # of dimensions by using the following notation tensor[..., idx]
      target_vectors = self.vectors[self.embedding_size//2:self.embedding_size]
      print (target_vectors.size())
      print ("Target vectors")
      print (torch.transpose(target_vectors, 0,-1))
      # TODO: compute scores between input and output vectors
      # via matrix multiplication (you'll need to transpose output_vectors)
      scores = input_vectors * torch.transpose(target_vectors, -1, 0)
    else: # Negative Sampling
      outputs = batch[1]
      # TODO: obtain the output vectors only for samples
      output_vectors = samples
      # TODO: compute scores between the input vectors and the output vectors
      # First, you'll need to expand the input vectors along dimension 1 using 
      # unsqueeze() so that the input vectors are now of shape (batch_size, 1, embedding_size)
      input_vectors = torch.unsqueeze(input_vectors, 0)
      # Then you'll need to transpose the output_vectors along dimensions 1 and 2
      # so that the output_vectors is of shape (batch_size, embedding_size, k + 1)
      output_vectors = torch.transpose(output_vectors, 0, 1)
      # Now a matrix multiply should yield a tensor of size (batch_size, 1, k + 1)
      # and you can get the matrix of scores by calling squeeze() to get a tensor
      # of size (batch_size, k + 1), which should match the size of labels
      scores = torch.squeeze( input_vectors * output_vectors)
    return scores

Вроде размеры у тензора [150, 300] и [300, 150] у транспонированного тензора - должны умножаться, а я получаю вот такую ошибку:
The size of tensor a (300) must match the size of tensor b (150) at non-singleton dimension 1

Застряла.