Published Feb 23, 2021 by Yeoun Yi When I took part in advertising contests, I googled for words that can make rhyme with or sound alike to keywords, to write memorable slogans.
Computers are still not as good writer as most of the people. However, there is something they are particularly good at, which is MEMORY!
They can find tons of sound-alike words just in a second. I’m trying to build a slogan-generating-model that takes advantage of this capability of computers. Given keywords, such as It is difficult to use auto-regressive model and assert the result to contain certain keywords, because these models only predict next token given previous ones.
So I use Masked Language Model, referring to Rhyme With AI .
But what I want to do is not writing slogans with rhymes, but writing slogans with phonetical similarity, not limited to the last token in the sentence. I want my slogans to have more flexible structure.
To do this, I need to predict the number of Given This is a typical MaskedLM. Given a sequence with <mask>, the model has to predict the real token behind the <mask>.
The difference is that there are a lot of <mask> tokens, not just one, in this model.
To improve the performance, I added some refinement. After predicting multiple <mask> tokens, the model masks a random token except keywords and predict that token again.
With refinements, the performance has improved a lot, especially at BLEU score. After generating multiple candidates, I want this model to evaluate the candidates and find the best one.
There are 3 criteria for this evaluation. For grammaticality, I used pretrained COLA model , and this for sentiment analysis.
#advertising
#nlp
#generation
#languagemodel
1. Find keywords
'cake'
, I want to find out sounds-alike words, such as 'bake'
, and write a slogan that contains both of these words.
To do this, actually one more condition is needed, means-alike. 'cake'
and 'snake'
sounds quite similar, but it would be very difficult to write a natural sentence using both of them, because their meanings are not related at all.
I used datamuse API to do this. There are ‘means like’ and ‘sounds like’ parameters in this API.
Below is the parameter I used to search for sounds-alike & means-alike words with a keyword 'cake'
https://api.datamuse.com/words?ml=cake&sl=cake
2. Generate Slogans
2.1. Mask Insertion
input: <s> bake cake </s> → output: [0,1,2]
<s> keyword1 keyword2 </s>
format, Mask Insertion Model will predict how many masks would be needed between <s> and keyword1, keyword1 and keyword2, keyword2 and </s>.
I referred to Levenshtein Transformer to implement this with codes and used pretrained RoBERTa model in huggingface.
When predicting the number of <mask> tokens between <s> and keyword1, I concatenated the hidden states of two tokens, fed this into linear layers, and used Cross Entropy Loss to compute the loss.class MaskClassifier(RobertaPreTrainedModel):
def __init__(self, config):
super().__init__(config=config)
self.roberta = RobertaModel(config)
self.max_mask = 5
self.hidden_size = config.hidden_size
self.linear1 = torch.nn.Linear(2*self.hidden_size, self.hidden_size)
self.linear2 = torch.nn.Linear(self.hidden_size, self.max_mask + 1)
self.softmax = torch.nn.Softmax(dim=1)
self.init_weights()
def forward(self, input_ids, attention_mask, token_type_ids, labels=None):
# Feed input to RoBERTa
#outputs = self.roberta(input_ids, attention_mask, token_type_ids)
outputs = self.roberta(input_ids, attention_mask)
# last hidden state: (batch_size, sequence_length, hidden_size)
org_hidden = outputs[0]
batch_size = org_hidden.shape[0]
# in case of one keyword token tokenized into multiple tokens
hidden = torch.zeros((batch_size, 4, self.hidden_size)).to(device)
hidden[:, 0, :] = org_hidden[:, 0, :] # [CLS] hidden
# the index of first appearance of 1 in token_type_ids (where keyword2 begins)
first_1_idx = [list(t).index(1) for t in token_type_ids]
# the index of second appearance of 0 in token_type_ids (where [PAD] or [SEP] begins)
sec_0_idx = [len(t)-1-list(t)[::-1].index(1)+1 for t in token_type_ids]
for i in range(batch_size):
hidden[i, 1, :] = torch.mean(org_hidden[i, 1:first_1_idx[i] , :], dim=0)
hidden[i, 2, :] = torch.mean(org_hidden[i, first_1_idx[i]:sec_0_idx[i], :], dim=0)
hidden[i, 3, :] = torch.mean(org_hidden[i, sec_0_idx[i]: , :], dim=0)
# logits: (batch_size, 3, 11)
logits = torch.zeros((batch_size, 3, self.max_mask+1)).to(device)
for i in range(3):
concated_h = torch.cat((hidden[:, i,:], hidden[:, i+1,:]), dim=1) # batch_size, 2*hidden_size
logits[:, i] = self.softmax(self.linear2(self.linear1(concated_h)))
if labels is not None:
labels = labels.to(device)
criterion = CrossEntropyLoss()
# logits.permute(0,2,1): (batch_size, max_mask+1, 3)
# labels: (batch_size, 3)
loss = criterion(logits.permute(0,2,1), labels)
# tuple of (loss, prediction)
# loss, if labels provided
return (loss, logits)
else:
return logits
2.2. Masked Language Model
input: <s> bake <mask> cake <mask> <mask> </s> → output: <s> bake a cake with love </s>
class MaskedLM(RobertaPreTrainedModel):
def __init__(self, config):
super().__init__(config=config)
self.roberta = RobertaModel(config)
self.lm_head = RobertaLMHead(config)
self.refinement_num = 3
self.mask_id = self.tokenizer.convert_tokens_to_ids([self.tokenizer.mask_token])[0] # 50264
self.init_weights()
def forward(self, input_ids, attention_mask, labels=None):
batch_size = input_ids.size(0)
if labels is not None:
labels = labels.to(device)
outputs = self.roberta(input_ids, attention_mask)
prediction_scores = self.lm_head(outputs[0])
# logits, hidden_states, attentions
outputs = (prediction_scores,) + outputs[2:] # Add hidden states and attention if they are here
if labels is not None:
loss_fct = CrossEntropyLoss()
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.roberta.config.vocab_size), labels.view(-1))
outputs = (masked_lm_loss,) + outputs # loss, logits, hidden_states, attentions
# predictions: (batch_size, sequence_length)
predictions = prediction_scores.argmax(dim=2)
# REFINEMENT
for n in range(self.refinement_num):
for b in range(batch_size):
# keep keywords from changing
keyword_dict = {idx:iid for idx, iid in enumerate(input_ids[b]) if iid not in [0, 1, 2, 50264]}
for key in keyword_dict:
predictions[b][key] = keyword_dict[key]
# re-mask random predicted tokens except keywords
remask_idx = self.find_refinable(predictions[b], list(keyword_dict.keys()), attention_mask[b])
predictions[b][remask_idx] = self.mask_id
outputs = self.roberta(predictions, attention_mask)
prediction_scores = self.lm_head(outputs[0])
predictions = prediction_scores.argmax(dim=2)
outputs = (prediction_scores,) + outputs[2:]
if labels is not None:
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
outputs = (masked_lm_loss,) + outputs # loss, logits, hidden_states, attentions
return outputs
def find_refinable(self, prediction, keyword_idx, attention_mask):
# refine except keywords, <s>, </s> and <pad>
refinable_idx = [idx for idx, token in enumerate(prediction) if idx != 0 and idx not in keyword_idx and \
(0 not in list(attention_mask) or idx < list(attention_mask).index(0)-1)]
return random.choice(refinable_idx)
# needed for resize_token_embeddings
def get_output_embeddings(self):
return self.lm_head.decoder
# needed for resize_token_embeddings
def set_output_embeddings(self, new_embeddings):
self.lm_head.decoder = new_embeddings
3. Evaluate Slogans
© Yeoun Yi