Let computers write advertising slogans 📺

Published Feb 23, 2021 by Yeoun Yi
#advertising #nlp #generation #languagemodel

When I took part in advertising contests, I googled for words that can make rhyme with or sound alike to keywords, to write memorable slogans. Computers are still not as good writer as most of the people. However, there is something they are particularly good at, which is MEMORY! They can find tons of sound-alike words just in a second. I’m trying to build a slogan-generating-model that takes advantage of this capability of computers.

1. Find keywords

Given keywords, such as 'cake', I want to find out sounds-alike words, such as 'bake', and write a slogan that contains both of these words. To do this, actually one more condition is needed, means-alike. 'cake' and 'snake' sounds quite similar, but it would be very difficult to write a natural sentence using both of them, because their meanings are not related at all. I used datamuse API to do this. There are ‘means like’ and ‘sounds like’ parameters in this API. Below is the parameter I used to search for sounds-alike & means-alike words with a keyword 'cake'

https://api.datamuse.com/words?ml=cake&sl=cake

2. Generate Slogans

It is difficult to use auto-regressive model and assert the result to contain certain keywords, because these models only predict next token given previous ones. So I use Masked Language Model, referring to Rhyme With AI . But what I want to do is not writing slogans with rhymes, but writing slogans with phonetical similarity, not limited to the last token in the sentence. I want my slogans to have more flexible structure. To do this, I need to predict the number of tokens first between keywords, which I call 'Mask Insertion'

2.1. Mask Insertion

input: <s> bake cake </s> → output: [0,1,2]

Given <s> keyword1 keyword2 </s> format, Mask Insertion Model will predict how many masks would be needed between <s> and keyword1, keyword1 and keyword2, keyword2 and </s>. I referred to Levenshtein Transformer to implement this with codes and used pretrained RoBERTa model in huggingface. When predicting the number of <mask> tokens between <s> and keyword1, I concatenated the hidden states of two tokens, fed this into linear layers, and used Cross Entropy Loss to compute the loss.

class MaskClassifier(RobertaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config=config)
        self.roberta = RobertaModel(config)
        self.max_mask = 5
        self.hidden_size = config.hidden_size
        self.linear1 = torch.nn.Linear(2*self.hidden_size, self.hidden_size)
        self.linear2 = torch.nn.Linear(self.hidden_size, self.max_mask + 1)
        self.softmax = torch.nn.Softmax(dim=1)
        self.init_weights()
        
    def forward(self, input_ids, attention_mask, token_type_ids, labels=None):

        # Feed input to RoBERTa
        #outputs = self.roberta(input_ids, attention_mask, token_type_ids)
        outputs = self.roberta(input_ids, attention_mask)
        
        # last hidden state: (batch_size, sequence_length, hidden_size)
        org_hidden = outputs[0]
        batch_size = org_hidden.shape[0]

        # in case of one keyword token tokenized into multiple tokens 
        hidden = torch.zeros((batch_size, 4, self.hidden_size)).to(device)
        hidden[:, 0, :] = org_hidden[:, 0, :] # [CLS] hidden

        # the index of first appearance of 1 in token_type_ids (where keyword2 begins)
        first_1_idx = [list(t).index(1) for t in token_type_ids]
        # the index of second appearance of 0 in token_type_ids (where [PAD] or [SEP] begins) 
        sec_0_idx = [len(t)-1-list(t)[::-1].index(1)+1 for t in token_type_ids]

        for i in range(batch_size):
            hidden[i, 1, :] = torch.mean(org_hidden[i, 1:first_1_idx[i] , :], dim=0)
            hidden[i, 2, :] = torch.mean(org_hidden[i, first_1_idx[i]:sec_0_idx[i], :], dim=0)
            hidden[i, 3, :] = torch.mean(org_hidden[i, sec_0_idx[i]: , :], dim=0)

        # logits: (batch_size, 3, 11)
        logits = torch.zeros((batch_size, 3, self.max_mask+1)).to(device)
        
        for i in range(3):
            concated_h = torch.cat((hidden[:, i,:], hidden[:, i+1,:]), dim=1) # batch_size, 2*hidden_size
            logits[:, i] = self.softmax(self.linear2(self.linear1(concated_h)))

        if labels is not None:
            labels = labels.to(device)

            criterion = CrossEntropyLoss()
            # logits.permute(0,2,1): (batch_size, max_mask+1, 3)
            # labels: (batch_size, 3)
            loss = criterion(logits.permute(0,2,1), labels)

            # tuple of (loss, prediction)
            # loss, if labels provided
            return (loss, logits)

        else:
            return logits

2.2. Masked Language Model

input: <s> bake <mask> cake <mask> <mask> </s> → output: <s> bake a cake with love </s>

This is a typical MaskedLM. Given a sequence with <mask>, the model has to predict the real token behind the <mask>. The difference is that there are a lot of <mask> tokens, not just one, in this model. To improve the performance, I added some refinement. After predicting multiple <mask> tokens, the model masks a random token except keywords and predict that token again. With refinements, the performance has improved a lot, especially at BLEU score.

class MaskedLM(RobertaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config=config)
        self.roberta = RobertaModel(config)
        self.lm_head = RobertaLMHead(config)
        self.refinement_num = 3
        self.mask_id = self.tokenizer.convert_tokens_to_ids([self.tokenizer.mask_token])[0] # 50264
        self.init_weights()

    def forward(self, input_ids, attention_mask, labels=None):

        batch_size = input_ids.size(0)

        if labels is not None:
            labels = labels.to(device)

        outputs = self.roberta(input_ids, attention_mask)
        prediction_scores = self.lm_head(outputs[0])

        # logits, hidden_states, attentions
        outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here

        if labels is not None:
            loss_fct = CrossEntropyLoss()
            masked_lm_loss = loss_fct(prediction_scores.view(-1, self.roberta.config.vocab_size), labels.view(-1))
            outputs = (masked_lm_loss,) + outputs # loss, logits, hidden_states, attentions

        # predictions: (batch_size, sequence_length)
        predictions = prediction_scores.argmax(dim=2)

        # REFINEMENT
        for n in range(self.refinement_num):
            for b in range(batch_size):
                # keep keywords from changing 
                keyword_dict = {idx:iid for idx, iid in enumerate(input_ids[b]) if iid not in [0, 1, 2, 50264]}
                for key in keyword_dict:
                    predictions[b][key] = keyword_dict[key] 
                    
                # re-mask random predicted tokens except keywords 
                remask_idx = self.find_refinable(predictions[b], list(keyword_dict.keys()), attention_mask[b])
                predictions[b][remask_idx] = self.mask_id

            outputs = self.roberta(predictions, attention_mask)
            prediction_scores = self.lm_head(outputs[0])
            predictions = prediction_scores.argmax(dim=2)

            outputs = (prediction_scores,) + outputs[2:]
            if labels is not None:
                masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
                outputs = (masked_lm_loss,) + outputs  # loss, logits, hidden_states, attentions
                
        return outputs


    def find_refinable(self, prediction, keyword_idx, attention_mask):
        # refine except keywords, <s>, </s> and <pad> 
        refinable_idx = [idx for idx, token in enumerate(prediction) if idx != 0 and idx not in keyword_idx and \
                         (0 not in list(attention_mask) or idx < list(attention_mask).index(0)-1)]
        return random.choice(refinable_idx)

    # needed for resize_token_embeddings
    def get_output_embeddings(self):
        return self.lm_head.decoder

    # needed for resize_token_embeddings
    def set_output_embeddings(self, new_embeddings):
        self.lm_head.decoder = new_embeddings

3. Evaluate Slogans

After generating multiple candidates, I want this model to evaluate the candidates and find the best one. There are 3 criteria for this evaluation.

Grammaticality
Positive Sentiment
Domain Relatedness

For grammaticality, I used pretrained COLA model , and this for sentiment analysis.

*****