Home

To do list:

It seems that in order to calculate the relatedness score in Ruber metric, some of the training data is used for validation, so we need to have a large number of response pairs in order to have enough big data for training and predicting the relatedness correctly. The datasets that Ruber metric has been experimented on them had 1.5 million and 480k response pairs respectively. We are interested to see that can we use another dataset rather than Larry King (let’s say toronto_books_in_sentences) to train the model for unreference score and then use it to evaluate the performance of seq2seq model on Larry King dataset. The reason is that our dataset, which has around 250k response pairs for training, is not big enough so that we can use one portion of it for validation and the other for training.
As figure 2 illustrates, in Ruber metric the previous queries are not involved in calculating the unreferenced score, while it is important that the response be related not only to previous utterance but also all context. So our next step is to add the history of previous utterances for calculating this score.
As it is obvious, human always gives high score to general responses like ‘Yes’, ‘no’, ‘I don’t know ’ and etc., since such kind of responses can be matches to many queries, but we want to identify such kind of responses and penalize them in automatically evaluation process.

Provide feedback