Van Son: "Suppose you are wondering whether or not you should have your child vaccinated. Language models could be used in applications that help provide an overview of all the claims surrounding the safety and effectiveness of vaccinations. Which sources argue in favour of vaccinations? Which ones do they advise against? What are the underlying arguments?"
Limitations of existing datasets
In the study, Van Son first analysed several existing datasets and concluded that they do not work optimally: "For instance, some datasets are based on artificial text. Moreover, many datasets do not take into account the different perspectives that can be expressed in a text, even though this often occurs, for example in news reports or on social media."
New technique
Van Son therefore decided to look for ways to develop new datasets that are more representative of natural language use and take into account different points of view in a text. It led to the PANLI(Perspective-Aware Natural Language Inference) dataset. "This dataset was constructed from texts about vaccinations, where sentences were paired based on their meaning. Each sentence pair was then assessed by multiple individuals, who were given the task of determining the relationship between the two sentences. This involved distinguishing between the point of view of the author of the sentence and named sources. The final dataset reflects the different layers of subjectivity in both textual meaning and human interpretation, which may represent a major advance for application of language models in practice."