Ten thousands of articles
"Who is said to be responsible for the financial crisis?" Was one of the research questions that DoCMA has examined. More than 50.000 articles of several European countries were included in the examination - an order of magnitude that was unthinkable in the era before Big Data. "In another project, we examined how TTIP is treated on Twitter - compared to the discourse that was then negotiated in the classic media," says Müller.
"The added value of this method," says Rahnenführer, "basically comes from comparison." Because the results of text mining are often not particularly surprising at first glance. They often provide confirmation of truths previously felt. This also applies to an analysis of reporting in 2015 and 2016 on the so-called refugee crisis. Three national German daily newspapers were compared with the newspaper "Junge Freiheit", the media mouthpiece of the New Right. The algorithm measures the average interpretation and topic setting, no extreme fluctuations. “But when we looked at the media coverage individually, we recognized patterns,” says Müller. For example, "Junge Freiheit" had different aspects in the debate than the mainstream newspapers. It reported more nationally and less European - and the debate raised fewer economic issues than other newspapers. "It was actually surprising - one might have guessed that right-wing discourse strongly argues with the costs of taking in refugees. Here you can see how the populist media systematically hide the more complicated aspects of a debate and limit themselves to simple narratives."
In addition to comparison, a further value of text mining lies in the opportunity to present longer time courses or "topic careers". This makes it possible to document the development of a topic in the media without gaps over long periods of time. Mapping, describing and comparing topic careers - this is the task of the communication scientist and his team. Instead, statistician Jörg Rahnenführer hopes that the collaboration will create standards for the analysis of large amounts of data in communication sciences. This would make it possible to gain comprehensible and, above all, reliably reproducible information from large collections of text that do not depend on the daily mood of a scientist.
In order to make such almost imperceptible shifts visible, it is necessary to analyze media reporting over long periods of time. Initially, a DVD with the SPIEGEL archive was used for this, but now the university library supports the researchers with constantly updated newspaper corpora. So there is data. However, the tools to examine the multitude of available texts still had to be developed. Fortunately, Müller did not only find open ears among the colleagues from the statistics, but also a similar research interest. "We had previously been involved in data mining in the life sciences and had just started to deal with text analysis. We were looking for an interesting application - that is, for large data sets, and for a partner who wanted to develop new methods with us," says Jörg Rahnenführer. "For us it is the most intensive cooperation so far with another subject outside of the life sciences. What I find particularly interesting is the work process. We don't throw a complex statistical model into the hands of colleagues who want to use the tool, but it goes back and forth between the disciplines and is constantly being adjusted."
So instead of data mining, text mining: the statisticians let computers work so that they can Recognize reporting patterns and summarize texts on categories or topics. That sounds simple - but it is not. Because the algorithms form the assignments based on probabilities. At the end of the calculation process, for example, the researchers receive the information that 87 percent of the text is about football. You then have to check the plausibility of the clusters automatically created by the algorithm. When the computers have calculated, the tedious work begins for the researchers.